## Finding useful data in the wild perspectives from data journalism 

[This](https://www.reddit.com/r/dataisbeautiful/comments/t9a764/oc_equipment_losses_in_the_ukrainerussian_war/) reddit post is an interesting read. This is a submission to `r/dataisbeautiful`, a reddit sub that I am subscribed to. It tries to compare the military losses on both sides of the Russian-Ukrainian conflict. A cursory look at the charts presented here, gives one, the impression that numbers are the absolute truth! Dig a little deep and when you see the source cited [here](https://www.oryxspioenkop.com/2022/02/attack-on-europe-documenting-equipment.html?m=1), does one realise that:
1. Data is extremely dirty in the wild.
2. The biases and inaccuracies don't become apparent until one digs into the source.


### Getting the relevant information from data

Look at the exhibit below and you will see that the relevant data is not in a form you can do any analysis on

![](osint_main.png)

How does one even start putting this data into a reasonable structure?

One reasonable schema is the following

| Vehicle Type | Vehicle Category | Country | Damaged | Captured | Abandoned | Destroyed |
| :----------- | :--------------- | :------ | :------ | :------- | :-------- | :-------- |
| T72          | Armour           | Russia  | 10      | 8        | 4         | 13        |
| ..           | ..               | ..      | ..      | ..       | ..        | ..        |
| ..           | ..               | ..      | ..      | ..       | ..        | ..          |

But how does one go about creating such a schema? For starters one can scrape this data and put it in a tabular form


In [1]:
import requests 
from bs4 import BeautifulSoup
url = 'https://www.oryxspioenkop.com/2022/02/attack-on-europe-documenting-equipment.html?m=1'

In [2]:
raw_html = requests.get(url).text
raw_html[0:1000]

'<!DOCTYPE html>\n<html class=\'v2\' dir=\'ltr\' xmlns=\'http://www.w3.org/1999/xhtml\' xmlns:b=\'http://www.google.com/2005/gml/b\' xmlns:data=\'http://www.google.com/2005/gml/data\' xmlns:expr=\'http://www.google.com/2005/gml/expr\' xmlns:og=\'http://ogp.me/ns#\'>\n<head>\n<link href=\'https://www.blogger.com/static/v1/widgets/1529571102-css_bundle_v2.css\' rel=\'stylesheet\' type=\'text/css\'/>\n<meta content=\'width=device-width, initial-scale=1, maximum-scale=1\' name=\'viewport\'/>\n<link href="//fonts.googleapis.com/css?family=Muli:700,700i,800%7CLora:400,400i,700,700i%7CPlayfair+Display:400,400i,700" media="all" rel="stylesheet" type="text/css">\n<link href=\'//maxcdn.bootstrapcdn.com/font-awesome/4.5.0/css/font-awesome.min.css\' rel=\'stylesheet\'/>\n<meta content=\'text/html; charset=UTF-8\' http-equiv=\'Content-Type\'/>\n<meta content=\'blogger\' name=\'generator\'/>\n<link href=\'https://www.oryxspioenkop.com/favicon.ico\' rel=\'icon\' type=\'image/x-icon\'/>\n<link href=\'

In [3]:
soup = BeautifulSoup(raw_html,'html.parser')

In [8]:
main_categories = soup.find('div',attrs={'itemprop':'articleBody'}).find_all("h3")

In [13]:
cleaned_cats = []
for cat in main_categories:
    if cat.text.strip()!='':
        cleaned_cats.append(cat.text.strip())


In [23]:
cleaned_cats[1].split(",")

['Tanks (156',
 ' of which destroyed: 53',
 ' damaged: 2',
 ' abandoned: 30',
 ' captured: 71)']

In [17]:
cleaned_cats[0]

'Russia - 967, of which: destroyed: 386, damaged: 13, abandoned: 157, captured: 411'

In [24]:
import re
pattern_key = re.compile(r'[a-zA-Z]+')
pattern_value = re.compile(r'\d+')

In [None]:
cnt_dict = []
for cat in cleaned_cats:
    cat = cat.replace("of which","")
    parts = cat.split(",")
    parts = [p.strip() for p in parts]
    keys = []
    values = []
    details = details
    for idx, p in enumerate(parts):
        key = re.findall(pattern_key,p)
        key = " ".join(key)
        value = re.findall(pattern_value,p)[0]
        keys.append(key)
        values.append(value)
        if idx!=0:
            details[key]=value
    cnt_dict.append({keys[0]:values[0],'details':details})
        
        

