## Finding useful data in the wild perspectives from data journalism 

[This](https://www.reddit.com/r/dataisbeautiful/comments/t9a764/oc_equipment_losses_in_the_ukrainerussian_war/) reddit post is an interesting read. This is a submission to `r/dataisbeautiful`, a reddit sub that I am subscribed to. It tries to compare the military losses on both sides of the Russian-Ukrainian conflict. A cursory look at the charts presented here, gives one, the impression that numbers are the absolute truth! Dig a little deep and when you see the source cited [here](https://www.oryxspioenkop.com/2022/02/attack-on-europe-documenting-equipment.html?m=1), does one realise that:
1. Data is extremely dirty in the wild.
2. The biases and inaccuracies don't become apparent until one digs into the source.


### Getting the relevant information from data

Look at the exhibit below and you will see that the relevant data is not in a form you can do any analysis on

![](osint_main.png)

How does one even start putting this data into a reasonable structure?

One reasonable schema is the following

| Vehicle Type | Vehicle Category | Country | Damaged | Captured | Abandoned | Destroyed |
| :----------- | :--------------- | :------ | :------ | :------- | :-------- | :-------- |
| T72          | Armour           | Russia  | 10      | 8        | 4         | 13        |
| ..           | ..               | ..      | ..      | ..       | ..        | ..        |
| ..           | ..               | ..      | ..      | ..       | ..        | ..          |

But how does one go about creating such a schema? For starters one can scrape this data and put it in a tabular form


In [1]:
import requests 
from bs4 import BeautifulSoup
url = 'https://www.oryxspioenkop.com/2022/02/attack-on-europe-documenting-equipment.html?m=1'

In [53]:
raw_html = requests.get(url).text

In [3]:
soup = BeautifulSoup(raw_html,'html.parser')

In [8]:
main_categories = soup.find('div',attrs={'itemprop':'articleBody'}).find_all("h3")

In [13]:
cleaned_cats = []
for cat in main_categories:
    if cat.text.strip()!='':
        cleaned_cats.append(cat.text.strip())


In [23]:
cleaned_cats[1].split(",")

['Tanks (156',
 ' of which destroyed: 53',
 ' damaged: 2',
 ' abandoned: 30',
 ' captured: 71)']

In [17]:
cleaned_cats[0]

'Russia - 967, of which: destroyed: 386, damaged: 13, abandoned: 157, captured: 411'

In [24]:
import re
pattern_key = re.compile(r'[a-zA-Z]+')
pattern_value = re.compile(r'\d+')

In [57]:
cnt_dict = []
for cat in cleaned_cats:
    if "Trucks, Vehicles and Jeeps" in cat:
        cat = cat.replace("Trucks, Vehicles and Jeeps","Trucks Vehicles and Jeeps")
    cat = cat.replace("of which","")
    parts = cat.split(",")
    parts = [p.strip() for p in parts]
    keys = []
    values = []
    details = {}
    for idx, p in enumerate(parts):
        key = re.findall(pattern_key,p)
        key = " ".join(key)
        value = re.findall(pattern_value,p)[0]
        keys.append(key)
        values.append(value)
        if idx!=0:
            details[key]=value
    cnt_dict.append({keys[0]:values[0],'details':details})
        
        

In [58]:
cnt_dict[0]

{'Russia': '967',
 'details': {'destroyed': '386',
  'damaged': '13',
  'abandoned': '157',
  'captured': '411'}}

In [59]:
category = []
destroyed = []
damaged = []
abandoned = []
captured = []
total = []

for counts in cnt_dict:
    for key in counts:
        if key!='details':
            category.append(key)
            total.append(int(counts[key]))
        else:
            destroyed.append(int(counts['details'].get('destroyed',0)))
            damaged.append(int(counts['details'].get('damaged',0)))
            abandoned.append(int(counts['details'].get('abandoned',0)))
            captured.append(int(counts['details'].get('captured',0)))

In [47]:
import pandas as pd

In [60]:
result = pd.DataFrame()
result['Category'] = category
result['Total'] = total
result['Destroyed'] = destroyed
result['Damaged'] = damaged
result['Abandoned'] = abandoned
result['Captured'] = captured

In [61]:
result

Unnamed: 0,Category,Total,Destroyed,Damaged,Abandoned,Captured
0,Russia,967,386,13,157,411
1,Tanks,156,53,2,30,71
2,Armoured Fighting Vehicles,97,33,0,17,45
3,Infantry Fighting Vehicles,140,55,0,25,60
4,Armoured Personnel Carriers,55,16,0,10,29
5,Mine Resistant Ambush Protected MRAP Vehicles,6,2,0,1,3
6,Infantry Mobility Vehicles,35,18,1,2,12
7,Communications Stations,8,2,0,4,2
8,Engineering Vehicles,37,11,0,11,15
9,Anti Tank Guided Missiles,48,0,0,0,48
