## Finding useful data in the wild: Perspectives from data journalism 

[This](https://www.reddit.com/r/dataisbeautiful/comments/t9a764/oc_equipment_losses_in_the_ukrainerussian_war/) reddit post is an interesting read. This is a submission to `r/dataisbeautiful`, a reddit sub that I am subscribed to. It tries to compare the military losses on both sides of the Russian-Ukrainian conflict. A cursory look at the charts presented here, gives one, the impression that numbers are the absolute truth! Dig a little deep and when you see the source cited [here](https://www.oryxspioenkop.com/2022/02/attack-on-europe-documenting-equipment.html?m=1), does one realise that:
1. Data is extremely dirty in the wild.
2. The biases and inaccuracies don't become apparent until one digs into the source.


### 1. Getting the relevant information from data

Look at the exhibit below and you will see that the relevant data is not in a form you can do any analysis on

![](osint_main.png)

How does one even start putting this data into a reasonable structure?

One reasonable schema is the following

| Category | Total | Damaged | Destroyed | Captured | Abandoned |
| :------- | :---- | :------ | :-------- | :------- | :-------- |
| Armour   | 120   | 100     | 10        | 8        | 2         |
| ..       | ..    | ..      | ..        | ..       | ..        |
| ..       | ..    | ..      | ..        | ..       | ..        |

But how does one go about creating such a schema? For starters one can scrape this data and put it in a tabular form


In [1]:
import requests 
from bs4 import BeautifulSoup
import re
import pandas as pd
url = 'https://www.oryxspioenkop.com/2022/02/attack-on-europe-documenting-equipment.html?m=1'
raw_html = requests.get(url).text
soup = BeautifulSoup(raw_html,'html.parser')
main_categories = soup.find('div',attrs={'itemprop':'articleBody'}).find_all("h3")
cleaned_cats = []
for cat in main_categories:
    if cat.text.strip()!='':
        cleaned_cats.append(cat.text.strip())


In [2]:
pattern_key = re.compile(r'[a-zA-Z]+')
pattern_value = re.compile(r'\d+')

In [3]:
cnt_dict = []
for cat in cleaned_cats:
    if "Trucks, Vehicles and Jeeps" in cat:
        cat = cat.replace("Trucks, Vehicles and Jeeps","Trucks Vehicles and Jeeps")
    cat = cat.replace("of which","")
    parts = cat.split(",")
    parts = [p.strip() for p in parts]
    keys = []
    values = []
    details = {}
    for idx, p in enumerate(parts):
        key = re.findall(pattern_key,p)
        key = " ".join(key)
        value = re.findall(pattern_value,p)[0]
        keys.append(key)
        values.append(value)
        if idx!=0:
            details[key]=value
    cnt_dict.append({keys[0]:values[0],'details':details})
        
        

In [4]:
category = []
destroyed = []
damaged = []
abandoned = []
captured = []
total = []

for counts in cnt_dict:
    for key in counts:
        if key!='details':
            category.append(key)
            total.append(int(counts[key]))
        else:
            destroyed.append(int(counts['details'].get('destroyed',0)))
            damaged.append(int(counts['details'].get('damaged',0)))
            abandoned.append(int(counts['details'].get('abandoned',0)))
            captured.append(int(counts['details'].get('captured',0)))

In [5]:
result = pd.DataFrame()
result['Category'] = category
result['Total'] = total
result['Destroyed'] = destroyed
result['Damaged'] = damaged
result['Abandoned'] = abandoned
result['Captured'] = captured

In [11]:
russia = result.iloc[1:23]
ukraine = result.iloc[25:]
russia.to_csv("../../data/data_acquisition/russian_losses_aggregates.csv",index=False)
ukraine.to_csv("../../data/data_acquisition/ukrainian_losses_aggregates.csv",index=False)

In [15]:
!head -5 ../../data/data_acquisition/russian_losses_aggregates.csv ## windows users can use a different command here

Category,Total,Destroyed,Damaged,Abandoned,Captured
Tanks,156,53,2,30,71
Armoured Fighting Vehicles,98,33,0,18,45
Infantry Fighting Vehicles,141,55,0,25,61
Armoured Personnel Carriers,55,16,0,10,29


All of this jugglery with the code to be just able to fetch the relevant data. One can go a step further and get a detailed break-up of numbers by each category. That will be left as an exercise and solution will be shared later.

What next?

We can look at two specific areas
- How do we share this data once it has been scrapped?
- What are some of the biases we need to inform the users of this data?


**1. Sharing Data**

There are a couple of ways to go about sharing data:
1. Use a platform such as [kaggle](https://www.kaggle.com/datasets), to host your datasets. Do provide an appropriate license.
2. Use github to store not only the dataset but also the source code.
3. Create an api service, using which one can consume the dataset.

**2. Biases in the collected data**

Most data collection exercises will have some bias built into it. For example, in the current case the following biases are present:
1. The data has been collected by the volunteers sympathetic towards the Ukrainian cause. There could be under-reporting of Ukrainian losses and over-reporting of Russian losses.
2. The relative size of Russian and Ukrainian militaries is missing, so there is little point in comparing the absolute number of losses. 