In [None]:
import json
import sys
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")
%matplotlib inline

# Milestone 2; 10,000 battles Project

## Tasks Description

Milestone 2 (20%): the project repo contains a notebook with data collection and descriptive analysis, properly commented, and the notebook ends with a more structured and informed plan for what comes next.

The tasks involving a large amount of data were pre-run and we simply describe their purposes and outputs while showing how to call them in comments.

## Data Collection

Data collection was a significant part of our work, given the nature of our original dataset. The task was to go from a 44 Gb large wikipedia dump to clean and normalized features about each battle. As it was shown during the lecture, data collection is in fact an iterative process: new needs in the analysis part may require new data to be extracted, or different transformation applied. Thus, the data collection was organized into a **pipeline** of 3 operations, in order to achieve **composability** and **reproducibility**. We know explain this pipeline as a part of this notebook, the actual code being organized as **python modules**, much more suited for data processing than notebooks.

Each step of the pipeline is a **python script** in the `processing` folder. They have the following usage:

```shell
python script.py file-name-in file-name-out
```

For the sake of reproducibilty and organization, we used a naming convention for the output files including what step of the pipline was run, and the version of this dataset. Each version of the dataset is then associated with a git tag marking the state of the codebase that generated the file, to avoid confusion when coming back to the very begining of the pipeline each time we have a doubt (see README in the `datasets/` folder). We know describe each of the 3 pipeline operation.

### Page extraction

**Script**: `page_extraction.py`<br />
**Environmnent**: Cluster<br />
**Input**: `hdfs:///datasets/wikipedia/enwiki-latest-pages-articles-multistream.xml` (~44 Gb)<br />
**Output**: `battle-pages-v.json` (~123 Mb)<br />
**Description**:<br />
Pages extraction has two main goals, it selects what pages are (entierly) kept in the next step using a regular expression in the title (may be refined later, if we find a better way to isolate a battle related page), and translate from an XML to a JSON representation for easier python processing. It leverages on [pySpark's DataFrame](https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-python.html) and its custom [XML data-source from Databricks](https://github.com/databricks/spark-xml) to provide an SQL-like, parallelized Spark job.


### Fields extraction

**Script**: `fields_extraction.py`<br>
**Envir.**: Local<br>
**Output**: `battle-fields-v.json` (~14 Mb)<br>
**Description**:<br>
This step's purpose is to extract key-value pairs from the raw page [Wikitext](https://www.mediawiki.org/wiki/Wikitext), where keys are identifiers of information contained in the page and values are either a `dict` of other key-values pairs, a `list` of string, or a string (mainly, strings that are actually wikitext). In other words, we parse the page into a tree-like structure on which it will be easier to parse and extract actual features afterward. We mainly (but not exclusively) relied on the presence of an `infobox` [template](https://en.wikipedia.org/wiki/Help:Template) for most of the battle pages. Again, each line contains the tree of a battle. Therefore:

In [None]:
# We can load all the battle trees like this
battles = [json.loads(line) for line in open('../datasets/battle-fields-1.json')]
print("Number of pages or battles pages", len(battles))

We then perform some preliminary assessments about the current state of the dataset.

In [None]:
errors = [b.get('infobox').get('error') for b in battles]
no_infoboxes = sum([1 for e in errors if e=="no infobox"])
more_infoboxes = sum([1 for e in errors if e=="more than one infobox"])

print("Number of pages that do not contain an infobox ", no_infoboxes)
print("Number of pages that contains more than one infobox ", more_infoboxes)
print("Number of pages that do contain an infobox ", len(battles)-no_infoboxes-more_infoboxes)

We immediately see that the number of page we can extract information from is greately reduced. It later came out that many of the pages in the dump are alias, redirects and discussion pages. However, the new battle count is still comfortably high to provide interesting analysis.

We then assess the actual population of the extractable keys, so that we know on what field we can focus our feature extraction effort:

In [None]:
df = pd.DataFrame([b["infobox"] for b in battles if not b["infobox"].get("error")])
f, ax = plt.subplots(figsize=(15, 20))
counts = df.count().sort_values(ascending=False)
counts = counts[counts > 20]
sns.barplot(x=counts, y=counts.index, ax=ax)
plt.show()

Based on these observations, we selected the following set of features to be extracted from the key-value pairs by the next pipeline step (we may add more of them later):
- date
- coordinates
- combatants (combatant1, ...)
- result
- strengths (strength1, ...) in terms of number of men
- casualties (casualties1, ...) in terms of number of men killed/wounded/captured

### Features extraction

**Script**: `features_extraction.py`<br />
**Envir.**: Local<br />
**Output**: `battle-features-v.json` (~4.1 Mb)
**Description**:<br>
The final step extract a "flat" set of key-value from the preivious step's tree (e.i., values cannot be another `dict`) that we call features. It does so by first further parsing and transforming values from the previous step into a normalized representation, possibly combining multiple ones togheter. This was indeed the most time-consuming part of the current implementation, as it involves parsing sometime highly variable, unormalized free-text data. We elaborate on this aspect for some features in the following subsections. Each of line in the output file contains such set, so it can be convieniently imported in a `pandas.DataFrame` object.

In [None]:
battles = pd.DataFrame([json.loads(line) for line in open("../datasets/battle-features-0.json")])
print(battles.count())
demo_col = ["combatant_first_1", "combatant_first_2", "result_combatant_1", "result_combatant_2", "start_date", "end_date", "casualties_1", "casualties_2"]
battles[demo_col].head()

## Descriptive analysis

We provide a succint descriptive analysis for the main features. 

### Geolocation

We begin by showing the geographical coverage of our dataset. This seems to be consistent with the location of wars across history.

In [None]:
from folium.plugins import HeatMap
import folium as fl
m = fl.Map()
coord_df = battles[["latitude", "longitude"]].dropna()
coords = [[lat, long] for lat, long in zip(coord_df["latitude"], coord_df["longitude"])]
HeatMap(coords).add_to(m)
m

### Dates

We continue by looking at the time domain coverage. It looks like we have a pretty uniform one.

In [None]:
import datetime
dates = battles[battles.start_date.notna()].start_date
date_bc = battles[battles.start_date.notna()].dates_bc

X=[datetime.datetime.strptime(date, "%Y-%m-%d") for date, bc in zip(dates, date_bc) if not bc]
fig, ax = plt.subplots(figsize=(20,1))
ax.scatter(X, [1]*len(X),
           marker='|', s=100)

ax.yaxis.set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.xaxis.set_ticks_position('bottom')

ax.get_yaxis().set_ticklabels([])
day = pd.to_timedelta("1", unit='D')
plt.xlim(X[0] - day, X[-1] + day)
plt.show()