# Milestone 2; 10,000 battles Project

## Tasks Description

Milestone 2 (20%): the project repo contains a notebook with data collection and descriptive analysis, properly commented, and the notebook ends with a more structured and informed plan for what comes next.

The tasks involving a large amount of data were pre-run and we simply describe their purposes and outputs while showing how to call them in comments.

## Data Collection

In the data collection, one of our main goals is to remain flexible such that we can always go back in our pipeline to fecth more or less data. Thus, our data collection is organised in multiple steps:
1. fetching of all the wikipedia pages of interest
2. fetching of the wanted information on each page

The data collection is done first on the cluster by using spark jobs in order to read and parse the wikipedia dump.
In this first part, we fetch all the wikipedia pages which have a title corresponding to the following regex: "(((B|b)attle|(S|s)iege) (of|on))".


In [None]:
#TODO run command for pages extraction

#run pages_extraction.py ...

We retrieve 27255 pages with a title satisfying the regex.

### Fetching of the wanted information on each page

We have observed that wikipedia pages about battles often contain an "infobox" summarizing the main information of the battle: title, place, coordinates, dates, casualties, warriers, victory type, etc. An example is observable in https://en.wikipedia.org/wiki/Battle_of_Waterloo.
Thus, our goal is to use these structured data to do our analysis and the next step in our data collection pipeline is the fetch the infoboxes of all the pages we retrieved in the previous step.

In [None]:
#run fields_extraction.py '/datasets/battle-pages-2.json' '/datasets/battle-fields-0.json'

Out of the 27255 pages, 7486 contain an infobox and only 17 contain two infoboxes (these numbers are obtained by using the python file of the next step but are described here for the clarity of the analysis). We consider that we can do a first analysis on the 7486 pages containing an infobox and do not include pages without an infobox or containing multiple infoboxes. These pages are often redirect pages !!!!!!!!!! CHECK OR REMOVE !!!!!!!!!! or pages containing incomplete information.

### Summary of these first two steps

##### Processing pipeline
This document describes the data processing pipeline steps that goes from a wikipedia dump to actionable datasets.

###### Pipeline steps

| Name | Job | Input | Output | Description |
| ---- | --- | ----- | ------ | ----------- |
| Pages extraction | `page_extraction.py` (Cluster)| XML page article dump file | `battle-pages-`**v**`.jsonlist` | Selects all battle-related pages and write them in JSON format, one page per line |
| Fields extraction | `fields_extraction.py` (Local)| `battle-pages-`**v**`.json`  | `battle-fields-`**v**`.json`  | Parses the wikitext data the pages to extract (yet unparsed) fields |

##### Datasets

The purpose of this page is to achieve strong reproducibility and clarity in dataset provenance by summarising the different pipeline steps and the dataset versions they 
produced.

###### Pages Extraction

Source file: ````hdfs:///datasets/wikipedia/enwiki-latest-pages-articles-multistream.xml```` (Cluster)

Location: ````hdfs:///user/mouchet/datasets/```` (Cluster) and ````datasets/```` (Git, zipped)

| File  | Version | Size | Number of battle pages | Comment |
| --- | --- | --- | --- | --- |
| battle-pages-0.xml | 0.1 | 127 Mo | 24089 | Recursive "revision" schema |
| battle-pages-1.xml  | 0.2 | 131 Mo | 27255 | Switched to flat schema, included Sieges|
| battle-pages-2.json | 0.3 | 124 Mo | 27255 | Set output to JSON for easier python processing |

###### Fields Extraction

Location: ````datasets/```` (Git)

| Source | File | Version | Size | Infobox count | Coords count | Comment |
| ---- | --- | ------- | ---- | ------------- | ------------ | ------- |
| battle-pages-2.json | battle-fields-0.json | 0.4 | 13 Mo | ? | ? | Extract infobox fields only |

## Descriptive analysis

Our first step is to observe which features are observable for the battles. For example, for how many battles the date or the geographic coordinates are available.

### Available data

In [None]:
import json
import sys
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")
%matplotlib inline

In [None]:
battles = json.load(open('../datasets/battle-fields-0.json'))
print("Number of pages or battles ", len(battles))
no_infoboxes = [b.get('infobox').get('error') for b in battles if b.get('infobox').get('error') == 'no infobox']
print("Number of pages that do not contain an infobox ", len(no_infoboxes))
double_infoboxes = [b.get('infobox').get('error') for b in battles if b.get('infobox').get('error') == 'more than one infobox']
print("Number of pages that do not contain an infobox ", len(double_infoboxes))
infoboxes = [b["infobox"] for b in battles if not b["infobox"].get("error")]
print("Number of pages that do contain an infobox ", len(infoboxes))

In [None]:
battles = json.load(open("../datasets/battle-fields-0.json"))
df = pd.DataFrame([b["infobox"] for b in battles if not b["infobox"].get("error")])

In [None]:
f, ax = plt.subplots(figsize=(6, 15))
counts = df.count().sort_values(ascending=False)
counts = counts[counts > 20]
sns.barplot(x=counts, y=counts.index, ax=ax)

In this first part, we consider the features of interest to be:
- place
- date
- combatants (combatant1, ...)
- result
- commanders (commander1, ...)
- conflict
- partof (if the battle is part of a bigger war)
- strength (strength1, ...) which are the number of soldiers
- casualties (casualties1, ...)
- coordinates 
As a first analysis, we can see that when an infobox is present (in 7486 of the cases), this one is often complete and contains information about most of the battles' features of interest with the exception of coordinates. For coordinates, we will also search the complete wikipedia page or use the place information (name of the city/region) to complete the coordinates when possible.
After this step, we can confirm that we have almost complete information about 7486 battles but we have to see how these information can be used, retrieved and formatted for an explaratory analysis. Thus, we will now observe into details each feature and try to map them to usable data. For missing features, we will also see if we can retrieve these information in the page or using other existing features.

### Features exploration
