# Milestone 2; 10,000 battles Project

## Tasks Description

Milestone 2 (20%): the project repo contains a notebook with data collection and descriptive analysis, properly commented, and the notebook ends with a more structured and informed plan for what comes next.

The tasks involving a large amount of data were pre-run and we simply describe their purposes and outputs while showing how to call them in comments.

## Data Collection

Data collection was a significant part of our work, given the nature of our original dataset. The task was to go from a 44 Gb large wikipedia dump to clean and normalized features about each battle. As it was shown during the lecture, data collection is in fact an iterative process: new needs in the analysis part may require new data to be extracted, or different transformation applied. Thus, the data collection was organized into a **pipeline** of 3 operations, in order to achieve **composability** and **reproducibility**. We know explain this pipeline as a part of this notebook, the actual code being organized as **python modules**, much more suited for data processing than notebooks.

Each step of the pipeline is a **python script** in the `processing` folder. They have the following usage:

```shell
python script.py file-name-in file-name-out
```

For the sake of reproducibilty and organization, we used a naming convention for the output files including what step of the pipline was run, and the version of this dataset. Each version of the dataset is then associated with a git tag marking the state of the codebase that generated the file, to avoid confusion when coming back to the very begining of the pipeline each time we have a doubt (see README in the `datasets/` folder). We know describe each of the 3 pipeline operation.

### Page extraction

**Script**: `page_extraction.py`<br />
**Envir.**: Cluster<br />
**Input**: `hdfs:///datasets/wikipedia/enwiki-latest-pages-articles-multistream.xml` (~44 Gb)<br />
**Output**: `battle-pages-v.json` (~123 Mb)<br />
**Desc.**:<br />
Pages extraction has two main goals, it selects what pages are (entierly) kept in the next step, and translate from an XML to a JSON representation for easier python processing.


### Fields extraction

**Script**: `fields_extraction.py`<br />
**Envir.**: Local<br />
**Output**: `battle-fields-v.json` (~14 Mb)

### Features extraction

**Script**: `features_extraction.py`<br />
**Envir.**: Local<br />
**Output**: `battle-features-v.json` (~4.1 Mb)



In [None]:
#TODO run command for pages extraction

#run pages_extraction.py ...

We retrieve 27255 pages with a title satisfying the regex.

### Fetching of the wanted information on each page

We have observed that wikipedia pages about battles often contain an "infobox" summarizing the main information of the battle: title, place, coordinates, dates, casualties, warriers, victory type, etc. An example is observable in https://en.wikipedia.org/wiki/Battle_of_Waterloo.
Thus, our goal is to use these structured data to do our analysis and the next step in our data collection pipeline is the fetch the infoboxes of all the pages we retrieved in the previous step.

In [None]:
#run fields_extraction.py '/datasets/battle-pages-2.json' '/datasets/battle-fields-0.json'

Out of the 27255 pages, 7486 contain an infobox and only 17 contain two infoboxes (these numbers are obtained by using the python file of the next step but are described here for the clarity of the analysis). We consider that we can do a first analysis on the 7486 pages containing an infobox and do not include pages without an infobox or containing multiple infoboxes. These pages are often redirect pages !!!!!!!!!! CHECK OR REMOVE !!!!!!!!!! or pages containing incomplete information.

### Summary of these first two steps

##### Processing pipeline
This document describes the data processing pipeline steps that goes from a wikipedia dump to actionable datasets.

###### Pipeline steps


##### Datasets

The purpose of this page is to achieve strong reproducibility and clarity in dataset provenance by summarising the different pipeline steps and the dataset versions they 
produced.



###### Pages Extraction

Source file: ````hdfs:///datasets/wikipedia/enwiki-latest-pages-articles-multistream.xml```` (Cluster)

Location: ````hdfs:///user/mouchet/datasets/```` (Cluster) and ````datasets/```` (Git, zipped)

| File  | Version | Size | Number of battle pages | Comment |
| --- | --- | --- | --- | --- |
| battle-pages-0.xml | 0.1 | 127 Mo | 24089 | Recursive "revision" schema |
| battle-pages-1.xml  | 0.2 | 131 Mo | 27255 | Switched to flat schema, included Sieges|
| battle-pages-2.json | 0.3 | 124 Mo | 27255 | Set output to JSON for easier python processing |

###### Fields Extraction

Location: ````datasets/```` (Git)

| Source | File | Version | Size | Infobox count | Coords count | Comment |
| ---- | --- | ------- | ---- | ------------- | ------------ | ------- |
| battle-pages-2.json | battle-fields-0.json | 0.4 | 13 Mo | ? | ? | Extract infobox fields only |

## Descriptive analysis

Our first step is to observe which features are observable for the battles. For example, for how many battles the date or the geographic coordinates are available.

### Available data

In [None]:
import json
import sys
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")
%matplotlib inline

In [None]:
battles = json.load(open('../datasets/battle-fields-0.json'))
print("Number of pages or battles ", len(battles))
no_infoboxes = [b.get('infobox').get('error') for b in battles if b.get('infobox').get('error') == 'no infobox']
print("Number of pages that do not contain an infobox ", len(no_infoboxes))
double_infoboxes = [b.get('infobox').get('error') for b in battles if b.get('infobox').get('error') == 'more than one infobox']
print("Number of pages that do not contain an infobox ", len(double_infoboxes))
infoboxes = [b["infobox"] for b in battles if not b["infobox"].get("error")]
print("Number of pages that do contain an infobox ", len(infoboxes))

In [None]:
battles = json.load(open("../datasets/battle-fields-0.json"))
df = pd.DataFrame([b["infobox"] for b in battles if not b["infobox"].get("error")])

In [None]:
f, ax = plt.subplots(figsize=(6, 15))
counts = df.count().sort_values(ascending=False)
counts = counts[counts > 20]
sns.barplot(x=counts, y=counts.index, ax=ax)

In this first part, we consider the features of interest to be:
- place
- date
- combatants (combatant1, ...)
- result
- commanders (commander1, ...)
- conflict
- partof (if the battle is part of a bigger war)
- strength (strength1, ...) which are the number of soldiers
- casualties (casualties1, ...)
- coordinates 
As a first analysis, we can see that when an infobox is present (in 7486 of the cases), this one is often complete and contains information about most of the battles' features of interest with the exception of coordinates. For coordinates, we will also search the complete wikipedia page or use the place information (name of the city/region) to complete the coordinates when possible.
After this step, we can confirm that we have almost complete information about 7486 battles but we have to see how these information can be used, retrieved and formatted for an explaratory analysis. Thus, we will now observe into details each feature and try to map them to usable data. For missing features, we will also see if we can retrieve these information in the page or using other existing features.

### Features exploration


In [None]:
battles = [json.loads(line) for line in open("../datasets/battle-features-0.json")]
df = pd.DataFrame(battles)
df.count()