### Features exploration

We start with a descriptive analysis of the different features separately and then combined.
We first fetch the data from the end of our pipeline.

In [None]:
import json
import sys
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")
%matplotlib inline

We group here all our functions to plot:

In [None]:
def densplot(columns, xlabel, title, axo):
    for i,v in enumerate(columns):
        sns.distplot(v, ax=axo, kde_kws={"label": i})
    axo.set_title(title)
    axo.set_xlabel(xlabel, fontsize=12)
    
def scatplot(xelem, yelem, xlabel, ylabel, title, axo, polyfit=None):
    axo.scatter(xelem, yelem)
    if polyfit:
        plt.plot(np.unique(xelem), np.poly1d(np.polyfit(xelem, yelem, polyfit))(np.unique(xelem)), 'C2')
    axo.set_title(title)
    axo.set_xlabel(xlabel, fontsize=12)
    axo.set_ylabel(ylabel, fontsize=12)

In [None]:
battles = [json.loads(line) for line in open("../datasets/battle-features-0.json")]
df = pd.DataFrame(battles).replace(0, np.NaN)

#### Casualties

A battle has at maximum casualties for 4 different combatants ("casualties_1", "casualties_2", ...). And each casualties information can contain a number of people "killed", "wounded", "missing" or "captured". When possible, we also fetched (parsed) these data. We have observed that for multiple battles, wikipedia was referencing multiple sources for the same number and we handled this by doing the average of the values. We also did the average for range values. 

In [None]:

df[['killed_1', 'wounded_1', 'missing_1', 'captured_1', 'casualties_1']].head()

In [None]:
df[['killed_1', 'wounded_1', 'missing_1', 'captured_1', 'casualties_1', 'killed_2', 'wounded_2', 'missing_2', 'captured_2', 'casualties_2']].count()

We observe that the casualties is most of the time defined by the number of killed and wounded people.
We focus on the total number of casualties for this descriptive analysis and will go into more details in the future exploratory analysis.

We observe that the dataset contains a lot of NaN values. This is mainly because there is no numeric information for the casualties. In our data extraction pipeline we have observed that casualties_1 and casualties_2 contain only 79 and 80 lines (battles) that have a numeric information but cannot be parsed.  

In [None]:
sums_null = df[['casualties_1', 'casualties_2', 'casualties_3', 'casualties_4']].isnull().sum()
sums_non_null_percent = 100-(100*sums_null/len(df))
print("number of null values")
print(sums_null)
print("number of non null values")
print(len(df)-sums_null)
plt.bar(range(len(sums_null)), sums_non_null_percent)
plt.title("Percent of battles with numeric values for casualties")
plt.xticks(range(len(sums_null)), sums_null.keys())
plt.show()

We also observe that almost 60% of the battles have numeric values for two combatants ("casualties_1" and "casualties_2"). We also notice that casualties_1 and 2 have more data points than the other. This makes sense as we can usually consider two opposite sides in a battle.
Each casualties feature has an average of:

In [None]:
averages = df[['casualties_1', 'casualties_2', 'casualties_3', 'casualties_4']].mean()
print(averages)

We can see that the average for 1 and 3 are close, while casualties_2 is higher. This may indicate that usually the combatant 2 (to which corresponds casualties_2) is usually the looser of the battle. Even though it is too early to jump to conclusions !!
In the following, we observe the distributions of the features. Since casualties_4 only has 5 data points, we do not use it in our analysis.

In [None]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(nrows=3, ncols=2, sharey=False, figsize=(15,20))
densplot([df['casualties_1'].dropna()], 'casualties_1', "casualties_1 density", ax1)
densplot([df.query('2000 > casualties_1 >1')['casualties_1']], 'casualties_1', "casualties_1 density (ZOOM) ", ax2)

densplot([df['casualties_2'].dropna()], 'casualties_2', "casualties_2 density", ax3)
densplot([df.query('2000 > casualties_2 >1')['casualties_2']], 'casualties_2', "casualties_2 density (ZOOM) ", ax4)

densplot([df['casualties_3'].dropna()], 'casualties_3', "casualties_3 density", ax5)
densplot([df.query('2000 > casualties_3 >1')['casualties_3']], 'casualties_3', "casualties_3 density (ZOOM) ", ax6)

fig.tight_layout()
plt.show()

We observe that casualties 1,2 and 3 are all pretty sparsed in their values while they all have there peak for small casualties values. By "zooming", we observe that most of the values are between 0 and 2000 casualties. For casualties_4 which contain much less data points, we observe that the values are between 0 and 2000 too.

For the remaining of our analysis for this feature we will focus on casualties_1 (c1) and casualties_2 (c2).
We first observe that 3782 out of the 4392 (min(#casualties_1, #casualties_2)) battles have information on two combatants' casualties.

In [None]:
print("number of battles with values for c1 and c2: ", (len(df.query('0 < casualties_1 and 0 < casualties_2')['casualties_1'])))

As a first step towards our future exploratory analysis, we can combine these two features, to see if, for example, high casualties for 1 combatant, means also high casualties for the other.

In [None]:
c1_zoom = df.query('25000 > casualties_1 >1 and 25000 > casualties_2 > 1')['casualties_1']
c2_zoom = df.query('25000 > casualties_1 >1 and 25000 > casualties_2 > 1')['casualties_2']

c1 = df.query('casualties_1 >1 and casualties_2>1')['casualties_1']
c2 = df.query('casualties_1 >1 and casualties_2>1')['casualties_2']


fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, sharey=False, figsize=(10,10))
scatplot(c1, c2, 'casualities_1', 'casualities_2', "Casualities 1 vs. 2", ax1)
scatplot(c1_zoom, c2_zoom, 'casualities_1', 'casualities_2', "Casualities 1 vs. 2 (ZOOM)", ax2)
densplot([abs(c1-c2)], 'c1', "Difference between c1 and c2 in density", ax3)
densplot([abs(c1_zoom-c2_zoom)], 'c1', "Difference between c1 and c2 in density (ZOOM)", ax4)
ax4.set_xlim(0,10000)
fig.tight_layout()
plt.show()

We observe that the difference between the casualties is concentrated to 0. We observe that we cannot conclude that high casualties on one side does not mean high casualties for the other.

#### Strengths

Strength is the number of men involved in a battle. A battle has at maximum strength information for 3 different combatants ("strength_1", "strength_2", ...). We also did the average for range values. 

In [None]:
df[['strength_1', 'strength_2', 'strength_3']].head()

We observe that the dataset contains a lot of NaN values. This is mainly because there is no numeric information for the strength. In our data extraction pipeline we have observed that strength_1 and strength_2 contain only 120 and 110 lines (battles) that have a numeric information but cannot be parsed. (this cans also be the case that it is numeric information but irrelevant, so it is not parsed on purpose.

In [None]:
sums_null = df[['strength_1', 'strength_2', 'strength_3']].isnull().sum()
sums_non_null_percent = 100-(100*sums_null/len(df))
print("number of null values")
print(sums_null)
print("number of non null values")
print(len(df)-sums_null)
plt.bar(range(len(sums_null)), sums_non_null_percent)
plt.title("Percent of battles with numeric values for casualties")
plt.xticks(range(len(sums_null)), sums_null.keys())
plt.show()

We also observe that almost 70% of the battles have numeric values for two combatants ("strength_1" and "strength_2"). We also notice that strength_1 and 2 have more data points than the other. Again, this makes sense as we can usually consider two opposite sides in a battle.
Each casualties feature has an average of:

In [None]:
averages_strength = df[['strength_1', 'strength_2', 'strength_3']].mean()
print(averages_strength)

We can see that the average for the 3 features is pretty similar. This may indicate that usually the battles oppose two sides that have a similar number of fighters.
We observe the distributions of the features: (Again we will focus on the first two as the last one only has 14 data points)

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, sharey=False, figsize=(15,10))
densplot([df['strength_1'].dropna()], 'strength_1', "strength_1 density", ax1)
densplot([df.query('2000 > strength_1 >1')['strength_1']], 'strength_1', "strength_1 density (ZOOM) ", ax2)

densplot([df['strength_2'].dropna()], 'strength_2', "strength_2 density", ax3)
densplot([df.query('2000 > strength_2 >1')['strength_2']], 'strength_2', "strength_2 density (ZOOM) ", ax4)

fig.tight_layout()
plt.show()

We observe similar results as for casualties, in fact strength 1,2 and 3 are all pretty sparsed in their values while they all have there peak for small strength values. By "zooming", we observe that most of the values are between 0 and 2000 for strength.

For the remaining of our analysis for this feature we will focus on strength_1 (s1) and strength_2 (s2).
We first observe that 4383 out of the 5104 (min(#strength_1, #strength_2)) battles have information on two combatants' strengths.

In [None]:
print("number of battles with values for s1 and s2: ", (len(df.query('1 < strength_1 and strength_2 > 1')['strength_1'])))

We now combine these two features, to see if, for example, high strength for 1 combatant, means also high strength for the other, as the average and distribution results tend to show.

In [None]:
s1_zoom = df.query('25000 > strength_1 >1 and 25000 > strength_2 > 1')['strength_1']
s2_zoom = df.query('25000 > strength_1 >1 and 25000 > strength_2 > 1')['strength_2']

s1 = df.query('strength_1 >1 and strength_2>1')['strength_1']
s2 = df.query('strength_1 >1 and strength_2>1')['strength_2']


fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, sharey=False, figsize=(13,10))
scatplot(s1, s2, 'strength_1', 'strength_2', "Strengths 1 vs. 2", ax1)
scatplot(s1_zoom, s2_zoom, 'strength_1', 'strength_2', "strength 1 vs. 2 (ZOOM)", ax2)
densplot([abs(s1-s2)], 's1', "Difference between s1 and s2 in density", ax3)
densplot([abs(s1_zoom-s2_zoom)], 's1', "Difference between s1 and s2 in density (ZOOM)", ax4)
ax4.set_xlim(0,10000)
fig.tight_layout()
plt.show()

These first results tend to show that usually the battles oppose two sides with a close strength. Indeed, the difference between the two strength is usually small. We also observe that, the number seems to be pretty often round up in wikipedia by looking at the top right graph where we can see kind of grid patterns.

#### Combatants

A battle has at maximum 3 different combatants ("combatant_1", "combatant_2", ...). We have noticed that wikipedia often contains multiple combatants in one combatant feature. For examples, during the world war, the battles usually opposed two sides while each side was made of an alliance of multiple combatants. Thus, we retrieve the main combatant and a list of all the combatants present for each battle for each feature.

In [None]:
df[['combatant_first_1', 'combatant_list_1', 'combatant_first_2', 'combatant_list_2']].head()

In [None]:
sums_null = df[['combatant_first_1', 'combatant_first_2', 'combatant_first_3']].isnull().sum()
sums_non_null_percent = 100-(100*sums_null/len(df))
print("number of null values")
print(sums_null)
print("number of non null values")
print(len(df)-sums_null)
plt.bar(range(len(sums_null)), sums_non_null_percent)
plt.title("Percent of battles with numeric values for casualties")
plt.xticks(range(len(sums_null)), sums_null.keys())
plt.show()

We observe that in contrary with the previous features, most of the battles contain information about the combatants. Again most of the battles are between two (group of) combatants.

Since the number of different combatants is pretty high: 4719. We will show here the 50 combatants that participated in the higher number of battles. An not surprisingly (!), the U.S. are first, right before France and Spain.

In [None]:
print("Number of different combatants ", len(names.value_counts()))

In [None]:
cbt_1 = np.array(df["combatant_list_1"].dropna())
cbt_2 = np.array(df["combatant_list_2"].dropna())
cbt_3 = np.array(df["combatant_list_3"].dropna())
print(len(cbt_1), len(cbt_2), len(cbt_3))

cbt_all = np.concatenate((cbt_1, cbt_2, cbt_3))

all_cbt_names = [c for cl in cbt_all for c in cl]
names = pd.Series(all_cbt_names)
print(len(set(all_cbt_names)))

f, ax = plt.subplots(figsize=(6, 15))
counts = names.value_counts().sort_values(ascending=False)
counts = counts.head(50)
sns.barplot(x=counts, y=counts.index, ax=ax)
counts

#### Results

When parsing the battles' results, we mapped them to a qualifier and a result type.
A qualifier can be: "decisive", "major", "crushing", "tactical" or "strategic", while the result type is in: "victory", "defeat", "retreat".
The results correspond then to each combatant (or group of combatant).

In [None]:
df[[result_combatant_1]]