# Filter UCDP dataset

(Starting from the UCDP dataset already parsed)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
sns.set_style('whitegrid')

%matplotlib inline

parsed_ucdp_dataset = '../data/parsed/parsed_ucdp.csv'
ucdp_df = pd.read_csv(parsed_ucdp_dataset, index_col='index', encoding='utf-8', compression='gzip')


The variables are the following:
* Year
* Type
    1. State-Based Violence
    2. Non-State Violence
    3. One-Sided Violence
* Country
* Date start
* Date end
* Casualties

Only look at _recent_ conflicts, i.e. that take place from 2014.

I will focus on type 1 (more conflicts of this type, more total number of casualties, and we are assuming that it is more likely for this type of violence to have an impact).

We also separate conflicts depending on the number of casualties: _few_ (0-1 casualties) or _many_ (2 or more casualties). We are considering that a conflict that results in more than one casualty would likely correspond to a more violent incident.

In [None]:
few_casualties_df = ucdp_df[(ucdp_df['Casualties'] <= 3) & (ucdp_df['Year'] > 2015) & (ucdp_df['Type'] == 1)]
many_casualties_df = ucdp_df[(ucdp_df['Casualties'] > 3) & (ucdp_df['Year'] > 2015) & (ucdp_df['Type'] == 1)]

Number of conflicts:

In [None]:
print(len(few_casualties_df))
print(len(many_casualties_df))

Number of different countries:

In [None]:
print(len(few_casualties_df['Country'].unique()))
print(len(many_casualties_df['Country'].unique()))

Number of conflicts per country:

In [None]:
grouped_few_df = few_casualties_df.groupby('Country').count()
grouped_many_df = many_casualties_df.groupby('Country').count()

plt.boxplot(grouped_few_df['Year'])
plt.title('Few casualties')
plt.show()

plt.boxplot(grouped_many_df['Year'])
plt.title('Many casualties')
plt.show()

I focus on the ones with _many_ casualties, since I expect those to result in a bigger 'response'.

-------------------

At this point:

#### Option 1
Keep the `many_casualties_df` as it is:
* Number of conflicts per country varies a lot! (see boxplot above)
* More conflicts to analyze
* Maybe it's not fair since people might stop paying attention to countries were there are many conflicts (i.e. we can't compare conflicts from countries where there are many conflicts, to conflicts from countries were conflicts are unusual)

In [None]:
# Save to csv file
many_casualties_df = many_casualties_df.reset_index()
many_casualties_df.to_csv('../data/parsed/parsed_filtered1_ucdp.csv', encoding='utf-8', index=False, compression='gzip')


#### Option 2 (to solve issue from Option 1)
Filter by the _total number of conflicts per country_ (this would be a _meta-feature_):
* Solve the previous issue
* Also manage to get a more 'even' number of conflicts per country
* We also reduce the total number of conflicts we consider

In [None]:
conflicts_per_country_s = many_casualties_df.groupby('Country').count()['Year']
countries_unusual_conflicts = list(conflicts_per_country_s[conflicts_per_country_s < 10].index)

many_casualties_filtered_df = many_casualties_df[many_casualties_df['Country'].isin(countries_unusual_conflicts)]

Result: 70 conflicts, from 19 different countries, with the number of _conflicts per country_ shown in the boxplot below

In [None]:
len(many_casualties_filtered_df)

In [None]:
len(countries_unusual_conflicts)

In [None]:
plt.boxplot(many_casualties_filtered_df.groupby('Country').count()['Year'])
plt.show()

In [None]:
# Save to csv file
many_casualties_filtered_df = many_casualties_filtered_df.reset_index()
many_casualties_filtered_df.to_csv('../data/parsed/parsed_filtered2_ucdp.csv', encoding='utf-8', index=False, compression='gzip')