In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import seaborn as sns
from datetime import datetime
import statistics
import re

* Data Dict = https://ucdp.uu.se/downloads/ged/ged211.pdf
* UCDP Website: https://ucdp.uu.se/encyclopedia

Dyad = 'A dyad consists of two opposing actors in an armed conflict where at least one party is the government of a state.'

# Introduction

## Project Goal
The goal of this project is to create a classification model which can classify an act of organised individual violence, where a fatality has taken place, as belonging to one of the following classes:

* Low Fatality Incident (2 or fewer fatalities)
* Moderate Fatality Incident (between 2 and 10 fatalities)
* High Fatality Incident (between 10 and 100 fatalities)
* Moderate Fatality Incident (over 100 fatalities)

A model which can accurately predict whether an incident of fatal violence is likely to escalate into a larger conflict with mass casualties would be of great value in efforts to protect civilian populations and reduce loss of life. Hypothetical end users could be organisations such as the United Nations and the World Health Organisation, as well as other aid agencies and communities themselves in areas where regional conflicts are taking place.


## Acknowledgements & Assumptions
This project touches on very serious topics which require the utmost sensitivity, respect and understanding. For many the topics and conflicts touched on in this project are very personal and have significant meaning attached to them, which I will neither have experience nor understanding of. As such, I have taken every means possible to approach this project in the most respectful and unbiased way possible - focusing entirely on what the dataset is reporting, and working hard not to impart my own personal bias onto any of the analysis.

With that said, it is unfortunately very rare to remove all bias entirely, so I would like to further acknowledge:

* My limitations in fully understanding the nature of these events
* I have not personally endured any suffering or consequences as part of the events touched on in this project
* My openness to hearing other perspectives on this project and the events it touches on
* I am approaching this project from a position of privilege and plenty


## Data Source & Overview
The base data I used for this project was sourced from the Uppsala Conflict Data Program (UCDP) at the Department of Peace and Conflict Research, Uppsala University in Sweden. The UCDP is the world’s main provider of data on organized violence and the oldest ongoing data collection project for civil war, with a history of almost 40 years.

The specific dataset used was the “UCDP Georeferenced Event Dataset (GED) Global version 21.1”, which, according to the UCDP, is their “most disaggregated dataset, covering individual events of organized violence (phenomena of lethal violence occurring at a given time and place). These events are sufficiently fine-grained to be geo-coded down to the level of individual villages, with temporal durations disaggregated to single, individual days.”

* Pettersson, Therese, Shawn Davis, Amber Deniz, Garoun Engström, Nanar Hawach, Stina Högbladh, Margareta Sollenberg & Magnus Öberg (2021). Organized violence 1989-2020, with a special emphasis on Syria. Journal of Peace Research 58(4).
* Sundberg, Ralph and Erik Melander (2013) Introducing the UCDP Georeferenced Event Dataset. Journal of Peace Research 50(4).


## Hypothesis
The hypothesis for this project is that it is possible to create a predictive model which is able to outperform a baseline model, through feature engineering with external data on country level indicators and national statistics.


In [3]:
df = pd.read_csv('ged211.csv', low_memory=False)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 261864 entries, 0 to 261863
Data columns (total 49 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   id                 261864 non-null  int64  
 1   relid              261864 non-null  object 
 2   year               261864 non-null  int64  
 3   active_year        261864 non-null  int64  
 4   code_status        261864 non-null  object 
 5   type_of_violence   261864 non-null  int64  
 6   conflict_dset_id   261864 non-null  int64  
 7   conflict_new_id    261864 non-null  int64  
 8   conflict_name      261864 non-null  object 
 9   dyad_dset_id       261864 non-null  int64  
 10  dyad_new_id        261864 non-null  int64  
 11  dyad_name          261864 non-null  object 
 12  side_a_dset_id     261864 non-null  int64  
 13  side_a_new_id      261864 non-null  int64  
 14  side_a             261864 non-null  object 
 15  side_b_dset_id     261864 non-null  int64  
 16  si

In [197]:
df.sample(10)

Unnamed: 0,id,year,active_year,type_of_violence,conflict_new_id,conflict_name,dyad_new_id,dyad_name,side_a_new_id,side_a,side_b_new_id,side_b,source_article,source_original,where_coordinates,where_description,adm_1,latitude,longitude,priogrid_gid,country,region,date_start,date_end,deaths_a,deaths_b,deaths_civilians,deaths_unknown,best_est_fatalities,conflict_length
150894,28262,2008,Part of an Ongoing Conflict,State-Based Conflict,337,Somalia: Government,750,Government of Somalia - Al-Shabaab,95,Government of Somalia,717,Al-Shabaab,Shabelle 2008-01-21; Garowe 2008-01-22,residents,Mogadishu city,Mogadishu city,Banaadir region,2.066667,45.366667,132931,Somalia,Africa,2008-01-21,2008-01-21,0,0,3,0,3,0 days
178293,361368,2012,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,4456,Syrian insurgents,"""VDC,2012-10-05,Abdow Haj Ahmad""",VDC,Latakia governorate,Lattakia,Latakia governorate,35.566667,36.033333,181153,Syria,Middle East,2012-10-05,2012-10-05,0,1,0,0,1,0 days
244939,308412,2019,Part of an Ongoing Conflict,Non-State Conflict,14803,SNA - SDF,16135,SNA - SDF,7514,SNA,6288,SDF,"""SOHR,2019-10-10,Renewed clashes between the K...",SOHR,Maranaz village,Mar’anaz town area in the northern countryside...,Aleppo governorate,36.551056,37.018678,182595,Syria,Middle East,2019-10-09,2019-10-09,2,2,0,0,4,0 days
156589,14951,1993,Part of an Ongoing Conflict,One-Sided Violence,1909,IFP - Civilians,2391,IFP - Civilians,1117,IFP,1,Civilians,TRC Report,,East Rand,East Rand,Transvaal province,-26.183333,28.25,91857,South Africa,Africa,1993-08-01,1993-08-01,0,0,1,0,1,0 days
122874,102005,2002,Part of an Ongoing Conflict,One-Sided Violence,490,Government of Nepal - Civilians,957,Government of Nepal - Civilians,146,Government of Nepal,1,Civilians,"INSEC ""Human Rights Yearbook 2003",,Bhaktapur district,Bhaktapur district,Bagmati zone,27.66734,85.41673,169731,Nepal,Asia,2002-06-04,2002-06-04,0,0,6,0,6,0 days
210631,266207,2014,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,4456,Syrian insurgents,"""SOHR,2014-10-23,138 died yesterday, including...",SOHR,Taftanaz town,town of Tufnaz.,Idlib governorate,35.99832,36.78579,181154,Syria,Middle East,2014-10-22,2014-10-22,0,0,1,0,1,0 days
171507,46928,2012,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,4456,Syrian insurgents,"""Agence France Presse,2012-01-21,Roadside bomb...","Syrian Observatory for Human Rights, VDC",Al Mastumah town,on the road between Idlib town and the village...,Idlib governorate,35.8755,36.62967,181154,Syria,Middle East,2012-01-21,2012-01-21,3,0,0,11,14,0 days
231059,313812,2019,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,4456,Syrian insurgents,"""SOHR,2019-11-11,13 civilian casualties and wo...","SOHR, airwars",Kafr Ruma town,Kafruma town in the southern countryside of Idlib,Idlib governorate,35.63661,36.63293,181154,Syria,Middle East,2019-11-10,2019-11-10,0,0,7,0,7,0 days
98324,114854,2014,Part of an Ongoing Conflict,Non-State Conflict,13265,IS - JRTN,14152,IS - JRTN,234,IS,4359,JRTN,"""Agence France Presse,2014-06-21,Sunni militan...","Security official, witnesses",Ḩawījah town,Hawija town,Kirkūk province,35.324934,43.768621,180448,Iraq,Middle East,2014-06-20,2014-06-20,0,0,0,17,17,0 days
190952,331235,2013,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,4456,Syrian insurgents,"""SOHR,2013-05-23,About 220 were killed yesterd...",SOHR; VDC,Ar Rastan town,Rastan,Homs governorate,34.926532,36.732381,179714,Syria,Middle East,2013-05-22,2013-05-22,0,0,1,0,1,0 days


In [30]:
df.describe()

Unnamed: 0,id,year,conflict_new_id,dyad_new_id,side_a_new_id,side_b_new_id,latitude,longitude,priogrid_gid,deaths_a,deaths_b,deaths_civilians,deaths_unknown,best_est_fatalities
count,261864.0,261864.0,261864.0,261864.0,261864.0,261864.0,261864.0,261864.0,261864.0,261864.0,261864.0,261864.0,261864.0,261864.0
mean,204135.24098,2009.569567,2211.954511,5569.123557,387.742737,1689.069211,25.887871,39.369586,166985.823699,1.901403,2.460002,3.992794,1.920928,10.275128
std,111157.063675,8.352765,4370.887413,5843.424747,1083.003802,2139.891098,14.751878,41.364117,21254.925047,58.489905,32.157393,173.874671,115.988807,222.556417
min,4.0,1989.0,205.0,406.0,3.0,1.0,-37.813611,-117.3,75530.0,0.0,0.0,0.0,0.0,0.0
25%,104387.75,2004.0,299.0,735.0,118.0,234.0,14.60063,34.495142,150836.0,0.0,0.0,0.0,0.0,1.0
50%,211886.5,2013.0,333.0,973.0,121.0,341.0,33.340582,37.616667,177556.0,0.0,0.0,0.0,0.0,2.0
75%,299287.25,2016.0,506.0,11973.0,146.0,4456.0,35.335876,66.889427,180441.0,1.0,2.0,1.0,0.0,5.0
max,390297.0,2020.0,15263.0,16692.0,7984.0,8006.0,68.97917,155.896681,228667.0,14288.0,9505.0,40000.0,48183.0,48183.0


### Dropping Variables

The data includes variables which are either not useful, out of the scope of, or would skew model results, and as such I am dropping them below as a first step.

In [6]:
df.drop('relid', axis=1, inplace=True)

In [7]:
df.drop('code_status', axis=1, inplace=True)

In [8]:
df.drop('conflict_dset_id', axis=1, inplace=True)

In [9]:
df.drop('dyad_dset_id', axis=1, inplace=True)
df.drop('side_a_dset_id', axis=1, inplace=True)
df.drop('side_b_dset_id', axis=1, inplace=True)

In [10]:
df.drop('number_of_sources', axis=1, inplace=True)

In [11]:
df.source_date.isnull().value_counts()

False    157303
True     104561
Name: source_date, dtype: int64

In [12]:
df.drop('source_date', axis=1, inplace=True)

In [31]:
## Keeping as possibly interesting from an NLP angle.

df.source_original.value_counts()

SOHR                                                                               38956
VDC                                                                                20296
police                                                                              9393
SOHR, VDC                                                                           5417
SOHR; VDC                                                                           5335
                                                                                   ...  
Antara News Agency quaoting local sources                                              1
Garoweonline.com in Somali 26 May 12                                                   1
Defence and Police sources                                                             1
Fisseha Tekle, a Nairobi-based researcher with Amnesty International, witnesses        1
Reports/ Police                                                                        1
Name: source_original

In [14]:
df.drop('geom_wkt', axis=1, inplace=True)

In [15]:
df.gwnoa.isnull().value_counts()

False    207473
True      54391
Name: gwnoa, dtype: int64

In [16]:
df.gwnob.isnull().value_counts()

True     260447
False      1417
Name: gwnob, dtype: int64

In [17]:
df.drop('gwnoa', axis=1, inplace=True)
df.drop('gwnob', axis=1, inplace=True)

In [18]:
df.drop('country_id', axis=1, inplace=True)

In [32]:
## Changing to a categorical variable as it makes it easier to understand for now.

df.loc[df['type_of_violence'] == 1, 'type_of_violence'] = 'State-Based Conflict'
df.loc[df['type_of_violence'] == 2, 'type_of_violence'] = 'Non-State Conflict'
df.loc[df['type_of_violence'] == 3, 'type_of_violence'] = 'One-Sided Violence'

In [20]:
df.drop('adm_2', axis=1, inplace=True)

In [21]:
## Dropping both 'High' and 'Low' fatality estimates as I will be using the 'Best' estimate.

df.drop('high', axis=1, inplace=True)
df.drop('low', axis=1, inplace=True)

In [22]:
df.loc[df['active_year'] == 0, 'active_year'] = 'Isolated Incident'
df.loc[df['active_year'] == 1, 'active_year'] = 'Part of an Ongoing Conflict'

In [23]:
df.drop('where_prec', axis=1, inplace=True)

In [24]:
df.drop('event_clarity', axis=1, inplace=True)

In [25]:
## The below values represent how precise the information is regarding the date ranges given for the conflicts.
## Vast majority of dates are very precise, so will be dropping as not very relevant.

df.date_prec.value_counts()

1    222727
2     28298
4      6250
5      2795
3      1790
0         4
Name: date_prec, dtype: int64

In [26]:
df.drop('date_prec', axis=1, inplace=True)

#### Cleaning Up the Column Names

In [27]:
df.columns

Index(['id', 'year', 'active_year', 'type_of_violence', 'conflict_new_id',
       'conflict_name', 'dyad_new_id', 'dyad_name', 'side_a_new_id', 'side_a',
       'side_b_new_id', 'side_b', 'source_article', 'source_office',
       'source_headline', 'source_original', 'where_coordinates',
       'where_description', 'adm_1', 'latitude', 'longitude', 'priogrid_gid',
       'country', 'region', 'date_start', 'date_end', 'deaths_a', 'deaths_b',
       'deaths_civilians', 'deaths_unknown', 'best'],
      dtype='object')

In [28]:
df.columns = ['id', 'year', 'active_year', 'type_of_violence', 'conflict_new_id',
       'conflict_name', 'dyad_new_id', 'dyad_name', 'side_a_new_id', 'side_a',
       'side_b_new_id', 'side_b', 'source_article', 'source_office',
       'source_headline', 'source_original', 'where_coordinates',
       'where_description', 'adm_1', 'latitude', 'longitude', 'priogrid_gid',
       'country', 'region', 'date_start', 'date_end', 'deaths_a', 'deaths_b',
       'deaths_civilians', 'deaths_unknown', 'best_est_fatalities']

In [29]:
df.sample(10)

Unnamed: 0,id,year,active_year,type_of_violence,conflict_new_id,conflict_name,dyad_new_id,dyad_name,side_a_new_id,side_a,...,priogrid_gid,country,region,date_start,date_end,deaths_a,deaths_b,deaths_civilians,deaths_unknown,best_est_fatalities
252392,159177,2004,Part of an Ongoing Conflict,State-Based Conflict,354,Turkey: Kurdistan,781,Government of Turkey - PKK,115,Government of Turkey,...,184048,Turkey,Middle East,2004-07-25 00:00:00.000,2004-07-25 00:00:00.000,0,0,0,0,0
711,146053,1994,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,734,Government of Afghanistan - Junbish-i Milli-yi...,130,Government of Afghanistan,...,179059,Afghanistan,Asia,1994-06-28 00:00:00.000,1994-06-28 00:00:00.000,0,0,0,50,50
24779,273978,2018,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,...,178339,Afghanistan,Asia,2018-12-15 00:00:00.000,2018-12-15 00:00:00.000,0,0,3,0,3
22802,260097,2018,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,...,174732,Afghanistan,Asia,2018-04-16 00:00:00.000,2018-04-17 00:00:00.000,0,10,0,6,16
158635,14541,1989,Part of an Ongoing Conflict,Non-State Conflict,4841,Supporters of IFP - Supporters of UDF,5451,Supporters of IFP - Supporters of UDF,620,Supporters of IFP,...,86822,South Africa,Africa,1989-07-09 00:00:00.000,1989-07-09 00:00:00.000,1,0,0,0,1
185793,305351,2013,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,...,176833,Syria,Middle East,2013-02-23 00:00:00.000,2013-02-25 00:00:00.000,0,2,0,0,2
45392,386823,1992,Part of an Ongoing Conflict,State-Based Conflict,389,Bosnia-Herzegovina: Serb,835,Government of Bosnia-Herzegovina - Serbian Rep...,50,Government of Bosnia-Herzegovina,...,193359,Bosnia-Herzegovina,Europe,1992-07-05 00:00:00.000,1992-07-05 00:00:00.000,1,0,0,0,1
225178,235223,2016,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,...,176833,Syria,Middle East,2016-11-26 00:00:00.000,2016-11-26 00:00:00.000,0,0,0,1,1
163348,72709,2008,Part of an Ongoing Conflict,State-Based Conflict,352,Sri Lanka (Ceylon): Eelam,776,Government of Sri Lanka - LTTE,145,Government of Sri Lanka,...,143081,Sri Lanka,Asia,2008-07-17 00:00:00.000,2008-07-17 00:00:00.000,5,20,0,0,25
194116,332140,2013,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,...,180441,Syria,Middle East,2013-07-31 00:00:00.000,2013-07-31 00:00:00.000,0,0,1,0,1


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 261864 entries, 0 to 261863
Data columns (total 31 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   id                   261864 non-null  int64  
 1   year                 261864 non-null  int64  
 2   active_year          261864 non-null  object 
 3   type_of_violence     261864 non-null  object 
 4   conflict_new_id      261864 non-null  int64  
 5   conflict_name        261864 non-null  object 
 6   dyad_new_id          261864 non-null  int64  
 7   dyad_name            261864 non-null  object 
 8   side_a_new_id        261864 non-null  int64  
 9   side_a               261864 non-null  object 
 10  side_b_new_id        261864 non-null  int64  
 11  side_b               261864 non-null  object 
 12  source_article       261862 non-null  object 
 13  source_office        157303 non-null  object 
 14  source_headline      157303 non-null  object 
 15  source_original  

In [34]:
df = df[(df['adm_1'].isnull() != True)]

In [35]:
df.source_article[242153]

'"Reuters News,2016-07-04,Seventeen suspects to appear in court over Istanbul airport bombing-media"'

In [36]:
df.source_headline[242153]

'Seventeen suspects to appear in court over Istanbul airport bombing-media'

In [37]:
## Dropping 'Source Headline' as only available for data added in 2013 and 2014, many NaN's, whereas
## 'Source Article' has data for all observations.

df.drop('source_headline', axis=1, inplace=True)

In [38]:
df.source_office.isnull().value_counts()

False    151106
True      98684
Name: source_office, dtype: int64

In [39]:
## Dropping 'Source Office' as I don;t believe it will add much value to the analysis, and it's missing
## nearly half of data.

df.drop('source_office', axis=1, inplace=True)

In [40]:
df.rename(columns = {'best':'fatalities'}, inplace = True)

In [41]:
df['date_start'] =  pd.to_datetime(df.date_start, errors='coerce')
df['date_end'] =  pd.to_datetime(df.date_end, errors='coerce')

In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 249790 entries, 0 to 261863
Data columns (total 29 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   id                   249790 non-null  int64         
 1   year                 249790 non-null  int64         
 2   active_year          249790 non-null  object        
 3   type_of_violence     249790 non-null  object        
 4   conflict_new_id      249790 non-null  int64         
 5   conflict_name        249790 non-null  object        
 6   dyad_new_id          249790 non-null  int64         
 7   dyad_name            249790 non-null  object        
 8   side_a_new_id        249790 non-null  int64         
 9   side_a               249790 non-null  object        
 10  side_b_new_id        249790 non-null  int64         
 11  side_b               249790 non-null  object        
 12  source_article       249790 non-null  object        
 13  source_origina

In [43]:
df.sample(10)

Unnamed: 0,id,year,active_year,type_of_violence,conflict_new_id,conflict_name,dyad_new_id,dyad_name,side_a_new_id,side_a,...,priogrid_gid,country,region,date_start,date_end,deaths_a,deaths_b,deaths_civilians,deaths_unknown,best_est_fatalities
42197,20487,1999,Part of an Ongoing Conflict,State-Based Conflict,327,Angola: Government,714,Government of Angola - UNITA,99,Government of Angola,...,111992,Angola,Africa,1999-02-06,1999-02-07,0,0,0,25,25
23551,264807,2018,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,...,175455,Afghanistan,Asia,2018-07-22,2018-07-24,4,28,0,0,32
178848,311084,2012,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,...,181153,Syria,Middle East,2012-10-16,2012-10-16,0,0,1,0,1
194196,364594,2013,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,...,180441,Syria,Middle East,2013-08-02,2013-08-02,0,1,2,0,3
212371,340143,2015,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,...,181875,Syria,Middle East,2015-01-22,2015-01-22,0,1,0,0,1
153220,229365,2016,Part of an Ongoing Conflict,State-Based Conflict,337,Somalia: Government,750,Government of Somalia - Al-Shabaab,95,Government of Somalia,...,132931,Somalia,Africa,2016-08-12,2016-08-12,0,0,1,0,1
111093,277049,2018,Part of an Ongoing Conflict,Non-State Conflict,4979,Jalisco Cartel New Generation - Los Zetas,5589,Jalisco Cartel New Generation - Los Zetas,1151,Jalisco Cartel New Generation,...,157124,Mexico,Americas,2018-07-01,2018-07-01,0,0,0,4,4
105298,40966,2011,Part of an Ongoing Conflict,State-Based Conflict,11346,Libya: Government,11980,Government of Libya - NTC,111,Government of Libya,...,176071,Libya,Africa,2011-03-16,2011-03-16,0,2,2,14,18
3380,127057,2007,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,...,177619,Afghanistan,Asia,2007-10-20,2007-10-20,2,0,0,0,2
139261,125542,2009,Isolated Incident,State-Based Conflict,308,Philippines: Mindanao,656,Government of Philippines - MNLF,154,Government of Philippines,...,141715,Philippines,Asia,2009-08-19,2009-08-19,0,7,0,0,7


In [44]:
df['conflict_length'] = df.date_end - df.date_start

In [45]:
df.sample(10)

Unnamed: 0,id,year,active_year,type_of_violence,conflict_new_id,conflict_name,dyad_new_id,dyad_name,side_a_new_id,side_a,...,country,region,date_start,date_end,deaths_a,deaths_b,deaths_civilians,deaths_unknown,best_est_fatalities,conflict_length
86844,97340,2009,Part of an Ongoing Conflict,One-Sided Violence,523,ULFA - Civilians,990,ULFA - Civilians,326,ULFA,...,India,Asia,2009-07-06,2009-07-06,0,0,2,0,2,0 days
245079,312579,2019,Part of an Ongoing Conflict,Non-State Conflict,14803,SNA - SDF,16135,SNA - SDF,7514,SNA,...,Syria,Middle East,2019-11-04,2019-11-04,0,0,0,0,0,0 days
12365,78665,2013,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,...,Afghanistan,Asia,2013-03-13,2013-03-13,1,0,0,0,1,0 days
149123,38937,1997,Part of an Ongoing Conflict,State-Based Conflict,382,Sierra Leone: Government,818,Government of Sierra Leone - RUF,80,Government of Sierra Leone,...,Sierra Leone,Africa,1997-04-06,1997-04-06,0,0,0,3,3,0 days
257538,239232,2017,Part of an Ongoing Conflict,State-Based Conflict,13306,Ukraine: Novorossiya,15101,Government of Ukraine - LPR,61,Government of Ukraine,...,Ukraine,Europe,2017-04-23,2017-04-23,0,0,1,0,1,0 days
107493,239864,2017,Isolated Incident,Non-State Conflict,13988,Dozos (Mali) - JNIM,15162,Dozos (Mali) - JNIM,6769,Dozos (Mali),...,Mali,Africa,2017-03-22,2017-03-22,0,0,0,11,11,0 days
201504,273111,2014,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,...,Syria,Middle East,2014-01-30,2014-01-30,2,2,0,0,4,0 days
139721,124668,2000,Part of an Ongoing Conflict,State-Based Conflict,308,Philippines: Mindanao,658,Government of Philippines - ASG,154,Government of Philippines,...,Philippines,Asia,2000-05-03,2000-05-03,0,0,0,0,0,0 days
224793,233650,2016,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,...,Syria,Middle East,2016-11-02,2016-11-02,0,0,0,3,3,0 days
217925,215364,2015,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,...,Syria,Middle East,2015-09-20,2015-09-20,0,1,0,0,1,0 days


In [46]:
df.conflict_length.value_counts()

0 days      213530
1 days       17244
2 days        6397
3 days        1640
6 days        1593
             ...  
157 days         1
351 days         1
261 days         1
247 days         1
306 days         1
Name: conflict_length, Length: 296, dtype: int64

## Feature Engineering - National Indicators

In order to add value and predictive power to the model, and to test the project hypothesis, below I am preparing data to feature engineer into our main dataset. This was a very challenging process for many reasons, the main one being the time it took to position the data in a way where it could feasibly be joined into the main dataset. As it was a very complex transposition, I was unable to achieve the desired results in python and reverted to some basic structuring in Excel.

Once the various datasets had been structured in Excel, I imported them into pandas dataframes and merged them for further analysis and processing.

In [47]:
one = pd.read_csv('one.csv')
two = pd.read_csv('two.csv')
three = pd.read_csv('three.csv')
four = pd.read_csv('four.csv')
five = pd.read_csv('five.csv')
fragile_state_index = pd.read_csv('fragile_state_index.csv')

In [48]:
main_feat = pd.merge(
    one.reset_index(),
    two.reset_index(),
    on=['Year', 'Country'],
    how='outer'
)

In [49]:
main_feat.drop('index_x', axis=1, inplace=True)

In [50]:
main_feat.drop('index_y', axis=1, inplace=True)

In [51]:
main_feat

Unnamed: 0,Country,Year,Economic growth: the rate of change of real GDP,GDP per capita current U.S. dollars,Capital investment as percent of GDP,Household consumption as percent of GDP,Inflation: percent change in the Consumer Price Index,Unemployment rate,Unemployment rate for females,Unemployment rate for males,...,Voice and accountability index (-2.5 weak; 2.5 strong),Political stability index (-2.5 weak; 2.5 strong),Corruption Perceptions Index 100 = no corruption,Political rights index 7 (weak) - 1 (strong),Civil liberties index 7 (weak) - 1 (strong),Internet users percent of population,Gasoline prices at the pump in dollars per liter,Access to electricity percent of the population,Oil reserves billion barrels,Oil production thousand barrels per day
0,Afghanistan,1991,,,,,,11.38,14.36,10.82,...,,,,7.0,7.0,0.00,,,0.0,0.0
1,Afghanistan,1992,,,,,,11.46,14.61,10.88,...,,,,6.0,6.0,0.00,,,0.0,0.0
2,Afghanistan,1993,,,,,,11.61,14.61,11.06,...,,,,7.0,7.0,0.00,,,0.0,0.0
3,Afghanistan,1994,,,,,,11.65,14.76,11.09,...,,,,7.0,7.0,0.00,,,0.0,0.0
4,Afghanistan,1995,,,,,,11.65,14.95,11.04,...,,,,7.0,7.0,0.00,,,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3655,Zimbabwe,2016,0.76,1464.58,9.86,83.35,-1.5,5.24,5.87,4.58,...,-1.18,-0.62,22.0,5.0,5.0,23.12,1.34,39.68,0.0,0.0
3656,Zimbabwe,2017,4.70,1548.17,8.97,79.36,0.9,5.15,5.76,4.52,...,-1.19,-0.71,22.0,5.0,5.0,27.06,,40.14,0.0,0.0
3657,Zimbabwe,2018,4.83,1683.74,5.71,77.04,10.6,5.07,5.62,4.50,...,-1.12,-0.71,22.0,6.0,5.0,,,40.62,0.0,0.0
3658,Zimbabwe,2019,-8.10,1463.99,,,,5.02,5.55,4.46,...,-1.14,-0.92,24.0,5.0,5.0,,,41.09,0.0,0.0


In [52]:
main_feat = pd.merge(
    main_feat.reset_index(),
    three.reset_index(),
    on=['Year', 'Country'],
    how='outer'
)

In [53]:
main_feat.drop('index_x', axis=1, inplace=True)

In [54]:
main_feat.drop('index_y', axis=1, inplace=True)

In [55]:
main_feat

Unnamed: 0,Country,Year,Economic growth: the rate of change of real GDP,GDP per capita current U.S. dollars,Capital investment as percent of GDP,Household consumption as percent of GDP,Inflation: percent change in the Consumer Price Index,Unemployment rate,Unemployment rate for females,Unemployment rate for males,...,Property rights index (0-100),Freedom from corruption index (0-100),Fiscal freedom index (0-100),Business freedom index (0-100),Labor freedom index (0-100),Monetary freedom index (0-100),Trade freedom index (0-100),Investment freedom index (0-100),Financial freedom index (0-100),Economic freedom overall index (0-100)
0,Afghanistan,1991,,,,,,11.38,14.36,10.82,...,,,,,,,,,,
1,Afghanistan,1992,,,,,,11.46,14.61,10.88,...,,,,,,,,,,
2,Afghanistan,1993,,,,,,11.61,14.61,11.06,...,,,,,,,,,,
3,Afghanistan,1994,,,,,,11.65,14.76,11.09,...,,,,,,,,,,
4,Afghanistan,1995,,,,,,11.65,14.95,11.04,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3655,Zimbabwe,2016,0.76,1464.58,9.86,83.35,-1.5,5.24,5.87,4.58,...,10.0,21.0,61.0,38.0,30.0,79.1,50.0,10.0,10.0,38.0
3656,Zimbabwe,2017,4.70,1548.17,8.97,79.36,0.9,5.15,5.76,4.52,...,27.0,15.0,61.0,36.0,33.0,76.5,53.0,25.0,10.0,44.0
3657,Zimbabwe,2018,4.83,1683.74,5.71,77.04,10.6,5.07,5.62,4.50,...,28.0,19.0,61.0,37.0,39.0,76.9,69.0,25.0,10.0,44.0
3658,Zimbabwe,2019,-8.10,1463.99,,,,5.02,5.55,4.46,...,30.0,16.0,62.0,33.0,43.0,72.4,70.0,25.0,10.0,40.0


In [56]:
main_feat = pd.merge(
    main_feat.reset_index(),
    four.reset_index(),
    on=['Year', 'Country'],
    how='outer'
)

In [57]:
main_feat.drop('index_x', axis=1, inplace=True)
main_feat.drop('index_y', axis=1, inplace=True)

In [58]:
main_feat = pd.merge(
    main_feat.reset_index(),
    five.reset_index(),
    on=['Year', 'Country'],
    how='outer'
)

In [59]:
main_feat.drop('index_x', axis=1, inplace=True)
main_feat.drop('index_y', axis=1, inplace=True)

In [60]:
main_feat = pd.merge(
    main_feat.reset_index(),
    fragile_state_index.reset_index(),
    on=['Year', 'Country'],
    how='outer'
)

In [61]:
main_feat.drop('index_x', axis=1, inplace=True)
main_feat.drop('index_y', axis=1, inplace=True)

In [62]:
main_feat

Unnamed: 0,Country,Year,Economic growth: the rate of change of real GDP,GDP per capita current U.S. dollars,Capital investment as percent of GDP,Household consumption as percent of GDP,Inflation: percent change in the Consumer Price Index,Unemployment rate,Unemployment rate for females,Unemployment rate for males,...,Group grievance index 0 (low) - 10 (high),Economic decline index 0 (low) - 10 (high),Uneven economic development index 0 (low) - 10 (high),Human flight and brain drain index 0 (low) - 10 (high),State legitimacy index 0 (high) - 10 (low),Public services index 0 (high) - 10 (low),Human rights and rule of law index 0 (high) - 10 (low),Demographic pressures 0 (low) - 10 (high),Refugees and displaced persons index 0 (low) - 10 (high),External interventions index 0 (low) - 10 (high)
0,Afghanistan,1991,,,,,,11.38,14.36,10.82,...,,,,,,,,,,
1,Afghanistan,1992,,,,,,11.46,14.61,10.88,...,,,,,,,,,,
2,Afghanistan,1993,,,,,,11.61,14.61,11.06,...,,,,,,,,,,
3,Afghanistan,1994,,,,,,11.65,14.76,11.09,...,,,,,,,,,,
4,Afghanistan,1995,,,,,,11.65,14.95,11.04,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3655,Zimbabwe,2016,0.76,1464.58,9.86,83.35,-1.5,5.24,5.87,4.58,...,7.5,8.3,8.2,8.1,8.9,8.5,8.4,8.6,8.7,7.7
3656,Zimbabwe,2017,4.70,1548.17,8.97,79.36,0.9,5.15,5.76,4.52,...,7.3,8.6,8.5,7.9,9.2,8.9,8.2,9.1,8.5,7.5
3657,Zimbabwe,2018,4.83,1683.74,5.71,77.04,10.6,5.07,5.62,4.50,...,7.0,8.6,8.2,7.6,9.7,8.9,8.5,8.9,8.2,7.6
3658,Zimbabwe,2019,-8.10,1463.99,,,,5.02,5.55,4.46,...,6.7,8.1,7.9,7.3,9.4,8.6,8.2,9.0,8.2,7.3


In [63]:
main_feat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3660 entries, 0 to 3659
Data columns (total 97 columns):
 #   Column                                                    Non-Null Count  Dtype  
---  ------                                                    --------------  -----  
 0   Country                                                   3660 non-null   object 
 1   Year                                                      3660 non-null   int64  
 2   Economic growth: the rate of change of real GDP           3394 non-null   float64
 3   GDP per capita current U.S. dollars                       3425 non-null   float64
 4   Capital investment as percent of GDP                      3293 non-null   float64
 5   Household consumption as percent of GDP                   3131 non-null   float64
 6   Inflation: percent change in the Consumer Price Index     3164 non-null   float64
 7   Unemployment rate                                         3660 non-null   float64
 8   Unemployment rate 

#### Missing Values

Although there are many interesting features to consider here - it is very clear there is a problem with missing values. To start with the features with very, very small amounts of data will be dropped.

In [64]:
main_feat.drop(main_feat.columns[62], axis=1, inplace=True)

In [65]:
main_feat.drop(main_feat.columns[62], axis=1, inplace=True)

In [66]:
main_feat.drop(main_feat.columns[62], axis=1, inplace=True)

In [67]:
main_feat.drop(main_feat.columns[58], axis=1, inplace=True)

In [68]:
main_feat.drop(main_feat.columns[37], axis=1, inplace=True)

In [69]:
main_feat.drop(main_feat.columns[51], axis=1, inplace=True)

In [70]:
main_feat.drop(main_feat.columns[56], axis=1, inplace=True)

In [71]:
main_feat.drop(main_feat.columns[56], axis=1, inplace=True)

In [72]:
main_feat.drop(main_feat.columns[55], axis=1, inplace=True)

In [73]:
main_feat.drop(main_feat.columns[55], axis=1, inplace=True)

In [74]:
main_feat.drop(main_feat.columns[55], axis=1, inplace=True)

In [75]:
main_feat.drop(main_feat.columns[62], axis=1, inplace=True)

In [76]:
main_feat.drop(main_feat.columns[62], axis=1, inplace=True)

In [77]:
main_feat.drop(main_feat.columns[63], axis=1, inplace=True)

In [78]:
main_feat.drop(main_feat.columns[66], axis=1, inplace=True)

In [79]:
main_feat.drop(main_feat.columns[66], axis=1, inplace=True)

In [80]:
main_feat.drop(main_feat.columns[66], axis=1, inplace=True)

In [81]:
main_feat.drop(main_feat.columns[31], axis=1, inplace=True)

In [82]:
main_feat.drop(main_feat.columns[34], axis=1, inplace=True)

In [83]:
main_feat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3660 entries, 0 to 3659
Data columns (total 78 columns):
 #   Column                                                    Non-Null Count  Dtype  
---  ------                                                    --------------  -----  
 0   Country                                                   3660 non-null   object 
 1   Year                                                      3660 non-null   int64  
 2   Economic growth: the rate of change of real GDP           3394 non-null   float64
 3   GDP per capita current U.S. dollars                       3425 non-null   float64
 4   Capital investment as percent of GDP                      3293 non-null   float64
 5   Household consumption as percent of GDP                   3131 non-null   float64
 6   Inflation: percent change in the Consumer Price Index     3164 non-null   float64
 7   Unemployment rate                                         3660 non-null   float64
 8   Unemployment rate 

## Subsetting Feature Data - 2007 to Present

The missing feature data tends to start appearing as you begin progressing back in time. Therefore, one measure to reduce the missing data is to focus the project on conflicts and features from 2007 onwards.


In [84]:

pd.set_option('display.max_rows', 100)

In [85]:
more_recent = main_feat[main_feat['Year'] >= 2007]

In [86]:
more_recent.columns = ['country', 'year', 'economic_growth',
       'gdp_per_capita',
       'capital_investment',
       'household_consumption',
       'inflation',
       'unemployment_rate', 'unemployment_rate_f',
       'unemployment_rate_males', 'youth_unemployment',
       'lbr_frce_part_rate',
       'trade_openness',
       'exports_goods_services',
       'exports_goods_services',
       'trade_balance', 'remittances',
       'external_debt',
       'gov_spending',
       'government_debt',
       'foreign_aid_received',
       'rule_of_law_index_score',
       'gov_effectiveness_index_score',
       'control_corruption_index_score',
       'Regulatory_qual_index_score',
       'accountability_index_score',
       'pol_stability_index_score',
       'corruption_perception_index_score',
       'pol_rights_index_score',
       'civil_liberties_index_score',
       'internet_users_pcnt_pop',
       'access_electricity_pcnt_pop',
       'oil_reserves_barrels',
       'oil_prod_barrels_daily', 'banking_z_score',
       'property_rights_index_score',
       'freedom_fr_corruption_index_score', 'fiscal_freedom_index_score',
       'biz_freedom_index_score', 'labor_freedom_index_score',
       'monetary_freedom_index_score', 'trade_freedom_index_score',
       'inv_freedom_index_score', 'fin_freedom_index_score',
       'econ_freedom_index_score', 'pcnt_urban_population',
       'pop_den_ppl_sqkm',
       'rur_pop_pcnt_tot_pop', 'refugee_pop',
       'pop_growth_pcnt', 'health_spend_pcnt_gdp',
       'life_expectancy_years', 'death_rate_p1000',
       'globalization_index_score', 'econ_globalization_index_score',
       'pol_globalization_index_score',
       'social_globalization_index_score', 'pcnt_world_oil_reserves',
       'savings_pcnt_gdp', 'pcnt_world_tourist_arrivals',
       'numb_prisoners_p100000', 'homicides_p100000',
       'military_spen_pcnt_gdp',
       'arms_imports_mils',
       'human_development_index_score',
       'fragile_state index_score',
       'security_threats_index_score',
       'factionalized_elitesindex_score',
       'group_grievance_index_score',
       'economic_decline_index_score',
       'uneven_economic_dev_index_score',
       'human_flight_brain_drain_index_score',
       'state_leg_index_score',
       'public_serv_index_score',
       'human_rights_rule_law index_score',
       'demographic_pressures_score',
       'refugees_displaced_pers_index_score',
       'ext_interventions_index_score']

In [87]:
more_recent.reset_index(inplace=True)

In [88]:
more_recent.drop('index', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [89]:
more_recent.iloc[:30]

Unnamed: 0,country,year,economic_growth,gdp_per_capita,capital_investment,household_consumption,inflation,unemployment_rate,unemployment_rate_f,unemployment_rate_males,...,group_grievance_index_score,economic_decline_index_score,uneven_economic_dev_index_score,human_flight_brain_drain_index_score,state_leg_index_score,public_serv_index_score,human_rights_rule_law index_score,demographic_pressures_score,refugees_displaced_pers_index_score,ext_interventions_index_score
0,Afghanistan,2007,13.83,359.69,,,8.7,11.18,14.85,10.5,...,9.1,8.3,8.0,7.0,8.8,8.0,8.2,8.5,8.9,10.0
1,Afghanistan,2008,3.92,364.66,,,26.4,11.11,14.27,10.53,...,9.5,8.5,8.1,7.0,9.2,8.3,8.4,9.1,8.9,10.0
2,Afghanistan,2009,21.39,438.08,,,-6.8,11.46,14.99,10.81,...,9.6,8.3,8.4,7.2,9.8,8.9,8.8,9.3,8.9,10.0
3,Afghanistan,2010,14.36,543.3,,,2.2,11.52,14.76,10.92,...,9.7,8.3,8.2,7.2,10.0,8.9,9.2,9.5,9.2,10.0
4,Afghanistan,2011,0.43,591.16,,,11.8,11.51,14.79,10.9,...,9.3,8.0,8.4,7.2,9.7,8.5,8.8,9.1,9.3,10.0
5,Afghanistan,2012,12.75,641.87,,,6.4,11.52,14.86,10.88,...,9.4,7.7,8.1,7.4,9.5,8.5,8.5,8.9,9.0,10.0
6,Afghanistan,2013,5.6,637.17,,,7.4,11.54,14.7,10.89,...,9.2,8.2,7.8,7.2,9.4,8.8,8.4,9.3,9.2,10.0
7,Afghanistan,2014,2.72,613.86,,,4.7,11.45,14.53,10.78,...,8.7,8.3,7.5,7.8,9.5,9.0,8.3,8.8,9.3,9.9
8,Afghanistan,2015,1.45,578.47,,,-0.7,11.39,14.45,10.68,...,8.9,8.6,7.2,8.1,9.7,9.3,8.6,9.3,9.1,9.8
9,Afghanistan,2016,2.26,509.22,,,4.4,11.31,14.33,10.57,...,8.6,8.5,7.5,8.4,9.1,9.6,8.7,9.5,9.5,9.9


In [90]:
more_recent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1708 entries, 0 to 1707
Data columns (total 78 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   country                               1708 non-null   object 
 1   year                                  1708 non-null   int64  
 2   economic_growth                       1541 non-null   float64
 3   gdp_per_capita                        1554 non-null   float64
 4   capital_investment                    1540 non-null   float64
 5   household_consumption                 1437 non-null   float64
 6   inflation                             1551 non-null   float64
 7   unemployment_rate                     1708 non-null   float64
 8   unemployment_rate_f                   1586 non-null   float64
 9   unemployment_rate_males               1586 non-null   float64
 10  youth_unemployment                    1586 non-null   float64
 11  lbr_frce_part_rat

In [91]:
more_recent.drop(more_recent.columns[60], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [92]:
more_recent.drop(more_recent.columns[60], axis=1, inplace=True)

In [93]:
#more_recent.to_csv('more_recent.csv', encoding='utf-8')

In [94]:
more_recent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1708 entries, 0 to 1707
Data columns (total 76 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   country                               1708 non-null   object 
 1   year                                  1708 non-null   int64  
 2   economic_growth                       1541 non-null   float64
 3   gdp_per_capita                        1554 non-null   float64
 4   capital_investment                    1540 non-null   float64
 5   household_consumption                 1437 non-null   float64
 6   inflation                             1551 non-null   float64
 7   unemployment_rate                     1708 non-null   float64
 8   unemployment_rate_f                   1586 non-null   float64
 9   unemployment_rate_males               1586 non-null   float64
 10  youth_unemployment                    1586 non-null   float64
 11  lbr_frce_part_rat

In [95]:
df_recent = df[(df['year'] >= 2007)]

### Feature Selection & Merging with Core Data

Due to the amount of features, and given each feature has missing data, I will be handpicking features where there is the best correlations to the target ('best_est_fatalities'). These will then be the features we use for modelling.

In [124]:
main = pd.merge(
    df_recent.reset_index(),
    more_recent.reset_index(),
    on=['country', 'year'],
    how='left'
)

In [125]:
main.drop('index_x', axis=1, inplace=True)
main.drop('index_y', axis=1, inplace=True)

In [126]:
main.sample(10)

Unnamed: 0,id,year,active_year,type_of_violence,conflict_new_id,conflict_name,dyad_new_id,dyad_name,side_a_new_id,side_a,...,group_grievance_index_score,economic_decline_index_score,uneven_economic_dev_index_score,human_flight_brain_drain_index_score,state_leg_index_score,public_serv_index_score,human_rights_rule_law index_score,demographic_pressures_score,refugees_displaced_pers_index_score,ext_interventions_index_score
135074,287717,2014,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,...,10.0,6.7,6.9,6.9,9.8,7.2,9.9,6.0,10.0,8.6
15276,224844,2016,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,...,8.6,8.5,7.5,8.4,9.1,9.6,8.7,9.5,9.5,9.9
147790,206471,2015,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,...,10.0,7.5,7.0,7.4,9.9,8.2,10.0,8.1,10.0,9.9
89245,209490,2015,Part of an Ongoing Conflict,State-Based Conflict,337,Somalia: Government,750,Government of Somalia - Al-Shabaab,95,Government of Somalia,...,9.5,9.1,9.0,9.2,9.3,9.3,10.0,9.6,9.8,9.5
161574,271207,2014,Part of an Ongoing Conflict,State-Based Conflict,13604,Syria: Islamic State,14620,Government of Syria - IS,118,Government of Syria,...,10.0,6.7,6.9,6.9,9.8,7.2,9.9,6.0,10.0,8.6
10979,81247,2013,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,...,9.2,8.2,7.8,7.2,9.4,8.8,8.4,9.3,9.2,10.0
6596,171102,2011,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,...,9.3,8.0,8.4,7.2,9.7,8.5,8.8,9.1,9.3,10.0
116731,353013,2013,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,...,9.3,6.4,7.2,6.2,9.6,7.0,9.5,5.6,9.5,8.1
30429,338829,2020,Part of an Ongoing Conflict,One-Sided Violence,514,Taleban - Civilians,981,Taleban - Civilians,303,Taleban,...,7.5,8.3,7.7,7.5,9.0,9.5,7.6,9.0,9.3,8.6
162645,210387,2015,Part of an Ongoing Conflict,State-Based Conflict,13604,Syria: Islamic State,14620,Government of Syria - IS,118,Government of Syria,...,10.0,7.5,7.0,7.4,9.9,8.2,10.0,8.1,10.0,9.9


In [127]:
main.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 180861 entries, 0 to 180860
Columns: 104 entries, id to ext_interventions_index_score
dtypes: datetime64[ns](2), float64(76), int64(12), object(13), timedelta64[ns](1)
memory usage: 144.9+ MB


In [128]:
pd.set_option('display.max_columns', 500)

In [129]:
main.corrwith(main.best_est_fatalities).sort_values()

capital_investment                     -0.060968
id                                     -0.054562
priogrid_gid                           -0.050392
latitude                               -0.050244
economic_growth                        -0.046704
social_globalization_index_score       -0.045272
econ_globalization_index_score         -0.036509
inflation                              -0.035371
life_expectancy_years                  -0.035340
refugee_pop                            -0.034322
savings_pcnt_gdp                       -0.032791
longitude                              -0.032596
foreign_aid_received                   -0.029721
side_b_new_id                          -0.029562
pol_rights_index_score                 -0.028014
globalization_index_score              -0.026349
freedom_fr_corruption_index_score      -0.024520
civil_liberties_index_score            -0.024182
pop_den_ppl_sqkm                       -0.022823
pcnt_world_tourist_arrivals            -0.019167
unemployment_rate_f 

In [130]:
correlated_feat = more_recent[['country', 'year', 'capital_investment', 'economic_growth', 'savings_pcnt_gdp', 'inflation', 'pcnt_world_tourist_arrivals', 'death_rate_p1000', 'human_flight_brain_drain_index_score', 'government_debt', 'inv_freedom_index_score', 'external_debt', 'labor_freedom_index_score', 'remittances', 'pop_growth_pcnt', 'banking_z_score', 'oil_reserves_barrels', 'pcnt_world_oil_reserves', 'oil_prod_barrels_daily', 'trade_balance']]

###### Note: 
These features were selected as they were the most positively or negatively correlated with our target variable.

In [131]:
correlated_feat

Unnamed: 0,country,year,capital_investment,economic_growth,savings_pcnt_gdp,inflation,pcnt_world_tourist_arrivals,death_rate_p1000,human_flight_brain_drain_index_score,government_debt,inv_freedom_index_score,external_debt,labor_freedom_index_score,remittances,pop_growth_pcnt,banking_z_score,oil_reserves_barrels,pcnt_world_oil_reserves,oil_prod_barrels_daily,trade_balance
0,Afghanistan,2007,,13.83,,8.7,,9.29,7.0,20.14,,,,0.00,2.49,20.69,0.0,0.0,0.0,
1,Afghanistan,2008,,3.92,,26.4,,8.93,7.0,19.06,,,,0.89,2.27,12.91,0.0,0.0,0.0,
2,Afghanistan,2009,,21.39,,-6.8,,8.58,7.2,16.25,,20.00,,1.13,2.40,17.28,0.0,0.0,0.0,
3,Afghanistan,2010,,14.36,,2.2,,8.25,7.2,7.70,,15.33,,2.39,2.75,16.15,0.0,0.0,0.0,
4,Afghanistan,2011,,0.43,,11.8,,7.94,7.2,7.50,,13.97,,1.01,3.14,12.29,0.0,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1703,Zimbabwe,2016,9.86,0.76,-1.22,-1.5,0.18,8.29,8.1,48.49,10.0,61.24,30.0,,1.55,3.55,0.0,0.0,0.0,-11.33
1704,Zimbabwe,2017,8.97,4.70,13.48,0.9,0.18,8.04,7.9,43.73,25.0,62.45,33.0,,1.46,3.31,0.0,0.0,0.0,-9.91
1705,Zimbabwe,2018,5.71,4.83,,10.6,0.19,7.88,7.6,33.63,25.0,55.69,39.0,,1.41,,0.0,0.0,0.0,-7.33
1706,Zimbabwe,2019,,-8.10,,,,7.77,7.3,,25.0,61.86,43.0,,,,0.0,,0.0,


In [132]:
correlated_feat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1708 entries, 0 to 1707
Data columns (total 20 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   country                               1708 non-null   object 
 1   year                                  1708 non-null   int64  
 2   capital_investment                    1540 non-null   float64
 3   economic_growth                       1541 non-null   float64
 4   savings_pcnt_gdp                      1414 non-null   float64
 5   inflation                             1551 non-null   float64
 6   pcnt_world_tourist_arrivals           1320 non-null   float64
 7   death_rate_p1000                      988 non-null    float64
 8   human_flight_brain_drain_index_score  1708 non-null   float64
 9   government_debt                       1587 non-null   float64
 10  inv_freedom_index_score               1598 non-null   float64
 11  external_debt    

#### Missing Feature Data

As the most correlated features have been selected, below we will be imputing missing data. Firstly through a self-defined function which iterates through each row of a given dataframe column, and if it find a NaN, it checks what country values preceded that iteration and if appropriate it takes the mean of the last two rows. If the iteration belongs to a new country, it does not fill a value, and these remaining NaN's are dealt with by forward and backward filling values, while grouping by country.

###### Note:
There are significant assumptions being made here, as well as imputations of non-real-world data, which almost certainly skew and bias the analysis.

In [133]:
#values = []

def nan_filler(target_column, country_column=correlated_feat.country):
    #values = []
    
    for index, frame in target_column.iteritems():
        if pd.notnull(frame):
            values.append(frame)
        if not pd.notnull(frame):
            if country_column.iloc[index] != country_column.iloc[index - 1]:
                values.append('NaN')
            else:
                mean = (target_column.iloc[index - 2] + target_column.iloc[index - 1])/2
                values.append(mean.round(2))
    
    #target_column = values

In [134]:
values = []
nan_filler(correlated_feat.capital_investment)
correlated_feat.capital_investment = values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [135]:
values = []
nan_filler(correlated_feat.economic_growth)
correlated_feat.economic_growth  = values

In [136]:
values = []
nan_filler(correlated_feat.savings_pcnt_gdp)
correlated_feat.savings_pcnt_gdp  = values

In [137]:
values = []
nan_filler(correlated_feat.inflation)
correlated_feat.inflation  = values

In [138]:
values = []
nan_filler(correlated_feat.pcnt_world_tourist_arrivals)
correlated_feat.pcnt_world_tourist_arrivals  = values

In [139]:
values = []
nan_filler(correlated_feat.death_rate_p1000)
correlated_feat.death_rate_p1000  = values

In [140]:
values = []
nan_filler(correlated_feat.government_debt)
correlated_feat.government_debt  = values

In [141]:
values = []
nan_filler(correlated_feat.inv_freedom_index_score)
correlated_feat.inv_freedom_index_score  = values

In [142]:
values = []
nan_filler(correlated_feat.external_debt)
correlated_feat.external_debt  = values

In [143]:
values = []
nan_filler(correlated_feat.labor_freedom_index_score)
correlated_feat.labor_freedom_index_score = values

In [144]:
values = []
nan_filler(correlated_feat.remittances)
correlated_feat.remittances = values

In [145]:
values = []
nan_filler(correlated_feat.pop_growth_pcnt)
correlated_feat.pop_growth_pcnt = values

In [146]:
values = []
nan_filler(correlated_feat.banking_z_score)
correlated_feat.banking_z_score = values

In [147]:
values = []
nan_filler(correlated_feat.oil_reserves_barrels)
correlated_feat.oil_reserves_barrels = values

In [148]:
values = []
nan_filler(correlated_feat.pcnt_world_oil_reserves)
correlated_feat.pcnt_world_oil_reserves = values

In [149]:
values = []
nan_filler(correlated_feat.trade_balance)
correlated_feat.trade_balance = values

In [150]:
correlated_feat

Unnamed: 0,country,year,capital_investment,economic_growth,savings_pcnt_gdp,inflation,pcnt_world_tourist_arrivals,death_rate_p1000,human_flight_brain_drain_index_score,government_debt,inv_freedom_index_score,external_debt,labor_freedom_index_score,remittances,pop_growth_pcnt,banking_z_score,oil_reserves_barrels,pcnt_world_oil_reserves,oil_prod_barrels_daily,trade_balance
0,Afghanistan,2007,,13.83,,8.7,,9.29,7.0,20.14,,,,0.0,2.49,20.69,0.0,0.0,0.0,
1,Afghanistan,2008,,3.92,,26.4,,8.93,7.0,19.06,,,,0.89,2.27,12.91,0.0,0.0,0.0,
2,Afghanistan,2009,,21.39,,-6.8,,8.58,7.2,16.25,,20.0,,1.13,2.4,17.28,0.0,0.0,0.0,
3,Afghanistan,2010,,14.36,,2.2,,8.25,7.2,7.7,,15.33,,2.39,2.75,16.15,0.0,0.0,0.0,
4,Afghanistan,2011,,0.43,,11.8,,7.94,7.2,7.5,,13.97,,1.01,3.14,12.29,0.0,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1703,Zimbabwe,2016,9.86,0.76,-1.22,-1.5,0.18,8.29,8.1,48.49,10.0,61.24,30.0,,1.55,3.55,0.0,0.0,0.0,-11.33
1704,Zimbabwe,2017,8.97,4.7,13.48,0.9,0.18,8.04,7.9,43.73,25.0,62.45,33.0,,1.46,3.31,0.0,0.0,0.0,-9.91
1705,Zimbabwe,2018,5.71,4.83,6.13,10.6,0.19,7.88,7.6,33.63,25.0,55.69,39.0,,1.41,3.43,0.0,0.0,0.0,-7.33
1706,Zimbabwe,2019,7.34,-8.1,,5.75,0.18,7.77,7.3,38.68,25.0,61.86,43.0,,1.44,,0.0,0.0,0.0,-8.62


In [151]:
correlated_feat['capital_investment'] = correlated_feat.groupby('country')['capital_investment'].fillna(method='bfill')
correlated_feat['capital_investment'] = correlated_feat.groupby('country')['capital_investment'].fillna(method='ffill')

correlated_feat['economic_growth'] = correlated_feat.groupby('country')['economic_growth'].fillna(method='bfill')
correlated_feat['economic_growth'] = correlated_feat.groupby('country')['economic_growth'].fillna(method='ffill')

correlated_feat['savings_pcnt_gdp'] = correlated_feat.groupby('country')['savings_pcnt_gdp'].fillna(method='bfill')
correlated_feat['savings_pcnt_gdp'] = correlated_feat.groupby('country')['savings_pcnt_gdp'].fillna(method='ffill')

correlated_feat['inflation'] = correlated_feat.groupby('country')['inflation'].fillna(method='bfill')
correlated_feat['inflation'] = correlated_feat.groupby('country')['inflation'].fillna(method='ffill')

correlated_feat['pcnt_world_tourist_arrivals'] = correlated_feat.groupby('country')['pcnt_world_tourist_arrivals'].fillna(method='bfill')
correlated_feat['pcnt_world_tourist_arrivals'] = correlated_feat.groupby('country')['pcnt_world_tourist_arrivals'].fillna(method='ffill')

correlated_feat['death_rate_p1000'] = correlated_feat.groupby('country')['death_rate_p1000'].fillna(method='bfill')
correlated_feat['death_rate_p1000'] = correlated_feat.groupby('country')['death_rate_p1000'].fillna(method='ffill')

correlated_feat['government_debt'] = correlated_feat.groupby('country')['government_debt'].fillna(method='bfill')
correlated_feat['government_debt'] = correlated_feat.groupby('country')['government_debt'].fillna(method='ffill')

correlated_feat['inv_freedom_index_score'] = correlated_feat.groupby('country')['inv_freedom_index_score'].fillna(method='bfill')
correlated_feat['inv_freedom_index_score'] = correlated_feat.groupby('country')['inv_freedom_index_score'].fillna(method='ffill')

correlated_feat['external_debt'] = correlated_feat.groupby('country')['external_debt'].fillna(method='bfill')
correlated_feat['external_debt'] = correlated_feat.groupby('country')['external_debt'].fillna(method='ffill')

correlated_feat['labor_freedom_index_score'] = correlated_feat.groupby('country')['labor_freedom_index_score'].fillna(method='ffill')
correlated_feat['labor_freedom_index_score'] = correlated_feat.groupby('country')['labor_freedom_index_score'].fillna(method='ffill')

correlated_feat['remittances'] = correlated_feat.groupby('country')['remittances'].fillna(method='ffill')
correlated_feat['remittances'] = correlated_feat.groupby('country')['remittances'].fillna(method='ffill')

correlated_feat['pop_growth_pcnt'] = correlated_feat.groupby('country')['pop_growth_pcnt'].fillna(method='ffill')
correlated_feat['pop_growth_pcnt'] = correlated_feat.groupby('country')['pop_growth_pcnt'].fillna(method='ffill')

correlated_feat['banking_z_score'] = correlated_feat.groupby('country')['banking_z_score'].fillna(method='ffill')
correlated_feat['banking_z_score'] = correlated_feat.groupby('country')['banking_z_score'].fillna(method='ffill')

correlated_feat['oil_reserves_barrels'] = correlated_feat.groupby('country')['oil_reserves_barrels'].fillna(method='ffill')
correlated_feat['oil_reserves_barrels'] = correlated_feat.groupby('country')['oil_reserves_barrels'].fillna(method='ffill')

correlated_feat['pcnt_world_oil_reserves'] = correlated_feat.groupby('country')['pcnt_world_oil_reserves'].fillna(method='ffill')
correlated_feat['pcnt_world_oil_reserves'] = correlated_feat.groupby('country')['pcnt_world_oil_reserves'].fillna(method='ffill')

correlated_feat['trade_balance'] = correlated_feat.groupby('country')['trade_balance'].fillna(method='ffill')
correlated_feat['trade_balance'] = correlated_feat.groupby('country')['trade_balance'].fillna(method='ffill')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  correlated_feat['capital_investment'] = correlated_feat.groupby('country')['capital_investment'].fillna(method='bfill')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  correlated_feat['capital_investment'] = correlated_feat.groupby('country')['capital_investment'].fillna(method='ffill')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.htm

In [152]:
correlated_feat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1708 entries, 0 to 1707
Data columns (total 20 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   country                               1708 non-null   object 
 1   year                                  1708 non-null   int64  
 2   capital_investment                    1708 non-null   object 
 3   economic_growth                       1708 non-null   object 
 4   savings_pcnt_gdp                      1708 non-null   object 
 5   inflation                             1708 non-null   object 
 6   pcnt_world_tourist_arrivals           1708 non-null   object 
 7   death_rate_p1000                      1708 non-null   object 
 8   human_flight_brain_drain_index_score  1708 non-null   float64
 9   government_debt                       1708 non-null   object 
 10  inv_freedom_index_score               1708 non-null   object 
 11  external_debt    

In [153]:
pd.set_option('display.max_rows', 200)

In [154]:
correlated_feat = correlated_feat.replace('NaN', 0)

In [155]:
correlated_feat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1708 entries, 0 to 1707
Data columns (total 20 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   country                               1708 non-null   object 
 1   year                                  1708 non-null   int64  
 2   capital_investment                    1708 non-null   float64
 3   economic_growth                       1708 non-null   float64
 4   savings_pcnt_gdp                      1708 non-null   float64
 5   inflation                             1708 non-null   float64
 6   pcnt_world_tourist_arrivals           1708 non-null   float64
 7   death_rate_p1000                      1708 non-null   float64
 8   human_flight_brain_drain_index_score  1708 non-null   float64
 9   government_debt                       1708 non-null   float64
 10  inv_freedom_index_score               1708 non-null   float64
 11  external_debt    

In [156]:
correlated_feat

Unnamed: 0,country,year,capital_investment,economic_growth,savings_pcnt_gdp,inflation,pcnt_world_tourist_arrivals,death_rate_p1000,human_flight_brain_drain_index_score,government_debt,inv_freedom_index_score,external_debt,labor_freedom_index_score,remittances,pop_growth_pcnt,banking_z_score,oil_reserves_barrels,pcnt_world_oil_reserves,oil_prod_barrels_daily,trade_balance
0,Afghanistan,2007,0.00,13.83,0.00,8.70,0.00,9.29,7.0,20.14,0.0,0.00,0.0,0.00,2.49,20.69,0.0,0.0,0.0,0.00
1,Afghanistan,2008,0.00,3.92,0.00,26.40,0.00,8.93,7.0,19.06,65.0,20.00,0.0,0.89,2.27,12.91,0.0,0.0,0.0,0.00
2,Afghanistan,2009,0.00,21.39,0.00,-6.80,0.00,8.58,7.2,16.25,65.0,20.00,0.0,1.13,2.40,17.28,0.0,0.0,0.0,0.00
3,Afghanistan,2010,0.00,14.36,0.00,2.20,0.00,8.25,7.2,7.70,65.0,15.33,0.0,2.39,2.75,16.15,0.0,0.0,0.0,0.00
4,Afghanistan,2011,0.00,0.43,0.00,11.80,0.00,7.94,7.2,7.50,65.0,13.97,0.0,1.01,3.14,12.29,0.0,0.0,0.0,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1703,Zimbabwe,2016,9.86,0.76,-1.22,-1.50,0.18,8.29,8.1,48.49,10.0,61.24,30.0,0.00,1.55,3.55,0.0,0.0,0.0,-11.33
1704,Zimbabwe,2017,8.97,4.70,13.48,0.90,0.18,8.04,7.9,43.73,25.0,62.45,33.0,0.00,1.46,3.31,0.0,0.0,0.0,-9.91
1705,Zimbabwe,2018,5.71,4.83,6.13,10.60,0.19,7.88,7.6,33.63,25.0,55.69,39.0,0.00,1.41,3.43,0.0,0.0,0.0,-7.33
1706,Zimbabwe,2019,7.34,-8.10,6.13,5.75,0.18,7.77,7.3,38.68,25.0,61.86,43.0,0.00,1.44,3.43,0.0,0.0,0.0,-8.62


In [157]:
df_recent

Unnamed: 0,id,year,active_year,type_of_violence,conflict_new_id,conflict_name,dyad_new_id,dyad_name,side_a_new_id,side_a,side_b_new_id,side_b,source_article,source_original,where_coordinates,where_description,adm_1,latitude,longitude,priogrid_gid,country,region,date_start,date_end,deaths_a,deaths_b,deaths_civilians,deaths_unknown,best_est_fatalities,conflict_length
0,244657,2017,Part of an Ongoing Conflict,State-Based Conflict,259,Iraq: Government,524,Government of Iraq - IS,116,Government of Iraq,234,IS,"""Agence France Presse,2017-08-01,Attackers tar...","IS, interior ministry, security source",Kabul city,Iraqi embassy in Kabul,Kabul province,34.531094,69.162796,179779,Afghanistan,Asia,2017-07-31,2017-07-31,0,4,0,2,6,0 days
294,132916,2008,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,726,Government of Afghanistan - Hizb-i Islami-yi A...,130,Government of Afghanistan,299,Hizb-i Islami-yi Afghanistan,"National Afghanistan TV, Kabul, in Dari and Pa...",HIG spokesman,Tagab district (Kapisa),Tagab district,Kapisa province,34.797169,69.679230,179780,Afghanistan,Asia,2008-03-11,2008-03-11,0,8,0,0,8,0 days
295,132925,2008,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,726,Government of Afghanistan - Hizb-i Islami-yi A...,130,Government of Afghanistan,299,Hizb-i Islami-yi Afghanistan,"Afghan Islamic Press news agency, Peshawar, in...",Police /HIG spokesman,Salar village,Salar area (on the Kabul-Kandahar highway),Wardak province,33.904587,68.668884,178338,Afghanistan,Asia,2008-03-13,2008-03-13,2,0,0,0,2,0 days
296,133006,2008,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,726,Government of Afghanistan - Hizb-i Islami-yi A...,130,Government of Afghanistan,299,Hizb-i Islami-yi Afghanistan,"Reuters, 2008-04-07, ""UPDATE 2-More than 20 sa...",US coalition,Kendar village,Kendal village,Nuristan province,35.240800,70.412200,180501,Afghanistan,Asia,2008-04-06,2008-04-06,0,0,0,8,8,0 days
297,133007,2008,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,726,Government of Afghanistan - Hizb-i Islami-yi A...,130,Government of Afghanistan,299,Hizb-i Islami-yi Afghanistan,"Reuters, 2008-04-07, ""UPDATE 2-More than 20 sa...",US coalition,Shok village,Shok village,Nuristan province,35.251500,70.429200,180501,Afghanistan,Asia,2008-04-06,2008-04-06,0,0,0,8,8,0 days
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
261854,276685,2019,Isolated Incident,One-Sided Violence,476,Government of Zimbabwe (Rhodesia) - Civilians,943,Government of Zimbabwe (Rhodesia) - Civilians,101,Government of Zimbabwe (Rhodesia),1,Civilians,"""Agence France Presse,2019-01-15,Three killed,...",Amnesty,Chitungwiza,in Chitungwiza (on the outskirts of Harare),Harare province,-18.012741,31.075550,103383,Zimbabwe (Rhodesia),Africa,2019-01-14,2019-01-14,0,0,1,0,1,0 days
261855,276740,2019,Isolated Incident,One-Sided Violence,476,Government of Zimbabwe (Rhodesia) - Civilians,943,Government of Zimbabwe (Rhodesia) - Civilians,101,Government of Zimbabwe (Rhodesia),1,Civilians,"""Zimbabwe Human Rights NGO Forum,2019-02-06,On...",Zimbabwe Human Rights NGO Forum,Mutare town,Mutare,Manicaland province,-18.975973,32.650092,102666,Zimbabwe (Rhodesia),Africa,2019-01-22,2019-01-22,0,0,1,0,1,0 days
261856,326022,2019,Isolated Incident,One-Sided Violence,476,Government of Zimbabwe (Rhodesia) - Civilians,943,Government of Zimbabwe (Rhodesia) - Civilians,101,Government of Zimbabwe (Rhodesia),1,Civilians,"""Heal Zimbabwe,2020-01-01,HUMAN SECURITY EARLY...",Heal Zimbabwe,Harare city,Harare Central Business District,Harare province,-17.817777,31.044722,104103,Zimbabwe (Rhodesia),Africa,2019-10-12,2019-10-12,0,0,1,0,1,0 days
261857,325975,2019,Isolated Incident,One-Sided Violence,476,Government of Zimbabwe (Rhodesia) - Civilians,943,Government of Zimbabwe (Rhodesia) - Civilians,101,Government of Zimbabwe (Rhodesia),1,Civilians,"""AllAfrica,2020-01-17,Soldiers Mow Down Two Br...",AllAfrica,Mwenezi district,Mwenezi,Masvingo province,-21.358380,30.706680,99062,Zimbabwe (Rhodesia),Africa,2019-12-26,2019-12-26,0,0,2,0,2,0 days


### Merging Features and Core Data. 

In [158]:
recent_final = pd.merge(
    df_recent.reset_index(),
    correlated_feat.reset_index(),
    on=['country', 'year'],
    how='inner'
)

In [159]:
recent_final.drop('index_x', axis=1, inplace=True)
recent_final.drop('index_y', axis=1, inplace=True)

In [160]:
recent_final

Unnamed: 0,id,year,active_year,type_of_violence,conflict_new_id,conflict_name,dyad_new_id,dyad_name,side_a_new_id,side_a,side_b_new_id,side_b,source_article,source_original,where_coordinates,where_description,adm_1,latitude,longitude,priogrid_gid,country,region,date_start,date_end,deaths_a,deaths_b,deaths_civilians,deaths_unknown,best_est_fatalities,conflict_length,capital_investment,economic_growth,savings_pcnt_gdp,inflation,pcnt_world_tourist_arrivals,death_rate_p1000,human_flight_brain_drain_index_score,government_debt,inv_freedom_index_score,external_debt,labor_freedom_index_score,remittances,pop_growth_pcnt,banking_z_score,oil_reserves_barrels,pcnt_world_oil_reserves,oil_prod_barrels_daily,trade_balance
0,244657,2017,Part of an Ongoing Conflict,State-Based Conflict,259,Iraq: Government,524,Government of Iraq - IS,116,Government of Iraq,234,IS,"""Agence France Presse,2017-08-01,Attackers tar...","IS, interior ministry, security source",Kabul city,Iraqi embassy in Kabul,Kabul province,34.531094,69.162796,179779,Afghanistan,Asia,2017-07-31,2017-07-31,0,4,0,2,6,0 days,0.00,2.65,0.00,5.0,0.00,6.57,8.2,8.00,55.0,14.39,60.0,4.36,2.55,11.50,0.00,0.00,0.00,0.00
1,233552,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Xinhua News Agency,2017-01-01,Gun battle leav...","provincial governor, Taleban",Taghaye Khwajasufla village,Taghai-Khwaj and Tabir localities of Sangchara...,Sari Pul province,35.965400,66.369100,181213,Afghanistan,Asia,2017-01-01,2017-01-01,1,3,0,0,4,0 days,0.00,2.65,0.00,5.0,0.00,6.57,8.2,8.00,55.0,14.39,60.0,4.36,2.55,11.50,0.00,0.00,0.00,0.00
2,233553,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Xinhua News Agency,2017-01-01,Gun battle leav...",provincial governor,Sari Pul-Jawzjan road (Sari Pul province),a road connecting Saripul to the neighboring J...,Sari Pul province,36.306238,65.890063,181932,Afghanistan,Asia,2017-01-01,2017-01-01,2,2,0,0,4,0 days,0.00,2.65,0.00,5.0,0.00,6.57,8.2,8.00,55.0,14.39,60.0,4.36,2.55,11.50,0.00,0.00,0.00,0.00
3,233554,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Xinhua News Agency,2017-01-02,21 Taliban mili...","Interior Ministry, governor’s spokesman, resident",Abjosh Bala village,Ab Josh area of Charkh district of Logar province,Logar province,33.847100,68.829600,178338,Afghanistan,Asia,2017-01-01,2017-01-01,0,18,0,0,18,0 days,0.00,2.65,0.00,5.0,0.00,6.57,8.2,8.00,55.0,14.39,60.0,4.36,2.55,11.50,0.00,0.00,0.00,0.00
4,237882,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Pajhwok News,2017-01-01,Helmand’s Sangin, Mar...","administrative chief of Marja district, Milita...",Marja town,"Marja district centre, Hilmand",Hilmand province,31.521994,64.118492,175449,Afghanistan,Asia,2017-01-01,2017-01-01,1,0,0,0,1,0 days,0.00,2.65,0.00,5.0,0.00,6.57,8.2,8.00,55.0,14.39,60.0,4.36,2.55,11.50,0.00,0.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
171388,294414,2019,Isolated Incident,One-Sided Violence,447,Government of Venezuela - Civilians,914,Government of Venezuela - Civilians,18,Government of Venezuela,1,Civilians,"""BBC Monitoring Americas,2019-07-31,Colombia m...",El Tiempo,Zulia state,,Zulia state,10.000000,-72.166670,144216,Venezuela,Americas,2019-07-27,2019-07-27,0,0,2,0,2,0 days,26.04,-1.27,13.98,188.3,0.04,7.12,6.1,232.79,5.0,38.29,28.0,0.00,-1.66,3.51,302.81,18.23,876.82,-9.72
171389,285541,2019,Isolated Incident,One-Sided Violence,606,ELN - Civilians,1073,ELN - Civilians,744,ELN,1,Civilians,"""BBC Monitoring Americas,2019-04-29,Venezuela ...",El Nacional Online,Ureña town,"Urena and La Fria (Tachira), located on the Co...",Táchira state,7.916890,-72.441880,140616,Venezuela,Americas,2019-04-25,2019-04-26,0,0,3,0,3,1 days,26.04,-1.27,13.98,188.3,0.04,7.12,6.1,232.79,5.0,38.29,28.0,0.00,-1.66,3.51,302.81,18.23,876.82,-9.72
171390,285542,2019,Isolated Incident,One-Sided Violence,606,ELN - Civilians,1073,ELN - Civilians,744,ELN,1,Civilians,"""BBC Monitoring Americas,2019-04-29,Venezuela ...",El Nacional Online,La Fría town,"Urena and La Fria (Tachira), located on the Co...",Táchira state,8.215230,-72.248880,141336,Venezuela,Americas,2019-04-25,2019-04-26,0,0,2,0,2,1 days,26.04,-1.27,13.98,188.3,0.04,7.12,6.1,232.79,5.0,38.29,28.0,0.00,-1.66,3.51,302.81,18.23,876.82,-9.72
171391,349289,2020,Isolated Incident,One-Sided Violence,447,Government of Venezuela - Civilians,914,Government of Venezuela - Civilians,18,Government of Venezuela,1,Civilians,"""Reuters News,2020-07-18,Young man shot and ki...","Avilio Troconiz, an opposition deputy",Isla de Toas,,Zulia state,10.956510,-71.654690,144937,Venezuela,Americas,2020-07-17,2020-07-17,0,0,1,0,1,0 days,26.04,-1.27,13.98,188.3,0.04,7.05,6.4,304.13,5.0,38.29,28.0,0.00,-1.66,3.51,302.81,18.23,527.06,-9.72


#### Fixing Final Missing Values

In [161]:
recent_final.isnull().value_counts().count()

4

In [162]:
recent_final1 = recent_final[recent_final.isna().any(axis=1)]

In [163]:
recent_final1

Unnamed: 0,id,year,active_year,type_of_violence,conflict_new_id,conflict_name,dyad_new_id,dyad_name,side_a_new_id,side_a,side_b_new_id,side_b,source_article,source_original,where_coordinates,where_description,adm_1,latitude,longitude,priogrid_gid,country,region,date_start,date_end,deaths_a,deaths_b,deaths_civilians,deaths_unknown,best_est_fatalities,conflict_length,capital_investment,economic_growth,savings_pcnt_gdp,inflation,pcnt_world_tourist_arrivals,death_rate_p1000,human_flight_brain_drain_index_score,government_debt,inv_freedom_index_score,external_debt,labor_freedom_index_score,remittances,pop_growth_pcnt,banking_z_score,oil_reserves_barrels,pcnt_world_oil_reserves,oil_prod_barrels_daily,trade_balance
147,237210,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Xinhua News Agency,2017-01-27,Afghan troops k...",,Shahidi Hossas district (Charchino),"Charchino district, Uruzgan province",Uruzgan province,32.920909,65.474153,176891,Afghanistan,Asia,2017-01-26,2017-01-26,0,2,0,0,2,0 days,0.00,2.65,0.00,5.0,0.00,6.57,8.2,8.00,55.0,14.39,60.0,4.36,2.55,11.50,0.00,0.00,0.00,0.00
338,237601,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Bakhtar News Agency,2017-03-05,8 Taliban Comm...",,Alingar district,"Alinegar district highway, Laghman",Laghman province,34.822451,70.418087,179781,Afghanistan,Asia,2017-03-03,2017-03-05,0,3,0,0,3,2 days,0.00,2.65,0.00,5.0,0.00,6.57,8.2,8.00,55.0,14.39,60.0,4.36,2.55,11.50,0.00,0.00,0.00,0.00
364,238287,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Pajhwok News,2017-03-08,20 rebels killed in K...",,Kandahar town,limits of the fourth municipal district of Kan...,Kandahar province,31.611795,65.705795,175452,Afghanistan,Asia,2017-03-08,2017-03-08,1,0,0,0,1,0 days,0.00,2.65,0.00,5.0,0.00,6.57,8.2,8.00,55.0,14.39,60.0,4.36,2.55,11.50,0.00,0.00,0.00,0.00
612,239206,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Xinhua News Agency,2017-04-24,37 IS militants...",,Kandahar town,Kandahar city,Kandahar province,31.611795,65.705795,175452,Afghanistan,Asia,2017-04-23,2017-04-23,1,0,0,0,1,0 days,0.00,2.65,0.00,5.0,0.00,6.57,8.2,8.00,55.0,14.39,60.0,4.36,2.55,11.50,0.00,0.00,0.00,0.00
644,239913,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Xinhua News Agency,2017-05-01,Roundup: 2 civi...",,Baghlan province,along a main road in northern Baghlan province.,Baghlan province,35.750000,69.000000,181219,Afghanistan,Asia,2017-04-29,2017-04-29,0,0,2,0,2,0 days,0.00,2.65,0.00,5.0,0.00,6.57,8.2,8.00,55.0,14.39,60.0,4.36,2.55,11.50,0.00,0.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
171365,62009,2009,Isolated Incident,State-Based Conflict,415,Uzbekistan: Government,872,Government of Uzbekistan - JIG,133,Government of Uzbekistan,360,JIG,"AP 26 May 2009 ""Kyrgyzstan closes border with ...",,Andijan town,Andijan (town in Fergana Valley),Andijon Viloyati,40.783333,72.333333,188425,Uzbekistan,Asia,2009-05-26,2009-05-26,1,0,0,4,5,0 days,29.95,8.05,35.75,0.0,0.14,4.80,7.0,7.83,30.0,20.32,65.0,0.00,1.69,9.24,0.59,0.04,43.45,0.65
171373,252957,2017,Isolated Incident,One-Sided Violence,447,Government of Venezuela - Civilians,914,Government of Venezuela - Civilians,18,Government of Venezuela,1,Civilians,"""Telesur,2017-08-08,Here’s Your Guide to Under...","Telesur, Gobierno Bolivariano de Venezuela: Mi...",Palmira town,,Táchira state,7.836070,-72.227580,140616,Venezuela,Americas,2017-05-15,2017-05-15,0,0,1,0,1,0 days,26.04,-1.27,13.98,188.3,0.03,6.81,5.5,26.00,5.0,38.29,29.0,0.00,-1.54,2.64,300.88,18.27,1997.41,-9.72
171388,294414,2019,Isolated Incident,One-Sided Violence,447,Government of Venezuela - Civilians,914,Government of Venezuela - Civilians,18,Government of Venezuela,1,Civilians,"""BBC Monitoring Americas,2019-07-31,Colombia m...",El Tiempo,Zulia state,,Zulia state,10.000000,-72.166670,144216,Venezuela,Americas,2019-07-27,2019-07-27,0,0,2,0,2,0 days,26.04,-1.27,13.98,188.3,0.04,7.12,6.1,232.79,5.0,38.29,28.0,0.00,-1.66,3.51,302.81,18.23,876.82,-9.72
171391,349289,2020,Isolated Incident,One-Sided Violence,447,Government of Venezuela - Civilians,914,Government of Venezuela - Civilians,18,Government of Venezuela,1,Civilians,"""Reuters News,2020-07-18,Young man shot and ki...","Avilio Troconiz, an opposition deputy",Isla de Toas,,Zulia state,10.956510,-71.654690,144937,Venezuela,Americas,2020-07-17,2020-07-17,0,0,1,0,1,0 days,26.04,-1.27,13.98,188.3,0.04,7.05,6.4,304.13,5.0,38.29,28.0,0.00,-1.66,3.51,302.81,18.23,527.06,-9.72


In [164]:
recent_final[recent_final['source_original'].isna()]

Unnamed: 0,id,year,active_year,type_of_violence,conflict_new_id,conflict_name,dyad_new_id,dyad_name,side_a_new_id,side_a,side_b_new_id,side_b,source_article,source_original,where_coordinates,where_description,adm_1,latitude,longitude,priogrid_gid,country,region,date_start,date_end,deaths_a,deaths_b,deaths_civilians,deaths_unknown,best_est_fatalities,conflict_length,capital_investment,economic_growth,savings_pcnt_gdp,inflation,pcnt_world_tourist_arrivals,death_rate_p1000,human_flight_brain_drain_index_score,government_debt,inv_freedom_index_score,external_debt,labor_freedom_index_score,remittances,pop_growth_pcnt,banking_z_score,oil_reserves_barrels,pcnt_world_oil_reserves,oil_prod_barrels_daily,trade_balance
147,237210,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Xinhua News Agency,2017-01-27,Afghan troops k...",,Shahidi Hossas district (Charchino),"Charchino district, Uruzgan province",Uruzgan province,32.920909,65.474153,176891,Afghanistan,Asia,2017-01-26,2017-01-26,0,2,0,0,2,0 days,0.00,2.65,0.00,5.0,0.00,6.57,8.2,8.00,55.0,14.39,60.0,4.36,2.55,11.50,0.00,0.00,0.00,0.00
338,237601,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Bakhtar News Agency,2017-03-05,8 Taliban Comm...",,Alingar district,"Alinegar district highway, Laghman",Laghman province,34.822451,70.418087,179781,Afghanistan,Asia,2017-03-03,2017-03-05,0,3,0,0,3,2 days,0.00,2.65,0.00,5.0,0.00,6.57,8.2,8.00,55.0,14.39,60.0,4.36,2.55,11.50,0.00,0.00,0.00,0.00
364,238287,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Pajhwok News,2017-03-08,20 rebels killed in K...",,Kandahar town,limits of the fourth municipal district of Kan...,Kandahar province,31.611795,65.705795,175452,Afghanistan,Asia,2017-03-08,2017-03-08,1,0,0,0,1,0 days,0.00,2.65,0.00,5.0,0.00,6.57,8.2,8.00,55.0,14.39,60.0,4.36,2.55,11.50,0.00,0.00,0.00,0.00
612,239206,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Xinhua News Agency,2017-04-24,37 IS militants...",,Kandahar town,Kandahar city,Kandahar province,31.611795,65.705795,175452,Afghanistan,Asia,2017-04-23,2017-04-23,1,0,0,0,1,0 days,0.00,2.65,0.00,5.0,0.00,6.57,8.2,8.00,55.0,14.39,60.0,4.36,2.55,11.50,0.00,0.00,0.00,0.00
644,239913,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Xinhua News Agency,2017-05-01,Roundup: 2 civi...",,Baghlan province,along a main road in northern Baghlan province.,Baghlan province,35.750000,69.000000,181219,Afghanistan,Asia,2017-04-29,2017-04-29,0,0,2,0,2,0 days,0.00,2.65,0.00,5.0,0.00,6.57,8.2,8.00,55.0,14.39,60.0,4.36,2.55,11.50,0.00,0.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
170995,246818,2017,Part of an Ongoing Conflict,State-Based Conflict,13306,Ukraine: Novorossiya,15101,Government of Ukraine - LPR,61,Government of Ukraine,6712,LPR,"""International Crisis Group,2017-10-01,CrisisW...",,Stanytsia Luhanska village,Stanitsa Luhanska,Luhansk oblast,48.694329,39.453292,199879,Ukraine,Europe,2017-07-28,2017-07-28,2,0,0,0,2,0 days,19.95,2.47,17.78,14.4,0.00,14.50,5.2,71.80,25.0,106.90,49.0,0.00,0.00,4.57,0.40,0.02,56.63,-7.69
171143,304001,2019,Part of an Ongoing Conflict,State-Based Conflict,13306,Ukraine: Novorossiya,15100,Government of Ukraine - DPR,61,Government of Ukraine,6711,DPR,"""World Service Wire,2019-07-05,196 ceasefire v...",,Donetsk oblast,,Donetsk oblast,48.140000,37.740000,199156,Ukraine,Europe,2019-06-28,2019-07-05,0,1,0,0,1,7 days,14.89,3.23,14.07,7.9,0.00,14.70,5.2,50.30,35.0,78.12,47.0,0.00,0.00,4.06,0.40,0.02,48.98,-8.01
171357,332336,2019,Part of an Ongoing Conflict,State-Based Conflict,259,Iraq: Government,524,Government of Iraq - IS,116,Government of Iraq,234,IS,"""Agence France Presse,2019-11-30,Islamic State...",,London city,,England,51.508530,-0.125740,204120,United Kingdom,Europe,2019-11-29,2019-11-29,0,1,0,0,1,0 days,18.26,1.46,13.85,1.7,2.76,9.00,2.5,84.38,90.0,0.00,74.0,0.00,0.64,11.34,2.50,0.16,1026.33,-1.24
171365,62009,2009,Isolated Incident,State-Based Conflict,415,Uzbekistan: Government,872,Government of Uzbekistan - JIG,133,Government of Uzbekistan,360,JIG,"AP 26 May 2009 ""Kyrgyzstan closes border with ...",,Andijan town,Andijan (town in Fergana Valley),Andijon Viloyati,40.783333,72.333333,188425,Uzbekistan,Asia,2009-05-26,2009-05-26,1,0,0,4,5,0 days,29.95,8.05,35.75,0.0,0.14,4.80,7.0,7.83,30.0,20.32,65.0,0.00,1.69,9.24,0.59,0.04,43.45,0.65


In [165]:
recent_final['source_original'].value_counts()

SOHR                                                                                                                                    37766
VDC                                                                                                                                     20253
SOHR, VDC                                                                                                                                5414
SOHR; VDC                                                                                                                                5332
police                                                                                                                                   3975
                                                                                                                                        ...  
emergency services spokesman Osama Ali, Ghaniwa Brigade spokesperson, Tripoli Wounded Administration , Libya Security Studies, ACLED        1
accord

In [166]:
recent_final['source_original'].isna().value_counts()

False    160576
True      10817
Name: source_original, dtype: int64

In [167]:
## As 'source_original' has 10k NaN values, but the contextual value of the data is high, I will be 
## replacing NaN's with 'No Source'

In [168]:
recent_final['source_original'].fillna('No Source Specified', inplace=True)

In [169]:
## As there are only 3k NaN values for 'where_description', I will be dropping them.

In [170]:
recent_final['where_description'].isna().value_counts()

False    168354
True       3039
Name: where_description, dtype: int64

In [171]:
recent_final = recent_final[(recent_final['where_description'].isna() != True)]

In [172]:
recent_final = recent_final.dropna()

In [173]:
recent_final.head(10)

Unnamed: 0,id,year,active_year,type_of_violence,conflict_new_id,conflict_name,dyad_new_id,dyad_name,side_a_new_id,side_a,side_b_new_id,side_b,source_article,source_original,where_coordinates,where_description,adm_1,latitude,longitude,priogrid_gid,country,region,date_start,date_end,deaths_a,deaths_b,deaths_civilians,deaths_unknown,best_est_fatalities,conflict_length,capital_investment,economic_growth,savings_pcnt_gdp,inflation,pcnt_world_tourist_arrivals,death_rate_p1000,human_flight_brain_drain_index_score,government_debt,inv_freedom_index_score,external_debt,labor_freedom_index_score,remittances,pop_growth_pcnt,banking_z_score,oil_reserves_barrels,pcnt_world_oil_reserves,oil_prod_barrels_daily,trade_balance
0,244657,2017,Part of an Ongoing Conflict,State-Based Conflict,259,Iraq: Government,524,Government of Iraq - IS,116,Government of Iraq,234,IS,"""Agence France Presse,2017-08-01,Attackers tar...","IS, interior ministry, security source",Kabul city,Iraqi embassy in Kabul,Kabul province,34.531094,69.162796,179779,Afghanistan,Asia,2017-07-31,2017-07-31,0,4,0,2,6,0 days,0.0,2.65,0.0,5.0,0.0,6.57,8.2,8.0,55.0,14.39,60.0,4.36,2.55,11.5,0.0,0.0,0.0,0.0
1,233552,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Xinhua News Agency,2017-01-01,Gun battle leav...","provincial governor, Taleban",Taghaye Khwajasufla village,Taghai-Khwaj and Tabir localities of Sangchara...,Sari Pul province,35.9654,66.3691,181213,Afghanistan,Asia,2017-01-01,2017-01-01,1,3,0,0,4,0 days,0.0,2.65,0.0,5.0,0.0,6.57,8.2,8.0,55.0,14.39,60.0,4.36,2.55,11.5,0.0,0.0,0.0,0.0
2,233553,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Xinhua News Agency,2017-01-01,Gun battle leav...",provincial governor,Sari Pul-Jawzjan road (Sari Pul province),a road connecting Saripul to the neighboring J...,Sari Pul province,36.306238,65.890063,181932,Afghanistan,Asia,2017-01-01,2017-01-01,2,2,0,0,4,0 days,0.0,2.65,0.0,5.0,0.0,6.57,8.2,8.0,55.0,14.39,60.0,4.36,2.55,11.5,0.0,0.0,0.0,0.0
3,233554,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Xinhua News Agency,2017-01-02,21 Taliban mili...","Interior Ministry, governor’s spokesman, resident",Abjosh Bala village,Ab Josh area of Charkh district of Logar province,Logar province,33.8471,68.8296,178338,Afghanistan,Asia,2017-01-01,2017-01-01,0,18,0,0,18,0 days,0.0,2.65,0.0,5.0,0.0,6.57,8.2,8.0,55.0,14.39,60.0,4.36,2.55,11.5,0.0,0.0,0.0,0.0
4,237882,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Pajhwok News,2017-01-01,Helmand’s Sangin, Mar...","administrative chief of Marja district, Milita...",Marja town,"Marja district centre, Hilmand",Hilmand province,31.521994,64.118492,175449,Afghanistan,Asia,2017-01-01,2017-01-01,1,0,0,0,1,0 days,0.0,2.65,0.0,5.0,0.0,6.57,8.2,8.0,55.0,14.39,60.0,4.36,2.55,11.5,0.0,0.0,0.0,0.0
5,252386,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Xinhua News Agency,2017-01-01,Gun battle leav...","provincial governor, Taleban",Tabar village,Taghai-Khwaj and Tabir localities of Sangchara...,Sari Pul province,35.9894,66.3764,181213,Afghanistan,Asia,2017-01-01,2017-01-01,1,2,0,0,3,0 days,0.0,2.65,0.0,5.0,0.0,6.57,8.2,8.0,55.0,14.39,60.0,4.36,2.55,11.5,0.0,0.0,0.0,0.0
6,233586,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""BBC Monitoring South Asia,2017-01-03,BBCM Afg...","Local officials, Military, Taleban",Sangin district,Sangin and Marja districts,Hilmand province,32.120374,64.994325,176170,Afghanistan,Asia,2017-01-01,2017-01-02,4,17,0,0,21,1 days,0.0,2.65,0.0,5.0,0.0,6.57,8.2,8.0,55.0,14.39,60.0,4.36,2.55,11.5,0.0,0.0,0.0,0.0
7,233588,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""BBC Monitoring South Asia,2017-01-03,Afghan f...","Police, Taleban",Qara Ghoyli village,"Qara Ghuwaily area , Almar district, Faryab",Faryab province,35.814133,64.506638,181210,Afghanistan,Asia,2017-01-01,2017-01-02,1,12,0,0,13,1 days,0.0,2.65,0.0,5.0,0.0,6.57,8.2,8.0,55.0,14.39,60.0,4.36,2.55,11.5,0.0,0.0,0.0,0.0
8,236660,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""BBC Monitoring South Asia,2017-01-03,BBCM Afg...",Local officials,Nad Ali district (Marja),Sangin and Marja districts,Hilmand province,31.625941,63.861445,175448,Afghanistan,Asia,2017-01-01,2017-01-02,0,17,0,0,17,1 days,0.0,2.65,0.0,5.0,0.0,6.57,8.2,8.0,55.0,14.39,60.0,4.36,2.55,11.5,0.0,0.0,0.0,0.0
9,237885,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""Pajhwok News,2017-01-02,52 Taliban killed in ...",Military,Garmser district,Garmser district,Hilmand province,30.852922,64.131675,174009,Afghanistan,Asia,2017-01-01,2017-01-02,0,13,0,0,13,1 days,0.0,2.65,0.0,5.0,0.0,6.57,8.2,8.0,55.0,14.39,60.0,4.36,2.55,11.5,0.0,0.0,0.0,0.0


In [174]:
recent_final.source_original.replace('Syrian Observatory for Human Rights', 'Syrian Observatory for Human Rights (SOHR)', inplace=True)

In [175]:
recent_final.source_original.replace('Violations Documentation Center', 'Violations Documentation Center (VDC)', inplace=True)

In [176]:
recent_final.source_original.replace('SOHR, VDC', 'Violations Documentation Center', inplace=True)

## Creating Feature - Reporting Organisation

In [177]:
recent_final.sample(5)

Unnamed: 0,id,year,active_year,type_of_violence,conflict_new_id,conflict_name,dyad_new_id,dyad_name,side_a_new_id,side_a,side_b_new_id,side_b,source_article,source_original,where_coordinates,where_description,adm_1,latitude,longitude,priogrid_gid,country,region,date_start,date_end,deaths_a,deaths_b,deaths_civilians,deaths_unknown,best_est_fatalities,conflict_length,capital_investment,economic_growth,savings_pcnt_gdp,inflation,pcnt_world_tourist_arrivals,death_rate_p1000,human_flight_brain_drain_index_score,government_debt,inv_freedom_index_score,external_debt,labor_freedom_index_score,remittances,pop_growth_pcnt,banking_z_score,oil_reserves_barrels,pcnt_world_oil_reserves,oil_prod_barrels_daily,trade_balance
133059,255665,2018,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,4456,Syrian insurgents,"""SOHR,2018-01-08,The Russian as well as the re...",SOHR\nVDC,Kafr Nabl town,Kafr Nubl town,Idlib governorate,35.613964,36.560703,181154,Syria,Middle East,2018-01-07,2018-01-07,0,0,3,0,3,0 days,26.16,5.7,29.46,20.75,0.73,5.37,8.1,0.0,15.0,0.0,59.0,2.55,-0.95,8.4,2.5,0.15,24.41,2.6
106539,341630,2014,Part of an Ongoing Conflict,One-Sided Violence,13432,PYD - Civilians,14377,PYD - Civilians,4163,PYD,1,Civilians,"""VDC,2014-01-07,VDC 2014-01-07""",VDC,Tall Birak town,Tal Brak,Al Hasakah governorate,36.683741,41.053507,182603,Syria,Middle East,2014-01-07,2014-01-07,0,0,2,0,2,0 days,26.16,5.7,29.46,20.75,0.73,5.7,6.9,0.0,15.0,0.0,55.0,2.55,-4.54,5.56,2.5,0.15,22.66,2.6
129807,253153,2017,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,4456,Syrian insurgents,"""SOHR,2017-12-16,Shelling by the regime forces...",SOHR,Duma town,Douma city,Rif Dimashq governorate,33.572264,36.401811,178273,Syria,Middle East,2017-12-16,2017-12-16,0,0,2,0,2,0 days,26.16,5.7,29.46,20.75,0.73,5.58,8.4,0.0,15.0,0.0,56.0,2.55,-2.24,6.27,2.5,0.15,18.0,2.6
84491,229461,2016,Part of an Ongoing Conflict,State-Based Conflict,337,Somalia: Government,750,Government of Somalia - Al-Shabaab,95,Government of Somalia,717,Al-Shabaab,"""All Africa,2016-09-30,Gunmen Shoot Dead Somal...",relatives,Mogadishu city,walked out of a mosque in Waberi district.,Banaadir region,2.066667,45.366667,132931,Somalia,Africa,2016-09-29,2016-09-29,0,0,1,0,1,0 days,0.0,0.0,0.0,0.0,0.0,11.29,9.5,0.0,0.0,0.0,0.0,0.0,2.78,0.0,0.0,0.0,0.0,-74.48
128762,243480,2017,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,4456,Syrian insurgents,"""SOHR,2017-06-15,The regime forces shell a tow...",SOHR,Damascus city,Jobar neighborhood at the outskirts of the cap...,Damascus governorate,33.513364,36.291575,178273,Syria,Middle East,2017-06-14,2017-06-14,0,0,0,0,0,0 days,26.16,5.7,29.46,20.75,0.73,5.58,8.4,0.0,15.0,0.0,56.0,2.55,-2.24,6.27,2.5,0.15,18.0,2.6


In [178]:
recent_final['media_org_source'] = recent_final.source_article.str.split('-').str[0]

In [179]:
recent_final['media_org_source'] = recent_final['media_org_source'].str.replace('\d', '')

  recent_final['media_org_source'] = recent_final['media_org_source'].str.replace('\d', '')


In [180]:
recent_final['media_org_source'] = recent_final['media_org_source'].str.replace('"', '')

In [181]:
recent_final['media_org_source'] = recent_final['media_org_source'].str.replace(',', '')

In [182]:
recent_final['media_org_source']

0            Agence France Presse
1              Xinhua News Agency
2              Xinhua News Agency
3              Xinhua News Agency
4                    Pajhwok News
                   ...           
171386               Reuters News
171387                    Amnesty
171389    BBC Monitoring Americas
171390    BBC Monitoring Americas
171392               Reuters News
Name: media_org_source, Length: 168354, dtype: object

## Creating Fatality Category as Target Variable

In [183]:
conditions = [(recent_final['best_est_fatalities'] <= 2), 
              ((recent_final['best_est_fatalities'] > 2) & (recent_final['best_est_fatalities'] <= 10)),
              ((recent_final['best_est_fatalities'] > 10) & (recent_final['best_est_fatalities'] <= 100)),
              ((recent_final['best_est_fatalities'] > 100))]

values = ['Low Fatality Incident', 'Moderate Fatality Incident', 'High Fatality Incident', 'Very High Fatality Incident']

In [184]:
recent_final['incident_classification'] = np.select(conditions, values)

In [185]:
recent_final.sample(10)

Unnamed: 0,id,year,active_year,type_of_violence,conflict_new_id,conflict_name,dyad_new_id,dyad_name,side_a_new_id,side_a,side_b_new_id,side_b,source_article,source_original,where_coordinates,where_description,adm_1,latitude,longitude,priogrid_gid,country,region,date_start,date_end,deaths_a,deaths_b,deaths_civilians,deaths_unknown,best_est_fatalities,conflict_length,capital_investment,economic_growth,savings_pcnt_gdp,inflation,pcnt_world_tourist_arrivals,death_rate_p1000,human_flight_brain_drain_index_score,government_debt,inv_freedom_index_score,external_debt,labor_freedom_index_score,remittances,pop_growth_pcnt,banking_z_score,oil_reserves_barrels,pcnt_world_oil_reserves,oil_prod_barrels_daily,trade_balance,media_org_source,incident_classification
112936,355047,2015,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,4456,Syrian insurgents,"""VDC,2015-07-23,Nasser Dayan""",VDC,Maliha suburb,Damascus Suburbs: Mleha,Rif Dimashq governorate,33.484534,36.374172,177553,Syria,Middle East,2015-07-23,2015-07-23,0,1,0,0,1,0 days,26.16,5.7,29.46,20.75,0.73,5.77,7.4,0.0,15.0,0.0,49.0,2.55,-3.91,12.02,2.5,0.15,30.0,2.6,VDC,Low Fatality Incident
67822,324784,2019,Isolated Incident,One-Sided Violence,488,Government of Myanmar (Burma) - Civilians,955,Government of Myanmar (Burma) - Civilians,144,Government of Myanmar (Burma),1,Civilians,"""The Irrawaddy Online,2019-02-22,Amid Disappea...",Two village administrators confirmed; wife of ...,Yan Aung Pyin village,a military unit near Yan Aung Myin village/on ...,Rakhine state,20.81049,93.107697,159667,Myanmar (Burma),Asia,2019-02-19,2019-02-20,0,0,0,0,0,1 days,29.16,2.89,32.93,8.8,0.26,8.22,6.9,37.0,30.0,15.16,66.0,3.18,0.62,3.08,0.14,0.01,8.84,4.04,The Irrawaddy Online,Low Fatality Incident
50174,132621,2008,Part of an Ongoing Conflict,State-Based Conflict,259,Iraq: Government,13891,Government of Iraq - al-Mahdi Army,116,Government of Iraq,5659,al-Mahdi Army,AFP 5/5,"US, hospital sources",Baghdād city,Baghdad city (Sadr city),Baghdād province,33.340582,44.400876,177569,Iraq,Middle East,2008-05-04,2008-05-04,0,3,0,0,3,0 days,15.18,8.23,52.51,12.7,0.1,0.0,9.3,74.17,0.0,0.0,0.0,0.05,1.69,14.55,115.0,8.65,2375.27,19.6,AFP /,Moderate Fatality Incident
5165,166379,2010,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"Afghan Islamic Press news agency, Peshawar, in...",Nato spokesman,Zhari district,Zherai District,Kandahar province,31.640454,65.397591,175451,Afghanistan,Asia,2010-02-15,2010-02-15,0,0,5,0,5,0 days,0.0,14.36,0.0,2.2,0.0,8.25,7.2,7.7,65.0,15.33,0.0,2.39,2.75,16.15,0.0,0.0,0.0,0.0,Afghan Islamic Press news agency Peshawar in P...,Moderate Fatality Incident
121933,223853,2016,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,4456,Syrian insurgents,"""SOHR,2016-06-07,Intense shelling on the north...",SOHR,Aleppo town,Hamdania neighborhood,Aleppo governorate,36.201241,37.161173,181875,Syria,Middle East,2016-06-07,2016-06-07,0,0,0,0,0,0 days,26.16,5.7,29.46,20.75,0.73,5.73,8.6,0.0,15.0,0.0,47.0,2.55,-3.07,10.52,2.5,0.15,28.67,2.6,SOHR,Low Fatality Incident
120518,217997,2016,Part of an Ongoing Conflict,State-Based Conflict,299,Syria: Government,11973,Government of Syria - Syrian insurgents,118,Government of Syria,4456,Syrian insurgents,"""Waha Report,2016-02-23,Syria Daily Report 22/...",Waha Report,Khan Tuman town,Khan Tuman – Zerbeh,Aleppo governorate,36.11696,37.05103,181875,Syria,Middle East,2016-02-22,2016-02-22,2,2,0,0,4,0 days,26.16,5.7,29.46,20.75,0.73,5.73,8.6,0.0,15.0,0.0,47.0,2.55,-3.07,10.52,2.5,0.15,28.67,2.6,Waha Report,Moderate Fatality Incident
64748,385014,2007,Part of an Ongoing Conflict,Non-State Conflict,4696,Gulf Cartel - Sinaloa Cartel,5306,Gulf Cartel - Sinaloa Cartel,782,Gulf Cartel,775,Sinaloa Cartel,"""CIDE-PPD,2007-12-31,Executions 2007""",CIDE-PPD,Culiacán municipality,Sinaloa Culiacán,Sinaloa state,24.70309,-107.35265,165026,Mexico,Americas,2007-07-04,2007-07-04,0,0,0,1,1,0 days,23.12,2.29,23.26,4.0,2.44,4.96,7.0,19.9,50.0,18.95,62.0,2.55,1.49,27.6,12.35,0.94,3142.95,-1.74,CIDE,Low Fatality Incident
88713,231620,2016,Part of an Ongoing Conflict,One-Sided Violence,480,Government of Sudan - Civilians,947,Government of Sudan - Civilians,112,Government of Sudan,1,Civilians,"""SUDO(UK),2016-07-21,HUMAN RIGHTS ABUSES IN SU...",SUDO(UK),El Fasher district,in Al-Wohda district of El Fasher\nNorth Darfur,North Darfur state,13.63333,25.35,149451,Sudan,Africa,2016-06-24,2016-06-24,0,0,1,0,1,0 days,9.83,4.7,5.14,17.8,0.07,7.34,9.1,109.94,10.0,41.11,44.0,0.3,2.4,13.94,5.0,0.3,104.87,-5.57,SUDO(UK),Low Fatality Incident
435,237763,2017,Part of an Ongoing Conflict,State-Based Conflict,333,Afghanistan: Government,735,Government of Afghanistan - Taleban,130,Government of Afghanistan,303,Taleban,"""BBC Monitoring South Asia,2017-03-18,Taliban ...",Taleban,Baghlani Jadid district,Manalo area in Markazi Baghlan District of nor...,Baghlan province,36.324882,68.6234,181938,Afghanistan,Asia,2017-03-17,2017-03-17,0,2,0,0,2,0 days,0.0,2.65,0.0,5.0,0.0,6.57,8.2,8.0,55.0,14.39,60.0,4.36,2.55,11.5,0.0,0.0,0.0,0.0,BBC Monitoring South Asia,Low Fatality Incident
32143,347408,2020,Part of an Ongoing Conflict,One-Sided Violence,514,Taleban - Civilians,981,Taleban - Civilians,303,Taleban,1,Civilians,"""NYTimes.com Feed,2020-06-04,Afghan War Casual...",NY Times,Malarghi village,"Malarghi area of Khan Abad District, Kunduz pr...",Kunduz province,36.6619,68.985,182658,Afghanistan,Asia,2020-06-12,2020-06-12,0,0,1,0,1,0 days,0.0,2.55,0.0,1.45,0.0,6.35,7.5,6.76,10.0,14.01,62.0,4.34,2.46,12.08,0.0,0.0,0.0,0.0,NYTimes.com Feed,Low Fatality Incident


In [186]:
recent_final['incident_classification'].value_counts()

Low Fatality Incident          95707
Moderate Fatality Incident     56026
High Fatality Incident         16113
Very High Fatality Incident      508
Name: incident_classification, dtype: int64

In [187]:
recent_final.to_csv('recent_final.csv', encoding='utf-8')

### Final Preparation of Variables Pre-Modelling
Revisiting the data after the EDA and the first attempt at modelling, highlighted a few further actions which needed to be taken:

* Dropping a number of further variables
* Changing the Target variable to a numerical variable
* Creating a final feature: length of conflict

In [188]:
recent_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 168354 entries, 0 to 171392
Data columns (total 50 columns):
 #   Column                                Non-Null Count   Dtype          
---  ------                                --------------   -----          
 0   id                                    168354 non-null  int64          
 1   year                                  168354 non-null  int64          
 2   active_year                           168354 non-null  object         
 3   type_of_violence                      168354 non-null  object         
 4   conflict_new_id                       168354 non-null  int64          
 5   conflict_name                         168354 non-null  object         
 6   dyad_new_id                           168354 non-null  int64          
 7   dyad_name                             168354 non-null  object         
 8   side_a_new_id                         168354 non-null  int64          
 9   side_a                                168354 non

In [189]:
recent_final.drop('deaths_a', axis=1, inplace=True)
recent_final.drop('deaths_b', axis=1, inplace=True)
recent_final.drop('deaths_civilians', axis=1, inplace=True)
recent_final.drop('date_start', axis=1, inplace=True)
recent_final.drop('date_end', axis=1, inplace=True)
recent_final.drop('source_article', axis=1, inplace=True)

In [190]:
recent_final.drop('best_est_fatalities', axis=1, inplace=True)

In [191]:
recent_final.conflict_length = recent_final.conflict_length.dt.days

In [192]:
recent_final.incident_classification.value_counts()

Low Fatality Incident          95707
Moderate Fatality Incident     56026
High Fatality Incident         16113
Very High Fatality Incident      508
Name: incident_classification, dtype: int64

In [193]:
conditions = [(recent_final['incident_classification'] == 'Low Fatality Incident'), 
              (recent_final['incident_classification'] == 'Moderate Fatality Incident'),
              (recent_final['incident_classification'] == 'High Fatality Incident'),
              (recent_final['incident_classification'] == 'Very High Fatality Incident')]

values = [0, 1, 2, 3]

In [194]:
recent_final['incident_classification'] = np.select(conditions, values)

In [195]:
recent_final.incident_classification.value_counts()

0    95707
1    56026
2    16113
3      508
Name: incident_classification, dtype: int64

In [196]:
recent_final.to_csv('to_model2.csv', encoding='utf-8')