# Ghana Conflict Event
## Data Preprocessing

**Author:** Abdel An'lah TIDJANI

**Date:** December 31,2021

### Notebook Configuration

In [119]:
import numpy as np
import pandas as pd
from datetime import datetime

In [120]:
# Configure notebook output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Number of rows and columns
pd.set_option('display.max_rows', 150)
pd.set_option('display.max_columns', 150)

### Load Data


In [121]:
# Load data
dataset_name = 'data/1980-11-04-2021-11-13-Ghana.csv'
df = pd.read_csv(dataset_name, sep=";")

### Inspect the Structure
The data frame contains 37 attributes, and 1,276 observations.

In [122]:
print("Shape of the data:", df.shape)
Rows,Cols=df.shape
df.head()

Shape of the data: (1276, 31)


Unnamed: 0,data_id,iso,event_id_cnty,event_id_no_cnty,event_date,year,time_precision,event_type,sub_event_type,actor1,assoc_actor_1,inter1,actor2,assoc_actor_2,inter2,interaction,region,country,admin1,admin2,admin3,location,latitude,longitude,geo_precision,source,source_scale,notes,fatalities,timestamp,iso3
0,8628752,288,GHA1295,1295,04 November 2021,2021,1,Protests,Peaceful protest,Protesters (Ghana),Christian Group (Ghana); Muslim Group (Ghana);...,6,,,0,60,Western Africa,Ghana,Eastern,Akwapem North,,Larteh,5.9395,-0.0689,1,GNA (Ghana),National,"On 4 November 2021, members of the Presbyteria...",0,1636405422,GHA
1,8628433,288,GHA1296,1296,03 November 2021,2021,1,Riots,Violent demonstration,Rioters (Ghana),Labour Group (Ghana),5,Police Forces of Ghana (2017-),,1,15,Western Africa,Ghana,Central,Assin North,,Assin Bereku,5.8678,-1.3389,1,Citi News; Ghana Web,National,"On 3 November 2021, workers of the Shimizu Dai...",0,1636405421,GHA
2,8628634,288,GHA1297,1297,03 November 2021,2021,1,Protests,Peaceful protest,Protesters (Ghana),,6,,,0,60,Western Africa,Ghana,Ashanti,Amansie West,,Manso Atwere,6.4574,-1.8575,1,Citi News,National,"On 3 November 2021, residents demonstrated in ...",0,1636405422,GHA
3,8628753,288,GHA1298,1298,01 November 2021,2021,1,Riots,Mob violence,Rioters (Ghana),,5,Police Forces of Ghana (2017-),Civilians (Ghana); Labour Group (Ghana),1,15,Western Africa,Ghana,Northern,Tamale,,Tamale,9.4008,-0.8393,2,GNA (Ghana),National,"On 1 November 2021, residents of Kukuo suburb ...",0,1636405422,GHA
4,8628434,288,GHA1299,1299,30 October 2021,2021,1,Battles,Armed clash,Nadu Warriors Communal Militia (Ghana),,4,Police Forces of Ghana (2017-),,1,14,Western Africa,Ghana,Eastern,Lower Manya,,Krobo,6.1299,0.0012,1,Ghana Web; Chronicle (Ghana),National,"On 30 October 2021, members of the Nadu Warrio...",0,1636405421,GHA


In [123]:
# Display a summary of the data frame
df.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1276 entries, 0 to 1275
Data columns (total 31 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   data_id           1276 non-null   int64  
 1   iso               1276 non-null   int64  
 2   event_id_cnty     1276 non-null   object 
 3   event_id_no_cnty  1276 non-null   int64  
 4   event_date        1276 non-null   object 
 5   year              1276 non-null   int64  
 6   time_precision    1276 non-null   int64  
 7   event_type        1276 non-null   object 
 8   sub_event_type    1276 non-null   object 
 9   actor1            1276 non-null   object 
 10  assoc_actor_1     545 non-null    object 
 11  inter1            1276 non-null   int64  
 12  actor2            773 non-null    object 
 13  assoc_actor_2     288 non-null    object 
 14  inter2            1276 non-null   int64  
 15  interaction       1276 non-null   int64  
 16  region            1276 non-null   object 


### View Missing Data
Calculate the total number of null values and percent for each attribute.  As the results show, `"assoc_actor_1"`,  `"assoc_actor_2"`, and   `"admin3"`  are comprised of missing values of more than 50%.

In [124]:
# Check the number of missing values in each attribute
count = df.isnull().sum()
percent = round(count / Rows * 100, 2)
series = [count, percent]
result = pd.concat(series, axis=1, keys=['Count','Percent'])
result.sort_values(by='Count', ascending=False)

Unnamed: 0,Count,Percent
admin3,1276,100.0
assoc_actor_2,988,77.43
assoc_actor_1,731,57.29
actor2,503,39.42
notes,10,0.78
data_id,0,0.0
latitude,0,0.0
admin2,0,0.0
location,0,0.0
longitude,0,0.0


In [125]:
# delete columns with more than 50% missing value
df.drop(columns = ["assoc_actor_1","assoc_actor_2","admin3"], inplace =True)


### Looking at categorical features 
> low and high cardinality category features : `event_id_cnty`, `is03`, `region`, `country` don't give us useful information in this context of analysis. 

> `event_date` should be a date not object

> According to  [codebook](ACLED_Codebook_2019FINAL.pdf)  `admin1` represent the county, `admin2` represent the state and `location` represent the neighborhood 

In [126]:
df.select_dtypes('object').nunique().to_frame()

Unnamed: 0,0
event_id_cnty,1276
event_date,970
event_type,5
sub_event_type,16
actor1,127
actor2,87
region,1
country,1
admin1,16
admin2,175


In [127]:
# delete columns with no useful information 
df.drop(columns = ["event_id_cnty","region","iso3","country"], inplace=True)

In [128]:
# convert event_date to datetime dtype
dateFormatter = "%d %B %Y"
df['event_date']=pd.to_datetime(df["event_date"], format=dateFormatter)

### Looking at numeric features 
> `iso` , `time_precision`, `data_id`, `event_id_no_cnty`, `geo_precision`, `timestamp` don't give use useful information

> According to the [codebook](ACLED_Codebook_2019FINAL.pdf) , `inter1`, `inter2`, and `interaction` shouldn't be consider as numeric because they have label

> Good numeric features are  `year`, `latitude`, `longitude`, and `fatalities`

In [129]:
df.select_dtypes('number').describe()

Unnamed: 0,data_id,iso,event_id_no_cnty,year,time_precision,inter1,inter2,interaction,latitude,longitude,geo_precision,fatalities,timestamp
count,1276.0,1276.0,1276.0,1276.0,1276.0,1276.0,1276.0,1276.0,1276.0,1276.0,1276.0,1276.0,1276.0
mean,6721911.0,288.0,655.183386,2016.603448,1.158307,4.732759,2.826803,44.356583,6.954875,-0.787464,1.159875,0.528213,1598546000.0
std,1113512.0,0.0,376.729469,4.624347,0.371559,1.503426,3.001271,17.778258,1.772151,0.860652,0.418574,3.897911,25571570.0
min,4555287.0,288.0,1.0,1997.0,1.0,1.0,0.0,10.0,4.7949,-3.0454,1.0,0.0,1552576000.0
25%,5758440.0,288.0,330.75,2015.0,1.0,4.0,0.0,37.0,5.556,-1.6218,1.0,0.0,1572404000.0
50%,7060988.0,288.0,658.5,2018.0,1.0,5.0,1.0,50.0,6.35785,-0.4839,1.0,0.0,1618269000.0
75%,7764179.0,288.0,982.25,2020.0,1.0,6.0,7.0,60.0,7.643375,-0.1969,1.0,0.0,1618530000.0
max,8638703.0,288.0,1301.0,2021.0,3.0,8.0,8.0,78.0,11.0846,1.1901,3.0,126.0,1636413000.0


In [130]:
# drop numeric features with no useful information 
df.drop(columns = ["data_id", "iso", "event_id_no_cnty","time_precision","timestamp","geo_precision"] , inplace=True)

### Let create new variable `interact`  labeled variable of `interaction`

In [131]:
# new variable interact

key = [10,11,12,13,14,15,16,17,18,20,22,23,24,25,26,27,28,30,33,34,35,36,37,38,40,44,45,46,47,48,50,55,56,57,58,60,66,67,68,78,80]

value = ["SOLE MILITARY ACTION","MILITARY VERSUS MILITARY","MILITARY VERSUS REBELS","MILITARY VERSUS POLITICAL MILITIA",
  "MILITARY VERSUS COMMUNAL MILITIA","MILITARY VERSUS RIOTERS","MILITARY VERSUS PROTESTERS","MILITARY VERSUS CIVILIANS",
  "MILITARY VERSUS OTHER","SOLE REBEL ACTION","REBELS VERSUS REBELS","REBELS VERSUS POLITICAL MILIITA","REBELS VERSUS COMMUNAL MILITIA",
  "REBELS VERSUS RIOTERS","REBELS VERSUS PROTESTERS","REBELS VERSUS CIVILIANS","REBELS VERSUS OTHERS","SOLE POLITICAL MILITIA ACTION",
  "POLITICAL MILITIA VERSUS POLITICAL MILITIA","POLITICAL MILITIA VERSUS COMMUNAL MILITIA","POLITICAL MILITIA VERSUS RIOTERS",
  "POLITICAL MILITIA VERSUS PROTESTERS","POLITICAL MILITIA VERSUS CIVILIANS","POLITICAL MILITIA VERSUS OTHERS","SOLE COMMUNAL MILITIA ACTION",
  "COMMUNAL MILITIA VERSUS COMMUNAL MILITIA","COMMUNAL MILITIA VERSUS RIOTERS","COMMUNAL MILITIA VERSUS PROTESTERS","COMMUNAL MILITIA VERSUS CIVILIANS",
  "COMMUNAL MILITIA VERSUS OTHER","SOLE RIOTER ACTION","RIOTERS VERSUS RIOTERS","RIOTERS VERSUS PROTESTERS","RIOTERS VERSUS CIVILIANS",
  "RIOTERS VERSUS OTHERS","SOLE PROTESTER ACTION","PROTESTERS VERSUS PROTESTERS","PROTESTERS VERSUS CIVILIANS","PROTESTERS VERSUS OTHER",
  "OTHER ACTOR VERSUS CIVILIANS","SOLE OTHER ACTION"]

df["interact"]=df["interaction"]

for i in range(len(key)):
    df.interact[df["interact"]==int(key[i])]=str(value[i].lower())
    
df.drop(columns="interaction",inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [132]:
df.head()

Unnamed: 0,event_date,year,event_type,sub_event_type,actor1,inter1,actor2,inter2,admin1,admin2,location,latitude,longitude,source,source_scale,notes,fatalities,interact
0,2021-11-04,2021,Protests,Peaceful protest,Protesters (Ghana),6,,0,Eastern,Akwapem North,Larteh,5.9395,-0.0689,GNA (Ghana),National,"On 4 November 2021, members of the Presbyteria...",0,sole protester action
1,2021-11-03,2021,Riots,Violent demonstration,Rioters (Ghana),5,Police Forces of Ghana (2017-),1,Central,Assin North,Assin Bereku,5.8678,-1.3389,Citi News; Ghana Web,National,"On 3 November 2021, workers of the Shimizu Dai...",0,military versus rioters
2,2021-11-03,2021,Protests,Peaceful protest,Protesters (Ghana),6,,0,Ashanti,Amansie West,Manso Atwere,6.4574,-1.8575,Citi News,National,"On 3 November 2021, residents demonstrated in ...",0,sole protester action
3,2021-11-01,2021,Riots,Mob violence,Rioters (Ghana),5,Police Forces of Ghana (2017-),1,Northern,Tamale,Tamale,9.4008,-0.8393,GNA (Ghana),National,"On 1 November 2021, residents of Kukuo suburb ...",0,military versus rioters
4,2021-10-30,2021,Battles,Armed clash,Nadu Warriors Communal Militia (Ghana),4,Police Forces of Ghana (2017-),1,Eastern,Lower Manya,Krobo,6.1299,0.0012,Ghana Web; Chronicle (Ghana),National,"On 30 October 2021, members of the Nadu Warrio...",0,military versus communal militia


### cleaning function
> with these function we can go through another ACLED dataset automatically 

In [133]:
 # define cleaning function 

def wrangle(file):
    
    # Load data
    dataset_name = file
    df = pd.read_csv(dataset_name, sep=";")
    
    # delete columns with more than 50% missing value
    df.drop(columns = ["assoc_actor_1","assoc_actor_2","admin3"], inplace =True)

    # delete columns with no useful information 
    df.drop(columns = ["event_id_cnty","region","iso3","country"], inplace=True)

    # convert event_date to datetime dtype
    dateFormatter = "%d %B %Y"
    df['event_date']=pd.to_datetime(df["event_date"], format=dateFormatter)

    # drop numeric features with no useful information 
    df.drop(columns = ["data_id", "iso", "event_id_no_cnty","time_precision","timestamp","geo_precision"] , inplace=True)
    
    # create new feature interact
    key = [10,11,12,13,14,15,16,17,18,20,22,23,24,25,26,27,28,30,33,34,35,36,37,38,40,44,45,46,47,48,50,55,56,57,58,60,66,67,68,78,80]

    value = ["SOLE MILITARY ACTION","MILITARY VERSUS MILITARY","MILITARY VERSUS REBELS","MILITARY VERSUS POLITICAL MILITIA",
  "MILITARY VERSUS COMMUNAL MILITIA","MILITARY VERSUS RIOTERS","MILITARY VERSUS PROTESTERS","MILITARY VERSUS CIVILIANS",
  "MILITARY VERSUS OTHER","SOLE REBEL ACTION","REBELS VERSUS REBELS","REBELS VERSUS POLITICAL MILIITA","REBELS VERSUS COMMUNAL MILITIA",
  "REBELS VERSUS RIOTERS","REBELS VERSUS PROTESTERS","REBELS VERSUS CIVILIANS","REBELS VERSUS OTHERS","SOLE POLITICAL MILITIA ACTION",
  "POLITICAL MILITIA VERSUS POLITICAL MILITIA","POLITICAL MILITIA VERSUS COMMUNAL MILITIA","POLITICAL MILITIA VERSUS RIOTERS",
  "POLITICAL MILITIA VERSUS PROTESTERS","POLITICAL MILITIA VERSUS CIVILIANS","POLITICAL MILITIA VERSUS OTHERS","SOLE COMMUNAL MILITIA ACTION",
  "COMMUNAL MILITIA VERSUS COMMUNAL MILITIA","COMMUNAL MILITIA VERSUS RIOTERS","COMMUNAL MILITIA VERSUS PROTESTERS","COMMUNAL MILITIA VERSUS CIVILIANS",
  "COMMUNAL MILITIA VERSUS OTHER","SOLE RIOTER ACTION","RIOTERS VERSUS RIOTERS","RIOTERS VERSUS PROTESTERS","RIOTERS VERSUS CIVILIANS",
  "RIOTERS VERSUS OTHERS","SOLE PROTESTER ACTION","PROTESTERS VERSUS PROTESTERS","PROTESTERS VERSUS CIVILIANS","PROTESTERS VERSUS OTHER",
  "OTHER ACTOR VERSUS CIVILIANS","SOLE OTHER ACTION"]

    df["interact"]=df["interaction"]

    for i in range(len(key)):
        df.interact[df["interact"]==int(key[i])]=str(value[i].lower())
        
    df.drop(columns="interaction",inplace=True)
    
    return df

In [135]:
data = wrangle('data/1980-11-04-2021-11-13-Ghana.csv')
data.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,event_date,year,event_type,sub_event_type,actor1,inter1,actor2,inter2,admin1,admin2,location,latitude,longitude,source,source_scale,notes,fatalities,interact
0,2021-11-04,2021,Protests,Peaceful protest,Protesters (Ghana),6,,0,Eastern,Akwapem North,Larteh,5.9395,-0.0689,GNA (Ghana),National,"On 4 November 2021, members of the Presbyteria...",0,sole protester action
1,2021-11-03,2021,Riots,Violent demonstration,Rioters (Ghana),5,Police Forces of Ghana (2017-),1,Central,Assin North,Assin Bereku,5.8678,-1.3389,Citi News; Ghana Web,National,"On 3 November 2021, workers of the Shimizu Dai...",0,military versus rioters
2,2021-11-03,2021,Protests,Peaceful protest,Protesters (Ghana),6,,0,Ashanti,Amansie West,Manso Atwere,6.4574,-1.8575,Citi News,National,"On 3 November 2021, residents demonstrated in ...",0,sole protester action
3,2021-11-01,2021,Riots,Mob violence,Rioters (Ghana),5,Police Forces of Ghana (2017-),1,Northern,Tamale,Tamale,9.4008,-0.8393,GNA (Ghana),National,"On 1 November 2021, residents of Kukuo suburb ...",0,military versus rioters
4,2021-10-30,2021,Battles,Armed clash,Nadu Warriors Communal Militia (Ghana),4,Police Forces of Ghana (2017-),1,Eastern,Lower Manya,Krobo,6.1299,0.0012,Ghana Web; Chronicle (Ghana),National,"On 30 October 2021, members of the Nadu Warrio...",0,military versus communal militia


In [138]:
df.shape==data.shape

True

In [139]:
df.compare(data, align_axis=1, keep_shape=False, keep_equal=False)