# **Data Cleaning and Feature Engineering**

## Objectives

1. Clean dataset
2. Engineer new features
3. Transform features in preparation for ML modelling

## Inputs

- acled_original_optimised.csv file

## Outputs

1. Cleaned data set
2. ML modelling-ready input data set (with Feature Engineering)

## Additional Comments

<ins>Cleaning list from EDA notebook</ins>

1. actor2, inter2, population_1km, population_best, deal with missing values
2. disorder_type, has only 1 value type, drop
3. Consider dropping:
    - actor1/actor2 (name of beligerent), many unique values, difficult to interpret
3. interaction, check if column value can be grouped together
4. source, try to recategorise values, too many in current state
5. source_scale, try to recategorise values, note category 'Other'
6. Create additional dataset for DBSCAN without geo_precision = 3



In [1]:
# import libraries
import pandas as pd
import numpy as np
from pathlib import Path

In [2]:
# use the same loading procedure as in the previous notebook

# specify columns to keep
to_keep = ['event_id_cnty', 'event_date', 'disorder_type', 'event_type', 'sub_event_type', 'actor1', 
           'inter1', 'actor2', 'inter2', 'interaction', 'region', 'country', 'latitude', 'longitude',
           'geo_precision', 'source', 'source_scale','notes',  'fatalities',  'population_1km', 
           'population_best']

# define data types for each column
dtype_map = {
    "event_id_cnty": "string",           
    "disorder_type": "category",
    "event_type": "category",
    "sub_event_type": "category",
    "actor1": "category",
    "inter1": "category",                   
    "actor2": "category",
    "inter2": "category",                 
    "interaction": "category",               
    "region": "category",
    "country": "category",
    "latitude": "float32",
    "longitude": "float32",
    "geo_precision": "int8",           
    "source": "category",
    "source_scale": "category",
    "notes": "string",                   
    "fatalities": "int16",               
    "population_1km": "float32",       
    "population_best": "float32"    
}

# load the dataset with specified columns and data types
df = pd.read_csv(
    Path.cwd().parent / 'data/raw/original_acled.csv',
    usecols=to_keep,
    dtype=dtype_map,
    parse_dates=["event_date"],
    low_memory=False
)

# Data Cleaning

## Drop columns

Not all columns are useful for the upcoming analysis. Columns may be dropped for 3 reasons:

1. They are irrelevant, i.e. they contain no useful info 
2. They may contain relevant info, but it cannot be extracted and they cannot be used as they are 
3. They are redundant, i.e. several columns with roughly the same info. <ins>This would damage the model!</ins>

In [11]:
# 'disorder_type' is irrelevant as it has only one value, so it will be dropped
df.drop(columns=['disorder_type'], inplace=True)

In [None]:
# "actor1" and "actor2" provide the names of belligerents, e.g. 'Military Forces of Eritrea'
# these values are too granular (too many unique values), cannot be used
df.drop(columns=['actor1', 'actor2'], inplace=True)

In [7]:
# "inter1" and "inter2" provide the type of belligerents, e.g. 'State Forces'
# these values are more general and can be used, but they are redundant with "interaction"
# "interaction" combines both sides, e.g. 'State Forces vs. Non-State Forces'
# therefore, only "interaction" will be kept to reduce dimensionality
df.drop(columns=['inter1', 'inter2'], inplace=True)

In [8]:
# "source" contains the source of the report, e.g. 'BBC'
# this is too granular, not that useful for the analysis, also 'source_scale' ('Local News') is more relevant
df.drop(columns=['source'], inplace=True)

In [9]:
# "notes" contains unstructured text data about the event
# it may contain useful key words, but for the scopre of this project, it will be dropped
df.drop(columns=['notes'], inplace=True)

## Missing values

In [12]:
# review nans
df.isna().sum()

event_id_cnty          0
event_date             0
event_type             0
sub_event_type         0
interaction            0
region                 0
country                0
latitude               0
longitude              0
geo_precision          0
source_scale           0
fatalities             0
population_1km     48294
population_best    48294
dtype: int64

In [4]:
df.head()

Unnamed: 0,event_id_cnty,event_date,disorder_type,event_type,sub_event_type,actor1,inter1,actor2,inter2,interaction,...,country,latitude,longitude,geo_precision,source,source_scale,notes,fatalities,population_1km,population_best
0,MMR1,2010-01-01,Political violence,Violence against civilians,Attack,Military Forces of Myanmar (1988-2011),State forces,Civilians (Myanmar),Civilians,State forces-Civilians,...,Myanmar,16.0408,98.123199,2,Democratic Voice of Burma,National,"On 1 January 2010, the Democratic Karen Buddhi...",0,,
1,SOM5580,2010-01-01,Political violence,Battles,Armed clash,HI: Hizbul Islam,Political militia,Unidentified Armed Group (Somalia),Political militia,Political militia-Political militia,...,Somalia,2.2524,44.690498,1,Radio Gaalkacyo,National,Fighters loyal to Hisb Al-Islam group reported...,7,,
2,BGD7238,2010-01-01,Political violence,Battles,Armed clash,BCL: Bangladesh Chhatra League,Political militia,BCL: Bangladesh Chhatra League,Political militia,Political militia-Political militia,...,Bangladesh,24.457701,89.708,1,Right Vision News,International,Two factions of the BCL- one led by Kamal and ...,0,,
3,ETH1319,2010-01-01,Political violence,Battles,Armed clash,Military Forces of Eritrea (1993-),External/Other forces,Military Forces of Ethiopia (1991-2018),State forces,State forces-External/Other forces,...,Ethiopia,14.5091,39.443699,2,AFP,International,Eritrea accused arch-foe Ethiopia on Sunday of...,10,,
4,ETH1320,2010-01-01,Political violence,Battles,Armed clash,Military Forces of Ethiopia (1991-2018),State forces,Military Forces of Eritrea (1993-),External/Other forces,State forces-External/Other forces,...,Ethiopia,14.5219,39.384998,2,All Africa,International,Eritrean military claims Ethiopian troops atta...,10,,


In [5]:
df['inter1'].value_counts()

inter1
State forces             301000
Political militia        290082
External/Other forces    254095
Rebel group              184177
Identity militia          35958
Name: count, dtype: int64

# Feature Engineering