# **Exploratory Data Analysis**

## Objectives

1. Load raw dataset into a Pandas Dataframe
2. Appraise data quality
3. Explore data using descriptive statistics and visualisations

## Inputs

- Raw ACLED dataset (CSV), obtained from https://acleddata.com/

## Outputs

1. Optimised data set saved as CSV file
2. Set of descriptive statistics and figure
3. Insights into data strenghts, weaknesses
4. Data cleaning plan

## Additional Comments

- Loading operation presented a challenge: the raw dataset in CSV format is over 600 MB in size 



In [3]:
# import libraries
import pandas as pd
import numpy as np
from pathlib import Path

# Loading operation

The large size of the dataset (over 600 MB) may cause RAM issues if loaded as is. To mitigate this, the plan was to:

1. Load a preview of 10 rows for assessment. Choose which columns to keep and the smallest format that they could be converted into.
2. Load the entire data set according to the column and dtypes selection from Step 1.

In [None]:
# load a preview of the dataset to mitigate potential memory issues
df_preview = pd.read_csv(Path.cwd().parent / 'data/raw/original_acled.csv', nrows=10)
df_preview.head()

Unnamed: 0,event_id_cnty,event_date,year,time_precision,disorder_type,event_type,sub_event_type,actor1,assoc_actor_1,inter1,...,longitude,geo_precision,source,source_scale,notes,fatalities,tags,timestamp,population_1km,population_best
0,MMR1,2010-01-01,2010,1,Political violence,Violence against civilians,Attack,Military Forces of Myanmar (1988-2011),DKBA (Buddhist): Democratic Karen Buddhist Arm...,State forces,...,98.1232,2,Democratic Voice of Burma,National,"On 1 January 2010, the Democratic Karen Buddhi...",0,,1552577624,,
1,SOM5580,2010-01-01,2010,1,Political violence,Battles,Armed clash,HI: Hizbul Islam,,Political militia,...,44.6905,1,Radio Gaalkacyo,National,Fighters loyal to Hisb Al-Islam group reported...,7,,1572403772,,
2,BGD7238,2010-01-01,2010,1,Political violence,Battles,Armed clash,BCL: Bangladesh Chhatra League,Students (Bangladesh),Political militia,...,89.708,1,Right Vision News,International,Two factions of the BCL- one led by Kamal and ...,0,,1618526280,,
3,ETH1319,2010-01-01,2010,1,Political violence,Battles,Armed clash,Military Forces of Eritrea (1993-),,External/Other forces,...,39.4437,2,AFP,International,Eritrea accused arch-foe Ethiopia on Sunday of...,10,,1618529663,,
4,ETH1320,2010-01-01,2010,1,Political violence,Battles,Armed clash,Military Forces of Ethiopia (1991-2018),,State forces,...,39.385,2,All Africa,International,Eritrean military claims Ethiopian troops atta...,10,,1618529663,,


In [5]:
# specify columns to keep
to_keep = ['event_id_cnty', 'event_date', 'disorder_type', 'event_type', 'sub_event_type', 'actor1', 
           'inter1', 'actor2', 'inter2', 'interaction', 'region', 'country', 'latitude', 'longitude',
           'geo_precision', 'source', 'source_scale','notes',  'fatalities',  'population_1km', 
           'population_best']

# define data types for each column
dtype_map = {
    "event_id_cnty": "string",           
    "disorder_type": "category",
    "event_type": "category",
    "sub_event_type": "category",
    "actor1": "category",
    "inter1": "category",                   
    "actor2": "category",
    "inter2": "category",                 
    "interaction": "category",               
    "region": "category",
    "country": "category",
    "latitude": "float32",
    "longitude": "float32",
    "geo_precision": "int8",           
    "source": "category",
    "source_scale": "category",
    "notes": "string",                   
    "fatalities": "int16",               
    "population_1km": "float32",       
    "population_best": "float32"    
}

# load the dataset with specified columns and data types
df = pd.read_csv(
    Path.cwd().parent / 'data/raw/original_acled.csv',
    usecols=to_keep,
    dtype=dtype_map,
    parse_dates=["event_date"],
    low_memory=False
)

In [7]:
# save optimised dataframe
df.to_csv(Path.cwd().parent / 'data/raw/acled_original_optimised.csv', index=False)

# EDA

## Data Quality appraisal

In [6]:
# view dataframe
df.head()

Unnamed: 0,event_id_cnty,event_date,disorder_type,event_type,sub_event_type,actor1,inter1,actor2,inter2,interaction,...,country,latitude,longitude,geo_precision,source,source_scale,notes,fatalities,population_1km,population_best
0,MMR1,2010-01-01,Political violence,Violence against civilians,Attack,Military Forces of Myanmar (1988-2011),State forces,Civilians (Myanmar),Civilians,State forces-Civilians,...,Myanmar,16.0408,98.123199,2,Democratic Voice of Burma,National,"On 1 January 2010, the Democratic Karen Buddhi...",0,,
1,SOM5580,2010-01-01,Political violence,Battles,Armed clash,HI: Hizbul Islam,Political militia,Unidentified Armed Group (Somalia),Political militia,Political militia-Political militia,...,Somalia,2.2524,44.690498,1,Radio Gaalkacyo,National,Fighters loyal to Hisb Al-Islam group reported...,7,,
2,BGD7238,2010-01-01,Political violence,Battles,Armed clash,BCL: Bangladesh Chhatra League,Political militia,BCL: Bangladesh Chhatra League,Political militia,Political militia-Political militia,...,Bangladesh,24.457701,89.708,1,Right Vision News,International,Two factions of the BCL- one led by Kamal and ...,0,,
3,ETH1319,2010-01-01,Political violence,Battles,Armed clash,Military Forces of Eritrea (1993-),External/Other forces,Military Forces of Ethiopia (1991-2018),State forces,State forces-External/Other forces,...,Ethiopia,14.5091,39.443699,2,AFP,International,Eritrea accused arch-foe Ethiopia on Sunday of...,10,,
4,ETH1320,2010-01-01,Political violence,Battles,Armed clash,Military Forces of Ethiopia (1991-2018),State forces,Military Forces of Eritrea (1993-),External/Other forces,State forces-External/Other forces,...,Ethiopia,14.5219,39.384998,2,All Africa,International,Eritrean military claims Ethiopian troops atta...,10,,
