**DATA UNDERSTANDING**

**a) Imported Relevant Modules**

In [2]:
import pandas as pd
import numpy as np
import math
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

**b) Loading Dataset**

In [7]:
# Load the dataset
data = pd.read_csv('../../AfricaConflictDataset.csv')

# Display the first few rows of the dataset
data.head()

Unnamed: 0,event_id_cnty,event_date,year,time_precision,disorder_type,event_type,sub_event_type,actor1,assoc_actor_1,inter1,...,location,latitude,longitude,geo_precision,source,source_scale,notes,fatalities,tags,timestamp
0,MLI33921,28/02/2025,2025,1,Political violence,Explosions/Remote violence,Remote explosive/landmine/IED,JNIM: Group for Support of Islam and Muslims,,Rebel group,...,Ibdakan,18.9689,2.0041,2,Al Zallaqa,New media,"On 28 February 2025, JNIM claimed to have targ...",0,,1741039795
1,BFO13376,28/02/2025,2025,1,Political violence,Battles,Armed clash,JNIM: Group for Support of Islam and Muslims,,Rebel group,...,Gnangdin,11.3693,-0.36,1,Undisclosed Source,Local partner-Other,"On 28 February 2025, an armed group (likely JN...",0,,1741039795
2,MLI33922,28/02/2025,2025,1,Political violence,Violence against civilians,Attack,Military Forces of Mali (2021-),Wagner Group,State forces,...,Eghacher-Sediden,18.5558,1.1113,1,Twitter,New media,"On 28 February 2025, a FAMa and Wagner patrol ...",0,,1741039795
3,GHA2795,28/02/2025,2025,1,Political violence,Violence against civilians,Attack,Land Guards,,External/Other forces,...,Bibiani,6.4667,-2.3333,2,3 News,National,"On 28 February 2025, a land guard shot a farme...",0,,1741039795
4,GHA2800,28/02/2025,2025,1,Political violence,Riots,Mob violence,Rioters (Ghana),,Rioters,...,Accra,5.556,-0.1969,3,Ghana Web,National,"On 28 February 2025, citizens threw stones at ...",0,crowd size=no report,1741039795


**C) Data Shape**

In [8]:
print('Our data has {} rows and {} columns'.format(data.shape[0], data.shape[1]))

Our data has 413947 rows and 31 columns


**d) Data Description**


In [9]:
data.describe()

Unnamed: 0,year,time_precision,iso,latitude,longitude,geo_precision,fatalities,timestamp
count,413947.0,413947.0,413947.0,413947.0,413947.0,413947.0,413947.0,413947.0
mean,2017.527964,1.130737,510.399988,6.922263,21.8868,1.279074,2.439153,1676141000.0
std,6.60851,0.393477,250.075057,15.495063,16.776814,0.494604,24.320917,52636890.0
min,1997.0,1.0,12.0,-34.7068,-25.1631,1.0,0.0,1552576000.0
25%,2015.0,1.0,231.0,0.3156,8.1555,1.0,0.0,1622068000.0
50%,2020.0,1.0,566.0,6.6936,28.0436,1.0,0.0,1689711000.0
75%,2022.0,1.0,710.0,13.5157,33.4833,2.0,1.0,1724714000.0
max,2025.0,3.0,894.0,37.2815,64.6832,3.0,1350.0,1741072000.0


Our data includes events surrounding a armed conflict as well as the conflict outcome.


The Column Include: 

event_id_cnty:
A unique alphanumeric identifier combining a numeric ID with a country code (e.g., ETH9766).

event_date
The date on which the event occurred, recorded in YYYY-MM-DD format.

year:
The year during which the event took place.

time_precision:
A numeric code (1–3) indicating the precision of the event date, with 1 being the most precise.

disorder_type:
The overarching category of the event, such as Political violence, Demonstrations, or Strategic developments.

event_type:
The general classification of the event (e.g., Battles, Protests, Explosions/Remote violence).

sub_event_type:
A more specific classification within the event type (e.g., Armed clash, Peaceful protest).

actor1:
The primary actor involved in the event.

assoc_actor_1:
Additional or supporting actor(s) associated with actor1.

inter1:
A categorical code describing actor1’s type (e.g., Rebel group, State forces).

actor2:
The secondary actor involved in the event (which may represent a target or another involved party).

assoc_actor_2:
Additional or supporting actor(s) associated with actor2.

inter2:
A categorical code describing actor2’s type.

interaction:
A combined description derived from inter1 and inter2, indicating the nature of the interaction between the actors.

civilian_targeting:
An indicator specifying whether civilians were the primary target (e.g., “Civilians targeted”).

iso:
A three-digit ISO numeric code representing the country where the event occurred.

region:
The geographic region in which the event took place (e.g., Eastern Africa).

country:
The country in which the event occurred.

admin1:
The primary sub-national administrative region (e.g., state or province).

admin2:
The secondary sub-national administrative region (if applicable).

admin3:
The tertiary sub-national administrative region (if available).

location:
The specific name of the place where the event occurred.

latitude:
The latitude coordinate of the event location in decimal degrees.

longitude:
The longitude coordinate of the event location in decimal degrees.

geo_precision:
A numeric code (1–3) indicating the precision of the geographic data, with 1 being the most precise.

source:
The source(s) used to report the event, which may include multiple sources separated by semicolons.

source_scale:
Indicates the geographic closeness of the source to the event (e.g., Local partner, National).

notes:
Additional descriptive information about the event.

fatalities:
The number of reported fatalities resulting from the event (0 if none are reported).

tags:
Keywords or structured tags that provide additional context to the event (e.g., “women targeted: politicians”).

timestamp:
A Unix timestamp representing when the event was uploaded to the ACLED API, capturing the exact date and time of data entry.




**e) Datatypes**


In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 413947 entries, 0 to 413946
Data columns (total 31 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   event_id_cnty       413947 non-null  object 
 1   event_date          413947 non-null  object 
 2   year                413947 non-null  int64  
 3   time_precision      413947 non-null  int64  
 4   disorder_type       413947 non-null  object 
 5   event_type          413947 non-null  object 
 6   sub_event_type      413947 non-null  object 
 7   actor1              413947 non-null  object 
 8   assoc_actor_1       114747 non-null  object 
 9   inter1              413947 non-null  object 
 10  actor2              301956 non-null  object 
 11  assoc_actor_2       85527 non-null   object 
 12  inter2              301956 non-null  object 
 13  interaction         413947 non-null  object 
 14  civilian_targeting  123147 non-null  object 
 15  iso                 413947 non-nul

**f) Duplicates**


In [13]:
data.duplicated().sum()


0

relevant_data.isna().sum()/data.shape[0]*100

**g) Missing Values**

Next, we will look at missing data by column percentage.

In [15]:
data.isna().sum()/data.shape[0]*100


event_id_cnty          0.000000
event_date             0.000000
year                   0.000000
time_precision         0.000000
disorder_type          0.000000
event_type             0.000000
sub_event_type         0.000000
actor1                 0.000000
assoc_actor_1         72.279785
inter1                 0.000000
actor2                27.054430
assoc_actor_2         79.338659
inter2                27.054430
interaction            0.000000
civilian_targeting    70.250539
iso                    0.000000
region                 0.000000
country                0.000000
admin1                 0.004107
admin2                 0.922823
admin3                49.294958
location               0.000000
latitude               0.000000
longitude              0.000000
geo_precision          0.000000
source                 0.000000
source_scale           0.000000
notes                  0.000000
fatalities             0.000000
tags                  77.179204
timestamp              0.000000
dtype: f

We will start by inspecting and handling columns with a large amount of missing data

In [18]:
for col in data.columns:
    if data[col].isna().sum()/data.shape[0]*100 > 50:
        print(f'{col}\n Missing value percent {data[col].isna().sum()/data.shape[0]*100:.2f}\n Unique values {data[col].unique()}')

assoc_actor_1
 Missing value percent 72.28
 Unique values [nan 'Wagner Group'
 'Al Adl Wa Al Ihssane; Labor Group (Morocco); LMDDH: Moroccan League for the Defense of Human Rights'
 ... 'San Ethnic Group (Namibia)'
 'Military Forces of Ethiopia (1991-2018); NDA: National Democratic Alliance (Sudan)'
 'Military Forces of the Central African Republic (1993-2003) Yakoma Faction']
assoc_actor_2
 Missing value percent 79.34
 Unique values ['Wagner Group' nan 'Farmers (Ghana)' ...
 'Aid Workers (Spain); Women (Spain); Aid Workers (United States)'
 'NADECO: National Democratic Coalition Militia'
 'Government of the United States (1993-2001)']
civilian_targeting
 Missing value percent 70.25
 Unique values [nan 'Civilian targeting']
tags
 Missing value percent 77.18
 Unique values [nan 'crowd size=no report' 'crowd size=scores' ...
 'crowd size=at least 35; sexual violence'
 'crowd size=3; women targeted: candidates for office'
 'crowd size=no report; sexual violence; women targeted: protesters

We can drop these columns as they contain a large amount of missing data and imputing them based on mean, mode or median will change their distributions

In [20]:

columns_to_drop = [col for col in data.columns if data[col].isna().sum()/data.shape[0]*100 > 50]
data = data.drop(columns = columns_to_drop, axis=1)

print(data.isna().sum())

event_id_cnty          0
event_date             0
year                   0
time_precision         0
disorder_type          0
event_type             0
sub_event_type         0
actor1                 0
inter1                 0
actor2            111991
inter2            111991
interaction            0
iso                    0
region                 0
country                0
admin1                17
admin2              3820
admin3            204055
location               0
latitude               0
longitude              0
geo_precision          0
source                 0
source_scale           0
notes                  0
fatalities             0
timestamp              0
dtype: int64


**h) Variable Types**


In [21]:
# Separate categorical columns
categorical_columns = data.select_dtypes(include=['object']).columns.tolist()

# Separate continuous (numerical) columns
continuous_columns = data.select_dtypes(include=['float64', 'int64']).columns.tolist()

# Display the separated columns
print(f"We have {len(categorical_columns)} categorical columns and {len(continuous_columns)} continuous columns")

We have 19 categorical columns and 8 continuous columns
