## Udacity Project 1: Conflict Prediction with a Random Forest Model

#### Data and Package Loading

Import the necessary packages

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

Read in the data from a csv downloaded from the World Bank database

In [0]:
df = pd.read_csv(
    '/Workspace/Users/j60849@eon.com/Udacity_Project_1/Dataset_World_Bank.csv',
    sep='","',
    skipinitialspace=True,
    engine='python',
    quotechar='"'
)

# Print the data to evaluate pre-processing steps
df.head()

#### Data pre-processing

Strip the column headers of their double quotes

In [0]:
df.columns = df.columns.str.strip('"')

Remove double quotes from the data

In [0]:
def quote_remover(df):
    "This function removes double quotes from all entries of a pandas dataframe"
    for col in df.columns:
        df[col] = df[col].str.replace('"', '')
    return df

df = quote_remover(df)

Single out the country name from the country code in the first column

In [0]:
df['Country_Name'] = df.iloc[:, 0].str.split(',').str[0]

Select a subset of the data with which the modelling can take place

In [0]:
df_model = df[['Country_Name', 'Series Name', 'Time', 'Value']]

Pivot the data so that each unique Series Name becomes a separate column

In [0]:
df_model = df_model.drop_duplicates(subset=['Country_Name', 'Time', 'Series Name'])
df_pivot = df_model.pivot(index=['Country_Name', 'Time'], columns='Series Name', values='Value').reset_index()

Change blank spaces into null values

In [0]:
df_pivot = df_pivot.replace(r'^\s*$', None, regex=True)

Inspect the dataframe with .info() to see the number of missing values per column

In [0]:
df_pivot.info()

Change datatype of all numerical columns to float

In [0]:
for column in df_pivot.columns:
    if df_pivot[column].dtype == 'object' and column != 'Country_Name' and column != 'Time':
        df_pivot[column] = df_pivot[column].astype('float')
df_pivot.info()

Drop the columns with too many missing values

In [0]:
df_pivot.drop(['Arms exports (SIPRI trend indicator values)', 'Arms imports (SIPRI trend indicator values)', 'Central government debt, total (% of GDP)', 'International migrant stock (% of population)', 'School enrollment, secondary, male (% net)'], inplace=True, axis=1)

Drop the lines for 2024, since too much data from last year has not been processed well into the World Bank database yet

In [0]:
df_pivot = df_pivot[df_pivot['Time'] != '2024']

Count number of null values per country to see whether some countries have not enough data to be eligible for the model

In [0]:
null_counts = df_pivot.groupby('Country_Name').apply(lambda x: x.isnull().sum().sum())
null_counts_df = null_counts.to_frame(name='null_count').reset_index()
display(null_counts_df)

Remove some small (island) nations from the dataframe that lack too much data

In [0]:
df_pivot = df_pivot[
    (df_pivot['Country_Name'] != 'Comoros') &
    (df_pivot['Country_Name'] != 'Eritrea') &
    (df_pivot['Country_Name'] != 'Sao Tome and Principe') &
    (df_pivot['Country_Name'] != 'Seychelles') &
    (df_pivot['Country_Name'] != 'Somalia') &
    (df_pivot['Country_Name'] != 'South Sudan') &
    (df_pivot['Country_Name'] != 'Syrian Arab Republic') &
    (df_pivot['Country_Name'] != 'West Bank and Gaza')
]

Construct conflict indicator out of Political Stability estimate (conflict when the estimate is below -2, since then, the perception of political stability and absence of violence is at least two standard deviations away from the global mean)

In [0]:
df_pivot['IND_CONFLICT'] = np.where((df_pivot['Political Stability and Absence of Violence/Terrorism: Estimate'] < -2), 1, 0)
display(df_pivot)

#### Data Exploration

Generate some summary statistics of the data

In [0]:
df_pivot.describe()

Make histograms of all important variables to see their distributions

In [0]:
df_pivot.hist(bins=50, figsize=(25,15))
plt.show()

As becomes clear from the distribution plot above, access to electricity is skewed towards 100, foreign direct investment contains some outliers and all other variables except population growth, GDP growth and government effectiveness are right skewed

Calculate the correlation matrix to signal potential multicollinearity

In [0]:
plt.figure(figsize=(25, 15))
correlation_matrix = df_pivot.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

Check multicollinearity also with the VIF to identify non-linear and non-pairwise collinearity

In [0]:
# Multicollinearity Test: VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor 
import pandas as pd

# Ensure all columns are numeric and handle missing values
X = df_pivot.drop(['IND_CONFLICT', 'Country_Name', 'Time'], axis=1).apply(pd.to_numeric, errors='coerce').fillna(0)

# VIF dataframe 
vif_data = pd.DataFrame() 
vif_data["Feature"] = X.columns 
  
# calculating VIF for each feature 
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))] 

print(vif_data)

Show the correlation of the features with the target to identify potentially strong features and weak features

In [0]:
correlation_matrix = df_pivot.corr()
correlation_matrix["IND_CONFLICT"].sort_values(ascending=False)

As is shown in the correlation matrix above, the individual variables seem to not have a strong correlation with the target variable IND_CONFLICT, except government effectiveness. The correlation of political stability is understandingly strong, since the IND_CONFLICT target variable is based on this variable.

#### Modelling

While a logistic regression could be used, since the target variable IND_CONFLICT is binary, opt for a random forest because of class imbalance, robustness to outliers and robustness to skewed distributions.