# Modeling our Data

In this notebook we will attempt to model our data and see what kind of patterns we can recognize in this dataset.

In [6]:
import pandas as pd
import numpy as np
import datetime as dt
from altair import *
from IPython.display import Image, display

In [7]:
from sklearn.linear_model import SGDRegressor, Lasso, SGDClassifier
from sklearn.metrics import r2_score, accuracy_score
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split

In [8]:
data = pd.read_pickle('/data/augiedoebling/pickledGTD')

In [9]:
# list of countries with > 90% of the population who is islamic
islamic90 = ['Maldives','Mauritania','Afghanistan','Tunisia','Iran','Western Sahara','Morocco','Tajikistan','Yemen','Iraq','Somalia','Mayotte','Turkey','Azerbaijan','Comoros','Niger','Algeria','Palestine','Saudi Arabia','Djibouti','Sudan','Libya','Uzbekistan','Pakistan','Senegal','Kosovo','Gambia','Mali','Jordan','Turkmenistan','Egypt','Syria']

***
## Predicting Region

In [10]:
Xregion = pd.DataFrame()
Xregion['islamic90'] = np.where(data.country.isin(islamic90), 1, 0)
Xregion['onUS'] = data.inUS_yes
# REMINDER: PERS = Political, Economic, Religious or Social goal
Xregion['PERS'] = data.PERS
# REMINDER: CIP = goal to Coerse, Intimidate or Publicize an audience
Xregion['CIP'] = data.CIP
Xregion['outsidehumanlaw'] = data.outsidehumlaw
Xregion['multiple'] = data.multiple
Xregion['suicide'] = data.suicide
Xregion['international'] = data.international.replace(np.nan, 0).astype(int)
Xregion['fatalities'] = data.fatalities
Xregion['success'] = data.success
Xregion.head()

Unnamed: 0,islamic90,onUS,PERS,CIP,outsidehumanlaw,multiple,suicide,international,fatalities,success
0,0,0,1,1,1,0,0,0,1.0,1
1,0,0,1,1,1,0,0,1,0.0,1
2,0,0,1,1,1,0,0,1,1.0,1
3,0,0,1,1,1,0,0,1,0.0,1
4,0,0,1,1,1,0,0,1,0.0,1


In [11]:
region = data.region

In [12]:
Xtrain, Xtest, ytrain, ytest = train_test_split(Xregion, region, test_size=0.20)

In [17]:
model = SGDClassifier()

In [14]:
model = model.fit(Xtrain, ytrain)

In [15]:
ypred = model.predict(Xtest)

In [16]:
accuracy_score(ytest, ypred)

0.34874820602774675

### Region results

The accuracy score I got of 0.34 is actually fairly accurate compared to the 0.083 we could expect from random guessing. Therefore we can conclude this model is doing fairly well. However, we will look at one additional target to see what we can predict there.

***
## Predicting Fatality Count

In [12]:
Xfate = pd.DataFrame()
Xfate['islamic90'] = np.where(data.country.isin(islamic90), 1, 0)
Xfate['onUS'] = data.inUS_yes
# REMINDER: PERS = Political, Economic, Religious or Social goal
Xfate['PERS'] = data.PERS
# REMINDER: CIP = goal to Coerse, Intimidate or Publicize an audience
Xfate['CIP'] = data.CIP
Xfate['outsidehumanlaw'] = data.outsidehumlaw
Xfate['multiple'] = data.multiple
Xfate['suicide'] = data.suicide
Xfate['international'] = data.international.replace(np.nan, 0).astype(int)
Xfate['region'] = data.region_code
Xfate['success'] = data.success
Xfate['propvalue'] = data.propvalue
Xfate['wounded'] = data.wounded
Xfate.head()

Unnamed: 0,islamic90,onUS,PERS,CIP,outsidehumanlaw,multiple,suicide,international,region,success,propvalue,wounded
0,0,0,1,1,1,0,0,0,2,1,0.0,0.0
1,0,0,1,1,1,0,0,1,1,1,0.0,0.0
2,0,0,1,1,1,0,0,1,5,1,0.0,0.0
3,0,0,1,1,1,0,0,1,8,1,0.0,0.0
4,0,0,1,1,1,0,0,1,4,1,0.0,0.0


In [13]:
fatalities = data.fatalities

In [14]:
Xtrain, Xtest, ytrain, ytest = train_test_split(Xfate, fatalities, test_size=0.20)

In [15]:
model = SGDRegressor()

In [16]:
model = model.fit(Xtrain, ytrain)

In [17]:
ytrainpred = model.predict(Xtrain)
ypred = model.predict(Xtest)

In [18]:
print("Train r2:", r2_score(ytrain, ytrainpred), "Test r2:", r2_score(ytest, ypred))

Train r2: -1.99744579558e+47 Test r2: -2.36345614809e+46


That model is not working well, lets try a different regression model.

In [19]:
model = Lasso(alpha=0.01)
model = model.fit(Xtrain, ytrain)

In [20]:
ytrainpred = model.predict(Xtrain)
ypred = model.predict(Xtest)

In [22]:
print("Train r2:", r2_score(ytrain, ytrainpred), "Test r2:", r2_score(ytest, ypred))

Train r2: 0.0486737152258 Test r2: 0.0680202086616


### Fatalities Results
Although the Lasso model worked significantly better than the SGDRegressor, we still can not predict fatality counts with our data.

***
# Modeling Conclusions
While we could not predict fatality counts with any sort of accuracy, we were actually pretty successful in predicting region. While a little over 34% is not perfect, it is significantly better than random guessing.