# GTD Challenge
## Challenge task: 

__*Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2014. Some portion of the attacks have not been attributed to a particular terrorist group.*__


__*Use attack type, weapons used, description of the attack, etc. to build a model that can predict what group may have been responsible for an incident.*__ 

In order to obtain an accurate model and therefore the most accurate prediction, I have chosen to investigate the following learning models :

- Decision Trees
- Random Forest
- Extra Trees
- Extreme Gradient Boosting
- Support Vector Machine
- Light Gradient Boosting
- K-Nearest-Neighbours

***

## Preprocessing the dataframe:

In [1]:
#Import all packages used in analysis:
import pandas as pd
import numpy as np
from pprint import pprint
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import category_encoders as ce
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from operator import itemgetter
from sklearn.ensemble import RandomForestClassifier 
from sklearn.ensemble import ExtraTreesClassifier
import xgboost as xgb
from sklearn.svm.libsvm import cross_validation
import lightgbm as lgb
from sklearn.preprocessing import StandardScaler  
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt 
from sklearn.svm import SVC 
from sklearn import preprocessing

In [2]:
#Import the dataset 
dataset=pd.read_csv('globalterrorismdb_0718dist.csv', encoding = "ISO-8859-1")
dataset.head()
#the error shown below does not affect our analysis as during \
#the preprocessing all the columns mentioned in error message are deleted from the final dataset.

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,,PGIS,-9,-9,1,1,


In [3]:
print('Properties of the dataframe: ')
dataset.info() 
print(dataset.shape)

Properties of the dataframe: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181691 entries, 0 to 181690
Columns: 135 entries, eventid to related
dtypes: float64(55), int64(22), object(58)
memory usage: 187.1+ MB
(181691, 135)


As there are clearly 135 variables in our dataframe, we need to remove the unneccesary variables with regards to a predictive model. The columns include: the text version of another and subcategories of other more critical variables. Also, we must note that many are not available for all the events (i.e null).

In [4]:
#We shall keep the numeric equivalent of the some of the columns, so these columns are already formatted \
# numerically to use for our classification techniques.

dataset = dataset[dataset.columns.drop(list(dataset.filter(regex='_txt')))]

print("After removing the columns which have a complementary translated numerical column we obtain a \
dataframe with ", len(dataset.columns.values), " columns.")

After removing the columns which have a complementary translated numerical column we obtain a dataframe with  107  columns.


In [5]:
#delete column 'eventid' - unique and specific feature so needs to be deleted.

del dataset['eventid'] 

In [6]:
#we must delete all the columns with missing values to strengthen our model
dataset = dataset.dropna(axis=1)
print("After removing the columns that include missing entries (null) we obtain a \
dataframe with ", len(dataset.columns.values), " columns.")

After removing the columns that include missing entries (null) we obtain a dataframe with  23  columns.


In [7]:
print("There are ", dataset['gname'].nunique()-1, " unique elements in the 'gname' (perpetrator) category \
in our ", len(dataset['gname']), " element dataset (excluding 'Unknown' class). This means our target variable has \
high cardinality, which will indeed affect the speed and memory usage of our classifiers.")

There are  3536  unique elements in the 'gname' (perpetrator) category in our  181691  element dataset (excluding 'Unknown' class). This means our target variable has high cardinality, which will indeed affect the speed and memory usage of our classifiers.


In [8]:
print("The columns that are non-numeric and therefore need to be converted into a numeric \
form (aiding the classification models) are the following: ")
dataset.select_dtypes(exclude=['int64', 'int32', 'float64']).head()

The columns that are non-numeric and therefore need to be converted into a numeric form (aiding the classification models) are the following: 


Unnamed: 0,gname,dbsource
0,MANO-D,PGIS
1,23rd of September Communist League,PGIS
2,Unknown,PGIS
3,Unknown,PGIS
4,Unknown,PGIS


We must note that both variables above are nominal, therefore we need not worry about the order of the elements when encoding.

In [9]:
#First translate dbsource column using Label encoding:
h=dataset['dbsource'].values #need to convert dataframe to array format to fit encoder
label_encoder = LabelEncoder()
label_encoderdb = label_encoder.fit(h)
label_encoded_h = label_encoderdb.transform(h)
#Delete the original 'dbsource' column from dataset and add the newly encoded 'dbsource' columns to original dataset:
del dataset['dbsource']
dataset = pd.concat([dataset,pd.DataFrame(label_encoded_h)],axis=1)
dataset.rename(columns={0: 'dbsource'}, inplace=True)
dataset.head()

Unnamed: 0,iyear,imonth,iday,extended,country,region,vicinity,crit1,crit2,crit3,...,targtype1,gname,individual,weaptype1,property,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,dbsource
0,1970,7,2,0,58,2,0,1,1,1,...,14,MANO-D,0,13,0,0,0,0,0,13
1,1970,0,0,0,130,1,0,1,1,1,...,7,23rd of September Communist League,0,13,0,0,1,1,1,13
2,1970,1,0,0,160,5,0,1,1,1,...,10,Unknown,0,13,0,-9,-9,1,1,13
3,1970,1,0,0,78,8,0,1,1,1,...,7,Unknown,0,6,1,-9,-9,1,1,13
4,1970,1,0,0,101,4,0,1,1,1,...,7,Unknown,0,8,1,-9,-9,1,1,13


The 'dbsource' column is nominal, therefore does not have any order to the elements. This means that we should in theory one-hot-encode the column, however, having tried this method, I found that keeping to the label encoded dbsource variable and building a model with this, improved the accuracy level by ~2-3%, therefore I shall not be choosing to one-hot-encode this variable.

In [10]:
#Before we continue to encode the 'gname' column - we must extract the 'Unknown' perpetrator rows from the dataset, \
# and split the dataset into features and targets:

#let dfunknown be the dataset containing all the rows with 'unknown' perpetrators (ready to predict from, once models are built).
dfunknown=dataset.loc[dataset['gname'] == 'Unknown'].reset_index(drop=True)

#need to make sure that all rows with 'unknown' perpetrators are excluded when making trial and test sets, \
# so need to exclude from dataset.
dataset=dataset[dataset.gname != 'Unknown'].reset_index(drop=True)

#dfgname is the column 'gname' as a separate dataframe, ready as target values for trial and test sets.
dfgname=dataset.pop('gname') 
#This leaves dataset as features dataframe for trial and test sets (w/o unknown targets).

In [11]:
#Now translate the gname category from names to integers using Label Encoding.

y=dfgname.values #need to convert dataframe to array format to fit encoder
label_encodergn = label_encoder.fit(y)
label_encoded_y = label_encodergn.transform(y) 

In [12]:
#reformat the array of label encoded gname column into a data frame and name the column 'gname'.
dfgname=pd.DataFrame(label_encoded_y)
dfgname.columns=['gname']

In [13]:
#Before encoding further, we can split it into trial and test data sets:

#X=features , y = targets i.e gname column

X_train, X_test, y_train, y_test = train_test_split(dataset, label_encoded_y, test_size=0.2, random_state=123)

#the following lengths should be the same as that of the targets:

print('Number of observations in the training data:', len(X_train)) 
print('Number of observations in the test data:',len(X_test))

Number of observations in the training data: 79127
Number of observations in the test data: 19782


Due to the high cardinality of the 'gname' target variable, I shall not be one-hot-encoding this variable as it would result in a memory failure error due to the addition of 3536 extra columns to the dataset. To solve this issue, I shall implement Binary encoder on this variable.

In [14]:
#To encode the gname (without 'unknown') column further, first need to reformat the arrays of target \
#training and testing sets built above into data frames ready to encode:

y_traindf=pd.DataFrame(y_train)
y_traindf.columns=['gname']

y_testdf=pd.DataFrame(y_test)
y_testdf.columns=['gname']

y=dfgname
X=dfgname

# use binary encoding to encode categorical feature gname:
enc = ce.BinaryEncoder(cols=['gname']).fit(X, y)

# transform the dataset
numeric_train = enc.transform(y_traindf)
numeric_test = enc.transform(y_testdf)

Now we have preprocessed our dataset we can begin the classification process, with Decision Trees.

***
## Decision Tree Classification

In [None]:
#Train the model
tree = DecisionTreeClassifier(criterion = 'gini').fit(X_train,numeric_train)

#Predict the classes of new, unseen data
prediction = tree.predict(X_test)

#Check the accuracy
print("The prediction accuracy is: ",tree.score(X_test,numeric_test)*100,"%")
 
cm = confusion_matrix(numeric_test.values.argmax(axis=1), prediction.argmax(axis=1))
print("confusion matrix: ",cm)

print("10-Fold Cross Validation Score :", 100 * np.mean(cross_val_score(tree, X_train, numeric_train, cv=10, n_jobs=-1)), "%")
print("i.e according to accuracy_score, \
about ",accuracy_score(numeric_test, prediction, normalize=False), " of the ", len(prediction), "predictions were correct.")
 

# View a list of the features and their importance scores
feature_list = list(X_train.columns)
# Get numerical feature importances
importances = list(tree.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 4)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:22} Importance: {}'.format(*pair)) for pair in feature_importances];

print("We can clearly see that from the importance scores above ",feature_importances[0]," concerning the attacks \
was more important in classification than any other feature of the attack.")

#Before using our newly built tree to predict from our unknown data frame we need to remove the gname column of unknowns:
df_features = dfunknown.drop(['gname'], axis=1)

#now we can use the model to predict the unknown rows in df dataset:
predictiontree = tree.predict(df_features)

#want to translate these predictions into names of perpetrators
s=pd.DataFrame(predictiontree) 
S=s.idxmax(axis=1)
resultstree=label_encoder.inverse_transform(S)
#translated (ohe-->label-->) results  
#np.set_printoptions(threshold=np.inf)  #If one needs to see whole array of translated results
print("Translated prediction (tree): ",resultstree)

***
## Random Forest Classification

To help reach a balance between bias and variance in our model we shall investigate further using Bagging. This is a technique used to reduce the variance of the predictions using the results of multiple classifiers modeled on different sub-samples of the same data set. This can be implemented using Random Forests:

In [None]:
#Build a forest:
#Here I have tuned the parameters to have 500
forest = RandomForestClassifier(n_estimators = 500, oob_score = True, n_jobs = -1,random_state =123, max_features=0.33).fit(X_train,numeric_train)
prediction= forest.predict(X_test)
print('Prediction (NUMERIC): ', prediction)
print("The prediction accuracy is: ",forest.score(X_test,numeric_test)*100,"%")
print("Out-of-bag Score :",forest.oob_score_) #out-of-bag score

cm = confusion_matrix(numeric_test.values.argmax(axis=1), prediction.argmax(axis=1))
print("confusion matrix: ",cm)
print("10-Fold Cross Validation Score :", 100 * np.mean(cross_val_score(tree, X_train, numeric_train, cv=10, n_jobs=-1)), "%")
print("i.e according to accuracy_score, \
about ",accuracy_score(numeric_test, prediction, normalize=False), " of the ", len(prediction), "predictions were correct.")

feature_list = list(X_train.columns)
# Get numerical feature importances
importances = list(forest.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 4)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
print("Feature importances according to the random forest:")
[print('Variable: {:22} Importance: {}'.format(*pair)) for pair in feature_importances];

print("We can clearly see that from the importance scores above ",feature_importances[0]," concerning the attacks \
was still more important in classification than any other feature of the attack.")

#now we have constructed a forest we can use this to predict the unknown rows in df
predictionforest= forest.predict(df_features)

#want to translate these predictions into names of perpetrators
s=pd.DataFrame(predictionforest) 
S=s.idxmax(axis=1)
resultsforest=label_encoder.inverse_transform(S)
#translated (ohe-->label-->) results  
#np.set_printoptions(threshold=np.inf)  #If one needs to see whole array of translated results
print("Translated prediction (forest): ",resultsforest)

We can see here that the prediction accuracy has increased from approximately 75.5% to 77.5% when moving from the decision tree model to random forest model. I believe by tuning the forest's parameters, it can result in a higher level of accuracy.

***
## Extra Trees Classification

In [None]:
etclf = ExtraTreesClassifier(n_estimators=1000, max_depth=8, random_state=0, n_jobs = -1)
etclf = etclf.fit(X_train, y_train)
prediction = etclf.predict(X_test)

In [None]:
print("The prediction accuracy is: ",etclf.score(X_test,y_test)*100,"%")

In [None]:
predictionforest= forest.predict(df_features)

In [None]:
#want to translate these predictions into names of perpetrators
s=pd.DataFrame(predictionforest) 
S=s.idxmax(axis=1)
resultsforest=label_encoder.inverse_transform(S)
#translated (ohe-->label-->) results  
#np.set_printoptions(threshold=np.inf)  #If one needs to see whole array of translated results
print("Translated prediction (forest): ",resultsforest)

***
_Due to several issues ranging from memory-failure and long running time, to other more technical failures, the following models have yet to be completed due to time constraints._

***
### XGBoost Classification

In [None]:
#nc=df.nunique()
#xg = xgb.XGBClassifier(objective ='multi:softmax', colsample_bytree = 0.3, learning_rate = 0.1,
#                max_depth = 3, alpha = 10, n_estimators = 10, subsample =0.8, gamma=1)
#xg.fit(X_train, y_train, early_stopping_rounds=10)
#
#predictions = xg.predict(X_test)
#
## evaluate predictions
#accuracy = accuracy_score(y_test, predictions)
#print("Accuracy: %.2f%%" % (accuracy * 100.0))
#
##convert dataset to array
#df_featuresarr=df_features.values
#
#y_predictionxg = model.predict(df_features)
#predictionxg = [round(value) for value in y_predictionxg]
#predictionxg=label_encoder.inverse_transform(predictionxg)
#predictionxg

***
### SVM Classification

In [None]:
## training a linear SVM classifier 
#svm_model_linear = SVC(kernel='linear').fit(X_train, y_train) 
#svm_pred = svm_model_linear.predict(X_test) 
#
## model accuracy for X_test 
#accuracylin = svm_model_linear.score(X_test, y_test) 
#print(accuracylin)
#
## creating a confusion matrix 
#cmlin = confusion_matrix(y_test, svm_pred) 
#print(cmlin)
#
#print(confusion_matrix(y_test,svm_pred))  
#print(classification_report(y_test,svm_pred))  
#
#svm_model_poly = SVC(kernel='poly', random_state=0, degree=8).fit(X_train, y_train) 
#svm_pred_poly = svm_model_poly.predict(X_test)
#
#accuracy_poly = svm_model_poly.score(X_test, y_test) 
#print(accuracy_poly)
#
#cm_poly = confusion_matrix(y_test, svm_pred_poly) 
#print(cm_poly)
#
#svm_model_gauss = SVC(kernel='rbf').fit(X_train, y_train) 
#svm_pred_gauss = svm_model_gauss.predict(X_test)
#
#accuracy_gauss = svm_model_gauss.score(X_test, y_test) 
#print(accuracy_gauss)
# 
#cm_gauss = confusion_matrix(y_test, svm_pred_gauss) 
#print(cm_gauss)
#
#svm_model_sig = SVC(kernel='sigmoid').fit(X_train, y_train) 
#svm_pred_sig = svm_model_sig.predict(X_test)
#
#accuracy_sig = svm_model_sig.score(X_test, y_test) 
#print(accuracy_sig)
#
#cm_sig = confusion_matrix(y_test, svm_pred_sig) 
#print(cm_sig)
#
##Use the results above to gauge which of the kernels best suits the dataset and use this kernel to predict

***
### Light GBM Classification

In [None]:
# define dataset
#train_data = lgb.Dataset(X_train, label=y_train, free_raw_data=True)
#test_data = lgb.Dataset(X_test, label=y_test, reference=train_data, free_raw_data=True)

#params = {'task': 'train',
#    'boosting_type': 'gbdt',
#    'objective': 'multiclass',
#    'num_class':4000,
#    'metric': 'multi_logloss',
#    'learning_rate': 0.002296,
#    'max_depth': 7,
#    'num_leaves': 17,
#    'feature_fraction': 0.4,
#    'bagging_fraction': 0.6,
#    'bagging_freq': 17}

# evals_result = {}
# gbm = lgb.train(params, train_data, num_boost_round=10000, nfold=3, shuffle=True, valid_sets=[train_data, test_data], \
#                 valid_names = ['train', 'valid'], evals_result=evals_result, \
#                 early_stopping_rounds=10, verbose_eval=100) 

***
### K-Nearest-Neighbours Classification

In [None]:
#scaler = StandardScaler()  
#scaler.fit(X_train)
## Apply transform to both the training set and the test set.
#train = scaler.transform(X_train)
#test = scaler.transform(X_test)
#
#pca = PCA(n_components=.95, random_state=123, svd_solver='full', whiten='True')
#pca.fit(train)
#
#train = pca.transform(train)
#test = pca.transform(test)
#
#pca.explained_variance_ratio_
#
#
#scaler = preprocessing.MinMaxScaler() #found that adding this to preprocess increased the accuracy level by ~10%
#X = scaler.fit_transform(X)
#
#X_train, X_test, y_train, y_test = train_test_split(dataset, label_encoded_y, test_size=0.2, random_state=123)
#
#clf = KNeighborsClassifier(n_neighbors=3, p=1)
#clf.fit(train, y_train)
#print(clf)
#y_pred = clf.predict(test)
#
#accuracy = accuracy_score(y_pred, y_test)
#print("Accuracy: %.2f%%" % (accuracy * 100.0))
#
#
#print(confusion_matrix(y_test, y_pred))  
#print(classification_report(y_test, y_pred)) 
#
#error = []
#
## Calculating error for K values between 1 and 40
#for i in range(1, 40):  
#    knn = KNeighborsClassifier(n_neighbors=i, p=1)
#    knn.fit(train, y_train)
#    pred_i = knn.predict(test)
#    error.append(np.mean(pred_i != y_test))
#
#plt.figure(figsize=(12, 6))  
#plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o',  
#         markerfacecolor='blue', markersize=10)
#plt.title('Error Rate K Value')  
#plt.xlabel('K Value')  
#plt.ylabel('Mean Error')

## Comments on the challenge:

•**Ask yourself why would they have selected this problem for the challenge? What are some gotchas in this domain I should know about?**

I believe that they would have selected this problem for the challenge because this problem has a lot of scope for further research using machine learning techniques with respect to predictive data modelling. In addition, this specific problem offers a large data set, with a dense set of fields and variables that the analyst will need to take into account and experiment with.
Some of the ‘gotchas’ in this domain I think I should know about are the irregularities and its causes within the data (GTD Code booklet) and the areas from the data that could cause errors in the analysis had it been overlooked. The encoding of ‘object’ features before passing it to a classifier is one example. The sheer size of the dataset (181691, 135) makes it more difficult to run it through a classifier whilst avoiding errors concerned with memory, along with the aspect of the algorithms merely taking too much time to complete. 


•**What is the highest level of accuracy that others have achieved with this dataset or similar problems / datasets?**

The highest level of accuracy that others have achieved with similar predictive modelling tasks using this dataset is exhibited in 'Predictive Modeling of Terrorist Attacks Using Machine Learning', International Journal of Pure and Applied Mathematics
Volume 119 No. 15 2018, 49-61 (https://acadpubl.eu/hub/2018-119-15/4/630.pdf).


•**What types of visualizations will help me grasp the nature of the problem / data?**

Decision tree and Random forest classifiers aided me in understanding the relative hierarchy of importance of features when calculating the perpetrator of the attacks.
Because the target feature ‘gname’, i.e. the perpetrators of the attacks, were objects along with it having a high cardinality of classes, I decided to steer away from plotting graphs and graphically visualising the relationship between the target set and the features of the data set. However it is brilliantly visualised in the following kaggle article by Laurenstc: https://www.kaggle.com/laurenstc/global-terrorism-analysis

•**What feature engineering might help improve the signal?**

- Needed to discard the eventid column as keeping it would result in an overfitted model.

- I decided to exclude the columns that had null values from the feature set as many of the classifiers used above would have failed to run. However, as a potential area for improvement, I could replace these null elements with separate predictions of those features using similar classification or regression techniques, thereby creating a larger and full data frame to use.

- Using K-Fold cross validation on the test results (taking into account the original dataframe, including previously deleted features)- I could then gauge a better understanding on the error rate of the classifiers.

•**Which modelling techniques are good at capturing the types of relationships I see in this data?**

I believe several incidents seem to be caused by more complex issues not represented in the dataframe like state of economy at the time and political and religious tensions, therefore it is difficult to realistically and accurately model this data. So maybe techniques like neural networks may prove to be useful when finding connections between the features and the target data set.

•**Now that I have a model, how can I be sure that I didn't introduce a bug in the code? If results are too good to be true, they probably are!**

- Pruning the models to avoid overfitting
- K-fold cross-validation
- Experiment with hyperparameters of the models

•**What are some of the weaknesses of the model and how can the model be improved with additional work?**

- In order to improve my models and increase its level of accuracy I would further tune the parameters for the classifiers and experiment with the combinations that provide the relatively higher level of accuracy. I would use cross validation to gain a better understanding of the accuracy measure.
- Some of my models took too long to run, therefore to solve this issue, I could try to split the dataset into several dataframes using the 'iyear' category with respect to shorter time spans OR using 'country' (as this seems to be the most important feature from the few we have looked into). This will allow the algorithm to digest a smaller dataframe and therefore produce a prediction.  
- Many of the original features were deleted due to null-elements, so instead of removing all the columns that included null elements – I would build several models to help predict and therefore fill the respective elements to create a full (0-null value) dataset- thereby having a larger choice of features to investigate and exploit in terms of classifying the perpetrators.
- I would build a neural network to see if I can classify the unknown perpetrators with a higher level of accuracy.