# Predicting Terrorist Attacks

### Data Preprocessing-1

#### Date : August 31, 2018

#### Notebook Configuration

In [1]:
import pandas as pd
import numpy as np

# Configure notebook output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Number of rows and columns
pd.set_option('display.max_rows', 150)
pd.set_option('display.max_columns', 150)

#### Load the Datasets
For this project, dataset use from https://www.start.umd.edu/gtd/, this dataset including information on terrorist events around the world from 1970 through 2017, dataset convert from xlsx to csv

In [2]:
# Load dataset
gtd_df = pd.read_csv('dataset/globalterrorismdb_0718dist.csv',encoding='ISO-8859-1')

  interactivity=interactivity, compiler=compiler, result=result)


#### Inspect the Structure
The data frame contains 135 attributes, one of which is used for the data frame index, and 181690 observations

In [4]:
# Display a summary of the data frame
gtd_df.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181691 entries, 0 to 181690
Data columns (total 135 columns):
eventid               int64
iyear                 int64
imonth                int64
iday                  int64
approxdate            object
extended              int64
resolution            object
country               int64
country_txt           object
region                int64
region_txt            object
provstate             object
city                  object
latitude              float64
longitude             float64
specificity           float64
vicinity              int64
location              object
summary               object
crit1                 int64
crit2                 int64
crit3                 int64
doubtterr             float64
alternative           float64
alternative_txt       object
multiple              float64
success               int64
suicide               int64
attacktype1           int64
attacktype1_txt       object
attacktype2           floa

#### View Missing Data
Calculate the total number of null values and percent for each attribute. As the results show, many attributes are comprised of missing values of more than 50%.

In [5]:
# Check the number of missing values in each attribute# Check  
count = gtd_df.isnull().sum()
percent = round(count / 181690 * 100, 2)
series = [count, percent]
result = pd.concat(series, axis=1, keys=['Count','Percent'])
result.sort_values(by='Count', ascending=False)

Unnamed: 0,Count,Percent
gsubname3,181671,99.99
weapsubtype4_txt,181621,99.96
weapsubtype4,181621,99.96
weaptype4,181618,99.96
weaptype4_txt,181618,99.96
claimmode3,181558,99.93
claimmode3_txt,181558,99.93
gsubname2,181531,99.91
claim3,181373,99.83
guncertain3,181371,99.82


#### Identify the First Pass of Target Attributes
Select the list of attributes that contain missing values of less than 20% and that are not duplicated by another attribute.

In [7]:
target_attrs = result[result['Percent'] < 40.0]
keep_attrs = target_attrs.index.values

# The nperps attribute contain 39.14% blank values.  However, an additional 60.86% are 
# coded (-99, -9) as unknown.
keep_attrs = keep_attrs[keep_attrs != 'nperps']
keep_attrs

# Remove attributes that duplicate another attribute
keep_attrs = keep_attrs[keep_attrs != 'country']
keep_attrs = keep_attrs[keep_attrs != 'region']
keep_attrs = keep_attrs[keep_attrs != 'attacktype1']
keep_attrs = keep_attrs[keep_attrs != 'targtype1']
keep_attrs = keep_attrs[keep_attrs != 'targsubtype1']
keep_attrs = keep_attrs[keep_attrs != 'natlty1']
keep_attrs = keep_attrs[keep_attrs != 'weaptype1']
keep_attrs = keep_attrs[keep_attrs != 'weapsubtype1']

array(['eventid', 'iyear', 'imonth', 'iday', 'extended', 'country',
       'country_txt', 'region', 'region_txt', 'provstate', 'city',
       'latitude', 'longitude', 'specificity', 'vicinity', 'summary',
       'crit1', 'crit2', 'crit3', 'doubtterr', 'multiple', 'success',
       'suicide', 'attacktype1', 'attacktype1_txt', 'targtype1',
       'targtype1_txt', 'targsubtype1', 'targsubtype1_txt', 'corp1',
       'target1', 'natlty1', 'natlty1_txt', 'gname', 'guncertain1',
       'individual', 'nperpcap', 'claimed', 'weaptype1', 'weaptype1_txt',
       'weapsubtype1', 'weapsubtype1_txt', 'weapdetail', 'nkill',
       'nkillus', 'nkillter', 'nwound', 'nwoundus', 'nwoundte',
       'property', 'ishostkid', 'scite1', 'dbsource', 'INT_LOG',
       'INT_IDEO', 'INT_MISC', 'INT_ANY'], dtype=object)

#### Subset the Original Dataset
Only include the attributes in the target set of attributes.

In [8]:
subset_df = gtd_df.loc[:, keep_attrs]
subset_df.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181691 entries, 0 to 181690
Data columns (total 49 columns):
eventid             181691 non-null int64
iyear               181691 non-null int64
imonth              181691 non-null int64
iday                181691 non-null int64
extended            181691 non-null int64
country_txt         181691 non-null object
region_txt          181691 non-null object
provstate           181270 non-null object
city                181257 non-null object
latitude            177135 non-null float64
longitude           177134 non-null float64
specificity         181685 non-null float64
vicinity            181691 non-null int64
summary             115562 non-null object
crit1               181691 non-null int64
crit2               181691 non-null int64
crit3               181691 non-null int64
doubtterr           181690 non-null float64
multiple            181690 non-null float64
success             181691 non-null int64
suicide             181691 non-nul

#### Fix Missing Values
The code book is not consistent when classify missing or unknown values. The original data included, blanks, -9, an -99. For consistency, -1 is used for categorical attributes that are numeric and UNKNOWN is used for categorical attributes that are text. Numeric attributes that contain coded missing values are replaced with NAN.

In [10]:
# Categorical Variables
# ---------------------
subset_df['specificity'].fillna(-1, inplace=True)

subset_df.loc[subset_df['vicinity'] == -9, 'vicinity'] = -1

subset_df.loc[subset_df['doubtterr'] == -9, 'doubtterr'] = -1

subset_df['targsubtype1_txt'].fillna('UNKNOWN', inplace=True)

subset_df['natlty1_txt'].fillna('UNKNOWN', inplace=True)

subset_df['guncertain1'].fillna(-1, inplace=True)

subset_df['claimed'].fillna(-1, inplace=True)
subset_df.loc[subset_df['claimed'] == -9, 'claimed'] = -1

subset_df['weapsubtype1_txt'].fillna('UNKNOWN', inplace=True)

subset_df.loc[subset_df['property'] == -9, 'property'] = -1

subset_df['ishostkid'].fillna(-1, inplace=True)
subset_df.loc[subset_df['ishostkid'] == -9, 'ishostkid'] = -1

subset_df.loc[subset_df['INT_LOG'] == -9, 'INT_LOG'] = -1

subset_df.loc[subset_df['INT_IDEO'] == -9, 'INT_IDEO'] = -1

subset_df.loc[subset_df['INT_MISC'] == -9, 'INT_MISC'] = -1

subset_df.loc[subset_df['INT_ANY'] == -9, 'INT_ANY'] = -1


# Numeric Variables
# -----------------
subset_df.loc[subset_df['nperpcap'] == -9, 'nperpcap'] = np.nan
subset_df.loc[subset_df['nperpcap'] == -99, 'nperpcap'] = np.nan


# Text Variables
# --------------
subset_df['provstate'].fillna('UNKNOWN', inplace=True)
subset_df['city'].fillna('UNKNOWN', inplace=True)
subset_df.loc[subset_df['city'] == 'Unknown', 'city'] = 'UNKNOWN'
subset_df['summary'].fillna('UNKNOWN', inplace=True)
subset_df['corp1'].fillna('UNKNOWN', inplace=True)
subset_df['target1'].fillna('UNKNOWN', inplace=True)
subset_df['scite1'].fillna('UNKNOWN', inplace=True)
subset_df['weapdetail'].fillna('UNKNOWN', inplace=True)

#### Map Yes/No/Uknown Codes
Many attributes contain codes of 1, 0, -1 to represent Yes, No, and Unknown. Replace the codes with labels to improve exploratory data analysis.

In [11]:
# Map the codes to labels
ynu_map = {1: 'YES', 0: 'NO', -1: 'UKNOWN'}

# List of target attributes to map
ynu_attrs =['extended', 'vicinity', 'crit1', 'crit2', 'crit3', 'doubtterr', 'multiple', 
            'success', 'suicide', 'guncertain1', 'individual', 'claimed', 'property', 
            'ishostkid', 'INT_LOG', 'INT_IDEO', 'INT_MISC', 'INT_ANY']

# Iterate over each target attribute and map it
for att in ynu_attrs:
    att_txt = att + '_txt'
    subset_df[att_txt] = subset_df[att].map(ynu_map)

# Get the list of attributes, dropping the coded for labeled attributes
final_attrs = []

for attr in subset_df.columns.values:
    if attr not in ynu_attrs:
        final_attrs.append(attr)
        
subset_df2 = subset_df.loc[:, final_attrs]
subset_df2.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181691 entries, 0 to 181690
Data columns (total 49 columns):
eventid             181691 non-null int64
iyear               181691 non-null int64
imonth              181691 non-null int64
iday                181691 non-null int64
country_txt         181691 non-null object
region_txt          181691 non-null object
provstate           181691 non-null object
city                181691 non-null object
latitude            177135 non-null float64
longitude           177134 non-null float64
specificity         181691 non-null float64
summary             181691 non-null object
attacktype1_txt     181691 non-null object
targtype1_txt       181691 non-null object
targsubtype1_txt    181691 non-null object
corp1               181691 non-null object
target1             181691 non-null object
natlty1_txt         181691 non-null object
gname               181691 non-null object
nperpcap            110336 non-null float64
weaptype1_txt       181691 no

In [12]:
subset_df2.to_csv("dataset/processed_1_globalterrorismdb_0718dist.csv", sep = ",")