# Global Terrorism Database
## Data Pre-Processing

Prepared by: Keina Aoki

Date: January 2024

**Contents**
- [Data Dictionary](#Data_Dictionary)
- [Data Cleaning](#Data_Cleaning)
     - [Check Missing Values](#Check_Missing_Values)
     - [Fill Missing Values](#Fill_Missing_Values)
     - [Remove Irrelevant Data](#Remove_Irrelevant_Data)
     - [Duplicates](#Duplicates)

In [1]:
import numpy as np
import pandas as pd

### Load dataset

In [2]:
df = pd.read_csv("global_terrorism_dataset1.csv", low_memory=False)

### Data Cleaning
#### Inspect dataset

There are 135 columns and 209706 observations.

In [3]:
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209706 entries, 0 to 209705
Data columns (total 135 columns):
 #    Column              Dtype  
---   ------              -----  
 0    eventid             int64  
 1    iyear               int64  
 2    imonth              int64  
 3    iday                int64  
 4    approxdate          object 
 5    extended            int64  
 6    resolution          object 
 7    country             int64  
 8    country_txt         object 
 9    region              int64  
 10   region_txt          object 
 11   provstate           object 
 12   city                object 
 13   latitude            float64
 14   longitude           float64
 15   specificity         float64
 16   vicinity            int64  
 17   location            object 
 18   summary             object 
 19   crit1               int64  
 20   crit2               int64  
 21   crit3               int64  
 22   doubtterr           int64  
 23   alternative         float64
 24 

In [4]:
pd.set_option('display.max_columns', None)
df.tail()

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,region_txt,provstate,city,latitude,longitude,specificity,vicinity,location,summary,crit1,crit2,crit3,doubtterr,alternative,alternative_txt,multiple,success,suicide,attacktype1,attacktype1_txt,attacktype2,attacktype2_txt,attacktype3,attacktype3_txt,targtype1,targtype1_txt,targsubtype1,targsubtype1_txt,corp1,target1,natlty1,natlty1_txt,targtype2,targtype2_txt,targsubtype2,targsubtype2_txt,corp2,target2,natlty2,natlty2_txt,targtype3,targtype3_txt,targsubtype3,targsubtype3_txt,corp3,target3,natlty3,natlty3_txt,gname,gsubname,gname2,gsubname2,gname3,gsubname3,motive,guncertain1,guncertain2,guncertain3,individual,nperps,nperpcap,claimed,claimmode,claimmode_txt,claim2,claimmode2,claimmode2_txt,claim3,claimmode3,claimmode3_txt,compclaim,weaptype1,weaptype1_txt,weapsubtype1,weapsubtype1_txt,weaptype2,weaptype2_txt,weapsubtype2,weapsubtype2_txt,weaptype3,weaptype3_txt,weapsubtype3,weapsubtype3_txt,weaptype4,weaptype4_txt,weapsubtype4,weapsubtype4_txt,weapdetail,nkill,nkillus,nkillter,nwound,nwoundus,nwoundte,property,propextent,propextent_txt,propvalue,propcomment,ishostkid,nhostkid,nhostkidus,nhours,ndays,divert,kidhijcountry,ransom,ransomamt,ransomamtus,ransompaid,ransompaidus,ransomnote,hostkidoutcome,hostkidoutcome_txt,nreleased,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
209701,202012310015,2020,12,31,2020-12-31,0,,228,Yemen,10,Middle East & North Africa,Al Hudaydah,Sabaa,15.305307,43.01949,2.0,0,,12/31/2020: Assailants fired mortar shells tar...,1,1,1,0,,,0.0,1,0,3,Bombing/Explosion,,,,,14,Private Citizens & Property,76.0,House/Apartment/Residence,Not Applicable,Residences,228.0,Yemen,,,,,,,,,,,,,,,,,Houthi extremists (Ansar Allah),,,,,,,0.0,,,0,-99.0,0.0,0.0,,,,,,,,,,6,Explosives,11.0,"Projectile (rockets, mortars, RPGs, etc.)",,,,,,,,,,,,,Mortars were used in the attack.,,0.0,0.0,,0.0,0.0,1,3.0,Minor (likely < $1 million),-99.0,Houses and buildings damaged,0.0,,,,,,,,,,,,,,,,,"""Al Houthi militia escalated in Hays and targe...",,,START Primary Collection,0,0,0,0,
209702,202012310016,2020,12,31,2020-12-31,0,,228,Yemen,10,Middle East & North Africa,Al Hudaydah,Beit Maghari,13.931337,43.478924,2.0,0,The incident occurred in the Hays district.,12/31/2020: Assailants attempted to plant expl...,1,1,1,0,,,0.0,1,0,3,Bombing/Explosion,,,,,14,Private Citizens & Property,76.0,House/Apartment/Residence,Not Applicable,Residences,228.0,Yemen,,,,,,,,,,,,,,,,,Houthi extremists (Ansar Allah),,,,,,,0.0,,,0,-99.0,0.0,0.0,,,,,,,,,,6,Explosives,8.0,Landmine,6.0,Explosives,16.0,Unknown Explosive Type,,,,,,,,,,,0.0,,,0.0,,0,,,,,0.0,,,,,,,,,,,,,,,,,"""Al Houthi militia escalated in Hays and targe...",,,START Primary Collection,0,0,0,0,
209703,202012310017,2020,12,31,,0,,75,Germany,8,Western Europe,Lower Saxony,Leipzig,51.342239,12.374772,1.0,0,,12/31/2020: Assailants set fire to German Army...,1,1,0,1,1.0,Insurgency/Guerilla Action,0.0,1,0,7,Facility/Infrastructure Attack,,,,,4,Military,35.0,Military Transportation/Vehicle (excluding con...,German Army,Wolf-Class Vehicles,75.0,Germany,,,,,,,,,,,,,,,,,Left-wing extremists,,,,,,,0.0,,,0,-99.0,0.0,1.0,7.0,"Posted to website, blog, etc.",,,,,,,,8,Incendiary,18.0,Arson/Fire,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,1,4.0,Unknown,-99.0,Military vehicles damaged,0.0,,,,,,,,,,,,,,,,,"""Far-left arson attack suspected on German asy...","""Fire of Bundeswehr vehicles in Leipzig, proba...","""Anarchist Antifa Take Credit for Arson Attack...",START Primary Collection,-9,-9,0,-9,
209704,202012310018,2020,12,31,,0,,4,Afghanistan,6,South Asia,Kabul,Kabul,34.523842,69.140304,1.0,0,The incident occurred in Khair Khana neighborh...,12/31/2020: Assailants shot and killed a civil...,1,1,1,0,,,0.0,1,0,2,Armed Assault,,,,,14,Private Citizens & Property,83.0,Protester,Not Applicable,Activist: Abdi Jahid,4.0,Afghanistan,,,,,,,,,,,,,,,,,Unknown,,,,,,,0.0,,,0,-99.0,0.0,0.0,,,,,,,,,,5,Firearms,5.0,Unknown Gun Type,,,,,,,,,,,,,,1.0,0.0,0.0,0.0,0.0,0.0,0,,,,,0.0,,,,,,,,,,,,,,,,,"""Civil society activist and tribal elder kille...","""Terrorism Digest: 1-2 Jan 21,"" BBC Monitoring...",,START Primary Collection,-9,-9,0,-9,
209705,202012310019,2020,12,31,,1,,33,Burkina Faso,11,Sub-Saharan Africa,Sahel,Kelbo,13.864252,-1.161453,1.0,0,,12/31/2020: Assailants attacked a Volunteers o...,1,1,0,1,1.0,Insurgency/Guerilla Action,0.0,1,0,2,Armed Assault,,,,,4,Military,39.0,Paramilitary,Volunteers of the Defence of the Fatherland (VDP),Paramilitary Position,33.0,Burkina Faso,,,,,,,,,,,,,,,,,Unknown,,,,,,,0.0,,,0,-99.0,0.0,0.0,,,,,,,,,,5,Firearms,5.0,Unknown Gun Type,,,,,,,,,,,,,,5.0,0.0,0.0,0.0,0.0,0.0,0,,,,,-9.0,,,,,,,,,,,,,,,,,"""Terrorism Digest: 3-4 Jan 21,"" BBC Monitoring...",,,START Primary Collection,-9,-9,0,-9,


#### Check Missing Values

In [5]:
pd.set_option('display.max_rows', None)
na = pd.DataFrame(df.isna().sum())
na = na.rename(columns={0:"Count"})

percent = pd.DataFrame(round((na["Count"]/df.shape[0]) * 100,2))
percent = percent.rename(columns={"Count":"Percent"})
na_df = pd.concat([na, percent], axis=1)

In [6]:
na_df.reset_index().sort_values(by="Count", ascending=False)

Unnamed: 0,index,Count,Percent
63,gsubname3,209683,99.99
95,weapsubtype4,209636,99.97
96,weapsubtype4_txt,209636,99.97
93,weaptype4,209633,99.97
94,weaptype4_txt,209633,99.97
78,claimmode3,209566,99.93
79,claimmode3_txt,209566,99.93
61,gsubname2,209522,99.91
114,divert,209368,99.84
77,claim3,209297,99.8


This dataset has several missing values. Columns with missing values may be redunant and skew our analysis later. We will remove columns as appropriate.

In [7]:
# Create an array that only includes columns with less than 40% missing values.
keep = na_df[na_df["Percent"] < 40]
keep = keep.index.values
keep

array(['eventid', 'iyear', 'imonth', 'iday', 'extended', 'country',
       'country_txt', 'region', 'region_txt', 'provstate', 'city',
       'latitude', 'longitude', 'specificity', 'vicinity', 'summary',
       'crit1', 'crit2', 'crit3', 'doubtterr', 'multiple', 'success',
       'suicide', 'attacktype1', 'attacktype1_txt', 'targtype1',
       'targtype1_txt', 'targsubtype1', 'targsubtype1_txt', 'corp1',
       'target1', 'natlty1', 'natlty1_txt', 'gname', 'guncertain1',
       'individual', 'nperps', 'nperpcap', 'claimed', 'weaptype1',
       'weaptype1_txt', 'weapsubtype1', 'weapsubtype1_txt', 'nkill',
       'nkillus', 'nkillter', 'nwound', 'nwoundus', 'nwoundte',
       'property', 'ishostkid', 'scite1', 'dbsource', 'INT_LOG',
       'INT_IDEO', 'INT_MISC', 'INT_ANY'], dtype=object)

In [8]:
# Keep these attributes from our original dataset
new_df = df.loc[:, keep]
new_df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209706 entries, 0 to 209705
Data columns (total 57 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   eventid           209706 non-null  int64  
 1   iyear             209706 non-null  int64  
 2   imonth            209706 non-null  int64  
 3   iday              209706 non-null  int64  
 4   extended          209706 non-null  int64  
 5   country           209706 non-null  int64  
 6   country_txt       209706 non-null  object 
 7   region            209706 non-null  int64  
 8   region_txt        209706 non-null  object 
 9   provstate         209706 non-null  object 
 10  city              209279 non-null  object 
 11  latitude          205015 non-null  float64
 12  longitude         205014 non-null  float64
 13  specificity       209705 non-null  float64
 14  vicinity          209706 non-null  int64  
 15  summary           143586 non-null  object 
 16  crit1             20

#### Fill Missing Values

In [9]:
# Check which variables needed to be filled
pd.set_option('display.max_rows', None)
na = pd.DataFrame(new_df.isna().sum())
na = na.rename(columns={0:"Count"})

percent = pd.DataFrame(round((na["Count"]/df.shape[0]) * 100,2))
percent = percent.rename(columns={"Count":"Percent"})
na_df = pd.concat([na, percent], axis=1)

In [10]:
na_df.sort_values(by="Count", ascending=False)

Unnamed: 0,Count,Percent
nperps,71093,33.9
nwoundte,70906,33.81
nperpcap,69473,33.13
nkillter,68159,32.5
scite1,66182,31.56
summary,66120,31.53
claimed,66093,31.52
nwoundus,64697,30.85
nkillus,64437,30.73
corp1,42536,20.28


In [11]:
# Fill the missing values

# Numeric Variables
# To make data consistent, fill unknown numbers with NaN, rather than -9 or -99
new_df.loc[new_df["nperpcap"] == -9, "nperpcap"] = np.nan
new_df.loc[new_df["nperpcap"] == -99, "nperpcap"] = np.nan

new_df.loc[new_df["nperps"] == -9, "nperps"] = np.nan
new_df.loc[new_df["nperps"] == -99, "nperpcap"] = np.nan


# Categorical Variables
# To make data consistent, fill missing values with -9, each column should have 1="Yes", 0="No", -9="Uknown"
new_df["claimed"].fillna(-9, inplace=True)
new_df["weapsubtype1"].fillna(-9, inplace=True)
new_df["targsubtype1"].fillna(-9, inplace=True)
new_df["natlty1"].fillna(-9, inplace=True)
new_df["guncertain1"].fillna(-9, inplace=True)
new_df["multiple"].fillna(-9, inplace=True)
new_df["ishostkid"].fillna(-9, inplace=True)
new_df["doubtterr"].fillna(-9, inplace=True)


# Text Variable
# Fill missing values with "Uknown"
new_df["provstate"].fillna("Unknown", inplace=True)
new_df["city"].fillna("Unknown", inplace=True)
new_df["summary"].fillna("Unknown", inplace=True)
new_df["corp1"].fillna("Unknown", inplace=True)
new_df["weapsubtype1_txt"].fillna("Uknown", inplace=True)
new_df["targsubtype1_txt"].fillna("Unknown", inplace=True)
new_df["natlty1_txt"].fillna("Unknown", inplace=True)
new_df["target1"].fillna("Unknown", inplace=True)

In [None]:
#### Remove Irrelevant Data

In [12]:
# We will also drop columns that we will not use for analysis
new_df = new_df.drop(columns=["specificity", "vicinity", "scite1", "dbsource"])

In [13]:
new_df.tail()

Unnamed: 0,eventid,iyear,imonth,iday,extended,country,country_txt,region,region_txt,provstate,city,latitude,longitude,summary,crit1,crit2,crit3,doubtterr,multiple,success,suicide,attacktype1,attacktype1_txt,targtype1,targtype1_txt,targsubtype1,targsubtype1_txt,corp1,target1,natlty1,natlty1_txt,gname,guncertain1,individual,nperps,nperpcap,claimed,weaptype1,weaptype1_txt,weapsubtype1,weapsubtype1_txt,nkill,nkillus,nkillter,nwound,nwoundus,nwoundte,property,ishostkid,INT_LOG,INT_IDEO,INT_MISC,INT_ANY
209701,202012310015,2020,12,31,0,228,Yemen,10,Middle East & North Africa,Al Hudaydah,Sabaa,15.305307,43.01949,12/31/2020: Assailants fired mortar shells tar...,1,1,1,0,0.0,1,0,3,Bombing/Explosion,14,Private Citizens & Property,76.0,House/Apartment/Residence,Not Applicable,Residences,228.0,Yemen,Houthi extremists (Ansar Allah),0.0,0,-99.0,,0.0,6,Explosives,11.0,"Projectile (rockets, mortars, RPGs, etc.)",,0.0,0.0,,0.0,0.0,1,0.0,0,0,0,0
209702,202012310016,2020,12,31,0,228,Yemen,10,Middle East & North Africa,Al Hudaydah,Beit Maghari,13.931337,43.478924,12/31/2020: Assailants attempted to plant expl...,1,1,1,0,0.0,1,0,3,Bombing/Explosion,14,Private Citizens & Property,76.0,House/Apartment/Residence,Not Applicable,Residences,228.0,Yemen,Houthi extremists (Ansar Allah),0.0,0,-99.0,,0.0,6,Explosives,8.0,Landmine,,0.0,,,0.0,,0,0.0,0,0,0,0
209703,202012310017,2020,12,31,0,75,Germany,8,Western Europe,Lower Saxony,Leipzig,51.342239,12.374772,12/31/2020: Assailants set fire to German Army...,1,1,0,1,0.0,1,0,7,Facility/Infrastructure Attack,4,Military,35.0,Military Transportation/Vehicle (excluding con...,German Army,Wolf-Class Vehicles,75.0,Germany,Left-wing extremists,0.0,0,-99.0,,1.0,8,Incendiary,18.0,Arson/Fire,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,-9,-9,0,-9
209704,202012310018,2020,12,31,0,4,Afghanistan,6,South Asia,Kabul,Kabul,34.523842,69.140304,12/31/2020: Assailants shot and killed a civil...,1,1,1,0,0.0,1,0,2,Armed Assault,14,Private Citizens & Property,83.0,Protester,Not Applicable,Activist: Abdi Jahid,4.0,Afghanistan,Unknown,0.0,0,-99.0,,0.0,5,Firearms,5.0,Unknown Gun Type,1.0,0.0,0.0,0.0,0.0,0.0,0,0.0,-9,-9,0,-9
209705,202012310019,2020,12,31,1,33,Burkina Faso,11,Sub-Saharan Africa,Sahel,Kelbo,13.864252,-1.161453,12/31/2020: Assailants attacked a Volunteers o...,1,1,0,1,0.0,1,0,2,Armed Assault,4,Military,39.0,Paramilitary,Volunteers of the Defence of the Fatherland (VDP),Paramilitary Position,33.0,Burkina Faso,Unknown,0.0,0,-99.0,,0.0,5,Firearms,5.0,Unknown Gun Type,5.0,0.0,0.0,0.0,0.0,0.0,0,-9.0,-9,-9,0,-9


#### Duplicates

In [14]:
# Check for duplicates
new_df.duplicated().sum()

0

There are no duplicated rows in this dataset.

In [15]:
# Our final dataset after initial pre-processing
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209706 entries, 0 to 209705
Data columns (total 53 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   eventid           209706 non-null  int64  
 1   iyear             209706 non-null  int64  
 2   imonth            209706 non-null  int64  
 3   iday              209706 non-null  int64  
 4   extended          209706 non-null  int64  
 5   country           209706 non-null  int64  
 6   country_txt       209706 non-null  object 
 7   region            209706 non-null  int64  
 8   region_txt        209706 non-null  object 
 9   provstate         209706 non-null  object 
 10  city              209706 non-null  object 
 11  latitude          205015 non-null  float64
 12  longitude         205014 non-null  float64
 13  summary           209706 non-null  object 
 14  crit1             209706 non-null  int64  
 15  crit2             209706 non-null  int64  
 16  crit3             20

### Export Dataset into CSV

We have completed the pre-processing step. We can export the dataset into a csv for data exploration. 

In [16]:
new_df.to_csv(r"C:\Users\keinaaoki\Desktop\gtd_1.csv", index=False)