<a href="https://colab.research.google.com/github/ShashwatDev-IIITBBSR/Exploratory-Data-Analysis-of-Global-Terrorism/blob/main/Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **GLOBAL TERRORISM ANALYSIS**

### **PART-2 : DATA PREPROCESSING PHASE**

> By Shashwat Dev



Importing all the necessary libraries required for data preprocessing

In [None]:
import time
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 
import matplotlib.ticker as ticker
from matplotlib import animation

Importing dataset

In [None]:
primary_df = pd.read_csv('globalterrorismdb.csv', sep=',', encoding="ISO-8859-1")

  exec(code_obj, self.user_global_ns, self.user_ns)


Initial layout of data

In [None]:
primary_df.head(10)

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,,PGIS,-9,-9,1,1,
5,197001010002,1970,1,1,,0,,217,United States,1,...,"The Cairo Chief of Police, William Petersen, r...","""Police Chief Quits,"" Washington Post, January...","""Cairo Police Chief Quits; Decries Local 'Mili...","Christopher Hewitt, ""Political Violence and Te...",Hewitt Project,-9,-9,0,-9,
6,197001020001,1970,1,2,,0,,218,Uruguay,3,...,,,,,PGIS,0,0,0,0,
7,197001020002,1970,1,2,,0,,217,United States,1,...,"Damages were estimated to be between $20,000-$...",Committee on Government Operations United Stat...,"Christopher Hewitt, ""Political Violence and Te...",,Hewitt Project,-9,-9,0,-9,
8,197001020003,1970,1,2,,0,,217,United States,1,...,The New Years Gang issue a communiqué to a loc...,"Tom Bates, ""Rads: The 1970 Bombing of the Army...","David Newman, Sandra Sutherland, and Jon Stewa...","The Wisconsin Cartographers' Guild, ""Wisconsin...",Hewitt Project,0,0,0,0,
9,197001030001,1970,1,3,,0,,217,United States,1,...,"Karl Armstrong's girlfriend, Lynn Schultz, dro...",Committee on Government Operations United Stat...,"Tom Bates, ""Rads: The 1970 Bombing of the Army...","David Newman, Sandra Sutherland, and Jon Stewa...",Hewitt Project,0,0,0,0,


Shape of data

In [None]:
print ('dataframe shape: ', primary_df.shape)

dataframe shape:  (181691, 135)


In [None]:
print ('Existance of nul values', primary_df.isnull().values.any())
print ('Total number of null values in entire dataframe: ', primary_df.isnull().sum().sum())

Existance of nul values True
Total number of null values in entire dataframe:  13853997


In [None]:
print ('Number of null values in named col of dataframe: ', primary_df.isnull().sum())

Number of null values in named col of dataframe:  eventid            0
iyear              0
imonth             0
iday               0
approxdate    172452
               ...  
INT_LOG            0
INT_IDEO           0
INT_MISC           0
INT_ANY            0
related       156653
Length: 135, dtype: int64


## **Changing the content and features of the dataset**

> Renaming certain columns to better identifiable names



In [None]:
primary_df.rename(columns = 
                  {'iyear':'year', 
                   'imonth':'month',
                   'iday':'day',
                   'country_txt' : 'country',
                   'region_txt' : 'region',
                   'crit1' : 'crit',
                   'attacktype1_txt' : 'attacktype',
                   'targtype1_txt' : 'targettype',
                   'natlty1_txt' : 'nationalityofvic',
                   'gname' : 'organisation',
                   'claimed' : 'claimedresp',
                   'weaptype1_txt' : 'weapontype',
                   'nkill' : 'nkilled',
                   'nkillter' : 'nkillonlyter',
                   'nwound' : 'nwounded',
                   'propextent_txt' : 'propdamageextent',
                   'ishostkid' : 'victimkidnapped',
                   'ransom' : 'ransomdemanded',
                   }, inplace = True)

In [None]:
#Add column ncasualties (Number of Dead/Injured people) by adding Nkill and Nwound
primary_df['ncasualties'] = primary_df['nkilled'] + primary_df['nwounded']

In [None]:
# Limit long strings
primary_df['weapontype'] = primary_df['weapontype'].replace(u'Vehicle (not to include vehicle-borne explosives, i.e., car or truck bombs)', 'Vehicle')


primary_df['propdamageextent'] = primary_df['propdamageextent'].replace('Minor (likely < $1 million)', 'Minor')
primary_df['propdamageextent'] = primary_df['propdamageextent'].replace('Major (likely > $1 million but < $1 billion)', 'Major')
primary_df['propdamageextent'] = primary_df['propdamageextent'].replace('Catastrophic (likely > $1 billion)', 'Catastrophic')

### Glimpse of the final preprocessed data

In [None]:
primary_df.head(10)

Unnamed: 0,eventid,year,month,day,approxdate,extended,resolution,country,country.1,region,...,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related,ncasualties
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,PGIS,0,0,0,0,,1.0
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,PGIS,0,1,1,1,,0.0
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,PGIS,-9,-9,1,1,,1.0
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,PGIS,-9,-9,1,1,,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,PGIS,-9,-9,1,1,,
5,197001010002,1970,1,1,,0,,217,United States,1,...,"""Police Chief Quits,"" Washington Post, January...","""Cairo Police Chief Quits; Decries Local 'Mili...","Christopher Hewitt, ""Political Violence and Te...",Hewitt Project,-9,-9,0,-9,,0.0
6,197001020001,1970,1,2,,0,,218,Uruguay,3,...,,,,PGIS,0,0,0,0,,0.0
7,197001020002,1970,1,2,,0,,217,United States,1,...,Committee on Government Operations United Stat...,"Christopher Hewitt, ""Political Violence and Te...",,Hewitt Project,-9,-9,0,-9,,0.0
8,197001020003,1970,1,2,,0,,217,United States,1,...,"Tom Bates, ""Rads: The 1970 Bombing of the Army...","David Newman, Sandra Sutherland, and Jon Stewa...","The Wisconsin Cartographers' Guild, ""Wisconsin...",Hewitt Project,0,0,0,0,,0.0
9,197001030001,1970,1,3,,0,,217,United States,1,...,Committee on Government Operations United Stat...,"Tom Bates, ""Rads: The 1970 Bombing of the Army...","David Newman, Sandra Sutherland, and Jon Stewa...",Hewitt Project,0,0,0,0,,0.0


In [None]:
print ('final dataframe shape: ', primary_df.shape)

final dataframe shape:  (181691, 136)


In [None]:
#Check for general information
primary_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181691 entries, 0 to 181690
Columns: 136 entries, eventid to ncasualties
dtypes: float64(56), int64(22), object(58)
memory usage: 188.5+ MB


In [None]:
# Converting the dataframe to a csv file and uploading it to google drive with the name BaseForAnalysis_Version2.csv
primary_df.to_csv("BaseForAnalysis.csv", sep = ",")

### **End of Phase-2(Data Preprocessing)**