**The Spark Foundation - Data Science & Business Analytics Internship**

TASK #4: Exploratory Data Analysis on ‘Global Terrorism’ dataset. 

Nemat Allah Aloush

December 2022


## Required libraries

In [2]:
import pandas as pd
import numpy as np
import plotly.express as px
import datetime as dt

## Reading dataset

In [3]:
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Spark Internship/Terrorism/data.csv",low_memory=False)

In [None]:
df.head()

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,,PGIS,-9,-9,1,1,


In [None]:
 df.shape

(181691, 135)

## Closer look to the dataset

In [None]:
# The columns names in our dataset
df.columns.values

array(['eventid', 'iyear', 'imonth', 'iday', 'approxdate', 'extended',
       'resolution', 'country', 'country_txt', 'region', 'region_txt',
       'provstate', 'city', 'latitude', 'longitude', 'specificity',
       'vicinity', 'location', 'summary', 'crit1', 'crit2', 'crit3',
       'doubtterr', 'alternative', 'alternative_txt', 'multiple',
       'success', 'suicide', 'attacktype1', 'attacktype1_txt',
       'attacktype2', 'attacktype2_txt', 'attacktype3', 'attacktype3_txt',
       'targtype1', 'targtype1_txt', 'targsubtype1', 'targsubtype1_txt',
       'corp1', 'target1', 'natlty1', 'natlty1_txt', 'targtype2',
       'targtype2_txt', 'targsubtype2', 'targsubtype2_txt', 'corp2',
       'target2', 'natlty2', 'natlty2_txt', 'targtype3', 'targtype3_txt',
       'targsubtype3', 'targsubtype3_txt', 'corp3', 'target3', 'natlty3',
       'natlty3_txt', 'gname', 'gsubname', 'gname2', 'gsubname2',
       'gname3', 'gsubname3', 'motive', 'guncertain1', 'guncertain2',
       'guncertain3', 'in

The dataset has 135 columns, let's take a closer look at the columns and check how many missing values there are for each.

In [4]:
# Checking the null values for each column, calculating it as a precentage of the whole data
null_count = round(df.isnull().sum()*100/ df.shape[0],2)

In [None]:
null_count.values

array([ 0.  ,  0.  ,  0.  ,  0.  , 94.91,  0.  , 98.78,  0.  ,  0.  ,
        0.  ,  0.  ,  0.23,  0.24,  2.51,  2.51,  0.  ,  0.  , 69.46,
       36.4 ,  0.  ,  0.  ,  0.  ,  0.  , 84.03, 84.03,  0.  ,  0.  ,
        0.  ,  0.  ,  0.  , 96.52, 96.52, 99.76, 99.76,  0.  ,  0.  ,
        5.71,  5.71, 23.42,  0.35,  0.86,  0.86, 93.87, 93.87, 94.12,
       94.12, 94.43, 93.93, 94.04, 94.04, 99.35, 99.35, 99.4 , 99.4 ,
       99.44, 99.35, 99.37, 99.37,  0.  , 96.76, 98.89, 99.91, 99.82,
       99.99, 72.17,  0.21, 98.92, 99.82,  0.  , 39.14, 38.25, 36.39,
       89.5 , 89.5 , 98.96, 99.66, 99.66, 99.82, 99.93, 99.93, 97.34,
        0.  ,  0.  , 11.43, 11.43, 92.78, 92.78, 93.65, 93.65, 98.97,
       98.97, 99.07, 99.07, 99.96, 99.96, 99.96, 99.96, 37.24,  5.68,
       35.47, 36.85,  8.98, 35.61, 38.06,  0.  , 64.74, 64.74, 78.54,
       68.1 ,  0.1 , 92.53, 92.56, 97.76, 95.53, 99.82, 98.18, 57.41,
       99.26, 99.69, 99.57, 99.7 , 99.72, 93.95, 93.95, 94.28, 84.43,
       36.43, 57.66,

We can see that there're a lot of columns that miss more than 10% of theirs values.

In [None]:
# looking for columns has less than 10% null values
for k,v in (null_count.items()):
  if (v < 10):
      print (k)

eventid
iyear
imonth
iday
extended
country
country_txt
region
region_txt
provstate
city
latitude
longitude
specificity
vicinity
crit1
crit2
crit3
doubtterr
multiple
success
suicide
attacktype1
attacktype1_txt
targtype1
targtype1_txt
targsubtype1
targsubtype1_txt
target1
natlty1
natlty1_txt
gname
guncertain1
individual
weaptype1
weaptype1_txt
nkill
nwound
property
ishostkid
dbsource
INT_LOG
INT_IDEO
INT_MISC
INT_ANY


After checking the values of the columns, decided to keep some columns that seems informative and do not have much null values

In [5]:
df = df[['iyear','imonth','iday','country_txt','region_txt','provstate','city','success', 'suicide',
          'attacktype1_txt','targtype1_txt','target1', 'natlty1_txt','nkill','nwound','summary','gname','individual','weaptype1_txt','dbsource']] 

Renaming the columns to better wording :)

In [6]:
df.rename(columns={'iyear':'Year','imonth':'Month','iday':'Day','country_txt':'Country','region_txt':'Region', 
                   'provstate':'State', 'success' : 'Success', 'suicide':'Suicide','attacktype1_txt':'AttackType',
                   'targtype1_txt':'TargetType' , 'target1':'Target', 'natlty1_txt':'Nationality',
                  'nkill':'Loss','nwound':'Wounded','summary':'Summary','gname':'Group',
                  'weaptype1_txt':'Weapon_type', 'dbsource':'InfoSource'},inplace=True)

## Dataset Descriptive Analysis

#### Checking the available years in the dataset

In [None]:
df['Year'].value_counts().keys().sort_values()

Int64Index([1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980,
            1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991,
            1992, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
            2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014,
            2015, 2016, 2017],
           dtype='int64')

As we can see, the dataset contains data about the years from 1970 to 2017 with the exception to the year 1993.

#### Analyzing the dataset regarding the countries 

In [None]:
# Let's find the number of terrorist events happened in each country
countries_df = df.groupby(['Country']).size().sort_values(ascending=[False])\
         .to_frame().reset_index()\
        .rename(columns= {0: 'Events_Count'})

In [None]:
# Reformating the dataframe to keep the details of only the top 15 country and combine the other countries in one row 'other'

# This  dataframe (countries_df_part) contains the top 15 country
countries_df_part=countries_df[:15] 

# Finding the number of terrorist events that happened out of the top 15 countries
accidents_other = countries_df['Events_Count'][15:].sum() 
df2 = pd.DataFrame([['other', accidents_other]], columns=['Country','Events_Count'])

#The final dataframe to visualize
countries_df_part=countries_df_part.append(df2)

In the following pie chart there are the top 15 countries where the most terrorist events happened.

In [None]:
fig = px.pie(countries_df_part, 
            values='Events_Count', 
            names='Country', 
            title='Top 15 countries where the most terrorist events happend')
fig.update_traces( textinfo='percent+label',textfont_size=10)
fig.show()

13.6% of the terrorist accidents happened in Iraq, 7.91% happened in Pakistan, 7.01% in Afghanistan and 6.58% in India. 71.48% in the other countries where in each of them no more than 4.57% occured.

#### Analyzing the dataset regarding the countries and year

In [None]:
year_country= df.groupby(['Country','Year']).size().to_frame()\
    .reset_index().rename(columns={0: 'Events_Count'})

In the sake of more clear visualization, let's draw only the top 14 countries in details, and combine the rest in 'other' colomn


In [None]:
# Countries to be shown
year_country_vis=year_country[year_country['Country'].isin(countries_df_part['Country'][:10])]

In [None]:
#countries to be combined
year_other_country =year_country[~year_country['Country'].isin(countries_df_part['Country'][:10])]
year_other_country = year_other_country.groupby('Year').size().to_frame()\
    .reset_index().rename(columns={0: 'Events_Count'})
year_other_country.insert(0, 'Country', 'Other')

In [None]:
# the final dataframe to be plotted.
year_country_finale = pd.concat([year_country_vis,year_other_country])

In [None]:
fig = px.bar(year_country_finale, x="Year", y="Events_Count", color="Country")
fig.show()

#### Analyzing the dataset regarding the regions 

In [None]:
# Let's find the number of terrorist events happened in each region
regions_df = df.groupby(['Region']).size().sort_values(ascending=[False])\
         .to_frame().reset_index()\
        .rename(columns= {0: 'Events_Count'})

In [None]:
 fig = px.pie(regions_df, 
            values='Events_Count', 
            names='Region', 
            title='Top 15 regions where the most terrorist events happend')
fig.update_traces( textinfo='percent+label',textfont_size=10)
fig.show()

As it can bee seen from the pie chart, more than 50% of the terrorist attacks happend in the MENA region and south Asia.

#### Analyzing the dataset regarding the regions and year of happing

In [None]:
year_region= df.groupby(['Region','Year']).size().to_frame()\
    .reset_index().rename(columns={0: 'Events_Count'})

In [None]:
fig = px.bar(year_region, x="Year", y = 'Events_Count', color="Region")
fig.show()

It can be noticed from the graph that only after 1992 the terrorist attacks started to happen more in the MENA region and south Asia, before that It seemed to happen more in South America, Central America & Caribbean and westren Europe.

#### Analyzing the dataset regarding the cities

In [None]:
# Let's find the number of terrorist events happened in each city
city_df = df.groupby(['city']).size().sort_values(ascending=[False])\
         .to_frame().reset_index()\
        .rename(columns= {0: 'Events_Count'})

In [9]:
fig = px.pie(city_df[:15], 
            values='Events_Count', 
            names='city', 
            title='Top 15 cities where the most terrorist attacks happend')
fig.update_traces( textinfo='percent+label',textfont_size=10)
fig.show()

The pie chart contains only 20.6% from the total dataset, the other attacks happened in other cities where in each city did not accur more than 2.27% from the whole attacks.


We can see that the city that suffered the most terrorist attacks between 1970 to 2017 is Baghdad, which makes scence because Iraq is the country that suffered the most attacks at this period.

#### Analyzing the dataset regarding the number of lost and wounded people and year of happing

In [None]:
victims_counts = df.groupby('Year').agg(
     Loss_count = ('Loss','sum'),
     Wounded_count = ('Wounded','sum'),
     ).reset_index()

In [None]:
fig = px.bar(victims_counts, x="Year", y = [victims_counts.Loss_count,victims_counts.Wounded_count])
fig.show()

The above histogram just shows the number of lost and wounded people in each year. Comparing the histogram with the previous one, the count of attacks diffenrtly affect the count of killed/wounded people, but certinally is not the only factor.

#### Analyzing the dataset regarding the number of lost and wounded people and the region where the terrorist attack happend


In [None]:
victims_region_counts = df.groupby(['Region']).agg(
     Loss_count = ('Loss','sum'),
     Wounded_count = ('Wounded','sum'),
     ).reset_index()

In [None]:
fig = px.bar(victims_region_counts, y="Region", x = [victims_region_counts.Loss_count,victims_region_counts.Wounded_count])
fig.show()

Since the most terrorist attacks happend in the MENA and South Asia, there count of killed and wounded people are significntaly more in those two regions.

#### Analyzing the dataset regarding the groups who did the terrorist attacks

In [45]:
victims_group_counts = df.groupby(['Group']).size().sort_values(ascending=[False])\
         .to_frame().reset_index()\
        .rename(columns= {0: 'Events_Count'})

In [46]:
fig = px.bar(victims_group_counts[1:15], y="Group", x = "Events_Count", title = 'The top 14 groups that organized terrorist attacks.' )
fig.show()

The above figure shows the top known 14 groups that did terrorist attacks between 1970 and 2017. Taliban followed by ISIL were responsible to more terrorist attacks.

#### Analyzing the dataset regarding the groups who did the terrorist attacks and year of happining

In [47]:
year_group= df.groupby(['Group','Year']).size().to_frame()\
    .reset_index().rename(columns={0: 'attacks'})

In [56]:
year_group_vis=year_group[year_group['Group'].isin(victims_group_counts['Group'][1:14])]

In [57]:
fig = px.line(year_group_vis, x="Year", y="attacks", color="Group", line_group="Group", hover_name="Group",
        line_shape="spline", render_mode="svg")
fig.show()


Some Groups as FMLN, SL had their attacks before 1992, after that they almost performs no attacks comparing to the rest groups.
After 2004 groups Taliban seems to be responisible to the most attacks.
After 2013 ISIL and Taliban performed more terrorist attacks than any other group.

#### Analyzing the dataset regarding the target types for those terrorist attacks

In [23]:
weapons_counts = df.groupby(['Weapon_type']).size().sort_values(ascending=[False])\
         .to_frame().reset_index()\
        .rename(columns= {0: 'Events_Count'})

In [38]:
# Reformating the dataframe to keep the details of only the top 15 country and combine the other countries in one row 'other'

# This  dataframe (countries_df_part) contains the top 15 country
weapons_counts_part=weapons_counts[:4] 

# Finding the number of terrorist events that happened out of the top 15 countries
accidents_other = weapons_counts['Events_Count'][4:].sum() 
df2 = pd.DataFrame([['other', accidents_other]], columns=['Weapon_type','Events_Count'])

#The final dataframe to visualize
weapons_counts_part=weapons_counts_part.append(df2)

In [40]:
fig = px.pie(weapons_counts_part, 
            values='Events_Count', 
            names='Weapon_type', 
            title='Weapons used during terrorist attacks')
fig.update_traces( textinfo='percent+label',textfont_size=10)
fig.show()

'Expolsives' seems to be the most chosen weapon among terrorist attacks.

#### Analyzing the dataset regarding the Attack Type for those terrorist attacks

In [43]:
attacks_type = df.groupby(['AttackType']).size().sort_values(ascending=[False])\
         .to_frame().reset_index()\
        .rename(columns= {0: 'Events_Count'})

In [44]:
fig = px.pie(attacks_type, 
            values='Events_Count', 
            names='AttackType', 
            title='Attack types')
fig.update_traces( textinfo='percent+label',textfont_size=10)
fig.show()

Bombing and Explosions are the most common Attack types.

#### Analyzing the dataset regarding the Target Type for those terrorist attacks

In [45]:
targert_type = df.groupby(['TargetType']).size().sort_values(ascending=[False])\
         .to_frame().reset_index()\
        .rename(columns= {0: 'Events_Count'})

In [47]:
fig = px.bar(targert_type, y="TargetType", x = "Events_Count", title = 'Target Types' )
fig.show()

Private Citizens and property were the most injured due to the documented terrorist attacks.

### The Foundings.

* Note : The dataset contains data about the years from 1970 to 2017 with the exception to the year 1993.

* More than 50% of the terrorist attacks happend in the MENA region and south Asia.

* 13.6% of the terrorist accidents happened in Iraq, 7.91% happened in Pakistan, 7.01% in Afghanistan and 6.58% in India. 71.48% in the other countries where in each of them no more than 4.57% occured.

* The city that suffered the most terrorist attacks between 1970 to 2017 is Baghdad, which makes scence because Iraq is the country that suffered the most attacks at this period.

* Only after 1992 the terrorist attacks started to happen more in the MENA region and south Asia, before that they seemed to happen more in South America, Central America & Caribbean and westren Europe.

* Since the most terrorist attacks happend in the MENA and South Asia, the count of killed and wounded people are significntaly more in those two regions.

* Taliban followed by ISIL were responsible to the most terrorist attacks.

* Some Groups as FMLN, SL did their attacks before 1992, after that they almost performs no attacks comparing to the other groups.

* After 2004 Taliban seems to be responisible to the most attacks. And after 2013 ISIL and Taliban performed more terrorist attacks than any other group.

* Private Citizens and property were the most injured due to the documented terrorist attacks.

* 'Expolsives' seems to be the most chosen weapon among terrorist attacks.

* Bombing and Explosions are the most common Attack types.
