# Intro

## Abstract
Terrorism is a subject largely covered in the media, and, unfortunately, we became accustomed to its presence worldwide, particularly over the last decade. Nevertheless, the problem we are facing today is not new. The source of certain conflicts dates from multiple decades, some of which are still lasting today. Our goal is to track and vizualize terrorism evolution through the past 50 years based on "The Global Terrorism Database". There are many questions we can ask ourselves about terrorism, such as "Is EU less safe nowadays ?", "Did attack mediums & reasons change over the years ?" or "Can we discriminate current/future conflictual zones ?". It would be presumptuous from us to say that we are going to solve major issues, or even predict futur attacks. However, through the exploration of the dataset, and by trying to answer those interrogations, we aim to grasp an overview and a better understanding to the evolution of terrorism.

## Plan

1. [Raw data understanding and cleaning](#raw_data)
    1. [Field selections using documentation](#fields_select)
    2. [Data exploration ](#data_exploration)
2. [Data visualization](#data_viz)
    1. [ Worldmap heatmap all-time & over the years](#world_overview)
    2. [Some global evolutions over the years](#attacks_casualities)
3. [Groups](#groups)
4. [Events that marked the world](#events_world)
    1. [North America Bombings (1970)](#NAB_1970)
    2. [Nothern Irland Religion conflict (1972-1973)](#EU_1972)
    3. [Nothern Irland and Basque Country (1975-1977)](#EU_1975)
    4. [Salvadoran Civil War (1981-1983)](#CA_1981)
    5. [South America Conflicts (1984-1987)](#SA_1984)
    6. [Middle East (2003-2007)](#ME_2003)
    7. [South Asia (2008-2013)](#ME_2008)
    8. [Middle East (2013-Today)](#ME_2013)
5. [What's next](#whats_next)


# 1 Raw data understanding  <a id='raw_data'></a>

## 1.1 Field selections using documentation  <a id='fields_select'></a>

First of all, we need to take a deep look into the details of our dataset to sort out the relevant data we will be using to conduct our observations. The Global Terrorism Dataset contains 135 features and approximately 170'000 entries. In order to select the label we will keep, we used the official [documentation](http://start.umd.edu/gtd/downloads/Codebook.pdf) from the dataset which describes each features precisely. Let's make a quick summary of the labels from the dataset we decided to use for our project.

* `eventid` : this is the id of any entry, written as 12 numbers (first 8 digits are the date of event and last 4 digits are a sequential case number for the given day). This will be used as our index too.
* `iyear`, `imonth`, `iday` : Year, month and day of the event. In some rare occasion the month or days are unknown.
* `country_txt` : id and name of the country where the event took place.
* `region_txt` : id and region where the event took place.
* `city` : This field contains the name of the city, village, or town in which the incident occurred. If the city, village, or town for an incident is unknown, then this field contains the smallest administrative area below provstate which can be found for the incident (e.g., district).  
* `latitude` and `longitude` : Latitude and Longitude values where the event took place.
* `doubtterr` : boolean value set as 1 if there is a doubt to whether the incident is an act of terrorism and 0 if there is no doubt of a terrorist attack.
* `success` : boolean value set as 1 if the incident was successful or 0 if it was not. As stated in the documentation, "Success of a terrorist strike is defined according to the tangible effects of the attack. Success is not judged in terms of the larger goals of the perpetrators. For example, a bomb that exploded in a building would be counted as a success even if it did not succeed in bringing the building down or inducing government repression." 
* `suicide` : boolean value set as 1 if the attack perpetrator did not intend to escape from the attack alive, 0 otherwise.
* `attacktype1_txt` : This field captures the general method of attack and often reflects the broad class of tactics used. It consists of nine categories, which are defined below :
    1. Assassination
    2. Armed Assault
    3. Bombing/Explosion
    4. Hijacking 
    5. Hostage taking (barricade incident) 
    6. Hostage taking (kidnapping)
    7. Facility/Infrastructure Attack
    8. Unarmed Assault
    9. Unknown 
* `targtype1_txt` : The target/victim type field captures the general type of target/victim. When a victim is attacked specifically because of his or her relationship to a particular person, such as a prominent figure, the target type reflects that motive. For example, if a family member of a government official is attacked because of his or her relationship to that individual, the type of target is “government.” This variable consists of the following 22 categories: <br>
    1. Business
    2. Government (General)
    3. Police
    4. Military
    5. Abortion related
    6. Airport & aircraft
    7. Government (Diplomatic), differs from the other entry as here are taken into account representation of a gouvernment on a foreign soil (embassy, consulate...)
    8. Educational institution
    9. Food or water supply
    10. Journalist & media
    11. Maritime facilities, including ports
    12. NGO
    13. Other
    14. Private citizens & property, include attacks in a public area against private citizens
    15. Religious figures/insititutions
    16. Telecommunication
    17. Terrorists/non-state militias
    18. Tourists
    19. Transportation (other than aviation)
    20. Unknown
    21. Utilities, facilities for generation or transmission of energy
    22. Violent political parties
* `gname` : This field contains the name of the group that carried out the attack. In order to ensure consistency in the usage of group names for the database, the GTD database uses a standardized list of group names that have been established by project staff to serve as a reference for all subsequent entries.  
* `gname2` : This field is used to record the name of the second perpetrator when responsibility for the attack is attributed to more than one perpetrator. Conventions follow “Perpetrator Group” field.  
* `gname3` : same as for gname2
* `nperps` : This field indicates the total number of terrorists participating in the incident. (In the instance of multiple perpetrator groups participating in one case, the total number of perpetrators, across groups, is recorded). There are often discrepancies in information on this value.   
* `weaptype1_txt` : This field records the general type of weapon used in the incident. It consists of the following categories: <br>
    1. Biological
    2. Chemical
    3. Radiological
    4. Nuclear
    5. Firearms
    6. Explosive/bonbs/dynamite
    7. Fake weapons
    8. Incendiary
    9. Melee
    10. Vehicle
    11. Sabotage equipment 
    12. Other
    13. Unknown
* `nkill` : This field stores the number of total confirmed fatalities for the incident. The number includes all victims and attackers who died as a direct result of the incident.   
* `nkillter`: This field stores the number of confirmed terrorists fatalities.
* `nwound` : This field records the number of confirmed non-fatal injuries to both perpetrators and victims. 
* `nwoundte` : This field records the number of confirmed non-fatal terrorists injuries. 


We are now reduced to 22 features instead of the original 135 from the dataset. A part from the kept features, we explored some other features such as `weaptype2`, `weapsubtype` or `motive` to see if those would bring added informations thus be relevant to use also. However, we decided to drop them because of a too large amount of NaN or unknown entries. Our choice focused on labels that would allow us to answer the questions asked in the description, as well as labels relevant to get a pertinent visualization of the data.


## 1.2 Data exploration  <a id='data_exploration'></a>

Let's begin the work by importing the libraries and creating a dataframe to explore the data furthermore. As cautious wannabe data scientist, we will explore in detail each field and check the proportion of non categorized or Unknown-labeled entries to make sure each feature we kept countains relevant data. <br>
*NOTE : during this section we are only exploring data without drawing any conclusions nor making assumptions regarding the data. This will come further in our analysis*

In [None]:
import pandas as pd
import os
import numpy as np
import datetime
import time
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.core.display import display, HTML

%pylab inline
%matplotlib inline

sns.set_context("notebook")

In [None]:
data_path = 'data'
gtd_path = os.path.join(data_path, 'globalterrorismdb_0617dist.csv')

In [None]:
fields = ['eventid', 'iyear', 'imonth', 'iday', 'country_txt', 'region_txt', 'city', 
          'latitude', 'longitude', 'doubtterr', 'attacktype1_txt',  'success', 
          'suicide', 'weaptype1_txt', 'targtype1_txt', 'gname', 'gname2', 
          'gname3', 'compclaim', 'nperps', 'nkill', 'nkillter', 'nwound', 'nwoundte']
date_fileds = ['iyear', 'imonth', 'iday']

df = pd.read_csv(gtd_path, encoding='latin', usecols=fields, index_col='eventid', low_memory=False)

In [None]:
print('Is index unique: {}'.format(df.index.is_unique))

In [None]:
df.dtypes

According to the documentation, month or day (or both) can be set to 0 if the exact date of the attack is unknown. We created a function to set the value to of the field to 1 in the case of an unknown date. We then count the proportion of unknown date within the dataset, just to make sure it is not too high.

In [None]:
# According to documentation both month and day can be 0 (if unknown), we set them to 0
def parse_date(row):
    return datetime.date(row.iyear, int(row.imonth) if not np.isnan(row.imonth) else 1, 
                         int(row.iday) if not np.isnan(row.iday) else 1)

In [None]:
# Count number entries with uncertain date (either month or day)
df[date_fileds] = df[date_fileds].replace(0, np.nan)
n_uncertain = np.sum(np.sum(df[date_fileds].isnull(), axis=1) != 0 )
df['date'] =df.apply(lambda x: parse_date(x), axis=1)
print('Uncertain dates: {:.2f}%, ({}/{})'.format(100*n_uncertain/len(df), n_uncertain, len(df)))

We check the proportion of entries without geographic coordinates. In the case when the coordinates are unknow, we decided to completely drop the row of data. This is due to the fact that we want to have the location informations in order to represent the data with maps.

In [None]:
n_geo = len(df)
df.dropna(subset=('latitude', 'longitude'), inplace=True)
print('Entries without geographic coordinates droped: {:.2f}%, ({}/{})'.format(
    100*(n_geo-len(df))/n_geo, n_geo-len(df), n_geo))

We are now checking the amount of attacks that are categorized as unsure terror attacks. We wanted to check this particular feature to make sure that in the dataset there is a large majority of attacks that are hundred percent sure to be terror attack. If it would not have been the case, the whole dataset as well as our study would not have been relevant. <br>
There is 15% of attacks for which there is a doubt to categorize them as terror attacks. The entry -9 represent cases for which the value was not available at all when the dataset was constructed. We decided to assign them as if it was sure they were terror attacks.

In [None]:
print('Repartition of data in %:\n{}'.format(100*df.doubtterr.value_counts()/len(df)))
df.loc[df.doubtterr < 0, 'doubtterr'] = 0
df.doubtterr = df.doubtterr.astype('category')
df.doubtterr.cat.categories = ['N_DOUBT', 'DOUBT']
print('\nRepartition of data in %(after cleaning):\n{}'.format(100*df.doubtterr.value_counts()/len(df)))

For the two upcoming fields, data is, as expected, binary and completly categorized.

In [None]:
print('Unique values in field: {}'.format(np.unique(df.success)))
print('Percentage of sucessful attacks: {:.2f}%'.format(100*df.success.mean()))

In [None]:
print('Unique values in field: {}'.format(np.unique(df.suicide)))
print('Percentage of suicide attacks: {:.2f}%'.format(100*df.suicide.mean()))

Time to explore if the proportion of attack types and see if there are any NaN values.

In [None]:
print('Type of attack and repartition in dataset in %')
100*df.attacktype1_txt.value_counts()/len(df)

We do the same for the repartion of target types and the repartion of weapon types.

In [None]:
print('Type of target and repartition in dataset in %')
100*df.targtype1_txt.value_counts()/len(df)

In [None]:
print('Repartition of weapon type in dataset in %')
100*df.weaptype1_txt.value_counts()/len(df)

We now check the number of unique entries that categorize the name of the group conducting the terror attacks. According to the documentation, a work as been done to standardize the entries within this field by using a specific list of group names established by project staff.

In [None]:
pd.value_counts(df[['gname', 'gname2', 'gname3']].values.ravel('K')).head(10)

In [None]:
df[['gname', 'gname2', 'gname3']] = df[['gname', 'gname2', 'gname3']].replace({'Unknown': np.nan})
n_group = len(pd.unique(df[['gname', 'gname2', 'gname3']].values.ravel('K')))
print('Number of unique group name: {}'.format(n_group))

We will now explore data with numerical values. First we look at field corresponding to the number of perpretrators of an attack. As expected, it countains a large amount of unknown entries as it is not easy to know how many perpetrators of an attack there was. We decided to keep this row anyway as to explore, if possible, the evolution of terror attack, and the number of perpetrators is a value that could give an insight to know this.

In [None]:
df.loc[df.nperps < 0, 'nperps'] = np.nan
print('Percentage of entries with unknown # Perpretrators {:.2f}%'.format(100*np.sum(df.nperps.isnull())/len(df)))
print('Range of # Perpretrators: {} upto {}'.format(int(df.nperps.min()), int(df.nperps.max())))

We look now at the number of victims of terror attacks. As the dataset count the total number of fatalities, perpetrators included, we found it relevant to keep also the number of killed terrorists to conduct our analysis. Same logic applies for the number of wounded.

In [None]:
df['nkillnter'] = df.nkill-df.nkillter.fillna(0)
df.loc[df.nkillnter < 0, 'nkillnter'] = 0

print('Range total # of victims: [{}, {}]'.format(df.nkill.min(), df.nkill.max()))
print('Range # of non terrorists victims: [{}, {}]'.format(df.nkillnter.min(), df.nkillnter.max()))
print('Range # of terrorists victims: [{}, {}]'.format(df.nkillter.min(), df.nkillter.max()))

In [None]:
df['nwoundnter'] = df.nwound-df.nwoundte.fillna(0)
df.loc[df.nwoundnter < 0, 'nwoundnter'] = 0

print('Range total # of wounded: [{}, {}]'.format(df.nwound.min(), df.nwound.max()))
print('Range # of non terrorists wounded: [{}, {}]'.format(df.nwoundnter.min(), df.nwoundnter.max()))
print('Range # of terrorists wounded: [{}, {}]'.format(df.nwoundte.min(), df.nwoundte.max()))

---
# 2 Data Visualization  <a id='data_viz'></a>

## 2.1 Worldmap heatmap all-time & over the years  <a id='world_overview'></a>

Here we defined basic function for map plot. The first function extract `latitude` and `longitude` from the data that will be plotted in folium maps. The next two functions are used to plot the actual heat map of attacks (overall and with times steps).

In [None]:
import folium
from folium.plugins import HeatMap, HeatMapWithTime
from folium.plugins import MarkerCluster

def get_data_longlat(df, val=None):
    if val is not None:
        df_t = df.loc[df[val] > 0]
        data_year = df_t[['latitude', 'longitude']].values
        return np.concatenate((data_year, np.expand_dims(df_t[val], axis=1)), axis=1)
    else:
        data_year = df[['latitude', 'longitude']].values
        return np.concatenate((data_year, np.ones((len(data_year), 1))), axis=1)

def get_heatmap_time(df, coord=[30., 5.], zoom=2):
    data_all = []
    year_label = []
    for year, d in df.groupby('iyear'):
        data_all.append(get_data_longlat(d).tolist())
        year_label.append('Year: {}'.format(year))
    m = folium.Map(coord, tiles='stamentoner', zoom_start=zoom)
    HeatMapWithTime(data_all, index=year_label, radius=10, max_opacity=1).add_to(m)
    return m
    
def get_heatmap(df, val=None, coord=[30., 5.], zoom=2, min_opacity=0.5, blur=5):
    data = get_data_longlat(df, val) 
    m = folium.Map(coord, tiles='stamentoner', zoom_start=zoom)
    HeatMap(data.tolist(), radius=5, min_opacity=min_opacity, blur=blur).add_to(m)
    return m
    
rand_seed = 0
np.random.seed(rand_seed)

Due to the large amount of data and the limitation of Folium, we decided to select a random subset of n samples ($n=50000$). The first map displays thoses samples as heat map. Even if around 1/3 of the dataset is displayed we can assume the distribution is similar for the complete dataset. We can clearly see the region of conflicts for example in Middle East or even in India. However this map have no temporal view. We displayed as well a time-wise evolution of the data.

In [None]:
n = 50000
id_sub = np.random.permutation(len(df))[:n]
df_sub = df.iloc[id_sub]

In [None]:
m_overall = get_heatmap(df_sub, coord=[30., 5.])
m_overall_time = get_heatmap_time(df_sub, coord=[30., 5.])

display(HTML('<h4>{}</h4>'.format('50 years of terrorism worldwide')))
display(m_overall)
display(HTML('<h4>{}</h4>'.format('Yearly evolution of terrorism worldwide')))
display(m_overall_time)

## 2.2 Some global evolutions over the years   <a id='attacks_casualities'></a>

### 2.2.1 Number of Attacks by years

Here we will focus on the evolution of the number of attacks. First we will look at the data from a worldwide perspective. We can notice that year 1993 is missing in our data. It is unlikely that no terrorist attacks occured during this period. According to [Codebook](http://start.umd.edu/gtd/downloads/Codebook.pdf) lack of information is the reason of the abscence of data for this year.

> In addition, users familiar with the GTD’s Data Collection Methodology are aware that incidents 
of terrorism from 1993 are not present in the GTD because they were lost prior to START’s 
compilation of the GTD from multiple data collection efforts. 

Overall we can clearly see that the number of attacks is increasing. However, it seems that the number of attacks is even decreasing during the period 2015-2017.

In [None]:
year_span = 1+df.iyear.max()-df.iyear.min()
plt.figure(figsize=(16,5))
df.date.hist(bins=year_span, label='# Attacks')
plt.xlabel('Time'); plt.ylabel('# Attack'); plt.legend(); 
plt.title('Evolution of number of attacks worldwide', fontsize=12, fontweight='bold')

The number of attacks is sometimes not really relevant. In present days, terrorism is unfortunately linked toisolated events with huge number of casualities (Charlie Hebdo). We will therefore plot as well the number of death linked to terrorism over the years. The number of death have increased over the years to reach a peak of over 40'000 death in 2015. It represent an average number of <b> 110 death per day</b>. Note that year 1993 is still missing in our data (gap filled in this case).

In [None]:
plt.figure(figsize=(16,5))
df.groupby('iyear').nkill.sum().plot(kind='bar', width=1)
plt.xlabel('Time'); plt.ylabel('# casualties'); plt.legend(); 
plt.title('Evolution of casualties over the year Worldwide', fontsize=12, fontweight='bold')

### 2.2.2 Number of Attacks by months by years

In [None]:
MONTHS = ['January', 'February',  'March', 'April', 'May', 'June', 'July', 
          'August', 'September', 'October', 'November', 'December']

def get_2d_comp(df, x_col, y_col, hue=None, normalize=False):
    if hue is None:
        df_month_year = df.groupby([x_col, y_col]).size().reset_index(name='frequ')
    else:
        df_month_year = df.groupby([x_col, y_col])[hue].sum().reset_index(name='frequ')
    df_month_year =  df_month_year.pivot(index=y_col, columns=x_col, values='frequ')
    if normalize:
        df_month_year = df_month_year.div(df_month_year.sum(axis=0), axis=1)
    return df_month_year

Here we will look at the evolution of attacks as a function of the month of the year. We can see that there are no visible tendencies. It means that, as a worldwide perspective, there are no year periods where terrorist attacks a more frequent. We have to be careful in our analysis since local patterns can still exist (e.g. in South America, Middle East, etc.. ).

In [None]:
df_month_year = get_2d_comp(df,'iyear', 'imonth')
df_month_year.index = MONTHS
plt.figure(figsize=(16,4))
sns.heatmap(df_month_year)
plt.title('Evolution of casualties over the year and months', fontsize=12, fontweight='bold')

### 2.2.3 Number of Attacks by regions by years

An other way to look at the data is to plot them as a function of the region. We choosed to normalize our data. In this case it means the rows will sum to 1 for each years. It allows fairness with present days peaks of attacks and to highlight past conflict periodes. We can now distinguish, for example, periodes of trouble in South America from 1980 to 1990. As expected, Middle East ans South Asia (Afganistan, Pakistan, India) are the most dangrous areas nowdays.

In [None]:
df_reg_year = get_2d_comp(df,'iyear', 'region_txt', normalize=True)
plt.figure(figsize=(14,4))
sns.heatmap(df_reg_year.fillna(0))
plt.title('Region-wise attacks over the year and months', fontsize=12, fontweight='bold')

We use here the same approach but this time we are considering the number of death and not the frequencies of events. This time as well data are normalized. We can see a distinct peak for Noth America in 2001 which highlight the tragic event of the 9th September in New York.

In [None]:
df_reg_year_ca = get_2d_comp(df,'iyear', 'region_txt', 'nkill', normalize=True)
plt.figure(figsize=(14,4))
sns.heatmap(df_reg_year_ca.fillna(0))
plt.title('Region-wise casualities over the year and months', fontsize=12, fontweight='bold')

### 2.2.4 Weapons and Targets

In [None]:
# TODO Pouss

### 2.2.5 Top 25 most deadly

Here we will focus on the deadliest attacks recorded in our dataset. We display them as an interactive map. Each even shows the date, the coordinates, the city and the registred number of deaths.

In [None]:
m = folium.Map(location=[30., 5.], tiles='Stamen Terrain', zoom_start=2)
marker_cluster = MarkerCluster().add_to(m)

for i in range(25):
    data = df.sort_values(by='nkill',  ascending=False)[i:i+1]
    info = 'Terror attack in '+ str(data['city'].values[0])+\
    '<br>Date: '+str(data['iday'].values[0])+'.'+str(data['imonth'].values[0])+'.'+str(data['iyear'].values[0])+\
    '<br>Casualties: ' + str(int(data['nkill'].values[0]))
    folium.Marker(data[['latitude', 'longitude']].values.ravel('K'), popup=info, 
                  icon=folium.Icon(color='red', icon='info-sign')).add_to(marker_cluster)

display(HTML('<h4>{}</h4>'.format('Interactive map - Top 25 most deadly attacks recorded')))
display(m)

Note that <b>half</b> of the deadliest attacks occured during the last 8 years !

In [None]:
data = df.sort_values(by='nkill',  ascending=False)[:25]
print('Median date: ', data.sort_values('date').iloc[int(np.ceil(25/2))].date)

# 3. Groups  <a id='groups'></a>

To have better insight of the data we want to be able to locate each terrorist group. Of course it is not possible to get the exacte location of the group, but we can estimate it. As a proof of concept we define the location of eaxh group as the median of the attacks. We compute latitude and longitude median separately. The result can be weird coordinates located in the middle of the sea. An other idea will be tu use k-median (similar to k-mean) to locate cluster center. Note that we also want the name of the country the terrorist group belongs to. To do so we use `geopy` that use `OpenStreetMap` api. According to [documentation](https://operations.osmfoundation.org/policies/nominatim/) we have to limit our request to 1 per second to avoid IP ban.


Here we build our function that will get for a specific terrorsit group: number of attacks (`frequ`), number of casualities (`nkill`), coordinates (`latitude`, `longitude`) and country (`country`).

In [None]:
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut
from collections import OrderedDict
geolocator = Nominatim()

def estimate_home(df, gname):
    # Get only attacks in which the group took part
    group_entries = np.logical_or(np.logical_or(df['gname'] == gname, df['gname2'] == gname), df['gname3'] == gname)
    # Extract number of attacks, number of casualities, and estimate coordinates (median)
    frequ = np.sum(group_entries)
    n_death = df.loc[group_entries, 'nkill'].fillna(0).sum()
    coord = df.loc[group_entries, ['latitude', 'longitude']].median(axis=0).values
    # Use geopy to get approximate position of the event. Try multiple times in case of network failure or bad request
    n_tries = 0
    while True:
        time.sleep(1)  # Avoid gtting IP banned
        try:
            # Send coordinates and extrat country name from result if it exists
            geo = geolocator.reverse('{:.5f}, {:.5f}'.format(coord[0], coord[1]), language='en').raw
            country = geo['address']['country'] if 'address' in geo and 'country' in geo['address'] else np.nan
            break
        except GeocoderTimedOut:
            # Too many tries, skip this entry and let the field value to empty string
            if n_tries == 10:
                print('Unable to fetch coordinates for {} at: {} after {} tries'.format(gname, coord, n_tries))
                country = ''
                break
        n_tries += 1
    return frequ, n_death, coord[0], coord[1], country

We can now get our data. We first look in our dataset for unique gorup names. Afterward we can get for each group all the statistics. Since this operation takes some times we add a display of iteration and save the results at the end.

In [None]:
# Look for unique name of groups in dataset
groups = pd.value_counts(df[['gname', 'gname2', 'gname3']].values.ravel('K')).index.values
# Create empty dataframe that we will fill with group data
df_groups = pd.DataFrame(index = name_groups,  
                         data = OrderedDict(( ('frequ', np.nan), ('nkill', np.nan), 
                                             ('latitude', np.nan), ('longitude', np.nan), ('country', np.nan) )) )
# Compute statistics for each groups
for i, gname in enumerate(name_groups):
    if i%500 == 0:
        print('{}/{} Computed homes'.format(i, len(counts)))
    df_groups.iloc[i] = estimate_home(df, gname)
# Save results to file to avoid performing task multiple at each run
df_groups.to_csv(os.path.join(data_path, 'groups_stats_t.csv'))

We can see that we have for each group: the number of attacks (`frequ`), number of casualities (`nkill`), coordinates (`latitude`, `longitude`) and country (`country`).

In [None]:
df_groups.from_csv(os.path.join(data_path, 'groups_stats_t.csv'))
df_groups.head()

We define here the basic function to get color accorging to number of casualities and logaritm scale of values (see next cell for explanation)

In [None]:
from folium import LinearColormap

def get_ln_value(value, offset=0, factor=1):
    return np.log(1 + offset + factor*value)

def get_info(row):
    return  '<strong> {}</strong> <br># of attacks: {} <br>Casualties: {}'.format(
        row.name.replace('\'', ''), int(row.frequ), int(row.nkill))

kill_max = get_ln_value(df_groups.nkill.max())
linear_kill = LinearColormap(['green', 'yellow', 'red'], vmin=0, vmax=kill_max)

We can now display the group locations directly on a map. Here we have 2 important information we want to display : the number of casualities and the number of attacks. A group that performs multiple attacks might not actualy try to hurt population. Therefore we choosed to set the display as follows:

- The size of the circle gives an estimate of the total number of attacks. If the circle is small then the amounts of attacks is small.
- The color of the circle goes from green to red. If the circle if green the group did not kill a lot of personne. On the contrary if the circle is red therefore we can except a large amount of casualities.

Note that we use logaritm scale for both circle size and color. Which mean that if we compare two groups and group one performed 2 times more attack, the circle will not be twice as big.

In [None]:
m = folium.Map(location=[30., 5.], zoom_start=2, tiles='Stamen Toner')

for i, ids in enumerate(df_groups.loc[df_groups.frequ > 20].index):
    coord = df_groups.loc[ids, ['latitude', 'longitude']].values
    popup = get_info(df_groups.loc[ids])
    radius = get_ln_value(df_groups.loc[ids, 'frequ'])
    c_kill = linear_kill(get_ln_value(df_groups.loc[ids, 'nkill']))
    folium.CircleMarker(location=coord, radius=radius, 
                        color=c_kill, fill_color=c_kill, fill_opacity= 0.8, 
                        fill=True, popup=popup, weight=1).add_to(m)

display(HTML('<h3>{}</h3>'.format('Interactive map - Estimated location of terrorist groups')))
display(m)

# 4. Events that marked the world   <a id='events_world'></a>

In [None]:
def display_main_actors(df_period, n=5):
    frequ =  df_period.groupby('gname').size()
    n_death = df_period.groupby('gname').nkill.sum()
    res = pd.DataFrame({'Frequency of attacks': frequ, 'Number of casualities': n_death})
    display(res.sort_values('Frequency of attacks', ascending=False).head(n))

In [None]:
thresh = 0.8*df_reg_year.max(axis=1)
thresh[thresh < 0.20] = 0.2
thresh = df_reg_year.subtract(thresh, axis=0) >= 0

plt.figure(figsize=(14,4))
sns.heatmap(thresh); plt.title('Thresholded events', fontsize='12', fontweight='bold')

## 4.1 North America Bombings (1970)   <a id='NAB_1970'></a>

http://time.com/4501670/bombings-of-america-burrough/

In [None]:
df_NA_1970 = df.loc[df.region_txt=='North America']
df_NA_1970 = df_NA_1970.loc[np.logical_and(df_NA_1970.iyear >= 1970, df_NA_1970.iyear <= 1970)]

m_frequ = get_heatmap(df_NA_1970, coord=[35., -95.], zoom=4, min_opacity=0.9, blur=5)
m_kill = get_heatmap(df_NA_1970, val='nkill', coord=[35., -95.], zoom=4, min_opacity=0.9, blur=5)

display('Attack frequencies'); display(m_frequ)
display('Attack casualities');display(m_kill)

## 4.2 Nothern Irland Religion conflict (1972-1973)    <a id='EU_1972'></a>

https://en.wikipedia.org/wiki/The_Troubles 

In [None]:
df_EU_1972 = df.loc[df.region_txt=='Western Europe']
df_EU_1972 = df_EU_1972.loc[np.logical_and(df_EU_1972.iyear >= 1972, df_EU_1972.iyear <= 1973)]

m_frequ = get_heatmap(df_EU_1972, coord=[50., -5.], zoom=5, min_opacity=0.5, blur=1)
m_kill = get_heatmap(df_EU_1972, val='nkill', coord=[50., -5.], zoom=5, min_opacity=0.5, blur=1)

display('Attack frequencies'); display(m_frequ)
display('Attack casualities');display(m_kill)

## 4.3 Nothern Irland and Basque Country (1975-1977)  <a id='EU_1975'></a>

In [None]:
df_EU_1975 = df.loc[df.region_txt=='Western Europe']
df_EU_1975 = df_EU_1975.loc[np.logical_and(df_EU_1975.iyear >= 1975, df_EU_1975.iyear <= 1977)]

m_frequ = get_heatmap(df_EU_1975, coord=[50., -5.], zoom=5, min_opacity=0.5, blur=1)
m_kill = get_heatmap(df_EU_1975, val='nkill', coord=[50., -5.], zoom=5, min_opacity=0.5, blur=1)

display('Attack frequencies'); display(m_frequ)
display('Attack casualities');display(m_kill)

## 4.4 Salvadoran Civil War (1981-1983)    <a id='CA_1981'></a>

In [None]:
df_CA_1981 = df.loc[df.region_txt=='Central America & Caribbean']
df_CA_1981 = df_CA_1981.loc[np.logical_and(df_CA_1981.iyear >= 1981, df_CA_1981.iyear <= 1983)]

m_frequ = get_heatmap(df_CA_1981, coord=[15., -85.], zoom=5)
m_kill= get_heatmap(df_CA_1981, val='nkill', coord=[15., -85.], zoom=5)

display('Attack frequencies'); display(m_frequ)
display('Attack casualities');display(m_kill)

## 4.5 South America Conflicts (1984-1987)    <a id='SA_1984'></a>

Durint the 80s South America was hit by multiple conflicts. We can on the first map 3 main zones: Columbia, Peru and Chile.

1. Columbia : This conflict period was the results of the war between Narcotraficante and police. Today we still know the name of Pablo Escobar and the Medelín Cartel. (More information: [Columbian conflict](https://en.wikipedia.org/wiki/Colombian_conflict#1980s)).
2. Peru: [The shining path](https://en.wikipedia.org/wiki/Shining_Path) was a communist militant group. They wanted to establish a dictatorship of the proletariat (communist ideology) including [cultural revolution](https://en.wikipedia.org/wiki/Shining_Path). We can observe that they caused a large amount of casualities.
3. Chile: ...

In [None]:
df_SA_1984 = df.loc[df.region_txt=='South America']
df_SA_1984 = df_SA_1984.loc[np.logical_and(df_SA_1984.iyear >= 1984, df_SA_1984.iyear <= 1987)]

m_frequ = get_heatmap(df_SA_1984, coord=[-20., -60.], zoom=3, min_opacity=0.3, blur=1)
m_kill = get_heatmap(df_SA_1984, val='nkill', coord=[-20., -60.], zoom=3, min_opacity=0.3, blur=1)

display('Main Actors during this period'); display_main_actors(df_SA_1984, 7)
display('Attack frequencies'); display(m_frequ)
display('Attack casualities');display(m_kill)

## 4.6 Middle East (2003-2007)  <a id='ME_2003'></a>

1. Algeria 
2. Isreal and palestine
3. Iraq

In [None]:
df_ME_2003 = df.loc[df.region_txt=='Middle East & North Africa']
df_ME_2003 = df_ME_2003.loc[np.logical_and(df_ME_2003.iyear >= 2003, df_ME_2003.iyear <= 2007)]

m_frequ = get_heatmap(df_ME_2003, coord=[32., 25.], zoom=4, min_opacity=0.5, blur=2)
m_kill = get_heatmap(df_ME_2003, val='nkill', coord=[32., 25.], zoom=4, min_opacity=0.5, blur=2)

display('Attack frequencies'); display(m_frequ)
display('Attack casualities');display(m_kill)

## 4.7 South Asia (2008-2013)  <a id='ME_2008'></a>

1. Northeast India (https://en.wikipedia.org/wiki/Insurgency_in_Northeast_India)
2. Afganistan/Pakistan - Alqaida , talibans... 
3. South east india ?

In [None]:
df_SAsia_2008 = df.loc[df.region_txt=='South Asia']
df_SAsia_2008 = df_SAsia_2008.loc[np.logical_and(df_SAsia_2008.iyear >= 2008, df_SAsia_2008.iyear <= 2013)]

m_frequ = get_heatmap(df_SAsia_2008, coord=[20., 80.], zoom=4, min_opacity=0.3, blur=1)
m_kill = get_heatmap(df_SAsia_2008, val='nkill', coord=[20., 80.], zoom=4, min_opacity=0.3, blur=1)

display('Attack frequencies'); display(m_frequ)
display('Attack casualities');display(m_kill)

## 4.8 Middle East (2013-Today)    <a id='ME_2013'></a>

1. Lybia
2. Egypt
3. Isreal palentine
4. ISIS (Syria and Iraq)

In [None]:
df_ME_2013 = df.loc[df.region_txt=='Middle East & North Africa']
df_ME_2013 = df_ME_2013.loc[np.logical_and(df_ME_2013.iyear >= 2013, df_ME_2013.iyear <= 2017)]

m_frequ = get_heatmap(df_ME_2013, coord=[32., 25.], zoom=4, min_opacity=0.5, blur=2)
m_kill = get_heatmap(df_ME_2013, val='nkill', coord=[32., 25.], zoom=4, min_opacity=0.5, blur=2)

display('Attack frequencies'); display(m_frequ)
display('Attack casualities');display(m_kill)

# 5. What comes next : <a id='whats_next'></a>


* NLP to do to answer question of religious (or other motive factor)
* Interactive maps : one with weapon selection
* graph for interaction 
    * between group and second group
    * between group and weapon
    * between group and attack type