# Pakistan Drone Attacks - Data Exploration


Zeeshan-ul-Hassan Usmani's questions
- How many people got killed and injured per year in last 12 years?
- How many attacks involved killing of actual terrorists from Al-Qaeeda and Taliban?
- How many attacks involved women and children?
- Visualize drone attacks on timeline
- Find out any correlation with number of drone attacks with specific date and time, for example, do we have more drone attacks in September?
- Find out any correlation with drone attacks and major global events (US funding to Pakistan and/or Afghanistan, Friendly talks with terrorist outfits by local or foreign government?)
- The number of drone attacks in Bush Vs Obama tenure?
- The number of drone attacks versus the global increase/decrease in terrorism?
- Correlation between number of drone strikes and suicide bombings in Pakistan

## Loading Data

In [57]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [58]:
# Need to read data using the unicode-escape encoding because non-UTF-8 chars in CSV
raw_data = pd.read_csv('PakistanDroneAttacksWithTemp Ver 11 (November 30 2017).csv', 
                  encoding='unicode-escape')

In [59]:
# What does the dataset look like?
raw_data.head()

Unnamed: 0,S#,Date,Time,Location,City,Province,No of Strike,Al-Qaeda,Taliban,Civilians Min,...,Injured Min,Injured Max,Women/Children,Special Mention (Site),Comments,References,Longitude,Latitude,Temperature(C),Temperature(F)
0,1.0,"Friday, June 18, 2004",22:00,Near Wana,south Waziristan,FATA,1.0,,1.0,0.0,...,,,N,Blast occured in courtyard of the house of lon...,Village in Wana,http://archives.dawn.com/2004/06/19/top1.htm,69.9,33.0333,28.475,83.255
1,2.0,"Sunday, May 08, 2005",23:30,Mir Ali (Near Afghan Border),North Waziristan,FATA,1.0,1.0,,0.0,...,,,N,Drone struck a car driven by local warlord- ki...,Civilian killied was Samiullah Khan who was a ...,http://www.msnbc.msn.com/id/7847008/,70.1455,32.9746,11.475,52.655
2,3.0,"Thursday, December 01, 2005",,Haisori- Miran Shah,North Waziristan,FATA,1.0,1.0,,0.0,...,,2.0,,Explosive occurred at a mud house,No. 3 Al-Qaeda's Leader AbuHamza Rabia killed ...,http://edition.cnn.com/2005/WORLD/asiapcf/12/0...,70.1455,32.9746,7.08,44.744
3,4.0,"Friday, January 06, 2006",,Saidgai village- 115km north of Wana,North Waziristan,FATA,1.0,,,,...,,2.0,,,,http://www.reuters.com/article/2007/04/27/us-p...,70.1455,32.9746,0.535,32.963
4,5.0,"Friday, January 13, 2006",3:00,Damadola Village,Bajaur Agency,FATA,1.0,,,0.0,...,,2.0,Y,Three houses were tarheted in Damadola village...,Masood Khan house was among those bombed. Want...,http://www.dailytimes.com.pk/default.asp?page=...,71.5,34.6833,10.025,50.045


In [60]:
# Data types?
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 25 columns):
S#                        405 non-null float64
Date                      405 non-null object
Time                      175 non-null object
Location                  404 non-null object
City                      405 non-null object
Province                  405 non-null object
No of Strike              405 non-null float64
Al-Qaeda                  98 non-null float64
Taliban                   142 non-null float64
Civilians Min             337 non-null float64
Civilians Max             360 non-null float64
Foreigners Min            94 non-null float64
Foreigners Max            141 non-null float64
Total Died Min            309 non-null float64
Total Died Max            403 non-null float64
Injured Min               146 non-null float64
Injured Max               277 non-null float64
Women/Children            337 non-null object
Special Mention (Site)    331 non-null object
Comments   

There seems to be a lot of missing data, especially for the Al-Qaeda and Taliban columns. This may be because the columns are left empty if the attack wasn't carried out by any of these organizations. Possible preprocessing step: create an "other" category for attacks that weren't carried out by either of the two orgs. 

In [61]:
# Statistics for numeric columns
raw_data.describe()

Unnamed: 0,S#,No of Strike,Al-Qaeda,Taliban,Civilians Min,Civilians Max,Foreigners Min,Foreigners Max,Total Died Min,Total Died Max,Injured Min,Injured Max,Longitude,Latitude,Temperature(C),Temperature(F)
count,405.0,405.0,98.0,142.0,337.0,360.0,94.0,141.0,309.0,403.0,146.0,277.0,405.0,405.0,404.0,404.0
mean,203.017284,1.451852,1.0,9.338028,7.750742,14.133333,1.446809,5.276596,12.595469,18.168734,5.506849,9.595668,68.63646,34.455207,16.014691,60.813178
std,117.087203,1.117271,5.0889,55.482268,71.109052,133.966732,7.091744,31.332344,110.469782,182.093427,33.277568,79.729995,7.300586,7.157904,8.626755,15.556719
min,1.0,1.0,0.0,0.0,-4.0,-6.0,0.0,0.0,0.0,0.0,0.0,0.0,28.896179,25.67848,-14.155,6.521
25%,102.0,1.0,0.0,0.0,0.0,3.0,0.0,0.0,3.0,5.0,0.0,2.0,69.9,32.9746,9.12,48.416
50%,203.0,1.0,0.0,4.0,3.0,5.0,0.0,2.0,5.0,6.0,2.0,4.0,70.1455,32.9746,17.895,64.211
75%,304.0,1.0,0.0,7.0,5.0,8.25,0.75,4.0,8.0,10.5,4.0,6.0,70.1455,33.0333,23.705,74.669
max,406.0,8.0,49.0,663.0,1306.0,2544.0,68.0,372.0,1946.0,3661.0,402.0,1329.0,71.5,70.54072,29.485,85.073


Some observations:
- Why do the Taliban and Al-Qaeda columns have non-zero values for their max values? They looked like binary/one-hot encoded categorical variables when head was examined. 
- Number of civilians min and max values are exceptionally high for a single drone strike. Need to identify this sample and see if it is erroneous data. If not, need to see the reference. 
- The temperature (F) column is redundant as a feature if we use the Temperature (C) column. Even so, it may not be useful for predicting drone strike casualties. 
- Civilians Min's minimum reported value should not be -4. 
- Civilians Max's minimum reported value should not be -6. 
- Civilians Min's maximum reported value of 1306 looks like an outlier. Needs further examination.
- Civilians Max's maximum reported value of 2544 also looks like an outlier. Needs further examination.
- The quantities for total died min and total died max seem to be a combination of civilians + foreigners min and max. So if I change a min or max value for civilians or foreigners, I must also change the corresponding total died and injured counts.
- Temperature(C) being -14 is a little out of the ordinary. Need to examine this. 

In [51]:
# Finding the row with the min number of min reported civilian casualties
data[data['Civilians Min'] == data['Civilians Min'].min()]

Unnamed: 0,S#,Date,Time,Location,City,Province,No of Strike,Al-Qaeda,Taliban,Civilians Min,...,Injured Min,Injured Max,Women/Children,Special Mention (Site),Comments,References,Longitude,Latitude,Temperature(C),Temperature(F)
374,375.0,"Friday, December 26, 2014",,Shawal Valley,North Waziristan,FATA,1.0,,5.0,-4.0,...,,,N,,,http://www.dawn.com/news/1153301,70.1455,32.9746,12.65,54.77


In [54]:
# Finding the row with the max number of min reported civilian casulaties
data[data['Civilians Min'] == data['Civilians Min'].max()]

Unnamed: 0,S#,Date,Time,Location,City,Province,No of Strike,Al-Qaeda,Taliban,Civilians Min,...,Injured Min,Injured Max,Women/Children,Special Mention (Site),Comments,References,Longitude,Latitude,Temperature(C),Temperature(F)
405,,,,,,,,49.0,663.0,1306.0,...,402.0,1329.0,,,,,,,,


There is no date, serial number, location, city, or reference for this row. This is clearly erroneous data that needs to be dropped from the dataset. 

In [55]:
# Doing the same for the min and max number of maximum reported civilian casualties
data[data['Civilians Max'] == data['Civilians Max'].min()]

Unnamed: 0,S#,Date,Time,Location,City,Province,No of Strike,Al-Qaeda,Taliban,Civilians Min,...,Injured Min,Injured Max,Women/Children,Special Mention (Site),Comments,References,Longitude,Latitude,Temperature(C),Temperature(F)
367,368.0,"Thursday, November 20, 2014",,Dattakhel,North Waziristan,FATA,2.0,,6.0,0.0,...,,3.0,Y,,,http://www.dawn.com/news/1145788/us-drone-stri...,70.1455,32.9746,12.325,54.185
