**Intro**

Since 2011 I build interactive dashboards using a commercial tool. I have always been fascinated by the open source word of data scientist and decided to give it a try starting with Python. 

I started by taking a course online to understand the syntax and the capabilities. I chose this simple dataset to start exploring data manipulation and analysis. Below my thoughts during this journey. 

Any comment and advice is well appreciated.

**1. Import libraries**

At every new release of a commercial software, you go through the releases notes hoping that the software vendor delivered cool new features or the functionality you have ever hoped to see, sometimes the expectations are met, other times you feel disappointed. 

With open source languages the software is constantly growing as communities build new libraries or improve existing ones. 

Spending time to research libraries is part of the fun, some grow and become  a must (e.g. Pandas for data manipulations), as for the others, you need to make choices. 

Below the import of libraries I'm currently learning, I will mainly use Seaborn for visualisations, mainly because I really like the look and feel of the charts.

In [1]:
#Import libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import os
print('Working dir: ', os.getcwd())
#plot in the browser for jupyter
%matplotlib inline 
plt.rcParams['figure.figsize'] = 8,4
import warnings
warnings.filterwarnings('ignore') #ignore /0 errors
import os
print(os.listdir("../input"))
# Any results you write to the current directory are saved as output.

**2. Load and clean the data**

Ok let's build some vis... oh wait, you still have to extract, transform and load the data.

In [2]:
#Extract the data
ms = pd.read_csv("../input/Mass Shootings Dataset Ver 5.csv", encoding = "ISO-8859-1", parse_dates=["Date"])
#Print number of columns and rows to have the cardinality
print("Rows, Columns: ", ms.shape)
print("Max date: ", max(ms.Date))

In [3]:
#Identify the columns
ms.columns

In [4]:
#Clean columns Names removing spaces and special character, this makes it easier to use them later, with Python the syntax df.ColumnName is pretty handy rather than using df['Column Name'] 
ms.columns = ['S#', 'Title', 'Location', 'Date', 'IncidentArea', 'OpenCloseLocation', 'Target',
       'Cause', 'Summary', 'Fatalities', 'Injured', 'TotalVictims',
       'PolicemanKilled', 'Age', 'Employeed', 'EmployedAt',
       'MentalHealthIssues', 'Race', 'Gender', 'Latitude', 'Longitude']

In [5]:
ms.info()

In [6]:
#Show to rows
ms.head(5)

In [33]:
ms.tail(3)

This seams a pretty straight forward dataset, it is small (323 rows, 21 columns) and it talks about a subject we are (sadly) aware of thanks to the news reports
Looking at info() and the head and tail, here some initial considerations:

* S#: incremental counter, might be come in handy or not
* Title: description of the event, can be used to identify specific occurances in the analysis (e.g. Top x shootings)
* Location: Includes City, State - it can be used for geografical analysis, I will split it into 2 fields: City and State
* Date: date of the event, this can be used for time analysis, additional aggreagated fields can be derive as Year, month, day of the week, quarter..
* IncidentArea: provide some additional information on the shooting, doesn't seem much relevant for analysis
* OpenCloseLocation: categorical information, can be used for analysis (needs some cleaning e.g. 'Open+CLose' contains a misspelling), see next section
* Target: categorical information, can be used for analysis, needs some cleaning
* Cause: categorical information, can be used for analysis, needs some cleaning          
* Summary: provide some additional information on the shooting, doesn't seem much relevant for analysis
* Fatalities: number, can be used for analysis
* Injured: number, can be used for analysis
* TotalVictims: Fatalities + Injured
* PolicemanKilled: number, can be used for analysis
* Age: although an analysis on the age could be interesting, this has 45% of nulls - missing values could be populated with unknows or manually searching on internet the age of the shooter
* Employeed: this column has plenty of null vlaues, I will ignore it
* EmployedAt: same as above
* MentalHealthIssues:  boolean, can be used for analysis, needs some cleaning          
* Race: categorical information, can be used for analysis, needs some cleaning          
* Gender:  categorical information, can be used for analysis, needs some cleaning          
* Latitude: This can be used to plot a map 
* Longitude: same as above

In [8]:
#Split location column
ms['City'] = ms.Location.str.split(', ').str.get(0)
ms['State'] = ms.Location.str.split(', ').str.get(1)
ms = ms.drop(['Location'], axis=1)

#Add Year columns
ms['Year'] = ms['Date'].dt.year


In [9]:
#Convert the categorical columns form object to category type
ms.MentalHealthIssues = ms.MentalHealthIssues.astype('category')
ms.City = ms.City.astype('category')
ms.State = ms.State.astype('category')
ms.Race = ms.Race.astype('category')
ms.Gender = ms.Gender.astype('category')
ms.OpenCloseLocation = ms.OpenCloseLocation.astype('category')
ms.Target = ms.Target.astype('category')
ms.Cause = ms.Cause.astype('category')
ms.Employeed = ms.Employeed.astype('category')
ms.Year = ms.Year.astype('category')


In [10]:
#Check the content of each category
print(ms.MentalHealthIssues.cat.categories)
print(ms.Gender.cat.categories)
print(ms.Race.cat.categories)
print(ms.OpenCloseLocation.cat.categories)
print(ms.Target.cat.categories)
print(ms.Cause.cat.categories)
print(ms.Employeed.cat.categories)
print(ms.Year.cat.categories)

Some fields needs to be cleaned as they contain obvious missspelling, like "CLose" with capital "L" in Open Close or similar values like "M" and "Male" in Genre. As for the race I will reduce the number of distinct values aggregating some (e.g. Asian American and Asian will be considered just Asian)

In [11]:
#Add column for Counter
ms['Counter'] = '1'

In [12]:
#Clean category contents rreplacing similar values (e.g. Gender M and Male to Male)
#MentalHealthIssues
conditions = [
    (ms['MentalHealthIssues'] == 'Yes') ,
    (ms['MentalHealthIssues'] == 'No') ]
choices = ['Yes', 'No']
ms['MentalHealthIssues'] = np.select(conditions, choices, default='Unknown')

#Gender
conditions = [
    ((ms['Gender'] == 'M') | (ms['Gender'] == 'Male')) ,
    ((ms['Gender'] == 'F') | (ms['Gender'] == 'Female')) ,
    ((ms['Gender'] == 'M/F') | (ms['Gender'] == 'Male/Female')) ]
choices = ['Male', 'Female', 'Mixed']
ms['Gender'] = np.select(conditions, choices, default='Unknown')

#Race
conditions = [
    ((ms['Race'] == 'Latino') ) ,
    ((ms['Race'] == 'Black') | (ms['Race'] == 'Black American or African American/Unknown') | (ms['Race'] == 'Black American or African American') | (ms['Race'] == 'black') )  ,
    ((ms['Race'] == 'White') | (ms['Race'] == 'White American or European American') | (ms['Race'] == 'White American or European American/Some other Race') | (ms['Race'] == 'White ') | (ms['Race'] == 'white')  ) ,
    ((ms['Race'] == 'Asian') | (ms['Race'] == 'Asian American') | (ms['Race'] == 'Asian American/Some other race')   ),
    ((ms['Race'] == 'Latino') ) , 
    ((ms['Race'] == 'Native American') | (ms['Race'] == 'Native American or Alaska Native')) ]
     
choices = ['Latino', 'Black', 'White', 'Asian', 'Latino', 'Native American' ]
ms['Race'] = np.select(conditions, choices, default='Unknown')

ms.MentalHealthIssues = ms.MentalHealthIssues.astype('category')
ms.Race = ms.Race.astype('category')
ms.Gender = ms.Gender.astype('category')

print(ms.MentalHealthIssues.cat.categories)
print(ms.Gender.cat.categories)
print(ms.Race.cat.categories)


In [14]:
ms.info()

**3. Analyse the data**

**3a. Variables distributions and correlations**

In [15]:
# Let's check some statistical information on the numeric fields
ms.describe().transpose()

Initial considerations on the variables available:
* Fatalities: looking at the quartiles, this seems a right skewed distribution (most occurences on the left of the distribution) , the max value 59 appears as an outlier
* Injured: as above, this looks a right skewed distribution, the max value 527 appears as an outlier
* TotalVictims: the min is 3, seems like the datasets only considers a mass shooting when there are at least 3 victims (Injured or Killed), same as above regarding the distribution (expected as TotalVictims is Fatalities + Injuries)
* PolicemanKilled: includes 6 some NaN, ca be used for analysis (e.g. % of shootings that where policemans were killed)
* S, Latitude, Longitude: not relevant for calculations

Let's visualise (finally!) the distributions using the Seaborn library


In [16]:
ml = sns.distplot(ms.Fatalities, bins=15)
plt.title("Distibutions of Fatalities", fontsize = 15)
plt.show()

In [17]:
ml = sns.distplot(ms.Injured, bins=15)
plt.title("Distibutions of Injured", fontsize = 15)
plt.show()

In [18]:
ml = sns.distplot(ms.TotalVictims, bins=15)
plt.title("Distibutions of Total Victims", fontsize = 15)
plt.show()

One of my favourite plots I saw on kaggle when analysing datasets for machine learning is the correlation matrix.

Let's use it to check drives insights in this dataset

In [19]:
#Exclude non relevant columns
cols = [col for col in ms.columns if col not in ['S#']]

#Correlation matix
f, ax = plt.subplots(figsize=(5, 4))
corr = ms[cols].corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),
            square=True, ax=ax)
plt.show()

As expected, the number of total victims  is highly correlated to the number of injured and, less fatalities. 
There is no other evident strong correlation between the other categories.

Let's visualize the correlation between Fatalities and Injuries in a jointplot. Note: I have applied a limit to the axis that excludes 1 outlier (Las Vegas Strip mass shooting with 527 injuries)

In [20]:
#Jointplot - exclude outliers
j = sns.jointplot(data = ms, x ='Fatalities', y='Injured', ylim=(0,80),xlim=(0,80), )


**3b. Trend analysis**

Let's look at a classic trend (I will use bars instead of lines, personal preference) of the number of shootings per year.

In [21]:
#Barchart 
g = sns.factorplot(data=ms, x="Year", kind="count",color = '#1e488f', size=6, aspect=1.5)
g.set_xticklabels(rotation = 90)
plt.title("Number of shootings by year", fontsize = 20)
plt.ylabel("Number of shootings", fontsize = 13)
plt.xlabel("Year", fontsize = 13)
plt.xticks(fontsize = 11)
plt.yticks(fontsize = 11)
plt.show()

2015 and 2016 registered an high number of shootings compared to the previous years (3x+ times).

(note: further trends can be visualizes and analysed, I will not focus on those in this kernel)

**3c Categories analysis**

Let's now analyse the categories discussed above, for this I will use Seaborn factorplots (classic barcharts) counting the number of shootings for each category. 

To make the chart it easier to replicate (and more fun to code), I will build a function. Note that these could nicely be put together in 1 visualisation using mathplot subplots, however I couldn’t find a way to rotate the axis... 

In [22]:
#Let's build a function!
def Fzfactorplot (field):
    g= sns.factorplot(data=ms, x=field , kind="count", color="#1e488f",  order = ms[field].value_counts().index)
    plt.title("Number of shootings by " + field, fontsize = 13)
    plt.ylabel("Number of shootings", fontsize = 10)
    g.set_xticklabels(rotation = 90)
    plt.show()

In [23]:
Fzfactorplot("MentalHealthIssues")

Excluding the unknows, more than halfof the shootings have been done by people affected by mental issues, look like checking these issues before selling guns is a good idea!

In [24]:
Fzfactorplot("Gender")

No doubt men are more violent than women!

In [25]:
Fzfactorplot("Race")

Whites shows the higher frequency, would be interesting to compare it with the overall population by race of US

In [26]:
Fzfactorplot("OpenCloseLocation")

Mass shooters often chose close locations, possibly thinking it is more difficult for the victims to escape, or for the police to get them

In [27]:
Fzfactorplot("Cause")

This is an interesting one, the first cause of shootings is Psyco (again, better not to sell guns to them) followed by terrorism (as unfortunatly expected) and a generic anger.

Let's look at the efficacy of the shooters for the top 8 causes calculated as number of fatalities over the total number of victims

In [28]:
#Efficacy
ms_EfficacyByCause = ms.groupby(["Cause"])['Counter','TotalVictims','Fatalities','Injured'].aggregate(np.sum).reset_index().sort_values('TotalVictims')
ms_EfficacyByCause['Efficacy'] = ms_EfficacyByCause.Fatalities / ms_EfficacyByCause.TotalVictims * 100
ms_EfficacyByCauseFiltered = ms_EfficacyByCause[ms_EfficacyByCause.Cause.isin(['psycho','terrorism','anger','frustration','domestic dispute','unemployement','revenge', 'racism'])]

#Causes by total victims
ms_EfficacyByCauseFiltered.sort_values('TotalVictims', ascending=False)

In [29]:
#Efficacy
g= sns.factorplot(data=ms_EfficacyByCauseFiltered, x='Cause' , y = 'Efficacy', color="orange", kind='bar', order = ms_EfficacyByCauseFiltered.sort_values('Efficacy', ascending=False)['Cause'])
plt.title("Shootings efficacy (Fatalities / TotalVictims)", fontsize = 13)
plt.ylabel("Efficacy", fontsize = 10)
g.set_xticklabels(rotation = 90)
plt.show()

Althought terrorism produces the largest number of victims, other causes are more effecties in terms of fatalities

In [30]:
#median number of victims by cause
ms_EfficacyByCause = ms.groupby(["Cause"])['TotalVictims','Fatalities','Injured'].aggregate(np.median).reset_index().sort_values('TotalVictims')
ms_EfficacyByCause[::-1]

**3d. Top 25 shootings by Total Victims**

At last, let's draw a table with the top 10 shootings by total victims and a subset of dimensions

In [35]:
result = ms[["Title","Year","Cause","Race","IncidentArea", "MentalHealthIssues","Injured","Fatalities","TotalVictims"]].sort_values(["TotalVictims"],ascending =0)
result.head(10)