# Group 24 - Gun Violence Analysis

### Shashank Shastry / Peiyu Si / Yihan Hu

Gun violence is a common problem in USA, and our project is aimed at analyzing the trends and causes of these incidents

Data - Comprehensive record of over 260,000 gun violence incidents in USA from 2013-2018

Data fields such as Date, Location (State, Address, Latitude, Longitude), Casualties (Killed, Injured), Logistics (Guns stolen, Number of guns involved, Gun type), Persons involved (Age, Gender, Type)

## 1.Dataset Preparation

In [1]:
import pandas as pd
import numpy as np
from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as go
from collections import defaultdict
import calendar

import helper

init_notebook_mode(connected=True)

In [2]:
# Load dataset
path = "../gun-violence-data/gun-violence-data_01-2013_03-2018.csv"
df = helper.load_data(path)

In [3]:
# Print columns
print(df.keys())

Index(['incident_id', 'date', 'state', 'city_or_county', 'address', 'n_killed',
       'n_injured', 'incident_url', 'source_url',
       'incident_url_fields_missing', 'congressional_district', 'gun_stolen',
       'gun_type', 'incident_characteristics', 'latitude',
       'location_description', 'longitude', 'n_guns_involved', 'notes',
       'participant_age', 'participant_age_group', 'participant_gender',
       'participant_name', 'participant_relationship', 'participant_status',
       'participant_type', 'sources', 'state_house_district',
       'state_senate_district', 'year', 'month', 'monthday', 'weekday', 'loss',
       'month_day_comb'],
      dtype='object')


## 2.People Trends Analysis

In [4]:
# Visualize num of people killed, injured, both

print("Total number of incidents = {}".format(len(df['n_killed'])))

temp=[('n_killed','Number of People Killed'),('n_injured','Number of People Injured'),
     ('loss','Number of People Killed/Injured')]

for column,title in temp:
    labels,values=helper.get_bucketed_data(df,column,3)
    helper.plot_pie(labels,values,title)

Total number of incidents = 239678


* Nobody killed in most incidents. 1 death in 20% of incidents. Very few incidents with more than 2 deaths.
* Most incidents have 0 injuries, and some have 1 injury.
* Fraction with 0 injuries lesser than fraction with 0 deaths - injuries are easier to cause than death

In [5]:
# Print top 5 serious incidents

print("\n\nThe five most serious incidents (in terms of killed+injured)".upper())

df1 = df.sort_values(['loss'], ascending=[False])
df1[['year', 'state', 'city_or_county', 'n_killed', 'n_injured']].head(5)



THE FIVE MOST SERIOUS INCIDENTS (IN TERMS OF KILLED+INJURED)


Unnamed: 0,year,state,city_or_county,n_killed,n_injured
239677,2017,Nevada,Las Vegas,59,489
130448,2016,Florida,Orlando,50,53
217151,2017,Texas,Sutherland Springs,27,20
101531,2015,California,San Bernardino,16,19
232745,2018,Florida,Pompano Beach (Parkland),17,17


In [6]:
# Visualize age distribution of suspects and victims

for target_type in ['suspect','victim']:      
    age_groups=helper.get_age_distribution(df['participant_type'],df['participant_age'],target_type)
    helper.plot_histogram(age_groups,dict(range=[0, 100]),target_type+' age histogram')

* Most suspects in age group - 17 to 40 , peak at 17 to 20. Very low for ages < 12 or > 80 - Limited access or capability to use guns.
* Victim distribution similar to suspect distribution, but unfortunately not very low even at very low and high ages.

In [7]:
# Visualize distribution of participant types

types=['ARRESTED','INJURED','KILLED','UNHARMED']
values=helper.get_person_type_counts(df,"participant_status",types)
helper.plot_pie(types,values,'Participant Type')

In [8]:
# Visualize how num of victims and num of guns varies with num of suspects

p_type = df["participant_type"].str.replace("[::0-9|,]","").str.upper()
guns = df['n_guns_involved'][p_type.notnull()]
p_type = p_type[p_type.notnull()]
p_type = pd.DataFrame(p_type)
victims  = p_type["participant_type"].str.count("VICTIM")
suspects = p_type["participant_type"].str.count("SUBJECT-SUSPECT")

x,y1=helper.get_mean_vs_data(guns,suspects,9)
x,y2=helper.get_mean_vs_data(victims,suspects,9)

temp=dict(zip([str(i) for i in x],y1))
helper.plot_histogram(temp,dict(range=[1,10]),'Mean num of victims vs num of suspects')

temp=dict(zip([str(i) for i in x],y2))
helper.plot_histogram(temp,dict(range=[1,10]),'Mean num of guns vs num of suspects')

* Mean number of guns involved increases with number of suspects - as expected.
* Mean number of victims does not increase with suspects unlike number of guns - perhaps due to cases such as a larger group of people attacking a same smaller group.

## 3.Time Related Trends of Gun Violence

Time trend is an important part of analysis for a dataset. It could be used to estimate hidden pattern of the past which could be regarded as historical results to predict the future.

### Number of gun violence incidents by year

In [9]:
df.head()

Unnamed: 0,incident_id,date,state,city_or_county,address,n_killed,n_injured,incident_url,source_url,incident_url_fields_missing,...,sources,state_house_district,state_senate_district,year,month,monthday,weekday,loss,month_day_comb,temp
0,461105,2013-01-01,Pennsylvania,Mckeesport,1506 Versailles Avenue and Coursin Street,0,4,http://www.gunviolencearchive.org/incident/461105,http://www.post-gazette.com/local/south/2013/0...,False,...,http://pittsburgh.cbslocal.com/2013/01/01/4-pe...,,,2013,1,1,1,4,00-01-01,3+
1,460726,2013-01-01,California,Hawthorne,13500 block of Cerise Avenue,1,3,http://www.gunviolencearchive.org/incident/460726,http://www.dailybulletin.com/article/zz/201301...,False,...,http://losangeles.cbslocal.com/2013/01/01/man-...,62.0,35.0,2013,1,1,1,4,00-01-01,3+
2,478855,2013-01-01,Ohio,Lorain,1776 East 28th Street,1,3,http://www.gunviolencearchive.org/incident/478855,http://chronicle.northcoastnow.com/2013/02/14/...,False,...,http://www.morningjournal.com/general-news/201...,56.0,13.0,2013,1,1,1,4,00-01-01,3+
3,478925,2013-01-05,Colorado,Aurora,16000 block of East Ithaca Place,4,0,http://www.gunviolencearchive.org/incident/478925,http://www.dailydemocrat.com/20130106/aurora-s...,False,...,http://denver.cbslocal.com/2013/01/06/officer-...,40.0,28.0,2013,1,5,5,4,00-01-05,3+
4,478959,2013-01-07,North Carolina,Greensboro,307 Mourning Dove Terrace,2,2,http://www.gunviolencearchive.org/incident/478959,http://www.journalnow.com/news/local/article_d...,False,...,http://myfox8.com/2013/01/08/update-mother-sho...,62.0,27.0,2013,1,7,0,4,00-01-07,3+


In [10]:
helper.incidents_year_Barplot(df, title = 'Gun Violence Incidents by year')

From the above plot, we could see that the number of incidents is increasing every year. Especially there is a small jump between 2015 and 2016 compared to the rise of other years. Since the data are collected before March this year, so the number of 2018 is expected to further grow.

### The average number of incidents per month through 2014 - 2017

In [11]:
helper.incidents_month_Barplot(df, title = 'The Average number of Gun Violence Incidents by month')

From the plot, we could oberseve that July and August has the highest average number of incidents while Feburary has the least. These are actually interesting facts that inspired us to explore in later time that if there is any specific dates on which the number of gun violence incidents are always higher.

### The average number of incidents by weekday

In [12]:
helper.incidents_weekday_lineplot(df, title = 'The Average number of Gun Violence Incidents by weekday')

From the plot, we could see that gun violence incidents are mostly occuerd on weekends in the entire US.

### Time series plot of Total incidents, people killed and people injured

In [13]:
for year in [2014, 2015, 2016, 2017]:
    helper.time_series_plot(df, year, 'Gun Violence Incidents')

These plots are time series plot through four years and we could oberserve that during some special period of time the number of incidents are relatively higher. But these are not clearly and directly enought. 

Inspired by the previous plot of the average number of incidents by month, We select data and use another plot to visualize and explore the question: "What's the most dangerous dates for gun violence incidents?"

### Top 10 dates that Gun Violence Incidents happened

In [14]:
helper.top10_incidents(df, year = [2014, 2015, 2016, 2017], title = 'Top 10 dates that Gun Violence Incidents happened')

Every point stands for one of top-10 dates that incidents happened in specified year. These data points are distributed around Jan 1st, late May, early July, early September and late December which are exactly the dates of federal holidays. Among these date points, most are gathered on Independence day which we could infer July 4th and 5th are the most dangerous dates that incidents are likely to happened.

## 4.Gun Law and Registration related Analysis

### The number of guns registered by state

 **Source of Data** - https://www.thoughtco.com/gun-owners-percentage-of-state-populations-3325153
 
 Analysis of the number of guns registered by state could help us better answer the question whether more guns in one place mean more incidents there.

In [15]:
path = '../gun-violence-data/Gun_num_state.csv'
gun_registered = pd.read_csv(path)
gun_registered.head()

FileNotFoundError: File b'../gun-violence-data/Gun_num_state.csv' does not exist

In [16]:
state_df = df[df['year'] == 2017]['state'].value_counts()
statedf = pd.DataFrame()
statedf['state'] = state_df.index
statedf['counts'] = state_df.values
statedf = statedf.merge(gun_registered, on = 'state')
statedf.head()

Unnamed: 0,state,counts
0,Illinois,5089
1,California,4588
2,Florida,4156
3,Texas,2875
4,Ohio,2701


In [None]:
helper.guns_per_capita_plot(statedf,'guns per capita', 'Hot')

The point in the right bottom of graph, Wyoming state has the maximum gun registered per capita but have relatively few incidents in 2017. Since Wyoming is the least populous state in the country. So it could be regarded as the outlier. The points in the top left corner of the graph are California, Illinois and Florida. These states may had lower number of guns registered per capita but quite higher number of incidents.

In [None]:
helper.guns_per_capita_plot(statedf,'guns registered', 'RdBu')

From the plot of the total number of guns registered instead of guns per capita, Texas owns highest number of guns across America, but there number of incidents are not much higher than other states. And data points which stand for CA, IL and FL in the top suggest more guns means more incidents.

### Gun Laws on Gun Violence Incidents by state

 **Source of Data** - https://statefirearmlaws.org/national-data/


In [17]:
state_to_code = {'District of Columbia' : 'dc','Mississippi': 'MS', 'Oklahoma': 'OK', 'Delaware': 'DE', 'Minnesota': 'MN', 'Illinois': 'IL', 'Arkansas': 'AR', 'New Mexico': 'NM', 'Indiana': 'IN', 'Maryland': 'MD', 'Louisiana': 'LA', 'Idaho': 'ID', 'Wyoming': 'WY', 'Tennessee': 'TN', 'Arizona': 'AZ', 'Iowa': 'IA', 'Michigan': 'MI', 'Kansas': 'KS', 'Utah': 'UT', 'Virginia': 'VA', 'Oregon': 'OR', 'Connecticut': 'CT', 'Montana': 'MT', 'California': 'CA', 'Massachusetts': 'MA', 'West Virginia': 'WV', 'South Carolina': 'SC', 'New Hampshire': 'NH', 'Wisconsin': 'WI', 'Vermont': 'VT', 'Georgia': 'GA', 'North Dakota': 'ND', 'Pennsylvania': 'PA', 'Florida': 'FL', 'Alaska': 'AK', 'Kentucky': 'KY', 'Hawaii': 'HI', 'Nebraska': 'NE', 'Missouri': 'MO', 'Ohio': 'OH', 'Alabama': 'AL', 'Rhode Island': 'RI', 'South Dakota': 'SD', 'Colorado': 'CO', 'New Jersey': 'NJ', 'Washington': 'WA', 'North Carolina': 'NC', 'New York': 'NY', 'Texas': 'TX', 'Nevada': 'NV', 'Maine': 'ME'}

In [None]:
path = '../gun-violence-data/Gun_laws_data.csv'
gun_laws = pd.read_csv(path).rename(columns = {'Unnamed: 0': 'state'})
gun_laws.head()

In [18]:
gun_laws['state'] = [state_to_code[gun_laws['state'][i]] for i in range(50)]
gun_laws_max = gun_laws.nlargest(50,'2017')

NameError: name 'gun_laws' is not defined

In [None]:
helper.rise_of_laws(gun_laws_max, year = [2014, 2015, 2016, 2017], title = "Rise of Gun Violence Laws")

We could clearly observe that the states California and Illinois has considerably higher number of gun violence laws as well as higher number of incidents. At the same time, states Texas and Florida has relatively lower number of laws but high number of incidents and guns registered.

## 5.Location Related Analysis
This part we analysis the data related to Location, especially the analysis among US states.

### state wise incidents number

In [19]:
statedf['state_code'] = statedf['state'].apply(lambda x : state_to_code[x])
helper.state_wise_plot(statedf)

Illinois State has highest number of gun violence incidents over past 5 years the total number is over 17k. Then, It is followed by California Florida and Texas. However, the data is not adjusted by the population since IL,CA,TX has most population among US states.

### state wise incidents number per 100k people
**Source of population Data** - https://simple.wikipedia.org/wiki/List_of_U.S._states_by_population

In [20]:
# read and extract population
path = '../gun-violence-data/state_population.xlsx'
state_pop = pd.read_excel(path)
state_pop[2] = state_pop[2].apply(lambda x: x.strip())
df_pop = pd.DataFrame(state_pop[3].values,index=state_pop[2],columns=['pop' ])

In [21]:
# compute incidents rate
statedf['state_population'] = statedf['state'].apply(lambda x : df_pop.to_dict('dict')['pop'][x])
statedf['incidents_rate'] = statedf.eval('counts/state_population')
tempdf = statedf.sort_values('incidents_rate', ascending = False)[1:50]

In [22]:
helper.Barplot(tempdf.state,tempdf.incidents_rate*100000,'Gun Violence Incidents Per 100,000 people by State')

Then we normalized data with population. Alaska has the highest number of gun violence incidents per 100,000 people while Arizona, Utha, and Idaho and California are the states with least number of incidents in population adjusted dataset. Interesting to see that California which is one of the top states with high number of gun violence incidents ranks low in population adjusted dataset. However, Illinois still comes in top 5 states .

### GunViolence Incidents and CityPopulation Ration 
**Source of population Data** - https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population

In [23]:
# import city population data
path = '../gun-violence-data/city_population.xlsx'
city_pop_df = pd.read_excel(path)
population = dict((city_pop_df.to_dict('split')['data']))
df['city_population'] = df['city_or_county'].apply(lambda x : int(population[x]) if x in population else 0)

In [24]:
# data preparing
# from helper import city_data_prepare
# import importlib
# importlib.reload(helper)
i_p = helper.city_data_prepare(df)

In [25]:
helper.Barplot(i_p['city_or_county'],i_p['incidents_population_ratio'],'Gun Violence Incidents Per 1,000 people by City')

Baltimore has the highest ratio of GunViolence Incidents and City Population, In contrast to Chicago where absolute number of gun violence incidents was highest. Baltimore had 3943 total gun violence incidents and its population in 2017 was 614,664. Chicago had more than 10,000 gun violence incidents in recent years, but it is one of the most populated city of US (2017 population = 2704958) and its ratio of gunviolence incidents and population comes at number four after Baltimore, Washingon, and Milwaukee.

## 6.Cause Analysis
Here we analysis the causes for rising gun voilences, mainly on Social inequality (Gini coefficients) and Education (Graduation rate)

### Gini coefficients vs gun violence
**Source of Data** - https://en.wikipedia.org/wiki/List_of_U.S._states_by_Gini_coefficient

In [26]:
# import gini data
path = '../gun-violence-data/gini.xlsx'
state_gini = pd.read_excel(path)
statedf = pd.merge(statedf, state_gini, on='state')

In [27]:
# scatter plot
helper.scatter_plot(statedf.gini,statedf['incidents_rate']*10000,statedf['state_code'],"Gun Violence Incidents per 10,000 vs Gini coefficient",'Gini coefficient','Gun Violence Incidents per 10,000')

We first explore the relationship between the social inequality and incidents rate.
 Gini coefficient measure the wealth distribution among a population. Intuitively, The larger of gini coefficient, the more inequality in wealth distribution.

We collect the gini coefficient on the internet and plot  incidents rate vs gini coefficient in each state. As we can see, the whole data present positive correlation between incidents rate and gini coefficient in each state. Specifically, Washington dc has both highest incidence rate and gini coefficient. Linear regression also shows that there is positive correlation between this two, Which means more wealth inequality  is usually associated with higher gun violence rate for each state..

### Education vs gun violence
**Source of Data** - https://en.wikipedia.org/wiki/List_of_U.S._states_by_educational_attainment

In [28]:
# inport education data
path = '../gun-violence-data/education.xlsx'
state_edu = pd.read_excel(path)
statedf = pd.merge(statedf, state_edu, on='state')

In [29]:
# scatter plot
helper.scatter_plot(statedf.bachelor,statedf['incidents_rate']*10000,statedf['state_code'],"Gun Violence Incidents per 10,000 vs bachelor's graduation rate",'bachelor\'s graduate rate','Gun Violence Incidents per 10,000')

Next, explore the relation between education and incidents rate. We collect the graduation rate data on the internet and plot bachelors Graduation rate vs incidents rate in each state, except for the abnormal point Washington dc, the whole data present negative correlation between incidents and education. Which means better education lowers the gun violence rate which make sense.
