Hi! For this kernel, I am doing EDA on the [NY Shootings Dataset](https://www.kaggle.com/thaddeussegura/new-york-city-shooting-dataset), shared by Thaddeus Segura on Kaggle. The dataset covers the period from 2006 to 2019, and the source is NYC OpenData. As I don't live in New York, I was wondering how violent was the city, if some neighborhoods or type of persons are most affected, and if I could identify some patterns for shooting incidents over time.

I am planning to add new graphs regularly, and maybe analyze other datasets related to the main one, so feel free to go back to this notebook to see what I have added !

## Table of Contents

0. [Data and Libraries](#data-libraries)
1. [Shootings over time](#shootings-over-time)
2. [Annual Patterns](#annual-patterns)
3. [Weekly Patterns](#weekly-patterns)
4. [Hourly Patterns](#hourly-patterns)
5. [Geographical Locations](#geographical-locations)
6. [Places prone to shootings](#places-of-shootings)
7. [Analysis of the shooter identity](#shooter-identity)
8. [Analysis of the victim identity](#victim-identity)

<a id = 'data-libraries'></a>
# Data and Libraries

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly.express as px
import geopandas as gpd
import json
import shapely
import folium
init_notebook_mode(connected = True)

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

In [None]:
shootings = pd.read_csv('../input/new-york-city-shooting-dataset/NYPD_Shooting_Incident_Data__Historic_.csv')

In [None]:
# NY Precincts GeoJSON file, will be used in part IV
json_file_path = "../input/ny-police-precincts-geojson/Police Precincts.geojson"

with open(json_file_path, 'r') as j:
     precincts = json.loads(j.read())

In [None]:
shootings['date'] = pd.to_datetime(shootings['OCCUR_DATE'])
shootings['year'] = shootings['date'].dt.year
shootings['month'] = shootings['date'].dt.month
shootings['month_str'] = shootings['date'].dt.month_name()
shootings['day'] = shootings['date'].dt.day
shootings['weekdays'] = shootings['date'].dt.strftime('%A')  
shootings['hour'] = shootings['OCCUR_TIME'].apply(lambda date : int(date.split(':')[0]))

In [None]:
shootings.head()

<a id = 'shootings-over-time'></a>
# I - Shootings over time


In [None]:
monthly_df=shootings['date'].groupby(shootings.date.dt.to_period("M")).agg('count').to_frame(name="count").reset_index()
monthly_df['fatal shootings'] = shootings.groupby(shootings.date.dt.to_period("M")).agg('sum').reset_index()['STATISTICAL_MURDER_FLAG']
month_year=[]
for i in monthly_df['date']:
    month_year.append(str(i))

In [None]:
fig = go.Figure()

fig.add_trace(go.Scatter(
    x= month_year,
    y= monthly_df['count'],
    name="Monthly Shootings",
    mode='lines'))
fig.add_trace(go.Scatter(
    x= month_year,
    y= monthly_df['fatal shootings'],
    name="Monthly Fatal Shootings",
    mode='lines'))

fig.update_layout(title='Shootings in New York (2006-2019)') 
fig.update_xaxes(title_text="Years", showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(title_text="Number of Shootings", showline=True, linewidth=2, linecolor='black', mirror=True)

fig.show()

### Observations

- We observe seasonal shooting peaks during summer (July & August particularly), although those peaks are less pronounced after 2013, and seasonal lows in January, February and March. Maybe cold has an impact on people's violence ?
- The period 2006-2013 was way more violent than 2014-2019; it would be interesting to investigate what could have reduced the shootings count, but at least it is a good news!
- The number of fatal shootings stay between 5 and 50 per months, with the exception of July 2010, with 61 shootings that resulted in the death of the victim.
- Let's note that fatal shootings represent around 20% of the total.

<a id = 'annual-patterns'></a>
# II - Annual patterns

In [None]:
months = shootings['month_str'].groupby(shootings.date.dt.month_name()).agg('count').to_frame(name="count")
months['fatal shootings'] = shootings.groupby('month_str').agg('sum')['STATISTICAL_MURDER_FLAG']
months['% fatal'] = (100 * months['fatal shootings'] / months['count']).apply(round)

calendar = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
months = months.reindex(calendar, axis=0).reset_index()

In [None]:
fig = go.Figure(data=[go.Bar(
    x= months['date'],
    y= months['count'],
    name="Monthly Shootings"),
                     go.Bar(
    x= months['date'],
    y= months['fatal shootings'],
    name="Monthly Fatal Shootings",
    hovertext= months['% fatal'].apply(str) + ' % fatal')])

fig.update_layout(title='Annual trend in NY shootings')
fig.update_xaxes(title_text="Month", showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(title_text="Number of Shootings", showline=True, linewidth=2, linecolor='black', mirror=True)

fig.show()

### Observations

- As we've seen previously, there's a peak in shootings during summer, and a low during winter.
- The fatal shootings follow a similar but slightier trend, with its count being more stable from one month to another.
- As a result, fatality increases duting winter: we can then see that the reduction in shootings from summer to winter is principally due to the non-fatal shootings reduction.

<a id = 'weekly-patterns'></a>
# III - Weekly patterns

In [None]:
daily_df=shootings.groupby('date').agg('count')['INCIDENT_KEY'].to_frame(name="count")
daily_df['fatal shootings'] = shootings.groupby('date').agg('sum')['STATISTICAL_MURDER_FLAG']
daily_df = daily_df.reset_index()
missing_days = pd.DataFrame(data = {'date': pd.date_range(start = '2006-01-01', end = '2019-12-31' ).difference(daily_df['date']), 'count' : 0, 'fatal shootings': 0})
daily_df = pd.concat([daily_df, missing_days])
daily_df['weekdays'] = daily_df['date'].dt.strftime('%A')  

In [None]:
fig = make_subplots(rows=2, cols=1,subplot_titles=['Weekly shootings', 'Weekly fatal shootings'])

fig.add_trace(go.Box(x=daily_df['weekdays'], 
                     y=daily_df['count'], 
                     text=daily_df.apply(lambda row: f"{row['date']}<br>Shootings:{row['count']}<br>Of which fatal:{row['fatal shootings']}", axis=1),
                     hoverinfo="text",
                     name='Total Shootings'
                     ), row = 1, col = 1)

fig.add_trace(go.Box(x=daily_df['weekdays'], 
                     y=daily_df['fatal shootings'], 
                     text=daily_df.apply(lambda row: f"{row['date']}<br>Shootings:{row['count']}<br>Of which fatal:{row['fatal shootings']}", axis=1),
                     hoverinfo="text",
                     marker_color = 'red',
                     name='Fatal Shootings'
                     ), row = 2, col = 1)

fig.update_layout(height=1000) 
fig.update_xaxes(title_text="Week Days", showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(title_text="Number of Shootings", showline=True, linewidth=2, linecolor='black', mirror=True)
              
fig.show()

In [None]:
no_shootings = pd.DataFrame(data= daily_df[daily_df['count'] == 0]['weekdays'].value_counts())
no_shootings.columns = ['# Days without shootings']
no_shootings

### Observations
- Shootings (whether fatal or not) are more common during week-ends, while the most peaceful day is Thursday ... If week-ends can be explained (people have "more time to shoot"), I would be curious to know why Thursdays are so peaceful!
- Let's also note that only Sundays and Saturdays have a median higher than 0 for fatal shootings.
- During the 13 years covered by the dataset, ie approximately 730 weeks, only 18 Saturdays were peaceful ! Hence, only 2.5% of the Saturdays of this time period were peaceful: that is huge.
- The day with most shootings is Sunday, 04th of September 2011, with 31 shootings in total, and 4 homicides. I couldn't find any documentation on that day. By looking at the details of that day, this high number was certainly not due to a coordinated group of shooter (shootings happened in different places and at different time).
- Each day of the week has outliers with high number of shootings ending in the death of the victim, the highest number being 10 out of 11 total shootings, during Monday, 12th of December.

<a id = 'hourly-patterns'></a>
# IV - Hourly patterns

In [None]:
hourly_df=shootings.groupby('hour').agg('count')['INCIDENT_KEY'].to_frame(name="count").reset_index()
hourly_df['fatal shootings'] = shootings.groupby('hour').agg('sum').reset_index()['STATISTICAL_MURDER_FLAG']
hourly_df['% fatal'] = (100 * hourly_df['fatal shootings'] / hourly_df['count']).apply(round)

In [None]:
fig = go.Figure()

fig.add_trace(go.Bar(
    x= hourly_df['hour'],
    y= hourly_df['count'],
    name="Hourly Shootings"))
fig.add_trace(go.Bar(
    x= hourly_df['hour'],
    y= hourly_df['fatal shootings'],
    name="Hourly Fatal Shootings",
    hovertext= months['% fatal'].apply(str) + ' % fatal'))

fig.update_layout(title='Daily trend in NY shootings')
fig.update_xaxes(title_text="Hour", showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(title_text="Number of Shootings", showline=True, linewidth=2, linecolor='black', mirror=True)


fig.show()

I have also made this "clock-plot" to represent the shootings per hour. It is harder to read, but it was fun to do ! :)

In [None]:
def hr_str(hr):
    # Normalize hr to be between 1 and 12
    hr_str = str(((hr-1) % 12) + 1)
    suffix = ' AM' if (hr % 24) < 12 else ' PM'
    return hr_str + suffix

data = [go.Barpolar(
        r = hourly_df['fatal shootings'],
        marker_color = 'red',
        name = 'fatal shootings'
    ),
    go.Barpolar(
        r = hourly_df['count'] - hourly_df['fatal shootings'],
        marker_color = 'blue',
        name = 'non-fatal shootings')]

layout = go.Layout(showlegend = False)

layout.polar.angularaxis.direction = 'clockwise'
layout.polar.angularaxis.tickvals = [(hr * 15) % 360 for hr in range(24)]
layout.polar.angularaxis.ticktext = [hr_str(hr) for hr in range(24)]

fig = go.FigureWidget(data=data, layout=layout)
fig.update_layout(title='Daily trend in NY shootings - Clock Plot')
fig.show()

### Observations
- As we could have expected, most shootings (fatal & non-fatal) occur at night, with a peak at midnight. 
- The most peaceful hours are between 7 & 9 AM
- This pattern is rather constant over the years, although it diminished because of the lowering number of shootings after 2014:

In [None]:
year_hour_df=shootings.groupby(['year', 'hour']).agg('count')['INCIDENT_KEY'].to_frame(name="count").reset_index()
year_hour_df['fatal shootings'] = shootings.groupby(['year', 'hour']).agg('sum').reset_index()['STATISTICAL_MURDER_FLAG']

fig = go.Figure()
for year in year_hour_df['year'].unique():
    fig.add_trace(go.Scatter(
    x= year_hour_df[year_hour_df['year'] == year]['hour'],
    y= year_hour_df[year_hour_df['year'] == year]['count'],
    name=str(year),
    mode='lines'))
    
fig.update_layout(title='Daily trend in NY shootings over years')
fig.update_xaxes(title_text="Hour", showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(title_text="Number of Shootings", showline=True, linewidth=2, linecolor='black', mirror=True)

fig.show()

We can eventually note that the drop in the number of shootings is mainly due to the fewer shootings happening at night.

<a id = 'geographical-locations'></a>
# V - Geographical Locations

Let's start with an interactive map to see the location of those shootings over the year:

In [None]:
fig = px.scatter_mapbox(shootings.sort_values("year"), lat="Latitude", lon="Longitude", 
                        zoom=9, animation_frame="year", color = 'STATISTICAL_MURDER_FLAG',
                       labels={"STATISTICAL_MURDER_FLAG": "Fatal Shooting"})

fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.update_layout(title='Locations of NY shootings (2006-2019)')
fig.show()

In [None]:
yearly_boro_df = shootings.groupby(['year', 'BORO']).agg('count')['INCIDENT_KEY'].to_frame(name="count").reset_index()
yearly_boro_df['fatal shootings'] = shootings.groupby(['year', 'BORO']).agg('sum').reset_index()['STATISTICAL_MURDER_FLAG']

In [None]:
fig = make_subplots(rows=3, cols=2,subplot_titles=yearly_boro_df['BORO'].unique())

for i, borough in enumerate(yearly_boro_df['BORO'].unique()):
    fig.add_trace(go.Bar(
        x= yearly_boro_df.loc[yearly_boro_df['BORO'] == borough, 'year'],
        y= yearly_boro_df.loc[yearly_boro_df['BORO'] == borough, 'count'],
        name = borough),
        row = i //2 + 1, col = i%2 + 1)
    
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout(title_text='Shootings evolution per Borough', title_x=0.5,showlegend=False)

fig.show()

In [None]:
boro_df = shootings.groupby('BORO').agg('count')['INCIDENT_KEY'].to_frame(name="count").reset_index()
boro_df['fatal shootings'] = shootings.groupby('BORO').agg('sum').reset_index()['STATISTICAL_MURDER_FLAG']

In [None]:
fig = make_subplots(rows=1, cols=2, subplot_titles=['Repartition of total shootings in NY', 'Repartition of fatal shootings in NY'],
                    specs=[[{'type':'domain'}, {'type':'domain'}]])

fig.add_trace(go.Pie(
    labels = boro_df['BORO'],
    values = boro_df['count'],
    name = 'total shootings'),
    row = 1, col = 1)

fig.add_trace(go.Pie(
    labels = boro_df['BORO'],
    values = boro_df['fatal shootings'],
    name = 'fatal shootings'),
    row = 1, col = 2)

fig.show()

We can also plot this data per police precincts, which gives us a more precise view of the situation (A folium alternative is commented in the code).

In [None]:
precincts_df = shootings.groupby('PRECINCT').count()['INCIDENT_KEY'].to_frame(name="count").reset_index()
precincts_df['fatal shootings'] = shootings.groupby('PRECINCT').agg('sum').reset_index()['STATISTICAL_MURDER_FLAG']
precincts_df['PRECINCT'] = precincts_df['PRECINCT'].astype(str)

In [None]:
#Folium version

# m = folium.Map(location=[40.7, -74], zoom_start = 10)

# folium.Choropleth(
#     geo_data=precincts,
#     name='Shootings in NY',
#     data=precincts_df,
#     columns=['PRECINCT', 'count'],
#     key_on="properties.precinct",
#     fill_color='PuRd',
#     fill_opacity=0.7,
#     line_opacity=0.2,
#     legend_name='Number of shootings',
# ).add_to(m)

# m

In [None]:
fig = go.Figure(data = go.Choroplethmapbox(geojson = precincts,
    locations = precincts_df['PRECINCT'],
    z = precincts_df['count'], 
    featureidkey = "properties.precinct", marker_opacity=0.8,
    text=precincts_df.apply(lambda row: f"Precinct n°{row['PRECINCT']}<br>Total Shootings:{row['count']}<br>Fatal Shootings:{row['fatal shootings']}", axis=1),
    hoverinfo="text"
))

fig.update_layout(mapbox_style="open-street-map",
                  mapbox_zoom=8.5, mapbox_center = {"lat": 40.7, "lon": -74})
fig.update_layout(title_text='Shootings in NY per police precinct', title_x=0.5)
fig

### Observations
- It is clear that most of the shootings happened in Brooklyn and in the Bronx (around 70% of the total)
- From the interactive map, we can see that even within boroughs, shootings are localized on specific areas; as an exemple, the South of Brooklyn is rather peaceful compared to the rest of the borough.
- All boroughs follow the same tendency (ie strong reduction in the number of shootings after 2010, and almost the same number of shootings in the past 3 years) except Staten Island, which must be more random because of the few shootings that happened in that neighborhood.
- The repartition of total and fatal shootings are almost the same between boroughs; no neighborhood is more prone to shootings ending in death than any other.
- From the precincts map, we can see how shootings are concentrated in a few precincts: Brooklyn has the most violent ones (67, 73 and 75), which represent more than 30% of the total shootigns of the borough.

<a id = 'places-of-shootings'></a>
# VI - Places prone to shootings

In [None]:
places_df = shootings.groupby('LOCATION_DESC').count()['INCIDENT_KEY'].sort_values(ascending = False).to_frame(name="count").reset_index()[:10]

In [None]:
fig = go.Figure()

fig.add_trace(go.Bar(
    x= places_df['LOCATION_DESC'],
    y= places_df['count']))

fig.update_layout(title='Places with the highest number of shootings')
fig.update_xaxes(title_text="Places", showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(title_text="Number of Shootings", showline=True, linewidth=2, linecolor='black', mirror=True)

fig.show()

In [None]:
dwell_df = shootings[shootings['LOCATION_DESC'].str.startswith('MULTI DWELL', na = False)]
monthly_dwell_df=dwell_df['date'].groupby(dwell_df.date.dt.to_period("M")).agg('count').to_frame(name="count").reset_index()
monthly_dwell_df['fatal shootings'] = dwell_df.groupby(dwell_df.date.dt.to_period("M")).agg('sum').reset_index()['STATISTICAL_MURDER_FLAG']
    
fig = go.Figure()
fig.add_trace(go.Scatter(
    x= month_year,
    y= monthly_dwell_df['count'],
    name="Shootings",
    mode='lines'))
fig.add_trace(go.Scatter(
    x= month_year,
    y= monthly_dwell_df['fatal shootings'],
    name="Fatal Shootings",
    mode='lines'))
fig.update_layout(title='Number of shootings in New York appartments per month',
                  yaxis_zeroline=False, xaxis_zeroline=False)
fig.update_xaxes(title_text="Time", showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(title_text="Number of Shootings", showline=True, linewidth=2, linecolor='black', mirror=True)
fig.show()

### Observations
- Out of the shootings with a reported place, homes are by far the place most prone to shootings.
- For total shootings, their pattern follow the general one, with peaks in Summer and lows in Winter.
- The fatal shootings are more unpredictable, and all type of shootings in appartments have lowered since 2013.

<a id = 'shooter-identity'></a>
# VII - Analysis of the shooter identity

In [None]:
# Some age groups are unreadable, some cleaning is needed
shootings['PERP_AGE_GROUP'].fillna('UNKNOWN', inplace = True)
shootings['PERP_SEX'].fillna('U', inplace = True)
shootings['PERP_RACE'].fillna('UNKNOWN', inplace = True)
shootings_age_clean = shootings[shootings['PERP_AGE_GROUP'].isin(['25-44', '18-24', '<18', 'UNKNOWN', '45-64', '65+'])]

In [None]:
profile_df = pd.pivot_table(shootings_age_clean, values = 'INCIDENT_KEY', index = 'PERP_AGE_GROUP', 
               columns = 'PERP_SEX', aggfunc=(lambda x: x.count().round()), margins = True).reindex(['<18', '18-24', '25-44', '45-64', '65+', 'UNKNOWN', 'All'])
profile_df

As men dominate the number of shootings by far, I will drop the sex criteria to introduce the racial one.

In [None]:
profile_df2 = pd.pivot_table(shootings_age_clean, values = 'INCIDENT_KEY', index = 'PERP_AGE_GROUP', 
               columns = 'PERP_RACE', aggfunc=(lambda x: x.count().round()), margins = True).reindex(['<18', '18-24', '25-44', '45-64', '65+', 'UNKNOWN', 'All'])
profile_df2

### Observations
- A lot of data is missing (labelled as 'UNKNOWN'), certainly because some shooters may not have been identified, or because the identity details of the shooters aren't always reported. Thus, the following observations must be taken with caution.
- We see how much young men dominate this table: excluding the 'unknown' persons, men younger than 45 years old have committed more than 70% of the total shootings for which shooter information is available. That is huge!
- As we could have expected, young persons are doing more shootings than their elders, and men are way more violent than women in general.
- Let's also note the old woman that is the only one in her group to have committed a shooting in this time period. It occurred in March, 2010 in the Queens borough.
- We see that Blacks and Hispanics have committed more than 80% of the total shootings (exluding once again the 'UNKNOWN' profiles). With economic data, I am sure this could be linked to the poverty and social distress of these populations.

<a id = 'victim-identity'></a>
# VIII - Analysis of the victim identity

Let's perform the same analysis as in part VI with the victims, and see if we can observe some major differences...

In [None]:
#Strangely, no data cleaning is needed for the victims ...
victim_df = pd.pivot_table(shootings_age_clean, values = 'INCIDENT_KEY', index = 'VIC_AGE_GROUP', 
               columns = 'VIC_SEX', aggfunc=(lambda x: x.count().round()), margins = True).reindex(['<18', '18-24', '25-44', '45-64', '65+', 'UNKNOWN', 'All'])
victim_df

In [None]:
profile_df2 = pd.pivot_table(shootings_age_clean, values = 'INCIDENT_KEY', index = 'VIC_AGE_GROUP', 
               columns = 'VIC_RACE', aggfunc=(lambda x: x.count().round()), margins = True).reindex(['<18', '18-24', '25-44', '45-64', '65+', 'UNKNOWN', 'All'])
profile_df2

### Observations
- There is way less data labelled as 'Unknown', hence we can be confident in the results shown in these 2 tables.
- As with the shooters identity, young persons, men and Black/Hispanics have suffered the most from this shootings.
- This prevalence of Black/Hispanics confirm that the shootings concern the most their communities; the hypothesis of the link between poverty and social distress is then strenghtened
- Let's note that women are more present than in the shooter analysis, they represent around 10% of the victims. This may be due to the presence of a lot of Unkown data in the shooter fields, but we could also easily assume that women are more prone to be shot than shoot herself - unfortunately or not, the conclusion is hard to draw ...
- Older persons (45+ years old) are also more victims than shooters, and we can draw the same conclusion than for women.
- I have performed the same analysis only taking into account fatal shootings, and the proportions are the same (that's why I don't display it here). 

That's it for now, I will add new graphs as soon as possible, upvote and share this notebook if you liked it, that helps me a lot :)