<a href="https://colab.research.google.com/github/Lucyxuyd/Intro-to-Data-Analytics/blob/main/BA780_B07_GroupProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Boston Crime Analytics

<p align="right">BA780 CohortB Team7</p>

<p align="right">Jiadai Yu, Raj Patel, Yidan Xu, Yen-Chun Chen, Zhaner Sun</p>

**Summary**

Our goal for this project is to analyze which
areas of Boston have the highest crime rate. We also want to narrow down on specific variables, such as type of crime, district, and what day the crime occurred. With this, we can better predict possible incidents and be able to suggest appropriate reinforcements to make the city of Boston safer.

**Data Source**

Crime Incident Reports (August 2015 - To Date)(Source: New System)

https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system

**1. Import and Data Cleaning**  
1.1 Data Discription   
1.2 Data Cleaning  
**2. Descriptive Explorations**    
2.1 The Most Frequent and Most Dangerous Incidents  
2.2 Shooting Incident on District and Days of Week             
2.3 District Differences of Incidents   
2.4 Month with the Highest Rate of Crime  
2.5 Day Crime is Reported the Most

## 1. Import and Data Cleaning  
### 1.1 Data Discription

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

crime2021 = pd.read_csv('https://raw.githubusercontent.com/JiadaiY/JiadaiYu/main/BA780%20GroupProject/Crime%20Incident%20Reports%20-%202021.csv')
crime2022 = pd.read_csv('https://raw.githubusercontent.com/JiadaiY/JiadaiYu/main/BA780%20GroupProject/Crime%20Incident%20Reports%20-%202022.csv')

crime2021.describe()

### 1.2 Data Cleaning 

In [None]:
crime2021.isna().sum()
crime2022.isna().sum()

In [None]:
import missingno as msno
msno.matrix(crime2021)
msno.matrix(crime2022)

<p>Steps for data cleaning:

1.   Drop column 'OFFENSE_CODE_GROUP' and 'UCR_PART' since these are null columns.
2.   Change offense_code fields name into crime_type.
3. Add new column district name that corresponds to district code (https://bpdnews.com/districts).
4. Dropping all null rows (from 'DISTRICT', 'DISTRICT_NAME', and 'STREET') which lost about 2.5% of the whole dataset
5. Dropping rows from column 'Lat', 'Long', and 'Location' with value 0 which lost about 3.0% of the whole dataset
</p>

In [None]:
crime2021 = pd.DataFrame(crime2021)
crime2021_tidy = crime2021.copy()
crime2021_tidy.drop(['OFFENSE_CODE_GROUP', 'UCR_PART'], axis=1, inplace=True)
crime2021_tidy.rename(columns = {'OFFENSE_CODE': 'CRIME_TYPE'}, inplace=True)

DISTRICT_NAME = {'A1':'Downtown_&_Charlestown', 'A15':'Downtown_&_Charlestown', 'A7':'East_Boston', 'B2':'Roxbury', 'B3':'Mattapan', 'C6':'South_Boston', 'C11':'Dorchester', 'D4':'South_End', 'D14':'Brighton', 'E5':'West_Roxbury', 'E13':'Jamaica_Plain', 'E18':'Hyde_Park'}
crime2021_tidy['DISTRICT_NAME'] = crime2021_tidy['DISTRICT'].map(DISTRICT_NAME)
crime2021_tidy.dropna(subset=['DISTRICT', 'DISTRICT_NAME', 'STREET'], inplace=True)
crime2021_tidy = crime2021_tidy[(crime2021_tidy.Lat != 0.000000) & (crime2021_tidy.Long != 0.000000) & (crime2021_tidy.Location != (0, 0))]

crime2021_tidy = pd.DataFrame(crime2021_tidy, columns=['INCIDENT_NUMBER', 'CRIME_TYPE', 'OFFENSE_DESCRIPTION', 'DISTRICT', 'DISTRICT_NAME', 'REPORTING_AREA', 'SHOOTING', 'OCCURRED_ON_DATE', 'YEAR', 'MONTH', 'DAY_OF_WEEK', 'HOUR', 'STREET', 'Lat', 'Long', 'Location'])
crime2021_tidy.head()

In [None]:
crime2022 = pd.DataFrame(crime2022)
crime2022_tidy = crime2022.copy()
crime2022_tidy.drop(['OFFENSE_CODE_GROUP', 'UCR_PART'], axis=1, inplace=True)
crime2022_tidy.rename(columns = {'OFFENSE_CODE': 'CRIME_TYPE'}, inplace=True)

DISTRICT_NAME = {'A1':'Downtown_&_Charlestown', 'A15':'Downtown_&_Charlestown', 'A7':'East_Boston', 'B2':'Roxbury', 'B3':'Mattapan', 'C6':'South_Boston', 'C11':'Dorchester', 'D4':'South_End', 'D14':'Brighton', 'E5':'West_Roxbury', 'E13':'Jamaica_Plain', 'E18':'Hyde_Park'}
crime2022_tidy['DISTRICT_NAME'] = crime2022_tidy['DISTRICT'].map(DISTRICT_NAME)
crime2022_tidy.dropna(subset=['DISTRICT', 'DISTRICT_NAME'], inplace=True)
crime2022_tidy = crime2022_tidy[(crime2022_tidy.Lat != 0.000000) & (crime2022_tidy.Long != 0.000000) & (crime2022_tidy.Location != (0, 0))]

crime2022_tidy = pd.DataFrame(crime2022_tidy, columns=['INCIDENT_NUMBER', 'CRIME_TYPE', 'OFFENSE_DESCRIPTION', 'DISTRICT', 'DISTRICT_NAME', 'REPORTING_AREA', 'SHOOTING', 'OCCURRED_ON_DATE', 'YEAR', 'MONTH', 'DAY_OF_WEEK', 'HOUR', 'STREET', 'Lat', 'Long', 'Location'])
crime2022_tidy.head()

In [None]:
msno.matrix(crime2021_tidy)
msno.matrix(crime2022_tidy)

1. `INCIDENT_NUMBER`: Identical number for each crime attempted (ID)
1. `CRIME_TYPE`: Codes represented by each type of crime
1. `OFFENSE_DESCRIPTION`: Offense action describe in words
1. `DISTRICT`: District codes of the crime incident
1. `DISTRICT_NAME`: District name of the crime incident
1. `REPORTING_AREA`: Area that was reported for crime incident
1. `SHOOTING`: '1' for shooting incident and '0' for negative
1. `OCCURRED_ON_DATE`: Date and Time for crime incident occurred
1. `YEAR`: Year of the crime incident happened
1. `MONTH`: Month of the crime incident happened
1. `DAY_OF_WEEK`: Day of week of the crime incident happened
1. `HOUR`: The hour of the day that crime incident happened
1. `STREET`: Street name of the crime incident happened
1. `Lat`: Latitude of the crime incident happened
1. `Long`: Longitude of the crime incident happened
1. `Location`: Location of the crime incident happened

In [None]:
crime_total_2021 = crime2021_tidy['INCIDENT_NUMBER'].unique().size
print('This dataset includes ',crime_total_2021,' incidents in 2021.')
crime_total_2022 = crime2022_tidy['INCIDENT_NUMBER'].unique().size
print('This dataset includes ',crime_total_2022,' incidents in 2022, up to ',crime2022_tidy['OCCURRED_ON_DATE'].max())

## 2. Descriptive Explorations  
### 2.1 The Most Frequent Type of Incidents  
In this part, we will analyze, which type of incidents occur most in number, and which type of incidents involve the most shootings.  
First, let's take a closer look at the two variables that contains the information of incident types: `CRIME_TYPE` and `OFFENSE_DESCRIPTION`.

In [None]:
print('There are ', crime2021_tidy['CRIME_TYPE'].unique().size,' unique types of crime,')
print('and ',crime2021_tidy['OFFENSE_DESCRIPTION'].unique().size,' unique detailed descriptions of crime.')

There is an one-to-one relationship between crime type number and offense description, these two variables are alias to each other. In the following analysis, we will use only `OFFENSE_DESCRIPTION`, for better conciseness and readability.

In [None]:
crime_by_type_2021 = crime2021_tidy.groupby('OFFENSE_DESCRIPTION').count().reset_index()[['OFFENSE_DESCRIPTION','INCIDENT_NUMBER']]
crime_by_type_2021['RATE'] = crime_by_type_2021['INCIDENT_NUMBER']/crime_total_2021
top_10_types_2021 = crime_by_type_2021.sort_values(by=['INCIDENT_NUMBER'], ascending=False)[:10]['OFFENSE_DESCRIPTION'].values

crime_by_type_2022 = crime2022_tidy.groupby('OFFENSE_DESCRIPTION').count().reset_index()[['OFFENSE_DESCRIPTION','INCIDENT_NUMBER']]
crime_by_type_2022['RATE'] = crime_by_type_2022['INCIDENT_NUMBER']/crime_total_2022
top_10_types_2022 = crime_by_type_2022.sort_values(by=['INCIDENT_NUMBER'], ascending=False)[:10]['OFFENSE_DESCRIPTION'].values

top_10_types_shared = np.union1d(top_10_types_2021,top_10_types_2022)

crime_by_type = crime_by_type_2021.merge(crime_by_type_2022,on='OFFENSE_DESCRIPTION',suffixes=['_2021','_2022'])
crime_by_type.sort_values(by=['INCIDENT_NUMBER_2021'], ascending=False).head(len(top_10_types_shared))

In [None]:
crime_by_type[crime_by_type['OFFENSE_DESCRIPTION'].isin(top_10_types_shared)].plot.bar(
      x='OFFENSE_DESCRIPTION',y=['INCIDENT_NUMBER_2021','INCIDENT_NUMBER_2022'],color=["#1f77b4","#ff0000"])
ax=plt.axes()
ax.set(xlabel='Offense Description',ylabel='Incidents Count',title='Frequency of Different Type of Incidents')

The top ten frequent types of crime in 2021 and 2022 are almost the same, with only one difference, and the patterns are quite similar, Notably, since our data are only included through November 2022, most incidents have a higher number of occurrences in 2021.

### 2.2 District Differences of Incidents 
Rank the number of crimes by district

In [None]:
district_code = crime2021_tidy[["DISTRICT","DISTRICT_NAME"]].value_counts().reset_index().sort_values('DISTRICT')

There are 12 districts in Boston, each has a unique code `DISTRICT` and `DISTRICT_NAME`.

In [None]:
crime_by_district_2021 = crime2021_tidy.groupby('DISTRICT_NAME').count()
crime_by_district_2021 = crime_by_district_2021[['INCIDENT_NUMBER']]
crime_by_district_2021 = crime_by_district_2021.rename(columns={"INCIDENT_NUMBER": "INCIDENT_NUMBER_2021"})

crime_by_district_2022 = crime2022_tidy.groupby('DISTRICT_NAME').count()
crime_by_district_2022 = crime_by_district_2022[['INCIDENT_NUMBER']]
crime_by_district_2022 = crime_by_district_2022.rename(columns={"INCIDENT_NUMBER": "INCIDENT_NUMBER_2022"})

crime_by_district_2021_2022 = pd.concat([crime_by_district_2021,crime_by_district_2022], axis=1, join="inner")
crime_by_district_2021_2022 = crime_by_district_2021_2022.sort_values(by='INCIDENT_NUMBER_2021', ascending=False).head(10)
crime_by_district_2021_2022.sort_values(by="INCIDENT_NUMBER_2021", ascending=False)

Roxbury, Downtown and South End are the top three district with the highest crime rate.

In [None]:
crime_by_district_2021_2022.plot.bar(y = {"INCIDENT_NUMBER_2021","INCIDENT_NUMBER_2022"},color=["#1f77b4","#ff0000"])
ax=plt.axes()
ax.set(xlabel='District Name',ylabel='Incident Number', title='Top Ten District with the Most Crime Events')

In [None]:
import matplotlib.image as mpimg 
map_img = mpimg.imread('https://github.com/JiadaiY/JiadaiYu/blob/main/BA780%20GroupProject/Boston%20map.png?raw=true') 

fig = plt.figure()
fig.set_size_inches(4.97,5.93)
g = sns.scatterplot(data=crime2021_tidy,x='Long',y='Lat',hue='DISTRICT_NAME',alpha=0.02,palette="blend:#1f77b4,#ff0000")
g.set_title('Geographic Distribution of Incidents')
g.set(xlabel="Longitude",ylabel="Latitude",xlim=(-71.20,-70.975),ylim=(42.21,42.41))
g.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0)
g.imshow(map_img,
         aspect = g.get_aspect(),
         extent = g.get_xlim() + g.get_ylim(),
          zorder = 0)
plt.show()

*need some simple description*

### 2.3 Month with the Highest Crime Rate
Rank the number of crimes by month and select the top three months with the highest number of incidents.

In [None]:
crime_by_month = crime2021_tidy.groupby('MONTH').count().reset_index()[['MONTH','INCIDENT_NUMBER']]
crime_by_month.sort_values(by="INCIDENT_NUMBER", ascending=False).head(3)

List the top ten incidents with the highest number of reports in the first three months

In [None]:
# groupby two variables
month10 = crime2021_tidy[crime2021_tidy.MONTH==10]
month10 = month10.groupby('OFFENSE_DESCRIPTION').count()
month10 = month10.rename(columns={"INCIDENT_NUMBER": "Month_10_Count"})

month8 = crime2021_tidy[crime2021_tidy.MONTH==8]
month8 = month8.groupby('OFFENSE_DESCRIPTION').count()
month8 = month8.rename(columns={"INCIDENT_NUMBER": "Month_8_Count"})

month9 = crime2021_tidy[crime2021_tidy.MONTH==9]
month9 = month9.groupby('OFFENSE_DESCRIPTION').count()
month9 = month9.rename(columns={"INCIDENT_NUMBER": "Month_9_Count"})


result = pd.concat([month8, month9, month10], axis=1, join="inner")
result = result.sort_values(by='Month_10_Count', ascending=False).head(10)
result.plot.bar(y = {"Month_8_Count","Month_9_Count","Month_10_Count"},color=["#1f77b4","#a1a1a1","#ff0000"])

ax=plt.axes()
ax.set(xlabel='Top Ten Incidents',ylabel='NUmber of Reports', title='Top three month with most incidents of 2021')

Find high frequency reported incidents for all districts in October

In [None]:
top_3_crimes_lst = crime_by_type_2021.sort_values(by=['INCIDENT_NUMBER'], ascending=False)[:3]['OFFENSE_DESCRIPTION'].values

crime_by_type_month = crime2021_tidy.groupby(['OFFENSE_DESCRIPTION','MONTH','DISTRICT']).agg({'INCIDENT_NUMBER':'count'}).reset_index()
crime_by_top_type_month = crime_by_type_month[crime_by_type_month['OFFENSE_DESCRIPTION'].isin(top_3_crimes_lst)]
crime_by_top_type_month.sort_values(by="INCIDENT_NUMBER", ascending=False).head()

In [None]:
crime_by_top_type_month10 = crime_by_top_type_month[crime_by_type_month['MONTH']==10]

days_sumy = sns.catplot(x='DISTRICT', y='INCIDENT_NUMBER', data=crime_by_top_type_month10, kind='bar', hue='OFFENSE_DESCRIPTION',palette=["#1f77b4","#a1a1a1","#ff0000"])
plt.title('OCT.Number of Reported Cases by District')
plt.ylabel('INCIDENT')
plt.show()

Phase Summary: In 2021, the highest incident months were October, August, and September. The most reported incidents include investigate person, sick assist, and property damage. From the high-incidence district in October, investigate persons occurred in B2, B3, C11, and D4; A1, B2, C11, and D4 are the high-incidence areas for sick assist; B2, and C11 are the high-incidence areas for property damage. Generally speaking, B2, B3, C11, and D4 are the regions with the most cases reported throughout October.

In [None]:
# repeat the analysis in 2022
crime_by_month = crime2022_tidy.groupby('MONTH').count().reset_index()[['MONTH','INCIDENT_NUMBER']]
crime_by_month.sort_values(by="INCIDENT_NUMBER", ascending=False).head(3)

In [None]:
top_3_crimes_lst = crime_by_type_2022.sort_values(by=['INCIDENT_NUMBER'], ascending=False)[:3]['OFFENSE_DESCRIPTION'].values

crime_by_type_month = crime2022_tidy.groupby(['OFFENSE_DESCRIPTION','MONTH','DISTRICT']).agg({'INCIDENT_NUMBER':'count'}).reset_index()
crime_by_top_type_month = crime_by_type_month[crime_by_type_month['OFFENSE_DESCRIPTION'].isin(top_3_crimes_lst)]
crime_by_top_type_month.sort_values(by="INCIDENT_NUMBER", ascending=False).head()

In [None]:
crime_by_top_type_month7 = crime_by_top_type_month[crime_by_type_month['MONTH']==7]

days_sumy = sns.catplot(x='DISTRICT', y='INCIDENT_NUMBER', data=crime_by_top_type_month7, kind='bar', hue='OFFENSE_DESCRIPTION', palette=["#1f77b4","#a1a1a1","#ff0000"])
plt.title('JUL.Number of Reported Cases by District')
plt.ylabel('INCIDENT')
plt.show()

By comparing the types of incidents and the number of incidents reported in 2021 and 2022, it is clear that most of the cases occurred from July to October in one year, and the main types of reported cases include investigation person, sick assist, and property damage. Districts B2, B3, D4, and C11 are the high-frequency areas for incident reporting.

### 2.4 Day Crime is Reported the Most

In [None]:
crime2021_tidy["DAY_OF_WEEK"].unique()

What day is crime most and least reported? It seems to be that on Fridays is when crime is most reported, while Sundays are when they are least reported. 

In [None]:
crime_by_days_2021 = crime2021_tidy["DAY_OF_WEEK"].value_counts()
crime_by_days_2021

Surprisingly, most of the days that the crimes are reported are fairly close to each other. For example, Wednesday, Monday, Thursday, Saturday, and Tuesday only vary by 300 reports the whole year.  
In this anlysis, we focsued on the top three most common crimes and wanted to see which days reported these crimes the most. From 2.1, we found that the top 3 frequent types of incidents are: `INVESTIGATE PERSON`, `SICK ASSIST`, and `M/V - LEAVING SCENE - PROPERTY DAMAGE`. So let's break it down by each day of the week to see which type of crime is most common in those days. 

In [None]:
top_3_crimes_lst = crime_by_type_2021.sort_values(by=['INCIDENT_NUMBER'], ascending=False)[:3]['OFFENSE_DESCRIPTION'].values
top_3_crimes=crime2021_tidy[crime2021_tidy.OFFENSE_DESCRIPTION.isin(top_3_crimes_lst)]
top_3_crime_by_days=top_3_crimes.groupby(['OFFENSE_DESCRIPTION',"DAY_OF_WEEK"]).agg({'INCIDENT_NUMBER':'count'}).reset_index()
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
days_sumy = sns.catplot(x="DAY_OF_WEEK", y='INCIDENT_NUMBER', data=top_3_crime_by_days, kind='bar', order=day_order, 
                        col='OFFENSE_DESCRIPTION', col_order=top_3_crimes_lst, palette="blend:#1f77b4,#ff0000")
plt.show()

The graph shows that `INVESTIGATE PERSON` is most common on Wednesday, while `SICK ASSIST` is most common on Thrusday, and `M/V - LEAVING SCENE - PROPERTY DAMAGE` is most common on Friday.

In [None]:
crime_by_days_2022 = crime2022_tidy["DAY_OF_WEEK"].value_counts()
crime_by_days_2022

The most common day crime was reported in 2022 was Friday, which was the same as 2021. However, the second most common day that crime was reported was in 2022 was Monday, while in 2021 it was Wednesday. Saturday, Tuesday, and Sunday all finished last in both years.

In [None]:
top_3_crimes_lst = crime_by_type_2022.sort_values(by=['INCIDENT_NUMBER'], ascending=False)[:3]['OFFENSE_DESCRIPTION'].values
top_3_crimes=crime2022_tidy[crime2022_tidy.OFFENSE_DESCRIPTION.isin(top_3_crimes_lst)]
top_3_crime_by_days=top_3_crimes.groupby(['OFFENSE_DESCRIPTION',"DAY_OF_WEEK"]).agg({'INCIDENT_NUMBER':'count'}).reset_index()

days_sumy = sns.catplot(x="DAY_OF_WEEK", y='INCIDENT_NUMBER', data=top_3_crime_by_days, kind='bar', order=day_order, 
                        col='OFFENSE_DESCRIPTION', col_order=top_3_crimes_lst, palette="blend:#1f77b4,#ff0000")
plt.show()

For 2022, INVESTIGATE PERSON, was most common on Fridays, while in 2021 it was most common on Wedensday. SICK ASSIST in 2021 was most common on Thrusday, but in 2022 it was most common on Friday. M/V - LEAVING SCENE - PROPERTY DAMAGE was most common on Friday for both 2021 and 2022.

### 2.5 Time Crime is Reported the Most

In [None]:
imehour_day = pd.pivot_table(data=crime2021_tidy, index = "DAY_OF_WEEK",
                              columns = "HOUR", values = "INCIDENT_NUMBER", aggfunc = 'count')
hour_day.index = pd.CategoricalIndex(hour_day.index,categories=day_order)
hour_day.sort_index(level=0, inplace=True)
g = sns.heatmap(hour_day,cmap = 'Reds')
g.set_title('Time of Incidents Reports')
g.set(xlabel='Hour',ylabel='Day of week')
plt.show()

This heatmap shows at which hour crime was most commonly reported. It seems that at hour 17 (5 PM) crime is most commonly reported. 5 PM is also the most busiest time on the roads, so that is one reason crime is most commomly reported at that time. Crime is also least reported between 3 A.M. and 6 A.M.   

In [None]:
imehour_day = pd.pivot_table(data=crime2022_tidy, index = "DAY_OF_WEEK",
                              columns = "HOUR", values = "INCIDENT_NUMBER", aggfunc = 'count')
hour_day.index = pd.CategoricalIndex(hour_day.index,categories=day_order)
hour_day.sort_index(level=0, inplace=True)
g = sns.heatmap(hour_day,cmap = 'Reds')
g.set_title('Time of Incidents Reports')
g.set(xlabel='Hour',ylabel='Day of week')
plt.show()

### 2.6 Most Dangerous Incidents: Shooting

In [None]:
shooting_crime_2021 = crime2021_tidy[crime2021_tidy['SHOOTING']==1]
shooting_total_2021 = shooting_crime_2021['INCIDENT_NUMBER'].unique().size
print('In 2021, there are ',shooting_total_2021,' incidents involving shootings.')
print('The overall probability of shooting is', round(shooting_total_2021/crime_total_2021*100,2), '%.')

shooting_type_2021 = shooting_crime_2021.groupby('OFFENSE_DESCRIPTION').count().reset_index()[['OFFENSE_DESCRIPTION','INCIDENT_NUMBER']]

In [None]:
crime_type_shooting_2021 = crime_by_type_2021.merge(shooting_type_2021,on='OFFENSE_DESCRIPTION',suffixes=('_TYPE','_SHOOT'),how='outer')
crime_type_shooting_2021['SHOOTING_PROBABILITY(%)'] = crime_type_shooting_2021['INCIDENT_NUMBER_SHOOT']/crime_type_shooting_2021['INCIDENT_NUMBER_TYPE']*100
# create filters of most frequent and dangerous incidents
top_10_types_2021 = crime_by_type_2021[['OFFENSE_DESCRIPTION','INCIDENT_NUMBER','RATE']].sort_values(by=['INCIDENT_NUMBER'], ascending=False)[:10]['OFFENSE_DESCRIPTION'].values
top_10_shooting_types_2021 = shooting_type_2021[['OFFENSE_DESCRIPTION','INCIDENT_NUMBER']].sort_values(by=['INCIDENT_NUMBER'], ascending=False)[:10]['OFFENSE_DESCRIPTION'].values

top_crime_type_shooting_2021 = crime_type_shooting_2021[crime_type_shooting_2021['OFFENSE_DESCRIPTION'].isin(np.concatenate((top_10_types_2021,top_10_shooting_types_2021)))]

fig_2021 = plt.figure()
ax = fig_2021.add_subplot()
ax2 = ax.twiny()

top_crime_type_shooting_2021.plot.barh(y='INCIDENT_NUMBER_TYPE',ax=ax,label='Incident Count')
top_crime_type_shooting_2021.plot.scatter(x='SHOOTING_PROBABILITY(%)',y='OFFENSE_DESCRIPTION',marker='x',color='red',ax=ax2,label='Shooting Probability',xlim=(0,75))

ax.set_xlabel('Incident Count')
ax2.set_xlabel('Shooting Probability(%)')
ax2.set_title('Frequency and Shooting Probability of Different Types of Incidents in 2021')

The graph shows that shooting, which is an extreme public safety hazard, did not occur with a high probability, the majority types of incidents did not report any shooting, and if any shooting occurred, there were only one or two cases, for example, `SICK ASSIST`, which occurred with the second highest frequency, sums up to 4969 times with no shooting reported.  
However, there are also some types of incidents that have disproportionately high probability of shooting, such as `MURDER, NON-NEGLIGIENT MANSLAUGHTER`, occurred only 29 times, yet 20 of them involved shootings, the shooting probability is around 70%, and `BALLISTICS EVIDENCE/FOUND`, whose shooting probability is around 60%.
We can conclude that these types of incidents are extremely dangerous, and if you encounter one in progress, run away as soon as possible!

In [None]:
district_order = ['A1', 'A15', 'A7', 'B2', 'B3', 'C6', 'C11', 'D4', 'E5', 'E13', 'E18']
shooting_sumy = sns.catplot(x='DISTRICT', y='SHOOTING', data=crime2021_tidy, kind='point', order=district_order, col='DAY_OF_WEEK', col_order=day_order, ci=None)
shooting_sumy = sns.catplot(x='DISTRICT', y='SHOOTING', data=crime2022_tidy, kind='point', order=district_order, col='DAY_OF_WEEK', col_order=day_order, ci=None)
plt.ylim(0,0.05)
plt.show()

In [None]:
import matplotlib.image as mpimg 
map_img = mpimg.imread('https://github.com/JiadaiY/JiadaiYu/blob/main/BA780%20GroupProject/Boston%20map.png?raw=true') 

fig = plt.figure()
fig.set_size_inches(4.97,5.93)
g = sns.scatterplot(data=crime2021_tidy[crime2021_tidy['SHOOTING']==1],x='Long',y='Lat',hue='DISTRICT_NAME',alpha=0.5,palette="blend:#1f77b4,#ff0000")
g.set_title('Geographic Distribution of Shootings')
g.set(xlabel="Longitude",ylabel="Latitude",xlim=(-71.20,-70.975),ylim=(42.21,42.41))
g.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0)
g.imshow(map_img,
         aspect = g.get_aspect(),
         extent = g.get_xlim() + g.get_ylim(),
          zorder = 0)
plt.show()

In [None]:
hour_day_shooting = pd.pivot_table(data=crime2021_tidy[crime2021_tidy['SHOOTING']==1], index = "DAY_OF_WEEK",
                              columns = "HOUR", values = "INCIDENT_NUMBER", aggfunc = 'count')
hour_day_shooting.index = pd.CategoricalIndex(hour_day_shooting.index,categories=day_order)
hour_day_shooting.sort_index(level=0, inplace=True)
g = sns.heatmap(hour_day_shooting,cmap = 'Reds')
g.set_title('Time of Shootings Reports')
g.set(xlabel='Hour',ylabel='Day of week')
plt.show()

This heatmap shows at what time shootings are most commonly reported. As you can see the evening times are the most common, and in particular Saturday night and Sunday early morning (2 A.M.) are the most common time and day for shootings to occur. This heatmap also shows that there are rearly any shotings during the day time (6 A.M. - 12 P.M.).  

## 3. Conclusion  

With our analysis, we can see that there are certain areas of Boston that have the high levels of crime rate. We also narrowed down on specific variables, such as type of crime, district, month, day and what time the crime occurred to get a better understadning of the crime in Boston. With this, we can better predict possible incidents and be able to suggest appropriate reinforcements to make the city of Boston safer. 