# Final Tutorial
### Niko Zhang and Sophie Tsai

## Introduction
### Note this is generated by chatgpt and is not final
Crime is a pervasive problem that affects communities throughout the United States. As data scientists, we have the opportunity to contribute to the fight against crime by analyzing data on crime patterns and trends. In this exploratory data analysis, we will focus on crime in the United States at the state and city levels, using data from the FBI Uniform Crime Reporting (UCR) program.

The UCR program is a national initiative that collects and disseminates data on crime across the United States. Law enforcement agencies across the country submit data on a range of crimes, including murder, rape, robbery, aggravated assault, burglary, larceny theft, and motor vehicle theft. This data is used to inform policy decisions at the local, state, and national levels.

Our goal in this exploratory data analysis is to identify any significant trends or patterns in crime rates across the United States, as well as any differences in crime rates between states and cities. We will examine both the overall crime rate and rates for specific types of crimes to identify areas of concern and inform policy decisions aimed at reducing crime and improving public safety.

By analyzing crime data, we hope to contribute to the ongoing effort to reduce crime in the United States and provide valuable information to communities and law enforcement agencies alike. Our exploratory data analysis represents an important step in the fight against crime in the United States.

## Imports and configurations

In [17]:
# Imports for reading in data
import pandas as pd

# Set max rows displayed in DataFrame
pd.set_option('display.max_rows', 100)

## Read in and clean NIBRS crime data for 2021

In [51]:
# Read in excel file for persons offenses
persons_offenses_2021 = pd.read_excel('crimes_by_state/2021/Persons_Offenses.xlsx')
# Read in excel file for property offenses
property_offenses_2021 = pd.read_excel('crimes_by_state/2021/Property_Offenses.xlsx')
# Read in excel file for society offenses
society_offenses_2021 = pd.read_excel('crimes_by_state/2021/Society_Offenses.xlsx')

# clean data for persons offenses
# set column names
persons_offenses_2021.columns = ['state','num_agencies','covered_pop','total_offenses','assualt','homicide','trafficking','kidnapping','sex_offense']
# remove unnecessary rows
persons_offenses_2021 = persons_offenses_2021.iloc[5:-1, :]
persons_offenses_2021.head(5)

Unnamed: 0,state,num_agencies,covered_pop,total_offenses,assualt,homicide,trafficking,kidnapping,sex_offense
5,Total,11794,215058917,2939412,2706772,16537,2141,36919,177043
6,Alabama,356,3734077,70855,68366,381,18,388,1702
7,Alaska,30,402557,6858,5945,31,2,29,851
8,Arizona,79,3949562,45372,40946,235,7,706,3478
9,Arkansas,285,2916168,66379,62317,339,6,658,3059


In [48]:
# clean data for property offenses
# set column names
property_offenses_2021.columns = ['state','num_agencies','covered_pop','total_offenses','arson','bribe','burglary','counterfeiting','vandalism','embezzlement','extortion','fraud','larceny','motor_theft','robbery','stolen_property']
# remove unnecessary rows
property_offenses_2021 = property_offenses_2021.iloc[5:-1, :]
property_offenses_2021.tail(5)

Unnamed: 0,state,num_agencies,covered_pop,total_offenses,arson,bribe,burglary,counterfeiting,vandalism,embezzlement,extortion,fraud,larceny,motor_theft,robbery,stolen_property
51,Vermont,88,645570,13817,43,0,1129,179,3249,59,17,1455,7006,450,65,165
52,Virginia,411,8640726,221013,647,30,10464,3489,47560,1569,814,36400,104537,11248,2929,1326
53,Washington,248,7700987,357551,1587,11,39475,3560,77062,140,718,24008,164309,35326,5680,5675
54,West Virginia,247,1575083,30705,237,4,3344,663,5807,101,13,2615,15198,1797,198,728
55,Wisconsin,323,5423821,131390,474,0,9298,2005,25830,540,275,15565,57234,15950,2648,1571


## Read in UCR crime data (2000-2021) as DataFrames

In [3]:
# Read in the excel files as DataFrames
offenses2021df = pd.read_excel('offenses_by_city/2021offenses_by_state_and_city.xlsx')
offenses2020df = pd.read_excel('offenses_by_city/2020offenses_by_state_and_city.xlsx')
offenses2019df = pd.read_excel('offenses_by_city/2019offenses_by_state_and_city.xls')
offenses2018df = pd.read_excel('offenses_by_city/2018offenses_by_state_and_city.xls')
offenses2017df = pd.read_excel('offenses_by_city/2017offenses_by_state_and_city.xls')
offenses2016df = pd.read_excel('offenses_by_city/2016offenses_by_state_and_city.xls')
offenses2015df = pd.read_excel('offenses_by_city/2015offenses_by_state_and_city.xls')
offenses2014df = pd.read_excel('offenses_by_city/2014offenses_by_state_and_city.xls')
offenses2013df = pd.read_excel('offenses_by_city/2013offenses_by_state_and_city.xls')
offenses2012df = pd.read_excel('offenses_by_city/2012offenses_by_state_and_city.xls')
offenses2011df = pd.read_excel('offenses_by_city/2011offenses_by_state_and_city.xls')
offenses2010df = pd.read_excel('offenses_by_city/2010offenses_by_state_and_city.xls')
offenses2009df = pd.read_excel('offenses_by_city/2009offenses_by_state_and_city.xls')
offenses2008df = pd.read_excel('offenses_by_city/2008offenses_by_state_and_city.xls')
offenses2007df = pd.read_excel('offenses_by_city/2007offenses_by_state_and_city.xls')
offenses2006df = pd.read_excel('offenses_by_city/2006offenses_by_state_and_city.xls')
offenses2005df = pd.read_excel('offenses_by_city/2005offenses_by_state_and_city.xls')
offenses2004under10000df = pd.read_excel('offenses_by_city/2004offenses_by_state_and_city_pop_under_10000.xls')
offenses2004ge10000df = pd.read_excel('offenses_by_city/2004offenses_by_state_and_city_pop_ge_10000.xls')
offenses2003under10000df = pd.read_excel('offenses_by_city/2003offenses_by_state_and_city_pop_under_10000.xls')
offenses2003ge10000df = pd.read_excel('offenses_by_city/2003offenses_by_state_and_city_pop_ge_10000.xls')
offenses2002under10000df = pd.read_excel('offenses_by_city/2002offenses_by_state_and_city_pop_under_10000.xls')
offenses2002ge10000df = pd.read_excel('offenses_by_city/2002offenses_by_state_and_city_pop_ge_10000.xls')
offenses2001under10000df = pd.read_excel('offenses_by_city/2001offenses_by_state_and_city_pop_under_10000.xls')
offenses2001ge10000df = pd.read_excel('offenses_by_city/2001offenses_by_state_and_city_pop_ge_10000.xls')
offenses2000ge10000df = pd.read_excel('offenses_by_city/2000offenses_by_state_and_city_pop_ge_10000.xls')

## Clean 2021 crime data (by state & city)

In [4]:
# Read in 2021 crime data as DataFrame
offenses2021df = pd.read_excel('offenses_by_city/2021offenses_by_state_and_city.xlsx')

'''The US state column in the Excel file has merged cells. When reading this file as a 
   DataFrame, the corresponding column has NaN values due to the merged cells separating 
   into unmerged cells. The line below fixes the issue by filling in those NaN values
   with the correct US states.'''
offenses2021df = offenses2021df.fillna(method='ffill', axis=0)

# Remove the first 2 rows and the last row as they are not needed
offenses2021df = offenses2021df.iloc[2:-1:, :]

# Make the first row which contains the names of the features as the column names
header = offenses2021df.iloc[0] # takes the first row as the header for column names
header.name = '' # removes the name of the header (not needed)
offenses2021df = offenses2021df[1:]
offenses2021df.columns = header

# Reformat column names for readability
offenses2021df.columns = offenses2021df.columns.str.lower().str.replace('\n',' ').str.replace(' ','_').str.replace('-','')

# Reset the indices
offenses2021df.reset_index(drop=True, inplace=True)

# Save the column header for the rest of the crime datasets
header = offenses2021df.columns

# Display first couple rows of DataFrame
offenses2021df.head(5)

Unnamed: 0,state,city,population,violent_crime,murder_and_nonnegligent_manslaughter,rape,robbery,aggravated_assault,property_crime,burglary,larceny_theft,motor_vehicle_theft,arson
0,ALABAMA,Abbeville,2539,4,1,0,0,3,53,11,37,5,0
1,ALABAMA,Alabaster,33963,25,1,4,0,20,282,13,253,16,1
2,ALABAMA,Alexander City,14066,40,0,0,7,33,283,178,87,18,1
3,ALABAMA,Altoona,913,4,0,0,0,4,7,1,6,0,0
4,ALABAMA,Andalusia,8643,44,1,6,1,36,254,45,198,11,0


## Clean 2020 crime data (by state & city)

In [5]:
# Read in the excel file
offenses2020df = pd.read_excel('offenses_by_city/2020offenses_by_state_and_city.xlsx')

# Remove the rows that are not part of the data table
offenses2020df = offenses2020df.iloc[5:7694, :]

# Use previous column header
offenses2020df.columns = header

'''The US state column in the Excel file has merged cells. When reading this file as a 
   DataFrame, the corresponding column has NaN values due to the merged cells separating 
   into unmerged cells. The line below fixes the issue by filling in those NaN values
   with the correct US states.'''
offenses2020df = offenses2020df.fillna(method='ffill', axis=0)

# Reset the indices
offenses2020df.reset_index(drop=True, inplace=True)

# Correctly format the state and city names
offenses2020df['state'] = offenses2020df['state'].str.replace(r'\d+', '', regex=True) # remove all numbers from state names
offenses2020df['city'] = offenses2020df['city'].str.replace(r'\d+', '', regex=True) # remove all numbers from city names
offenses2020df['state'] = offenses2020df['state'].str.upper() # capitalize all state names

# Replace NaN values with 0
offenses2020df['arson'] = offenses2020df['arson'].fillna(0)

# Display the first couple rows of the DataFrame
offenses2020df.head(5)

Unnamed: 0,state,city,population,violent_crime,murder_and_nonnegligent_manslaughter,rape,robbery,aggravated_assault,property_crime,burglary,larceny_theft,motor_vehicle_theft,arson
0,ALABAMA,Cedar Bluff,1823,4,0,0,0,4,36,7,26,3,0.0
1,ALABAMA,Centre,3547,20,0,4,0,16,124,12,97,15,0.0
2,ALABAMA,Daleville,5080,16,0,0,1,15,98,19,72,7,0.0
3,ALABAMA,Enterprise,28569,128,2,17,9,100,715,97,570,48,0.0
4,ALABAMA,Eufaula,11568,95,3,9,15,68,456,95,318,43,0.0


## Clean 2019 crime data (by state & city)

In [6]:
# Read in excel file
offenses2019df = pd.read_excel('offenses_by_city/2019offenses_by_state_and_city.xls')

# Remove the rows that are not part of the data table
offenses2019df = offenses2019df.iloc[3:8108, :]

# Use previous column header
offenses2019df.columns = header

'''The US state column in the Excel file has merged cells. When reading this file as a 
   DataFrame, the corresponding column has NaN values due to the merged cells separating 
   into unmerged cells. The line below fixes the issue by filling in those NaN values
   with the correct US states.'''
offenses2019df = offenses2019df.fillna(method='ffill', axis=0)

# Reset the indices
offenses2019df.reset_index(drop=True, inplace=True)

# Correctly format the state and city names
offenses2019df['state'] = offenses2019df['state'].str.replace(r'\d+', '', regex=True) # remove all numbers from state names
offenses2019df['city'] = offenses2019df['city'].str.replace(r'\d+', '', regex=True) # remove all numbers from city names

# Display the first couple rows of the DataFrame
offenses2019df.head(5)

Unnamed: 0,state,city,population,violent_crime,murder_and_nonnegligent_manslaughter,rape,robbery,aggravated_assault,property_crime,burglary,larceny_theft,motor_vehicle_theft,arson
0,ALABAMA,Hoover,85670,114,4,15,27,68,1922,128,1694,100,2
1,ALASKA,Anchorage,287731,3581,32,540,621,2388,12261,1692,9038,1531,93
2,ALASKA,Bethel,6544,130,1,47,3,79,132,20,84,28,12
3,ALASKA,Bristol Bay Borough,852,2,0,0,0,2,20,5,8,7,0
4,ALASKA,Cordova,2150,0,0,0,0,0,7,1,6,0,0


## Clean 2018 crime data (by state & city)

In [20]:
# Read in excel file
offenses2018df = pd.read_excel('offenses_by_city/2018offenses_by_state_and_city.xls')

# Remove the rows that are not part of the data table
offenses2018df = offenses2018df.iloc[2:, :]

offenses2018df.head(5)

Unnamed: 0,Table 8,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12
2,State,City,Population,Violent\ncrime,Murder and\nnonnegligent\nmanslaughter,Rape1,Robbery,Aggravated\nassault,Property\ncrime,Burglary,Larceny-\ntheft,Motor\nvehicle\ntheft,Arson2
3,ALABAMA,Abbeville,2551,18,0,2,0,16,49,14,33,2,
4,,Adamsville,4323,19,0,1,4,14,289,42,230,17,
5,,Alabaster,33501,92,0,2,10,80,579,56,497,26,
6,,Albertville,21428,24,0,6,10,8,802,194,492,116,


## 2019 crime by state (testing to see if we should use crime data by states instead)

In [19]:
# read in the data for 2019 crime data by state
offenses2019bystatedf = pd.read_excel('Table_5_Crime_in_the_United_States_by_State_2019.xls')

# Remove the rows that are not part of the data table
offenses2019bystatedf = offenses2019bystatedf.iloc[2:, :]

# Make the first row which contains the names of the features as the column names
header = offenses2019bystatedf.iloc[0] # takes the first row as the header for column names
header.name = '' # removes the name of the header (not needed)
offenses2019bystatedf = offenses2019bystatedf[1:]
offenses2019bystatedf.columns = header

# Reformat column names for readability
offenses2019bystatedf.columns = offenses2019bystatedf.columns.str.lower().str.replace('\n','').str.replace(' ','_').str.replace('-','')
offenses2019bystatedf.columns = offenses2019bystatedf.columns.str.replace(r'\d','', regex=True)
offenses2019bystatedf.rename(columns = {'area':'unit_type'}, inplace = True)

# remove all numbers from state names
offenses2019bystatedf['state'] = offenses2019bystatedf['state'].str.replace(r'\d+', '', regex=True)

'''The US state column in the Excel file has merged cells. When reading this file as a 
   DataFrame, the corresponding column has NaN values due to the merged cells separating 
   into unmerged cells. The line below fixes the issue by filling in those NaN values
   with the correct US states.'''
offenses2019bystatedf['state'] = offenses2019bystatedf['state'].fillna(method='ffill', axis=0)
offenses2019bystatedf['unit_type'] = offenses2019bystatedf['unit_type'].fillna(method='ffill', axis=0)

# Remove the rows that we don't need (we only need rows for state totals)
offenses2019bystatedf = offenses2019bystatedf[offenses2019bystatedf['unit_type'] == 'State Total']

# Drop the NaN column
offenses2019bystatedf.drop(offenses2019bystatedf.columns[2], inplace=True, axis=1)

# Reset the indices
offenses2019bystatedf.reset_index(drop=True, inplace=True)

# Rename every other observation in the unit_type column to 'Rate per 100,000 inhabitants'
offenses2019bystatedf.iloc[1::2,:]['unit_type'][:] = 'Rate per 100,000 inhabitants'

offenses2019bystatedf.head(100)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  offenses2019bystatedf.iloc[1::2,:]['unit_type'][:] = 'Rate per 100,000 inhabitants'


Unnamed: 0,state,unit_type,population,violent_crime,murder_and_nonnegligent_manslaughter,rape,robbery,aggravated_assault,property_crime,burglary,larcenytheft,motor_vehicle_theft
0,ALABAMA,State Total,4903185.0,25046.0,358.0,2068.0,3941.0,18679.0,131133.0,26079.0,92477.0,12577.0
1,ALABAMA,"Rate per 100,000 inhabitants",,510.8,7.3,42.2,80.4,381.0,2674.4,531.9,1886.1,256.5
2,ALASKA,State Total,731545.0,6343.0,69.0,1088.0,826.0,4360.0,21294.0,3563.0,15114.0,2617.0
3,ALASKA,"Rate per 100,000 inhabitants",,867.1,9.4,148.7,112.9,596.0,2910.8,487.1,2066.0,357.7
4,ARIZONA,State Total,7278717.0,33141.0,365.0,3662.0,6410.0,22704.0,177638.0,28699.0,130788.0,18151.0
5,ARIZONA,"Rate per 100,000 inhabitants",,455.3,5.0,50.3,88.1,311.9,2440.5,394.3,1796.9,249.4
6,ARKANSAS,State Total,3017804.0,17643.0,242.0,2331.0,1557.0,13513.0,86250.0,18095.0,60735.0,7420.0
7,ARKANSAS,"Rate per 100,000 inhabitants",,584.6,8.0,77.2,51.6,447.8,2858.0,599.6,2012.6,245.9
8,CALIFORNIA,State Total,39512223.0,174331.0,1690.0,14799.0,52301.0,105541.0,921114.0,152555.0,626802.0,141757.0
9,CALIFORNIA,"Rate per 100,000 inhabitants",,441.2,4.3,37.5,132.4,267.1,2331.2,386.1,1586.3,358.8
