# Final Tutorial
### Niko Zhang and Sophie Tsai

## Introduction
### Note this is generated by chatgpt and is not final
Crime is a pervasive problem that affects communities throughout the United States. As data scientists, we have the opportunity to contribute to the fight against crime by analyzing data on crime patterns and trends. In this exploratory data analysis, we will focus on crime in the United States at the state and city levels, using data from the FBI Uniform Crime Reporting (UCR) program.

The UCR program is a national initiative that collects and disseminates data on crime across the United States. Law enforcement agencies across the country submit data on a range of crimes, including murder, rape, robbery, aggravated assault, burglary, larceny theft, and motor vehicle theft. This data is used to inform policy decisions at the local, state, and national levels.

Our goal in this exploratory data analysis is to identify any significant trends or patterns in crime rates across the United States, as well as any differences in crime rates between states and cities. We will examine both the overall crime rate and rates for specific types of crimes to identify areas of concern and inform policy decisions aimed at reducing crime and improving public safety.

By analyzing crime data, we hope to contribute to the ongoing effort to reduce crime in the United States and provide valuable information to communities and law enforcement agencies alike. Our exploratory data analysis represents an important step in the fight against crime in the United States.

## Imports and configurations

In [1]:
# Imports for reading in data
import pandas as pd

# Set max rows displayed in DataFrame
pd.set_option('display.max_rows', 50)

## Read in UCR crime data (2000-2021) as DataFrames

In [2]:
# Read in the excel files as DataFrames
offenses2021df = pd.read_excel('offenses/2021offenses_by_state_and_city.xlsx')
offenses2020df = pd.read_excel('offenses/2020offenses_by_state_and_city.xlsx')
offenses2019df = pd.read_excel('offenses/2019offenses_by_state_and_city.xls')
offenses2018df = pd.read_excel('offenses/2018offenses_by_state_and_city.xls')
offenses2017df = pd.read_excel('offenses/2017offenses_by_state_and_city.xls')
offenses2016df = pd.read_excel('offenses/2016offenses_by_state_and_city.xls')
offenses2015df = pd.read_excel('offenses/2015offenses_by_state_and_city.xls')
offenses2014df = pd.read_excel('offenses/2014offenses_by_state_and_city.xls')
offenses2013df = pd.read_excel('offenses/2013offenses_by_state_and_city.xls')
offenses2012df = pd.read_excel('offenses/2012offenses_by_state_and_city.xls')
offenses2011df = pd.read_excel('offenses/2011offenses_by_state_and_city.xls')
offenses2010df = pd.read_excel('offenses/2010offenses_by_state_and_city.xls')
offenses2009df = pd.read_excel('offenses/2009offenses_by_state_and_city.xls')
offenses2008df = pd.read_excel('offenses/2008offenses_by_state_and_city.xls')
offenses2007df = pd.read_excel('offenses/2007offenses_by_state_and_city.xls')
offenses2006df = pd.read_excel('offenses/2006offenses_by_state_and_city.xls')
offenses2005df = pd.read_excel('offenses/2005offenses_by_state_and_city.xls')
offenses2004under10000df = pd.read_excel('offenses/2004offenses_by_state_and_city_pop_under_10000.xls')
offenses2004ge10000df = pd.read_excel('offenses/2004offenses_by_state_and_city_pop_ge_10000.xls')
offenses2003under10000df = pd.read_excel('offenses/2003offenses_by_state_and_city_pop_under_10000.xls')
offenses2003ge10000df = pd.read_excel('offenses/2003offenses_by_state_and_city_pop_ge_10000.xls')
offenses2002under10000df = pd.read_excel('offenses/2002offenses_by_state_and_city_pop_under_10000.xls')
offenses2002ge10000df = pd.read_excel('offenses/2002offenses_by_state_and_city_pop_ge_10000.xls')
offenses2001under10000df = pd.read_excel('offenses/2001offenses_by_state_and_city_pop_under_10000.xls')
offenses2001ge10000df = pd.read_excel('offenses/2001offenses_by_state_and_city_pop_ge_10000.xls')
offenses2000ge10000df = pd.read_excel('offenses/2000offenses_by_state_and_city_pop_ge_10000.xls')

## Clean 2021 crime data

In [17]:
# Read in 2021 crime data as DataFrame
offenses2021df = pd.read_excel('offenses/2021offenses_by_state_and_city.xlsx')

'''The US state column in the Excel file has merged cells. When reading this file as a 
   DataFrame, the corresponding column has NaN values due to the merged cells separating 
   into unmerged cells. The line below fixes the issue by filling in those NaN values
   with the correct US states.'''
offenses2021df = offenses2021df.fillna(method='ffill', axis=0)

# Remove the first 2 rows and the last row as they are not needed
offenses2021df = offenses2021df.iloc[2:-1:, :]

# Make the first row which contains the names of the features as the column names
header = offenses2021df.iloc[0] # takes the first row as the header for column names
header.name = '' # removes the name of the header (not needed)
offenses2021df = offenses2021df[1:]
offenses2021df.columns = header

# Reformat column names for readability
offenses2021df.columns = offenses2021df.columns.str.lower().str.replace('\n',' ').str.replace(' ','_').str.replace('-','')

# Reset the indices
offenses2021df.reset_index(drop=True, inplace=True)

# Save the column header for the rest of the crime datasets
header = offenses2021df.columns

# Display first couple rows of DataFrame
offenses2021df.head(5)

Empty DataFrame
Columns: [state, city, population, violent_crime, murder_and_nonnegligent_manslaughter, rape, robbery, aggravated_assault, property_crime, burglary, larceny_theft, motor_vehicle_theft, arson]
Index: []


Unnamed: 0,state,city,population,violent_crime,murder_and_nonnegligent_manslaughter,rape,robbery,aggravated_assault,property_crime,burglary,larceny_theft,motor_vehicle_theft,arson
0,ALABAMA,Abbeville,2539,4,1,0,0,3,53,11,37,5,0
1,ALABAMA,Alabaster,33963,25,1,4,0,20,282,13,253,16,1
2,ALABAMA,Alexander City,14066,40,0,0,7,33,283,178,87,18,1
3,ALABAMA,Altoona,913,4,0,0,0,4,7,1,6,0,0
4,ALABAMA,Andalusia,8643,44,1,6,1,36,254,45,198,11,0


## Clean 2020 crime data

In [18]:
# Read in the excel file
offenses2020df = pd.read_excel('offenses/2020offenses_by_state_and_city.xlsx')

# Remove the rows that are not part of the data table
offenses2020df = offenses2020df.iloc[5:7694, :]

# Use previous column header
offenses2020df.columns = header

'''The US state column in the Excel file has merged cells. When reading this file as a 
   DataFrame, the corresponding column has NaN values due to the merged cells separating 
   into unmerged cells. The line below fixes the issue by filling in those NaN values
   with the correct US states.'''
offenses2020df = offenses2020df.fillna(method='ffill', axis=0)

# Reset the indices
offenses2020df.reset_index(drop=True, inplace=True)

# Correctly format the state and city names
offenses2020df['state'] = offenses2020df['state'].str.replace(r'\d+', '', regex=True) # remove all numbers from state names
offenses2020df['city'] = offenses2020df['city'].str.replace(r'\d+', '', regex=True) # remove all numbers from city names
offenses2020df['state'] = offenses2020df['state'].str.upper() # capitalize all state names

# Replace NaN values with 0
offenses2020df['arson'] = offenses2020df['arson'].fillna(0)

# Display the first couple rows of the DataFrame
offenses2020df.head(5)

Unnamed: 0,state,city,population,violent_crime,murder_and_nonnegligent_manslaughter,rape,robbery,aggravated_assault,property_crime,burglary,larceny_theft,motor_vehicle_theft,arson
0,ALABAMA,Cedar Bluff,1823,4,0,0,0,4,36,7,26,3,0.0
1,ALABAMA,Centre,3547,20,0,4,0,16,124,12,97,15,0.0
2,ALABAMA,Daleville,5080,16,0,0,1,15,98,19,72,7,0.0
3,ALABAMA,Enterprise,28569,128,2,17,9,100,715,97,570,48,0.0
4,ALABAMA,Eufaula,11568,95,3,9,15,68,456,95,318,43,0.0


## Clean 2019 crime data

In [20]:
# Read in excel file
offenses2019df = pd.read_excel('offenses/2019offenses_by_state_and_city.xls')

# Remove the rows that are not part of the data table
offenses2019df = offenses2019df.iloc[3:8108, :]

# Use previous column header
offenses2019df.columns = header

'''The US state column in the Excel file has merged cells. When reading this file as a 
   DataFrame, the corresponding column has NaN values due to the merged cells separating 
   into unmerged cells. The line below fixes the issue by filling in those NaN values
   with the correct US states.'''
offenses2019df = offenses2019df.fillna(method='ffill', axis=0)

# Reset the indices
offenses2019df.reset_index(drop=True, inplace=True)

# Correctly format the state and city names
offenses2019df['state'] = offenses2019df['state'].str.replace(r'\d+', '', regex=True) # remove all numbers from state names
offenses2019df['city'] = offenses2019df['city'].str.replace(r'\d+', '', regex=True) # remove all numbers from city names

# Display the first couple rows of the DataFrame
offenses2019df.head(5)

Unnamed: 0,state,city,population,violent_crime,murder_and_nonnegligent_manslaughter,rape,robbery,aggravated_assault,property_crime,burglary,larceny_theft,motor_vehicle_theft,arson
0,ALABAMA,Hoover,85670,114,4,15,27,68,1922,128,1694,100,2
1,ALASKA,Anchorage,287731,3581,32,540,621,2388,12261,1692,9038,1531,93
2,ALASKA,Bethel,6544,130,1,47,3,79,132,20,84,28,12
3,ALASKA,Bristol Bay Borough,852,2,0,0,0,2,20,5,8,7,0
4,ALASKA,Cordova,2150,0,0,0,0,0,7,1,6,0,0


## Clean 2018 crime data