# Final Tutorial
### Niko Zhang and Sophie Tsai

## Introduction
With the use of prompt engineering, I am never going to write a paragraph ever again.
#### ChatGPT prompt:
You are a data scientist working in a team conducting exploratory data analysis on crime in the united states at the state level. Your task is to turn in a tutorial that will walk users through the entire data science pipeline: data curation, parsing, and management. You want to see how household income and climate affect the amount of crime rate. You have taken household income data from the census bureau, taken crime data from UCR, and taken climate data from NOAA. Write an introduction to this project so that it shows the purpose of the project, the relevance/importance of the project with supporting background information, and where the data is obtained. Use third person.

#### ChatGPT output with some editing:
The purpose of this tutorial is to guide readers through the entire data science pipeline, including data curation, parsing, management, and analysis. To provide users with a comprehensive understanding of the data science pipeline, this tutorial conducts exploratory data analysis on crime rates in the United States at the state level. Specifically, it aims to investigate the impact of household income and climate on crime rates.

The relevance and importance of this project lies in its potential to provide insights into the factors that contribute to crime rates and how policymakers can use this information to implement effective crime prevention measures. By identifying the key drivers of crime rates, this project could lead to the development of more targeted and effective strategies for reducing crime in the United States.

The project uses data from three sources: the U.S. Census Bureau for household income data, the FBI Uniform Crime Reporting (UCR) program for crime data, and the National Oceanic and Atmospheric Administration (NOAA) for climate data. Through this project, we hope to demonstrate the power of data-driven insights in understanding complex social phenomena like crime and informing evidence-based policy decisions.

## Imports and configurations

In [2]:
# Imports for reading in data
import pandas as pd

# Set max rows displayed in DataFrame
pd.set_option('display.max_rows', 100)

## Read in and clean UCR crime data (by state) for 2019

In [67]:
# read in the data for 2019 crime data by state
offenses2019bystatedf = pd.read_excel('data/crimes_by_state/2019/2019offenses_by_state.xls')

# Remove the rows that are not part of the data table
offenses2019bystatedf = offenses2019bystatedf.iloc[2:-5, :]

# Make the first row which contains the names of the features as the column names
header = offenses2019bystatedf.iloc[0] # takes the first row as the header for column names
header.name = '' # removes the name of the header (not needed)
offenses2019bystatedf = offenses2019bystatedf[1:]
offenses2019bystatedf.columns = header

# Reformat column names for readability
offenses2019bystatedf.columns = offenses2019bystatedf.columns.str.lower().str.replace('\n','').str.replace(' ','_').str.replace('-','')
offenses2019bystatedf.columns = offenses2019bystatedf.columns.str.replace(r'\d','', regex=True)
offenses2019bystatedf.rename(columns = {'area':'unit_type'}, inplace = True)

# remove all numbers from state names
offenses2019bystatedf['state'] = offenses2019bystatedf['state'].str.replace(r'\d+', '', regex=True)

'''The US state column in the Excel file has merged cells. When reading this file as a 
   DataFrame, the corresponding column has NaN values due to the merged cells separating 
   into unmerged cells. The line below fixes the issue by filling in those NaN values
   with the correct US states.'''
offenses2019bystatedf['state'] = offenses2019bystatedf['state'].fillna(method='ffill', axis=0)
offenses2019bystatedf['unit_type'] = offenses2019bystatedf['unit_type'].fillna(method='ffill', axis=0)

# Remove the rows that we don't need (we only need rows for state totals)
offenses2019bystatedf = offenses2019bystatedf[offenses2019bystatedf['unit_type'] == 'State Total']

# Drop the NaN column
offenses2019bystatedf.drop(offenses2019bystatedf.columns[2], inplace=True, axis=1)

# Reset the indices
offenses2019bystatedf.reset_index(drop=True, inplace=True)

# Rename every other observation in the unit_type column to 'Rate per 100,000 inhabitants'
pd.options.mode.chained_assignment = None # turn off settingwithcopy warning
offenses2019bystatedf.iloc[1::2,:]['unit_type'][:] = 'Rate per 100,000 inhabitants'

# add column for year
offenses2019bystatedf.insert(0, 'year', 2019)

offenses2019bystatedf.head(2)

Unnamed: 0,year,state,unit_type,population,violent_crime,murder_and_nonnegligent_manslaughter,rape,robbery,aggravated_assault,property_crime,burglary,larcenytheft,motor_vehicle_theft
0,2019,ALABAMA,State Total,4903185.0,25046.0,358.0,2068.0,3941.0,18679,131133.0,26079.0,92477.0,12577.0
1,2019,ALABAMA,"Rate per 100,000 inhabitants",,510.8,7.3,42.2,80.4,381,2674.4,531.9,1886.1,256.5


## Read in and clean UCR crime data (by state) for 2018

In [68]:
# read in the data for 2019 crime data by state
# read in the data for 2019 crime data by state
offenses2018bystatedf = pd.read_excel('data/crimes_by_state/2018/2018offenses_by_state.xls')

# Remove the rows that are not part of the data table
offenses2018bystatedf = offenses2018bystatedf.iloc[2:-6, :-2]

# Use previous header for column names
offenses2018bystatedf.columns = header

# Reformat column names for readability
offenses2018bystatedf.columns = offenses2018bystatedf.columns.str.lower().str.replace('\n','').str.replace(' ','_').str.replace('-','')
offenses2018bystatedf.columns = offenses2018bystatedf.columns.str.replace(r'\d','', regex=True)
offenses2018bystatedf.rename(columns = {'area':'unit_type'}, inplace = True)

# remove all numbers from state names
offenses2018bystatedf['state'] = offenses2018bystatedf['state'].str.replace(r'\d+', '', regex=True)

'''The US state column in the Excel file has merged cells. When reading this file as a 
   DataFrame, the corresponding column has NaN values due to the merged cells separating 
   into unmerged cells. The line below fixes the issue by filling in those NaN values
   with the correct US states.'''
offenses2018bystatedf['state'] = offenses2018bystatedf['state'].fillna(method='ffill', axis=0)
offenses2018bystatedf['unit_type'] = offenses2018bystatedf['unit_type'].fillna(method='ffill', axis=0)

# Remove the rows that we don't need (we only need rows for state totals)
offenses2018bystatedf = offenses2018bystatedf[offenses2018bystatedf['unit_type'] == 'State Total']

# Drop the NaN column
offenses2018bystatedf.drop(offenses2018bystatedf.columns[2], inplace=True, axis=1)

# Reset the indices
offenses2018bystatedf.reset_index(drop=True, inplace=True)

# Rename every other observation in the unit_type column to 'Rate per 100,000 inhabitants'
pd.options.mode.chained_assignment = None # turn off settingwithcopy warning
offenses2018bystatedf.iloc[1::2,:]['unit_type'][:] = 'Rate per 100,000 inhabitants'

# add column for year
offenses2018bystatedf.insert(0, 'year', 2018)

offenses2018bystatedf.head(2)

Unnamed: 0,year,state,unit_type,population,violent_crime,murder_and_nonnegligent_manslaughter,rape,robbery,aggravated_assault,property_crime,burglary,larcenytheft,motor_vehicle_theft
0,2018,ALABAMA,State Total,4887871.0,25399.0,383.0,1996.0,4076.0,18944.0,137700.0,28841.0,95747.0,13112.0
1,2018,ALABAMA,"Rate per 100,000 inhabitants",,519.6,7.8,40.8,83.4,387.6,2817.2,590.1,1958.9,268.3


## Read in and clean UCR crime data (by state) for 2017

In [72]:
# read in the data for 2019 crime data by state
# read in the data for 2019 crime data by state
offenses2017bystatedf = pd.read_excel('data/crimes_by_state/2017/2017offenses_by_state.xls')

# Remove the rows that are not part of the data table
offenses2017bystatedf = offenses2017bystatedf.iloc[2:-5, :-2]

# Use previous header for column names
offenses2017bystatedf.columns = header

# Reformat column names for readability
offenses2017bystatedf.columns = offenses2017bystatedf.columns.str.lower().str.replace('\n','').str.replace(' ','_').str.replace('-','')
offenses2017bystatedf.columns = offenses2017bystatedf.columns.str.replace(r'\d','', regex=True)
offenses2017bystatedf.rename(columns = {'area':'unit_type'}, inplace = True)

# remove all numbers from state names
offenses2017bystatedf['state'] = offenses2017bystatedf['state'].str.replace(r'\d+', '', regex=True)

'''The US state column in the Excel file has merged cells. When reading this file as a 
   DataFrame, the corresponding column has NaN values due to the merged cells separating 
   into unmerged cells. The line below fixes the issue by filling in those NaN values
   with the correct US states.'''
offenses2017bystatedf['state'] = offenses2017bystatedf['state'].fillna(method='ffill', axis=0)
offenses2017bystatedf['unit_type'] = offenses2017bystatedf['unit_type'].fillna(method='ffill', axis=0)

# Remove the rows that we don't need (we only need rows for state totals)
offenses2017bystatedf = offenses2017bystatedf[offenses2017bystatedf['unit_type'] == 'State Total']

# Drop the NaN column
offenses2017bystatedf.drop(offenses2017bystatedf.columns[2], inplace=True, axis=1)

# Reset the indices
offenses2017bystatedf.reset_index(drop=True, inplace=True)

# Rename every other observation in the unit_type column to 'Rate per 100,000 inhabitants'
pd.options.mode.chained_assignment = None # turn off settingwithcopy warning
offenses2017bystatedf.iloc[1::2,:]['unit_type'][:] = 'Rate per 100,000 inhabitants'

# add column for year
offenses2017bystatedf.insert(0, 'year', 2017)

offenses2017bystatedf.tail(2)

Unnamed: 0,year,state,unit_type,population,violent_crime,murder_and_nonnegligent_manslaughter,rape,robbery,aggravated_assault,property_crime,burglary,larcenytheft,motor_vehicle_theft
98,2017,WYOMING,State Total,579315.0,1376.0,15.0,263.0,76.0,1022.0,10604.0,1593,8232,779.0
99,2017,WYOMING,"Rate per 100,000 inhabitants",,237.5,2.6,45.4,13.1,176.4,1830.4,275,1421,134.5


# The code below is for figuring out what data tables we should use for our project

## Read in and clean NIBRS crime data for 2021

In [3]:
# Read in excel file for persons offenses
persons_offenses_2021 = pd.read_excel('crimes_by_state/2021/Persons_Offenses.xlsx')
# Read in excel file for property offenses
property_offenses_2021 = pd.read_excel('crimes_by_state/2021/Property_Offenses.xlsx')
# Read in excel file for society offenses
society_offenses_2021 = pd.read_excel('crimes_by_state/2021/Society_Offenses.xlsx')

# clean data for persons offenses
# set column names
persons_offenses_2021.columns = ['state','num_agencies','covered_pop','total_offenses','assualt','homicide','trafficking','kidnapping','sex_offense']
# remove unnecessary rows
persons_offenses_2021 = persons_offenses_2021.iloc[5:-1, :]

# clean data for property offenses
# set column names
property_offenses_2021.columns = ['state','num_agencies','covered_pop','total_offenses','arson','bribe','burglary','counterfeiting','vandalism','embezzlement','extortion','fraud','larceny','motor_theft','robbery','stolen_property']
# remove unnecessary rows
property_offenses_2021 = property_offenses_2021.iloc[5:-1, :]

# clean data for society offenses
# set column names
society_offenses_2021.columns = ['state','num_agencies','covered_pop','total_offenses','animal_cruelty','drug_offense','gambling','pornography','prostitution','weapon_violation']
# remove unnecessary rows
society_offenses_2021 = society_offenses_2021.iloc[5:-1, :]
persons_offenses_2021.head(5)

Unnamed: 0,state,num_agencies,covered_pop,total_offenses,assualt,homicide,trafficking,kidnapping,sex_offense
5,Total,11794,215058917,2939412,2706772,16537,2141,36919,177043
6,Alabama,356,3734077,70855,68366,381,18,388,1702
7,Alaska,30,402557,6858,5945,31,2,29,851
8,Arizona,79,3949562,45372,40946,235,7,706,3478
9,Arkansas,285,2916168,66379,62317,339,6,658,3059


## Read in UCR crime data (2000-2021) as DataFrames

In [4]:
# Read in the excel files as DataFrames
offenses2021df = pd.read_excel('offenses_by_city/2021offenses_by_state_and_city.xlsx')
offenses2020df = pd.read_excel('offenses_by_city/2020offenses_by_state_and_city.xlsx')
offenses2019df = pd.read_excel('offenses_by_city/2019offenses_by_state_and_city.xls')
offenses2018df = pd.read_excel('offenses_by_city/2018offenses_by_state_and_city.xls')
offenses2017df = pd.read_excel('offenses_by_city/2017offenses_by_state_and_city.xls')
offenses2016df = pd.read_excel('offenses_by_city/2016offenses_by_state_and_city.xls')
offenses2015df = pd.read_excel('offenses_by_city/2015offenses_by_state_and_city.xls')
offenses2014df = pd.read_excel('offenses_by_city/2014offenses_by_state_and_city.xls')
offenses2013df = pd.read_excel('offenses_by_city/2013offenses_by_state_and_city.xls')
offenses2012df = pd.read_excel('offenses_by_city/2012offenses_by_state_and_city.xls')
offenses2011df = pd.read_excel('offenses_by_city/2011offenses_by_state_and_city.xls')
offenses2010df = pd.read_excel('offenses_by_city/2010offenses_by_state_and_city.xls')
offenses2009df = pd.read_excel('offenses_by_city/2009offenses_by_state_and_city.xls')
offenses2008df = pd.read_excel('offenses_by_city/2008offenses_by_state_and_city.xls')
offenses2007df = pd.read_excel('offenses_by_city/2007offenses_by_state_and_city.xls')
offenses2006df = pd.read_excel('offenses_by_city/2006offenses_by_state_and_city.xls')
offenses2005df = pd.read_excel('offenses_by_city/2005offenses_by_state_and_city.xls')
offenses2004under10000df = pd.read_excel('offenses_by_city/2004offenses_by_state_and_city_pop_under_10000.xls')
offenses2004ge10000df = pd.read_excel('offenses_by_city/2004offenses_by_state_and_city_pop_ge_10000.xls')
offenses2003under10000df = pd.read_excel('offenses_by_city/2003offenses_by_state_and_city_pop_under_10000.xls')
offenses2003ge10000df = pd.read_excel('offenses_by_city/2003offenses_by_state_and_city_pop_ge_10000.xls')
offenses2002under10000df = pd.read_excel('offenses_by_city/2002offenses_by_state_and_city_pop_under_10000.xls')
offenses2002ge10000df = pd.read_excel('offenses_by_city/2002offenses_by_state_and_city_pop_ge_10000.xls')
offenses2001under10000df = pd.read_excel('offenses_by_city/2001offenses_by_state_and_city_pop_under_10000.xls')
offenses2001ge10000df = pd.read_excel('offenses_by_city/2001offenses_by_state_and_city_pop_ge_10000.xls')
offenses2000ge10000df = pd.read_excel('offenses_by_city/2000offenses_by_state_and_city_pop_ge_10000.xls')

## Clean 2021 crime data (by state & city)

In [5]:
# Read in 2021 crime data as DataFrame
offenses2021df = pd.read_excel('offenses_by_city/2021offenses_by_state_and_city.xlsx')

'''The US state column in the Excel file has merged cells. When reading this file as a 
   DataFrame, the corresponding column has NaN values due to the merged cells separating 
   into unmerged cells. The line below fixes the issue by filling in those NaN values
   with the correct US states.'''
offenses2021df = offenses2021df.fillna(method='ffill', axis=0)

# Remove the first 2 rows and the last row as they are not needed
offenses2021df = offenses2021df.iloc[2:-1:, :]

# Make the first row which contains the names of the features as the column names
header = offenses2021df.iloc[0] # takes the first row as the header for column names
header.name = '' # removes the name of the header (not needed)
offenses2021df = offenses2021df[1:]
offenses2021df.columns = header

# Reformat column names for readability
offenses2021df.columns = offenses2021df.columns.str.lower().str.replace('\n',' ').str.replace(' ','_').str.replace('-','')

# Reset the indices
offenses2021df.reset_index(drop=True, inplace=True)

# Save the column header for the rest of the crime datasets
header = offenses2021df.columns

# Display first couple rows of DataFrame
offenses2021df.head(5)

Unnamed: 0,state,city,population,violent_crime,murder_and_nonnegligent_manslaughter,rape,robbery,aggravated_assault,property_crime,burglary,larceny_theft,motor_vehicle_theft,arson
0,ALABAMA,Abbeville,2539,4,1,0,0,3,53,11,37,5,0
1,ALABAMA,Alabaster,33963,25,1,4,0,20,282,13,253,16,1
2,ALABAMA,Alexander City,14066,40,0,0,7,33,283,178,87,18,1
3,ALABAMA,Altoona,913,4,0,0,0,4,7,1,6,0,0
4,ALABAMA,Andalusia,8643,44,1,6,1,36,254,45,198,11,0


## Clean 2020 crime data (by state & city)

In [6]:
# Read in the excel file
offenses2020df = pd.read_excel('offenses_by_city/2020offenses_by_state_and_city.xlsx')

# Remove the rows that are not part of the data table
offenses2020df = offenses2020df.iloc[5:7694, :]

# Use previous column header
offenses2020df.columns = header

'''The US state column in the Excel file has merged cells. When reading this file as a 
   DataFrame, the corresponding column has NaN values due to the merged cells separating 
   into unmerged cells. The line below fixes the issue by filling in those NaN values
   with the correct US states.'''
offenses2020df = offenses2020df.fillna(method='ffill', axis=0)

# Reset the indices
offenses2020df.reset_index(drop=True, inplace=True)

# Correctly format the state and city names
offenses2020df['state'] = offenses2020df['state'].str.replace(r'\d+', '', regex=True) # remove all numbers from state names
offenses2020df['city'] = offenses2020df['city'].str.replace(r'\d+', '', regex=True) # remove all numbers from city names
offenses2020df['state'] = offenses2020df['state'].str.upper() # capitalize all state names

# Replace NaN values with 0
offenses2020df['arson'] = offenses2020df['arson'].fillna(0)

# Display the first couple rows of the DataFrame
offenses2020df.head(5)

Unnamed: 0,state,city,population,violent_crime,murder_and_nonnegligent_manslaughter,rape,robbery,aggravated_assault,property_crime,burglary,larceny_theft,motor_vehicle_theft,arson
0,ALABAMA,Cedar Bluff,1823,4,0,0,0,4,36,7,26,3,0.0
1,ALABAMA,Centre,3547,20,0,4,0,16,124,12,97,15,0.0
2,ALABAMA,Daleville,5080,16,0,0,1,15,98,19,72,7,0.0
3,ALABAMA,Enterprise,28569,128,2,17,9,100,715,97,570,48,0.0
4,ALABAMA,Eufaula,11568,95,3,9,15,68,456,95,318,43,0.0


## Clean 2019 crime data (by state & city)

In [7]:
# Read in excel file
offenses2019df = pd.read_excel('offenses_by_city/2019offenses_by_state_and_city.xls')

# Remove the rows that are not part of the data table
offenses2019df = offenses2019df.iloc[3:8108, :]

# Use previous column header
offenses2019df.columns = header

'''The US state column in the Excel file has merged cells. When reading this file as a 
   DataFrame, the corresponding column has NaN values due to the merged cells separating 
   into unmerged cells. The line below fixes the issue by filling in those NaN values
   with the correct US states.'''
offenses2019df = offenses2019df.fillna(method='ffill', axis=0)

# Reset the indices
offenses2019df.reset_index(drop=True, inplace=True)

# Correctly format the state and city names
offenses2019df['state'] = offenses2019df['state'].str.replace(r'\d+', '', regex=True) # remove all numbers from state names
offenses2019df['city'] = offenses2019df['city'].str.replace(r'\d+', '', regex=True) # remove all numbers from city names

# Display the first couple rows of the DataFrame
offenses2019df.head(5)

Unnamed: 0,state,city,population,violent_crime,murder_and_nonnegligent_manslaughter,rape,robbery,aggravated_assault,property_crime,burglary,larceny_theft,motor_vehicle_theft,arson
0,ALABAMA,Hoover,85670,114,4,15,27,68,1922,128,1694,100,2
1,ALASKA,Anchorage,287731,3581,32,540,621,2388,12261,1692,9038,1531,93
2,ALASKA,Bethel,6544,130,1,47,3,79,132,20,84,28,12
3,ALASKA,Bristol Bay Borough,852,2,0,0,0,2,20,5,8,7,0
4,ALASKA,Cordova,2150,0,0,0,0,0,7,1,6,0,0


## Clean 2018 crime data (by state & city)

In [8]:
# Read in excel file
offenses2018df = pd.read_excel('offenses_by_city/2018offenses_by_state_and_city.xls')

# Remove the rows that are not part of the data table
offenses2018df = offenses2018df.iloc[2:, :]

offenses2018df.head(5)

Unnamed: 0,Table 8,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12
2,State,City,Population,Violent\ncrime,Murder and\nnonnegligent\nmanslaughter,Rape1,Robbery,Aggravated\nassault,Property\ncrime,Burglary,Larceny-\ntheft,Motor\nvehicle\ntheft,Arson2
3,ALABAMA,Abbeville,2551,18,0,2,0,16,49,14,33,2,
4,,Adamsville,4323,19,0,1,4,14,289,42,230,17,
5,,Alabaster,33501,92,0,2,10,80,579,56,497,26,
6,,Albertville,21428,24,0,6,10,8,802,194,492,116,
