# Final Tutorial
### Niko Zhang and Sophie Tsai

## Introduction
With the use of prompt engineering, I am never going to write a paragraph ever again.
#### ChatGPT prompt:
You are a data scientist working in a team conducting exploratory data analysis on crime in the united states at the state level. Your task is to turn in a tutorial that will walk users through the entire data science pipeline: data curation, parsing, and management. You want to see how household income and climate affect the amount of crime rate. You have taken household income data from the census bureau, taken crime data from UCR, and taken climate data from NOAA. Write an introduction to this project so that it shows the purpose of the project, the relevance/importance of the project with supporting background information, and where the data is obtained. Use third person.

#### ChatGPT output with some editing:
The purpose of this tutorial is to guide readers through the entire data science pipeline, including data curation, parsing, management, and analysis. To provide users with a comprehensive understanding of the data science pipeline, this tutorial conducts exploratory data analysis on crime rates in the United States at the state level. Specifically, it aims to investigate the impact of household income and climate on crime rates.

The relevance and importance of this project lies in its potential to provide insights into the factors that contribute to crime rates and how policymakers can use this information to implement effective crime prevention measures. By identifying the key drivers of crime rates, this project could lead to the development of more targeted and effective strategies for reducing crime in the United States.

The project uses data from three sources: the U.S. Census Bureau for household income data, the FBI Uniform Crime Reporting (UCR) program for crime data, and the National Oceanic and Atmospheric Administration (NOAA) for climate data. Through this project, we hope to demonstrate the power of data-driven insights in understanding complex social phenomena like crime and informing evidence-based policy decisions.

## Imports and configurations

In [1]:
# Imports for reading in data
import pandas as pd
import numpy as np

# Regex and strings
import re
import string

# Set max rows displayed in DataFrame
pd.set_option('display.max_rows', None)

# turn off SettingWithCopyWarning
pd.options.mode.chained_assignment = None

## Read in crime data by state (1979-2020)

In [2]:
# read in the csv file
crime_data = pd.read_csv('State_Crime_Rates.csv')
crime_data.head(5)

Unnamed: 0,year,state_abbr,state_name,population,violent_crime,property_crime,Violent_Crime_rate,Property_Crime_rate
0,1979,US,United States,220099000,1208030,11041500,548.85756,5016.606164
1,1979,AK,Alaska,406000,1994,23193,491.133005,5712.561576
2,1979,AL,Alabama,3769000,15578,144372,413.319183,3830.512072
3,1979,AR,Arkansas,2180000,7984,70949,366.238532,3254.541284
4,1979,AZ,Arizona,2450000,14528,177977,592.979592,7264.367347


In [3]:
# function to clean crime data
def clean_crime_data(df, year, version, state_names):
    
    # version 1 is for datatables in newer format (2005-2019)
    if (version == 1):

        '''The US state column in the Excel file has merged cells. When reading this file as a 
        DataFrame, the corresponding column has NaN values due to the merged cells separating 
        into unmerged cells. The line below fixes the issue by filling in those NaN values
        with the correct US states.'''
        df[df.columns[0]] = df[df.columns[0]].fillna(method='ffill', axis=0)

        # Remove the rows that are not part of the data table
        df = df[df.isin(['State Total', 'Total', 'Rate per 100,000 inhabitants']).any(axis=1)]

        # Remove unnecessary columns
        df = df.iloc[:, :13]

        # combine columns at indices 1 and 2
        df.iloc[:,1:3] = df.iloc[:,1:3].fillna('')
        df.insert(1, 'unit_type', df.iloc[:,2]+df.iloc[:,1])
        df.drop(df.columns[2:4], axis=1, inplace=True)

        # if number of columns is greater than 12, it means there are 2 columns for rape,
        # since the definition of 'rape' was changed at some point in time
        if len(df.columns) > 12:
            # drop the column for the old definiton of 'rape'
            df.drop(df.columns[6], axis=1, inplace=True)

        # set column names
        df.columns = ['state','unit_type','population','violent_crime','murder_and_nonnegligent_manslaughter','rape','robbery','aggravated_assault','property_crime','burglary','larcenytheft','motor_vehicle_theft']

        # add column for year
        df.insert(0, 'year', year)

        # remove all non-characters from state names
        df['state'] = df['state'].str.replace(r'\d|,', '', regex=True).str.title()

        # Reset the indices
        df.reset_index(drop=True, inplace=True) 
    
    # version 2 is for datatables in older format (1999-2004)
    elif (version == 2):
        
        # Remove unnecessary columns
        if (year < 2003):
            df.replace(0, np.nan, inplace=True) # replace all zeros with NaN
            df=df.dropna(axis=1,how='all') # drop columns with all NaN values
            df.drop(df.columns[2:4], axis=1, inplace=True)
            # move total property crime column to correct position
            col = df[df.columns[3]]
            df.drop(df.columns[3], axis=1, inplace=True)
            df.insert(7, 'property_crime', col)
        df = df.iloc[:, :11]
        
        # set column names
        df.columns = ['unit_type','population','violent_crime','murder_and_nonnegligent_manslaughter','rape','robbery','aggravated_assault','property_crime','burglary','larcenytheft','motor_vehicle_theft']
        
        # remove all unnecessary rows
        df = df[df.isin(['State Total', 'Total', '   Rate per 100,000 inhabitants']).any(axis=1)]
        
        # Reset the indices
        df.reset_index(drop=True, inplace=True)
        
        # strip spaces from unit_type column from both ends
        df['unit_type'] = df['unit_type'].str.strip()
        
        # add column for states
        df.insert(0, 'state', state_names)
        
        # add column for year
        df.insert(0, 'year', year)
    return df

In [4]:
'''for each crime datatable from 2005-2019, read in the excel file, clean it, and append it to
   the DataFrame'''
df_list = []
for i in range(2005,2020):
    df = pd.read_excel('data/crimes_by_state/'+str(i)+'/'+str(i)+'offenses_by_state.xls')
    df = clean_crime_data(df, i, 1, None)
    df_list.append(df)
state_names = df_list[0]['state'].head(104)
for i in range(1999,2005):
    df = pd.read_excel('data/crimes_by_state/'+str(i)+'/'+str(i)+'offenses_by_state.xls')
    df = clean_crime_data(df, i, 2, state_names)
    df_list.append(df)    

crime_data = pd.concat(df_list)
crime_data = crime_data.sort_values(['year', 'state'])
# Reset the indices
crime_data.reset_index(drop=True, inplace=True)
crime_data.head(10)

Unnamed: 0,year,state,unit_type,population,violent_crime,murder_and_nonnegligent_manslaughter,rape,robbery,aggravated_assault,property_crime,burglary,larcenytheft,motor_vehicle_theft
0,1999,Alabama,State Total,4370000.0,21421.0,345.0,1513.0,5297.0,14266.0,171398.0,38648.0,119616.0,13134.0
1,1999,Alabama,"Rate per 100,000 inhabitants",,490.2,7.9,34.6,121.2,326.5,3922.2,884.4,2737.2,300.5
2,1999,Alaska,State Total,619000.0,3909.0,53.0,517.0,566.0,2773.0,23099.0,3787.0,16654.0,2658.0
3,1999,Alaska,"Rate per 100,000 inhabitants",,631.5,8.6,83.5,91.4,448.0,3731.7,611.8,2690.5,429.4
4,1999,Arizona,State Total,4778000.0,26334.0,384.0,1383.0,7288.0,17279.0,255401.0,49423.0,167731.0,38247.0
5,1999,Arizona,"Rate per 100,000 inhabitants",,551.2,8.0,28.9,152.5,361.6,5345.4,1034.4,3510.5,800.5
6,1999,Arkansas,State Total,2551000.0,10848.0,143.0,710.0,2024.0,7971.0,92283.0,21692.0,63927.0,6664.0
7,1999,Arkansas,"Rate per 100,000 inhabitants",,425.2,5.6,27.8,79.3,312.5,3617.5,850.3,2506.0,261.2
8,1999,California,State Total,33145000.0,207879.0,2005.0,9363.0,60039.0,136472.0,1053285.0,223814.0,660991.0,168480.0
9,1999,California,"Rate per 100,000 inhabitants",,627.2,6.0,28.2,181.1,411.7,3177.8,675.3,1994.2,508.3
