# Final Tutorial
### Niko Zhang and Sophie Tsai

## Introduction
### Note this is generated by chatgpt and is not final
Crime is a pervasive problem that affects communities throughout the United States. As data scientists, we have the opportunity to contribute to the fight against crime by analyzing data on crime patterns and trends. In this exploratory data analysis, we will focus on crime in the United States at the state and city levels, using data from the FBI Uniform Crime Reporting (UCR) program.

The UCR program is a national initiative that collects and disseminates data on crime across the United States. Law enforcement agencies across the country submit data on a range of crimes, including murder, rape, robbery, aggravated assault, burglary, larceny theft, and motor vehicle theft. This data is used to inform policy decisions at the local, state, and national levels.

Our goal in this exploratory data analysis is to identify any significant trends or patterns in crime rates across the United States, as well as any differences in crime rates between states and cities. We will examine both the overall crime rate and rates for specific types of crimes to identify areas of concern and inform policy decisions aimed at reducing crime and improving public safety.

By analyzing crime data, we hope to contribute to the ongoing effort to reduce crime in the United States and provide valuable information to communities and law enforcement agencies alike. Our exploratory data analysis represents an important step in the fight against crime in the United States.

## Imports

In [1]:
# Imports for reading in data
import pandas as pd

## Create a DataFrame from UCR crime data (2021)

In [9]:
# Set max rows displayed in DataFrame
pd.set_option('display.max_rows', 10)

# Read in the excel file
df = pd.read_excel('2021offenses_by_state_and_city.xlsx')

'''The US state column in the Excel file has merged cells. When reading this file as a 
   DataFrame, the corresponding column has NaN values due to the merged cells separating 
   into unmerged cells. The line below fixes the issue by filling in those NaN values
   with the correct US states.'''
df = df.fillna(method='ffill', axis=0)

# Remove the first 2 rows and the last row as they are not needed
df = df.iloc[2:-1:, :]

# Make the first row which contains the names of the features as the column names
header = df.iloc[0] # takes the first row as the header for column names
header.name = '' # removes the name of the header (not needed)
df = df[1:]
df.columns = header

# Reformat column names for readability
df.columns = df.columns.str.lower().str.replace('\n',' ').str.replace(' ','_').str.replace('-','')

# Reset the indices
df.reset_index(drop=True, inplace=True)

# Display the DataFrame for UCR crime data
df

Unnamed: 0,state,city,population,violent_crime,murder_and_nonnegligent_manslaughter,rape,robbery,aggravated_assault,property_crime,burglary,larceny_theft,motor_vehicle_theft,arson
0,ALABAMA,Abbeville,2539,4,1,0,0,3,53,11,37,5,0
1,ALABAMA,Alabaster,33963,25,1,4,0,20,282,13,253,16,1
2,ALABAMA,Alexander City,14066,40,0,0,7,33,283,178,87,18,1
3,ALABAMA,Altoona,913,4,0,0,0,4,7,1,6,0,0
4,ALABAMA,Andalusia,8643,44,1,6,1,36,254,45,198,11,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5870,WYOMING,Rock Springs,22937,49,1,17,2,29,286,44,217,25,3
5871,WYOMING,Sheridan,18157,15,0,0,0,15,239,19,202,18,2
5872,WYOMING,Thermopolis,2747,2,0,0,0,2,25,2,23,0,0
5873,WYOMING,Torrington,6564,16,1,4,0,11,63,15,40,8,1
