# COGS 108 - Final Project

## Important

- ONE, and only one, member of your group should upload this notebook to TritonED. 
- Each member of the group will receive the same grade on this assignment. 
- Keep the file name the same: submit the file 'FinalProject.ipynb'.
- Only upload the .ipynb file to TED, do not upload any associted data. Make sure that for cells in which you want graders to see output that these cells have been executed.

## Group Members: Fill in the Student IDs of each group member here

Replace the lines below to list each persons full student ID, ucsd email and full name.

-  
- 
- 
- 



# Part 0 - Importing the required packages

In [5]:
import pandas as pd # for use of dataframes
import numpy as np
import matplotlib.pyplot as plt # for plotting graphs 
import statsmodels.api as sm # for fitting an OLS model
import scipy.stats as stats # for access to tests for normal distribution
from scipy.stats import ttest_ind, normaltest # for access to t tests, normal test

# Part 1 - Introduction and Background

Chicago, otherwise known as 'The Windy City', has some of the worst crime rates in America. Some of this has to do with statitical flukes; it is the third largest city in the U.S., which makes any type of criminal activity more easily reportable. However, most people wouldn't argue over the real danger you are in when you go to some of the worse off areas of Chicago.

We wanted to analyze the crime rates in Chicago in some way in order to assist people in the city in understanding what problems exist and where they are happening. In the end, we decided to ask the following question:

"Does the district a crime is committed in or type of crime more closely predict whether an arrest was made".

The answers derived from the answer to this question could spawn further analysis, e.g.:

1. If a particular crime is more likely to result in an arrest in one district over the other, why is that? Is it the police in that district? Is it the citizens? Or is it the geography?

2. If the type of crime affects whether an arrest is made, does that mean that we are more afraid of certain crimes over others? Is this a valid fear? 

# Part 2 - Data Description

We are using this dataset: https://www.kaggle.com/chicago/chicago-crime

This is chicago crime data from 2001 to present. It represents all reported instances of crime, and whether or not an arrest has been made since the data was last released. Each row represents a crime, except for multiple homicides, for which there is a different row for each victim of the crime. 

Here are some basic stats:

In [29]:
crime = pd.read_csv('~/data/chicago_crime_data.csv')

In [7]:
crime.describe()

Unnamed: 0,ID,Beat,District,Ward,Community Area,X Coordinate,Y Coordinate,Year,Latitude,Longitude
count,6802359.0,6802359.0,6802312.0,6187505.0,6188828.0,6740394.0,6740394.0,6802359.0,6740394.0,6740394.0
mean,6267879.0,1191.551,11.30251,22.68313,37.5862,1164517.0,1885721.0,2008.401,41.84201,-87.6718
std,3063983.0,703.3569,6.945669,13.83246,21.54045,17166.83,32704.68,5.067564,0.08999481,0.0621243
min,634.0,111.0,1.0,1.0,0.0,0.0,0.0,2001.0,36.61945,-91.68657
25%,3443594.0,622.0,6.0,10.0,23.0,1152934.0,1859189.0,2004.0,41.76892,-87.71387
50%,6254311.0,1111.0,10.0,22.0,32.0,1165991.0,1890586.0,2008.0,41.85537,-87.66618
75%,8909085.0,1731.0,17.0,34.0,58.0,1176352.0,1909313.0,2012.0,41.90686,-87.62836
max,11589340.0,2535.0,31.0,50.0,77.0,1205119.0,1951622.0,2019.0,42.02291,-87.52453


In [9]:
print('Total Number or Crimes:\t\t\t{0}'.format(len(crime)))
print('Total number of Unique Districts:\t{0}'.format(len(crime['District'].unique())))

Total Number or Crimes:			6802359
Total number of Unique Districts:	25


# Part 3 - Data Cleaning and Preprocessing

This data is fairly clean, because it is maintained by the Chicago Police Department, which has high organizational standards. There aren't very many invalid crimes or unknowns.

To clean the data, we will do the following (not necessarily in order):

1. Drop useless columns
2. Standardize the location description
3. Standardize the crime type to get rid of the uppercase, standardize the non-criminal type
4. Remove NaN values

In [30]:
# first, drop the location description column and community area column because 
crime = crime.drop(['Location Description','Community Area','Latitude','Longitude','X Coordinate','Y Coordinate','Location','Ward'],axis=1)

In [31]:
def standardize_primary_type(string):
    # compile all non-criminal offenses into on label
    if string == 'NON-CRIMINAL (SUBJECT SPECIFIED)' or string == 'NON - CRIMINAL' or string == 'NON-CRIMINAL':
        return 'non-criminal'
    if string == 'OTHER OFFENSE':
        return 'other'
    # rename crim sexual assault to just sexual assault to make it easier to read
    if string == 'CRIM SEXUAL ASSAULT':
        return 'sexual assault'
    else:
        # everything else, make sure to lowercase it so we don't have to use caps lock lol
        return string.lower()
        

In [32]:
# Standardize the crime type
crime['Type'] = crime['Primary Type'].apply(standardize_primary_type)

In [33]:
crime[crime.isnull().any(axis=1)]

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Arrest,Domestic,Beat,District,FBI Code,Year,Updated On,Type
708934,11181991,JA549530,12/14/2017 08:10:00 PM,100XX W BRYN MAWR AVE,0460,BATTERY,SIMPLE,False,False,1654,,08B,2017,05/04/2018 03:51:04 PM,battery
1314608,4437079,HL727137,11/10/2005 10:10:00 AM,010XX E GRAND AVE,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,False,True,1834,,26,2005,02/10/2018 03:50:01 PM,other
1813672,6376239,HP461051,07/18/2008 08:00:00 PM,010XX N SPRINGFIELD AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,False,False,1112,,14,2008,02/10/2018 03:50:01 PM,criminal damage
1821722,6420740,HP503796,01/07/2007 05:00:00 AM,030XX S ASHLAND AVE,1140,DECEPTIVE PRACTICE,EMBEZZLEMENT,False,False,922,,12,2007,02/10/2018 03:50:01 PM,deceptive practice
1829184,6462999,HP539483,08/23/2008 12:00:00 PM,061XX S MICHIGAN AVE,1122,DECEPTIVE PRACTICE,COUNTERFEIT CHECK,False,False,311,,10,2008,02/10/2018 03:50:01 PM,deceptive practice
1838356,6516041,HP587362,09/22/2008 09:10:00 PM,013XX N CALIFORNIA AVE,0560,ASSAULT,SIMPLE,False,False,1423,,08A,2008,02/10/2018 03:50:01 PM,assault
1841841,6534030,HP607574,09/03/2008 12:00:00 PM,003XX N WOLCOTT AVE,0910,MOTOR VEHICLE THEFT,AUTOMOBILE,False,False,1333,,07,2008,02/10/2018 03:50:01 PM,motor vehicle theft
1846326,6555854,HP629252,10/15/2008 04:10:00 PM,021XX W NORTH AVE,0610,BURGLARY,FORCIBLE ENTRY,False,False,1424,,05,2008,02/10/2018 03:50:01 PM,burglary
1847499,6562039,HP633798,10/17/2008 10:39:00 PM,025XX W HUTCHINSON ST,0320,ROBBERY,STRONGARM - NO WEAPON,False,False,1912,,03,2008,02/10/2018 03:50:01 PM,robbery
1850390,6576124,HP528527,08/21/2008 02:30:00 PM,004XX N MC CLURG CT,1152,DECEPTIVE PRACTICE,ILLEGAL USE CASH CARD,False,False,1834,,11,2008,02/10/2018 03:50:01 PM,deceptive practice


# Part 4 - Data Visualization

# Part 5 - Data Analysis and Results

# Part 6 - Privacy/Ethics Considerations

# Part 7 - Conclusions and Discussion