# COGS 108 - Final Project

## Important

- ONE, and only one, member of your group should upload this notebook to TritonED. 
- Each member of the group will receive the same grade on this assignment. 
- Keep the file name the same: submit the file 'FinalProject.ipynb'.
- Only upload the .ipynb file to TED, do not upload any associted data. Make sure that for cells in which you want graders to see output that these cells have been executed.

## Group Members: Fill in the Student IDs of each group member here

Replace the lines below to list each persons full student ID, ucsd email and full name.

-  
- 
- 
- 



# Part 0 - Importing the required packages

In [1]:
import pandas as pd # for use of dataframes
import numpy as np
import matplotlib.pyplot as plt # for plotting graphs 
import statsmodels.api as sm # for fitting an OLS model
import scipy.stats as stats # for access to tests for normal distribution
from scipy.stats import ttest_ind, normaltest # for access to t tests, normal test

# Part 1 - Introduction and Background

Chicago, otherwise known as 'The Windy City', has some of the worst crime rates in America. Some of this has to do with statitical flukes; it is the third largest city in the U.S., which makes any type of criminal activity more easily reportable. However, most people wouldn't argue over the real danger you are in when you go to some of the worse off areas of Chicago.

We wanted to analyze the crime rates in Chicago in some way in order to assist people in the city in understanding what problems exist and where they are happening. In the end, we decided to ask the following question:

"Does the district a crime is committed in or type of crime more closely predict whether an arrest was made".

The answers derived from the answer to this question could spawn further analysis, e.g.:

1. If a particular crime is more likely to result in an arrest in one district over the other, why is that? Is it the police in that district? Is it the citizens? Or is it the geography?

2. If the type of crime affects whether an arrest is made, does that mean that we are more afraid of certain crimes over others? Is this a valid fear? 

# Part 2 - Data Description

We are using this dataset: https://www.kaggle.com/chicago/chicago-crime

This is chicago crime data from 2001 to present. It represents all reported instances of crime, and whether or not an arrest has been made since the data was last released. Each row represents a crime, except for multiple homicides, for which there is a different row for each victim of the crime. 

Here are some basic stats:

In [2]:
crime = pd.read_csv('~/data/CrimesSmall.csv')

In [3]:
crime.head()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10000092,HY189866,03/18/2015 07:44:00 PM,047XX W OHIO ST,041A,BATTERY,AGGRAVATED: HANDGUN,STREET,False,False,...,28.0,25.0,04B,1144606.0,1903566.0,2015,02/10/2018 03:50:01 PM,41.891399,-87.744385,"(41.891398861, -87.744384567)"
1,10000094,HY190059,03/18/2015 11:00:00 PM,066XX S MARSHFIELD AVE,4625,OTHER OFFENSE,PAROLE VIOLATION,STREET,True,False,...,15.0,67.0,26,1166468.0,1860715.0,2015,02/10/2018 03:50:01 PM,41.773372,-87.665319,"(41.773371528, -87.665319468)"
2,10000095,HY190052,03/18/2015 10:45:00 PM,044XX S LAKE PARK AVE,0486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,False,True,...,4.0,39.0,08B,1185075.0,1875622.0,2015,02/10/2018 03:50:01 PM,41.813861,-87.596643,"(41.81386068, -87.596642837)"
3,10000096,HY190054,03/18/2015 10:30:00 PM,051XX S MICHIGAN AVE,0460,BATTERY,SIMPLE,APARTMENT,False,False,...,3.0,40.0,08B,1178033.0,1870804.0,2015,02/10/2018 03:50:01 PM,41.800802,-87.622619,"(41.800802415, -87.622619343)"
4,10000097,HY189976,03/18/2015 09:00:00 PM,047XX W ADAMS ST,031A,ROBBERY,ARMED: HANDGUN,SIDEWALK,False,False,...,28.0,25.0,03,1144920.0,1898709.0,2015,02/10/2018 03:50:01 PM,41.878065,-87.743354,"(41.878064761, -87.743354013)"


In [4]:
crime.describe()

Unnamed: 0,ID,Beat,District,Ward,Community Area,X Coordinate,Y Coordinate,Year,Latitude,Longitude
count,6802359.0,6802359.0,6802312.0,6187505.0,6188828.0,6740394.0,6740394.0,6802359.0,6740394.0,6740394.0
mean,6267879.0,1191.551,11.30251,22.68313,37.5862,1164517.0,1885721.0,2008.401,41.84201,-87.6718
std,3063983.0,703.3569,6.945669,13.83246,21.54045,17166.83,32704.68,5.067564,0.08999481,0.0621243
min,634.0,111.0,1.0,1.0,0.0,0.0,0.0,2001.0,36.61945,-91.68657
25%,3443594.0,622.0,6.0,10.0,23.0,1152934.0,1859189.0,2004.0,41.76892,-87.71387
50%,6254311.0,1111.0,10.0,22.0,32.0,1165991.0,1890586.0,2008.0,41.85537,-87.66618
75%,8909085.0,1731.0,17.0,34.0,58.0,1176352.0,1909313.0,2012.0,41.90686,-87.62836
max,11589340.0,2535.0,31.0,50.0,77.0,1205119.0,1951622.0,2019.0,42.02291,-87.52453


In [5]:
print('Total Number or Crimes:\t\t\t{0}'.format(len(crime)))
print('Total number of Unique Districts:\t{0}'.format(len(crime['District'].unique())))

Total Number or Crimes:			6802359
Total number of Unique Districts:	25


Some crimes appear more than others. Later on, we might have to select the top crimes to analyze, so here they are

In [6]:
crime['Primary Type'].value_counts().nlargest(3)

THEFT              1431614
BATTERY            1241949
CRIMINAL DAMAGE     776795
Name: Primary Type, dtype: int64

# Part 3 - Data Cleaning and Preprocessing

This data is fairly clean, because it is maintained by the Chicago Police Department, which has high organizational standards. There aren't very many invalid crimes or unknowns.

To clean the data, we will do the following (not necessarily in order):

1. Drop useless columns
2. Standardize the location description
3. Standardize the crime type to get rid of the uppercase, standardize the non-criminal type
4. Convert True to 1.0 and False to 0.0
5. Remove NaN values

In [7]:
# first, drop the location description column and community area column because 
crime = crime.drop(['Location Description', 'Block', 'Community Area','Latitude', 'Beat', 'Ward','Longitude','X Coordinate','Y Coordinate','Location','Ward'],axis=1)

In [8]:
"""
    Author: James McDougall
    Param: string - is the string which is the name of the Primary Type
    Returns: a variable of type str which is lower case and represents  a more standardized type
"""
def standardize_primary_type(string):
    # compile all non-criminal offenses into on label
    if string == 'NON-CRIMINAL (SUBJECT SPECIFIED)' or string == 'NON - CRIMINAL' or string == 'NON-CRIMINAL':
        return 'non-criminal'
    if string == 'OTHER OFFENSE':
        return 'other'
    # rename crim sexual assault to just sexual assault to make it easier to read
    if string == 'CRIM SEXUAL ASSAULT':
        return 'sexual assault'
    else:
        # everything else, make sure to lowercase it so we don't have to use caps lock lol
        return string.lower()
        

Removing NaNs. Check that the two important columns don't have NaNs.

In [9]:
to_drop = crime[ crime['District'].isnull()]
print('Number of rows to drop with NaN district: ' + str(len(to_drop)))
crime = crime.drop(to_drop.index,axis=0)

to_drop = crime[crime['Primary Type'].isnull()]
print('Number of row to drop with NaN Primary Type: ' + str(len(to_drop)))
crime = crime.drop(to_drop.index,axis=0)

Number of rows to drop with NaN district: 47
Number of row to drop with NaN Primary Type: 0


To standardize the data we do the following:
1. standardize the type to be all lower case, merge some similar crimes that are formatted differently together
2. convert district, which is a float, to a string, because district should be categorical, not numerical.

In [10]:
crime['Type'] = crime['Primary Type'].apply(standardize_primary_type)
crime['District'] = crime['District'].astype(str)

def arrest_to_int(string):
    if string == True:
        return 1
    elif string == False:
        return 0
crimes['Arrest'] = crimes['Arrest'].apply(arrest_to_int)

# Part 4 - Data Visualization

# Part 5 - Data Analysis and Results

# Part 6 - Privacy/Ethics Considerations

The data is private because there is no identifying information. There are no names, only crime ids and description. The only identifying information is the location. This could be an issue because some of the crimes occurred in apartments, so you could conceivably find out the location where a crime was committed and use this to find someone.

A potential ethics considerations involving this data is that we are publicizing the locations of high crime, which could be used to discriminate against communities. While this data is publicily available, we are organizing and formatting it in a way that can be used to show which districts have a high crime rate. People with ill intentions could use this to argue that no one should go near these areas, alientating those communities.

# Part 7 - Conclusions and Discussion