In [None]:
# Ref(install conda in google colab): https://medium.com/data-professor/how-to-install-conda-on-google-colab-e7bbf9036f76
# Ref(to install conda packages in google colab): https://towardsdatascience.com/conda-google-colab-75f7c867a522

In [1]:
#!pip install geopandas
#!pip install rtree

# **Crime Prediction & Forecasting towards 2022**

## **Business Case**
Chicago State Police recently participated in the annual State Defense and Security Summit 2021 and have identified the need for a change in their policies on a city level. To increase the prevention effort for high crimes in the city and equipping proper skillsets and resources for their fellow officers moving forward.

As the first phase of revolutionizing policing work, they decided to set up a new department called PreCrime Task Force (PCTF) as representative of the force, informing and geared frontline officers towards potential crimes.

*‘Prevent, deter and detect crime’*


## **Problem Statement**
- To achieve success, we, Data Forensic Scientists, have been engaged by the state police to be the brains of the unit in predicting and forecasting different types of crimes and measured based on accuracy of the predictions.


## **Objectives**
 - Determine how a combination of machine learning algorithm can be used by Chicago State Police to detect, prevent, and solve crimes accurately and at a faster rate
 - Develop a model to predict and forecast crime rates per month to antipate the allocation of resources and equipments


### **Project Approach**
Our task is to identify known crimes and develop a model to predict crimes rates to help inform state police officers for their preparation. 

Approached to this project are as follows:
1. Data acquisition from [Chicago Data Portal](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2)
2. Preliminary Analysis
3. Data Cleaning
4. Exploratory Data Analysis (EDA) 
5. Data Pre-processing
6. Modelling - Times Series & ML Regression
7. Conclusions & Recommendations






### Reference
- https://www.bestplaces.net/health/city/illinois/chicago
- https://www.nytimes.com/2021/06/16/upshot/murder-crime-trends-chicago.html
- https://en.wikipedia.org/wiki/Crime_in_Chicago#References



## **Preliminary Research**
Chicago City crime rate far exceeds the average of its country, United States and state, Illinois on every aspect from violent crime such as murder, rape to simple crime such as theft and assault.

Notably, in the past decade, Chicago's crime rates have seen a gradual decreased in trend but the reasons why are not known to the government nor the state police. 

While Chicago is known for series of violent and property crimes mentioned in the media, the percentage difference was measure based on individual crime. Other types of crime not mention may be greater in numbers and magnitude.

## **Summary Observations**
- Data contains a surface level information of crimes based on type and description, location and mapping area which it was reported
- Contains a high number of null values mainly from location

### **Cleaning:**

As each crime is an important information, instead of removing, we dealt with null values accordingly
- Latitude and longtiude using block address
- District, ward and community area by using Chicago Data Portal
- Location description values assign to others



In [None]:
# Import libraries and packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import re
import geopandas as gpd

from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

pd.set_option('display.max_columns', None)

In [None]:
# Load crime data
df = pd.read_csv('drive/MyDrive/Colab Notebooks/capstone/assets/all_crimes.csv')
df.head()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10224738,HY411648,09/05/2015 01:30:00 PM,043XX S WOOD ST,486,BATTERY,DOMESTIC BATTERY SIMPLE,RESIDENCE,False,True,924,9.0,12.0,61.0,08B,1165074.0,1875917.0,2015,02/10/2018 03:50:01 PM,41.815117,-87.67,"(41.815117282, -87.669999562)"
1,10224739,HY411615,09/04/2015 11:30:00 AM,008XX N CENTRAL AVE,870,THEFT,POCKET-PICKING,CTA BUS,False,False,1511,15.0,29.0,25.0,06,1138875.0,1904869.0,2015,02/10/2018 03:50:01 PM,41.89508,-87.7654,"(41.895080471, -87.765400451)"
2,11646166,JC213529,09/01/2018 12:01:00 AM,082XX S INGLESIDE AVE,810,THEFT,OVER $500,RESIDENCE,False,True,631,6.0,8.0,44.0,06,,,2018,04/06/2019 04:04:43 PM,,,
3,10224740,HY411595,09/05/2015 12:45:00 PM,035XX W BARRY AVE,2023,NARCOTICS,POSS: HEROIN(BRN/TAN),SIDEWALK,True,False,1412,14.0,35.0,21.0,18,1152037.0,1920384.0,2015,02/10/2018 03:50:01 PM,41.937406,-87.71665,"(41.937405765, -87.716649687)"
4,10224741,HY411610,09/05/2015 01:00:00 PM,0000X N LARAMIE AVE,560,ASSAULT,SIMPLE,APARTMENT,False,True,1522,15.0,28.0,25.0,08A,1141706.0,1900086.0,2015,02/10/2018 03:50:01 PM,41.881903,-87.755121,"(41.881903443, -87.755121152)"


---

# Data Dictionary

|S/N|Feature|Type|Dataset|Description|
|:---:|:---|:---|:---:|:---|
|1|ID|Object|df|Unique identifier for records|
|2|Case Number|Object|df|The Chicago Police Department RD Number (Records Division Number), which is unique to the incident|
|3|Date|Datetime|df|Date when the incident occurred(sometimes a best estimate)|
|4|Block|Object|df|The partially redacted address where the incident occurred, placing it on the same block as the actual address|
|5|IUCR|Object|df|The Illinois Unifrom Crime Reporting code. This is directly linked to the Primary Type and Description. See the list of IUCR codes at https://data.cityofchicago.org/d/c7ck-438e|
|6|Primary Type|Object|df|The primary description of the ICUR code|
|7|Description|Object|df|The secondary description of the IUCR code, a subcategory of the primary description|
|8|Location Description|Object|df|Description of the location where the incident occurred|
|9|Arrest|Boolean|df|Indicates whether the arrest was made|
|10|Domestic|Boolean|df|Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act|
|11|Beat|Int|df|Indicates the beat where the incident occurred. A beat is the smallest police geographic area – each beat has a dedicated police beat car. Three to five beats make up a police sector, and three sectors make up a police district. The Chicago Police Department has 22 police districts. See the beats at https://data.cityofchicago.org/d/aerh-rz74|
|12|District|Float|df|Indicates the police district where the incident occurred. See the districts at https://data.cityofchicago.org/d/fthy-xz3r|
|13|Ward|Float|df|The ward (City Council district) where the incident occurred. See the wards at https://data.cityofchicago.org/d/sp34-6z76|
|14|Community Area|Float|df|Indicates the community area where the incident occurred. Chicago has 77 community areas. See the community areas at https://data.cityofchicago.org/d/cauq-8yn6|
|15|FBI Code|Object|df|Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS). See the Chicago Police Department listing of these classifications at http://gis.chicagopolice.org/clearmap_crime_sums/crime_types.html|
|16|X Coordinate|Float|df|The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.|
|17|Y Coordinate|Float|df|The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.|
|18|Year|datetime|df|Year the incident occured|
|19|Updated On|Object|Df|Date and time the record was last updated|
|20|Latitude|Float|df|The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block|
|21|Longitude|Float|df|The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block|
|22|Location|Object|df|The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block|

# Data Summary

In [None]:
# Data rows and columns
df.shape 

(7369258, 22)

In [None]:
# df summary
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7369258 entries, 0 to 7369257
Data columns (total 22 columns):
 #   Column                Dtype  
---  ------                -----  
 0   ID                    int64  
 1   Case Number           object 
 2   Date                  object 
 3   Block                 object 
 4   IUCR                  object 
 5   Primary Type          object 
 6   Description           object 
 7   Location Description  object 
 8   Arrest                bool   
 9   Domestic              bool   
 10  Beat                  int64  
 11  District              float64
 12  Ward                  float64
 13  Community Area        float64
 14  FBI Code              object 
 15  X Coordinate          float64
 16  Y Coordinate          float64
 17  Year                  int64  
 18  Updated On            object 
 19  Latitude              float64
 20  Longitude             float64
 21  Location              object 
dtypes: bool(2), float64(7), int64(3), object(1

In [None]:
# Check for null values
df.isnull().sum()

ID                           0
Case Number                  4
Date                         0
Block                        0
IUCR                         0
Primary Type                 0
Description                  0
Location Description      8410
Arrest                       0
Domestic                     0
Beat                         0
District                    47
Ward                    614837
Community Area          613481
FBI Code                     0
X Coordinate             73441
Y Coordinate             73441
Year                         0
Updated On                   0
Latitude                 73441
Longitude                73441
Location                 73441
dtype: int64

In [None]:
df.describe()

Unnamed: 0,ID,Beat,District,Ward,Community Area,X Coordinate,Y Coordinate,Year,Latitude,Longitude
count,7369258.0,7369258.0,7369211.0,6754421.0,6755777.0,7295817.0,7295817.0,7369258.0,7295817.0,7295817.0
mean,6706293.0,1187.754,11.29414,22.71887,37.55121,1164558.0,1885726.0,2009.271,41.84203,-87.67165
std,3318089.0,702.8956,6.946204,13.83099,21.53817,16857.35,32284.76,5.732384,0.08882348,0.06111225
min,634.0,111.0,1.0,1.0,0.0,0.0,0.0,2001.0,36.61945,-91.68657
25%,3633551.0,622.0,6.0,10.0,23.0,1152942.0,1859074.0,2004.0,41.76871,-87.71382
50%,6697120.0,1034.0,10.0,22.0,32.0,1166045.0,1890643.0,2009.0,41.85572,-87.666
75%,9606108.0,1731.0,17.0,34.0,57.0,1176363.0,1909222.0,2014.0,41.90668,-87.62828
max,12437200.0,2535.0,31.0,50.0,77.0,1205119.0,1951622.0,2021.0,42.02291,-87.52453


In [None]:
# Change column names to lower and replace whitespace with _
df.columns = df.columns.str.lower().str.replace(' ', '_')
df.columns

Index(['id', 'case_number', 'date', 'block', 'iucr', 'primary_type',
       'description', 'location_description', 'arrest', 'domestic', 'beat',
       'district', 'ward', 'community_area', 'fbi_code', 'x_coordinate',
       'y_coordinate', 'year', 'updated_on', 'latitude', 'longitude',
       'location'],
      dtype='object')

# Understanding Crime Data

## Missing Values

In [None]:
sorted(df1.year.unique())

[2001,
 2002,
 2003,
 2004,
 2005,
 2006,
 2007,
 2008,
 2009,
 2010,
 2011,
 2012,
 2013,
 2014,
 2015,
 2016,
 2017,
 2018,
 2019,
 2020,
 2021]

In [None]:
# Check % of missing values (dataset)
def per_missing_values(df):
    for cols in df.columns:
        total_missing = df[cols].isnull().sum()
        if total_missing == 0:
            continue
        else:
            percentage = round((total_missing / df.shape[0]) * 100, 4)
            print (f'{cols} - {percentage}%')

# Check % of missing values (year)          
def per_missing_values_years(df):
    for year in sorted(df['year'].unique()):
        print (year)
        
        df_year = df[df['year'] == year]
        for cols in df_year.columns:
            total_missing = df_year[cols].isnull().sum()
            if total_missing == 0:
                continue
            else:
                percentage = round((total_missing / df.shape[0]) * 100, 4)
                print (f'{cols} - {percentage}%')
        print ('')

# Check total count of missing values (year)          
def count_missing_values_years(df):
    for year in sorted(df['year'].unique()):
        print (year)
        
        count_year = df[df['year'] == year]
        count_missing = count_year.isnull().sum()
        print (count_missing)
        print ('')

In [None]:
# Null values based on whole dataset
per_missing_values(df1)

case_number - 0.0001%
location_description - 0.1141%
district - 0.0006%
ward - 8.3433%
community_area - 8.3249%
x_coordinate - 0.9966%
y_coordinate - 0.9966%
latitude - 0.9966%
longitude - 0.9966%
location - 0.9966%


In [None]:
# Null values based on years
per_missing_values_years(df1)

2001
location_description - 0.0001%
ward - 6.5351%
community_area - 6.5082%
x_coordinate - 0.0398%
y_coordinate - 0.0398%
latitude - 0.0398%
longitude - 0.0398%
location - 0.0398%

2002
location_description - 0.0001%
ward - 1.8063%
community_area - 1.7999%
x_coordinate - 0.2068%
y_coordinate - 0.2068%
latitude - 0.2068%
longitude - 0.2068%
location - 0.2068%

2003
location_description - 0.0001%
ward - 0.0003%
community_area - 0.0007%
x_coordinate - 0.0533%
y_coordinate - 0.0533%
latitude - 0.0533%
longitude - 0.0533%
location - 0.0533%

2004
location_description - 0.0002%
district - 0.0%
ward - 0.0003%
community_area - 0.0009%
x_coordinate - 0.0299%
y_coordinate - 0.0299%
latitude - 0.0299%
longitude - 0.0299%
location - 0.0299%

2005
location_description - 0.0002%
district - 0.0%
ward - 0.0%
community_area - 0.0007%
x_coordinate - 0.052%
y_coordinate - 0.052%
latitude - 0.052%
longitude - 0.052%
location - 0.052%

2006
location_description - 0.0002%
district - 0.0%
ward - 0.0%
communi

In [None]:
# Total count of null values
count_missing_values_years(df)

2001
id                           0
case_number                  0
date                         0
block                        0
iucr                         0
primary_type                 0
description                  0
location_description         4
arrest                       0
domestic                     0
beat                         0
district                     0
ward                    481586
community_area          479607
fbi_code                     0
x_coordinate              2932
y_coordinate              2932
year                         0
updated_on                   0
latitude                  2932
longitude                 2932
location                  2932
day                          0
month                        0
day_of_week                  0
time                         0
dtype: int64

2002
id                           0
case_number                  0
date                         0
block                        0
iucr                         0
primary_type   

In [None]:
df1[['x_coordinate', 'y_coordinate', 'latitude', 'longitude', 'block', 'location']].head()

Unnamed: 0,x_coordinate,y_coordinate,latitude,longitude,block,location
0,1165074.0,1875917.0,41.815117,-87.67,043XX S WOOD ST,"(41.815117282, -87.669999562)"
1,1138875.0,1904869.0,41.89508,-87.7654,008XX N CENTRAL AVE,"(41.895080471, -87.765400451)"
2,,,,,082XX S INGLESIDE AVE,
3,1152037.0,1920384.0,41.937406,-87.71665,035XX W BARRY AVE,"(41.937405765, -87.716649687)"
4,1141706.0,1900086.0,41.881903,-87.755121,0000X N LARAMIE AVE,"(41.881903443, -87.755121152)"


In [None]:
df1['location_description'].unique()

array(['RESIDENCE', 'CTA BUS', 'SIDEWALK', 'APARTMENT',
       'RESIDENCE-GARAGE', 'GROCERY FOOD STORE', 'STREET', nan,
       'PARKING LOT/GARAGE(NON.RESID.)', 'SMALL RETAIL STORE', 'OTHER',
       'VEHICLE NON-COMMERCIAL', 'RESTAURANT', 'RESIDENCE PORCH/HALLWAY',
       'ALLEY', 'POLICE FACILITY/VEH PARKING LOT', 'LIBRARY',
       'ATHLETIC CLUB', 'DRUG STORE', 'PARK PROPERTY',
       'CHA PARKING LOT/GROUNDS', 'NURSING HOME/RETIREMENT HOME',
       'DRIVEWAY - RESIDENTIAL', 'RESIDENTIAL YARD (FRONT/BACK)',
       'COMMERCIAL / BUSINESS OFFICE', 'DEPARTMENT STORE', 'HOTEL/MOTEL',
       'GAS STATION', 'BAR OR TAVERN',
       'CHURCH/SYNAGOGUE/PLACE OF WORSHIP', 'SPORTS ARENA/STADIUM',
       'CONSTRUCTION SITE', 'HOSPITAL BUILDING/GROUNDS', 'CTA STATION',
       'TAVERN/LIQUOR STORE', 'CHA HALLWAY/STAIRWELL/ELEVATOR',
       'CONVENIENCE STORE', 'WAREHOUSE', 'VACANT LOT/LAND',
       'CTA BUS STOP', 'CHA APARTMENT', 'TAXICAB', 'CTA TRAIN',
       'APPLIANCE STORE', 'BARBERSHOP', 'BAN

## Initial Pre-processing

- Feature engineered new columns based on dates

In [None]:
# Check duplicate data
df.duplicated().sum()

0

In [None]:
# Check date incident occured and last updated on
df[['date', 'updated_on']].head()

Unnamed: 0,date,updated_on
0,09/05/2015 01:30:00 PM,02/10/2018 03:50:01 PM
1,09/04/2015 11:30:00 AM,02/10/2018 03:50:01 PM
2,09/01/2018 12:01:00 AM,04/06/2019 04:04:43 PM
3,09/05/2015 12:45:00 PM,02/10/2018 03:50:01 PM
4,09/05/2015 01:00:00 PM,02/10/2018 03:50:01 PM


In [None]:
# Change 'date' to datetime dtype
df['date'] = pd.to_datetime(df['date'])
df['date'].dtype

dtype('<M8[ns]')

In [None]:
# Create new columns daty, month, time
df['day'] = df['date'].apply(lambda x: x.day)
df['month'] = df['date'].apply(lambda x: x.month)
df['day_of_week'] = df['date'].apply(lambda x: x.dayofweek)
df['time'] = df['date'].apply(lambda x: x.strftime("%H"))

In [None]:
# Checking date breakdown 
df[['date', 'day', 'month', 'year', 'day_of_week', 'time']].head()

Unnamed: 0,date,day,month,year,day_of_week,time
0,2015-09-05 13:30:00,5,9,2015,5,13
1,2015-09-04 11:30:00,4,9,2015,4,11
2,2018-09-01 00:01:00,1,9,2018,5,0
3,2015-09-05 12:45:00,5,9,2015,5,12
4,2015-09-05 13:00:00,5,9,2015,5,13


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7369258 entries, 0 to 7369257
Data columns (total 26 columns):
 #   Column                Dtype         
---  ------                -----         
 0   id                    int64         
 1   case_number           object        
 2   date                  datetime64[ns]
 3   block                 object        
 4   iucr                  object        
 5   primary_type          object        
 6   description           object        
 7   location_description  object        
 8   arrest                bool          
 9   domestic              bool          
 10  beat                  int64         
 11  district              float64       
 12  ward                  float64       
 13  community_area        float64       
 14  fbi_code              object        
 15  x_coordinate          float64       
 16  y_coordinate          float64       
 17  year                  int64         
 18  updated_on            object        
 19  

In [None]:
df1 = df.copy()

# Cleaning & Transform Chicago Crime Data (2021 - 2010)

## Filter & Divide Data (Individual Year)

- Due to the large raw dataset and computationally heavy, data was separated into individual year and transform before combining it together into 1 data set

In [None]:
# Filter dataset  based on year >= 2010
df_ten = df1[df1['year'] >= 2010]
df_ten.reset_index()
sorted(df_ten['year'].unique())

[2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]

In [None]:
df_ten.to_csv('drive/MyDrive/Colab Notebooks/capstone/assets/ten_years_crime.csv', index=False)

In [None]:
# Function - filter based on individual year and save as csv
def filter_data_year(year):
  df_filter_year = df1[df1['year'] == year]
  df_filter_year.reset_index()

  df_filter_year.to_csv(f'drive/MyDrive/Colab Notebooks/capstone/assets/{year}_crime.csv', index=False)

In [None]:
year_list = sorted(df_ten['year'].unique())

for yr in year_list:
  filter_data_year(yr)

print ('Successfully converted!')

Successfully converted!


## Cleaning Functions

In [None]:
# 1) Function - replace null values to 0 for latitude, longitude and location
def replace_to_zero(df):
  df['latitude'].fillna(0, inplace=True)
  df['longitude'].fillna(0, inplace=True)
  df['location'].fillna(0, inplace=True)

# -----------------------------------------------------------------------------
# 2) Function - list of rows for missing datas (location)
def missing_row_loc(df):
  count = 0
  index_list = []

  for index, values in enumerate(df['location']):
    if values == 0:
      count += 1
      index_list.append(index)
  
  print (count)
  return index_list

# -----------------------------------------------------------------------------
# 3) Function - clean first 5 digit in block
def clean_data(text):
    text = re.sub(r'[0-9]{3}[a-zA-Z0-9]{2}[\s]{1}', '', text)
    return text

# -----------------------------------------------------------------------------
# 4) Function - impute latitude, longitude and location
def impute_data(df, index_list):
  print (index_list)
  print('')

  for index in index_list:
    processed_data = clean_data(str(df['block'].loc[index]))
    print (processed_data)

    g = geocoder.geocode(processed_data, timeout=10000)
    
    try:
      df.loc[index, 'latitude'] = g.latitude
      df.loc[index, 'longitude'] = g.longitude
      df.loc[index, 'location'] = f'({g.latitude}, {g.longitude})'

      print (df.loc[index, 'location'])
      print ('')
    except:
      df.loc[index, 'latitude'] = np.nan
      df.loc[index, 'longitude'] = np.nan
      df.loc[index, 'location'] = np.nan
  
# -----------------------------------------------------------------------------
# 5) Function - impute new values for ward

# Ref: https://stackoverflow.com/questions/61172069/extract-polygon-name-if-the-geo-point-is-inside-polygon
# Due to the update in boundaries, initial ward stated in the dataset is based on the time the crime is commited hence we will have to update the ward based on present date

def impute_ward(df):
  # Polygon Data (GeoDataFrame)
  data_poly = gpd.read_file('/content/drive/MyDrive/Colab Notebooks/capstone/assets/ward.geojson')

  # Readonly the required columns 
  # Drop NAN

  #convert dataframe to geodatframe
  gdf = gpd.GeoDataFrame(
      df, geometry=gpd.points_from_xy(df['longitude'], df['latitude']))

  #Output
  ward_gdf = gpd.sjoin(gdf, data_poly[['geometry', 'ward']], op='within')

  return ward_gdf

# -----------------------------------------------------------------------------
# 6) Function - impute new values for community_area

# Ref: https://stackoverflow.com/questions/61172069/extract-polygon-name-if-the-geo-point-is-inside-polygon
# Due to the update in boundaries, initial ward stated in the dataset is based on the time the crime is commited hence we will have to update the ward based on present date

def impute_commareas(df):
  # Polygon Data (GeoDataFrame)
  data_poly = gpd.read_file('/content/drive/MyDrive/Colab Notebooks/capstone/assets/comm_areas.geojson')

  # Readonly the required columns 
  # Drop NAN

  #convert dataframe to geodatframe
  gdf = gpd.GeoDataFrame(
      df, geometry=gpd.points_from_xy(df['longitude'], df['latitude']))

  #Output
  commarea_gdf = gpd.sjoin(gdf, data_poly[['geometry', 'area_numbe']], op='within')

  return commarea_gdf

# -----------------------------------------------------------------------------
# 7) Function - impute new values for district

# Ref: https://stackoverflow.com/questions/61172069/extract-polygon-name-if-the-geo-point-is-inside-polygon
# Due to the update in boundaries, initial ward stated in the dataset is based on the time the crime is commited hence we will have to update the ward based on present date

def impute_district(df):
  # Polygon Data (GeoDataFrame)
  data_poly = gpd.read_file('/content/drive/MyDrive/Colab Notebooks/capstone/assets/district.geojson')

  # Readonly the required columns 
  # Drop NAN

  #convert dataframe to geodatframe
  gdf = gpd.GeoDataFrame(
      df, geometry=gpd.points_from_xy(df['longitude'], df['latitude']))

  #Output
  district_gdf = gpd.sjoin(gdf, data_poly[['geometry', 'dist_num']], op='within')

  return district_gdf

# -----------------------------------------------------------------------------

## Clean data (2021)

In [None]:
# from geopy.geocoders import Nominatim
geocoder = Nominatim(user_agent = 'crime_predictions')

In [None]:
df_2021 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2021_crime.csv')
df_2021.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,beat,district,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude,location,day,month,day_of_week,time
0,12260346,JE102126,2021-01-03 13:23:00,070XX S EGGLESTON AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,False,True,732,7.0,6.0,68.0,08B,1174496.0,1858251.0,2021,01/16/2021 03:49:23 PM,41.766435,-87.635964,"(41.766435144, -87.635963997)",3,1,6,13
1,12263464,JE105797,2021-01-03 06:59:00,080XX S YALE AVE,820,THEFT,$500 AND UNDER,RESIDENCE,False,False,623,6.0,17.0,44.0,06,1176011.0,1851718.0,2021,01/16/2021 03:49:23 PM,41.748474,-87.630607,"(41.748473982, -87.630606588)",3,1,6,6
2,12259990,JE101773,2021-01-03 00:20:00,056XX W WASHINGTON BLVD,486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,False,True,1513,15.0,29.0,25.0,08B,1138722.0,1900183.0,2021,01/16/2021 03:49:23 PM,41.882224,-87.766076,"(41.88222427, -87.766076162)",3,1,6,0
3,12260669,JE102509,2021-01-03 20:47:00,057XX S RACINE AVE,2022,NARCOTICS,POSSESS - COCAINE,STREET,True,False,713,7.0,16.0,67.0,18,1169298.0,1866822.0,2021,01/16/2021 03:49:23 PM,41.790069,-87.654769,"(41.79006908, -87.654768679)",3,1,6,20
4,25702,JE102438,2021-01-03 20:09:00,068XX S STONY ISLAND AVE,110,HOMICIDE,FIRST DEGREE MURDER,STREET,False,False,332,3.0,5.0,43.0,01A,1188038.0,1860051.0,2021,01/10/2021 03:51:53 PM,41.771062,-87.586271,"(41.771062488, -87.586270811)",3,1,6,20


In [None]:
df_2021.isnull().sum()

id                        0
case_number               0
date                      0
block                     0
iucr                      0
primary_type              0
description               0
location_description    387
arrest                    0
domestic                  0
beat                      0
district                  0
ward                      7
community_area            0
fbi_code                  0
x_coordinate            695
y_coordinate            695
year                      0
updated_on                0
latitude                695
longitude               695
location                695
day                       0
month                     0
day_of_week               0
time                      0
dtype: int64

In [None]:
replace_to_zero(df_2021)

In [None]:
index_list = missing_row_loc(df_2021)

695


In [None]:
impute_data(df_2021, index_list)

[9, 13, 15, 29, 31, 42, 46, 47, 62, 63, 66, 67, 68, 69, 70, 84, 85, 104, 108, 133, 140, 141, 142, 867, 868, 1300, 2972, 4483, 4569, 4946, 5023, 5188, 5241, 5373, 5486, 5497, 5572, 5704, 6067, 6138, 6216, 6247, 6431, 6640, 6740, 7059, 7656, 7866, 7990, 8182, 8237, 8279, 8468, 8526, 8564, 8581, 8726, 8820, 9018, 9230, 9231, 9487, 9543, 9628, 9695, 9790, 9806, 10027, 10195, 10238, 10627, 10764, 11182, 11238, 11350, 11564, 11844, 12069, 12151, 12190, 12237, 12238, 12266, 12357, 12372, 12444, 12485, 12535, 12789, 12969, 13082, 13127, 13478, 13514, 13860, 14039, 14505, 14572, 14644, 14656, 14733, 14766, 14822, 14852, 14882, 14917, 15254, 15264, 15846, 15850, 15951, 16033, 16538, 16711, 16738, 16751, 16847, 16851, 17462, 17826, 17859, 18062, 18075, 18119, 18149, 18286, 18288, 18289, 18297, 18306, 18614, 18744, 19819, 19928, 20235, 20370, 20680, 21135, 21312, 21941, 21945, 22225, 22255, 22304, 22341, 22494, 22649, 22679, 22820, 22958, 23508, 23675, 23844, 24181, 24407, 24747, 24972, 25116, 252

In [None]:
# Save Checkpoint 
df2021_geo = df_2021.copy()

In [None]:
# Filter location that is not null
df2021_geo = df2021_geo[df2021_geo['location'].notna()]

In [None]:
df2021_geo = impute_district(df2021_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2021_geo.drop(['district', 'index_right'], axis=1, inplace=True)

# Rename columns
df2021_geo.rename(columns={'dist_num':'district'}, inplace=True)

In [None]:
# Impute updated ward values
df2021_geo = impute_ward(df2021_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2021_geo.drop(['ward_left', 'index_right'], axis=1, inplace=True)

# Rename columns
df2021_geo.rename(columns={'ward_right':'ward'}, inplace=True)

In [None]:
# Impute updated district values
df2021_geo = impute_commareas(df2021_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2021_geo.drop(['community_area', 'index_right'], axis=1, inplace=True)

# Rename columns
df2021_geo.rename(columns={'area_numbe':'community_area'}, inplace=True)

In [None]:
# Balance clean up of dataset after transformation
# Drop columns
df2021_geo.drop(['x_coordinate', 'y_coordinate'], axis=1, inplace=True)

# Assign null values in location description to 'OTHER (Specify)'
df2021_geo['location_description'].fillna('OTHER (SPECIFY)', inplace = True)


In [None]:
df2021_geo.to_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/clean_data/2021_crime_clean.csv', index=False)

In [None]:
df2021_geo.isnull().sum()

id                        0
case_number               0
date                      0
block                     0
iucr                      0
primary_type              0
description               0
location_description    361
arrest                    0
domestic                  0
beat                      0
fbi_code                  0
x_coordinate            211
y_coordinate            211
year                      0
updated_on                0
latitude                  0
longitude                 0
location                  0
day                       0
month                     0
day_of_week               0
time                      0
geometry                  0
district                  0
ward                      0
community_area            0
dtype: int64

## Clean data (2020)

In [None]:
# from geopy.geocoders import Nominatim
geocoder = Nominatim(user_agent = 'crime_predictions')

In [None]:
df_2020 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2020_crime.csv')
df_2020.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,beat,district,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude,location,day,month,day_of_week,time
0,12014684,JD189901,2020-03-17 21:30:00,039XX N LECLAIRE AVE,820,THEFT,$500 AND UNDER,STREET,False,False,1634,16.0,45.0,15.0,06,1141659.0,1925649.0,2020,03/25/2020 03:45:43 PM,41.952052,-87.75466,"(41.952051946, -87.754660372)",17,3,1,21
1,12012127,JD189186,2020-03-18 02:03:00,039XX W JACKSON BLVD,910,MOTOR VEHICLE THEFT,AUTOMOBILE,APARTMENT,False,True,1132,11.0,28.0,26.0,07,1150196.0,1898398.0,2020,03/25/2020 03:47:29 PM,41.87711,-87.72399,"(41.877110187, -87.723989719)",18,3,2,2
2,12012330,JD189367,2020-03-18 08:50:00,023XX N KEELER AVE,560,ASSAULT,SIMPLE,RESIDENCE,False,False,2525,25.0,35.0,20.0,08A,1147996.0,1915240.0,2020,03/25/2020 03:47:29 PM,41.923369,-87.731634,"(41.923368973, -87.731633833)",18,3,2,8
3,12014760,JD192130,2020-03-18 13:00:00,047XX W MONROE ST,1150,DECEPTIVE PRACTICE,CREDIT CARD FRAUD,OTHER (SPECIFY),False,False,1113,11.0,28.0,25.0,11,1144749.0,1899145.0,2020,03/25/2020 03:47:29 PM,41.879264,-87.743971,"(41.879264422, -87.743970898)",18,3,2,13
4,12012667,JD189808,2020-03-18 17:35:00,003XX S CICERO AVE,2017,NARCOTICS,MANUFACTURE / DELIVER - CRACK,SIDEWALK,True,False,1533,15.0,28.0,25.0,18,1144446.0,1898000.0,2020,03/25/2020 03:47:29 PM,41.876128,-87.745112,"(41.876128106, -87.745112291)",18,3,2,17


In [None]:
df_2020.isnull().sum()

id                         0
case_number                0
date                       0
block                      0
iucr                       0
primary_type               0
description                0
location_description    1206
arrest                     0
domestic                   0
beat                       0
district                   0
ward                       9
community_area             1
fbi_code                   0
x_coordinate            2659
y_coordinate            2659
year                       0
updated_on                 0
latitude                2659
longitude               2659
location                2659
day                        0
month                      0
day_of_week                0
time                       0
dtype: int64

In [None]:
replace_to_zero(df_2020)

In [None]:
index_list = missing_row_loc(df_2020)

2659


In [None]:
impute_data(df_2020, index_list)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
(32.3874438, -80.5764049)

W NORTH SHORE AVE
(42.0027093, -87.744558)

E 56TH ST
(44.9015711, -93.2646052)

S KENWOOD AVE
(41.8060326, -87.5935411)

S INDIANA AVE
(41.5256039, -85.0627283)

N SHERIDAN RD
(40.3578419, -76.2273206)

W ADDISON ST
(42.226415, -84.406033)

W Foster Ave
(41.975528, -87.7281106)

W MADISON ST
(41.44819, -82.7183141)

W BELDEN AVE
(41.9187168, -87.9610749)

S STATE ST
(43.964385, -88.9429077)

S CICERO AVE
(41.8545413, -87.7443874)

N CAMPBELL AVE
(32.2527259, -110.9439424)

E 54TH PL
(39.851389, -86.050353)

S WENTWORTH AVE
(42.9981308, -87.8894434)

N MAGNOLIA AVE
(29.1969846, -82.1365046)

S FAIRFIELD AVE
(41.8828928, -88.0058916)

N MC CLURG CT
(30.332179, -82.799061)

N HOYNE AVE
(41.9550717, -87.6812061)

N MILWAUKEE AVE
(42.451391, -88.091261)

N MALDEN ST
(41.9687299, -87.6628804)

S COTTAGE GROVE AVE
(41.6856412, -87.6114137)

N WINTHROP AVE
(41.9873032, -87.6581226)

N JANSSEN AVE
(41.9

In [None]:
# Save Checkpoint 
df2020_geo = df_2020.copy()

In [None]:
# Filter location that is not null
df2020_geo = df2020_geo[df2020_geo['location'].notna()]

In [None]:
df2020_geo = impute_district(df2020_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2020_geo.drop(['district', 'index_right'], axis=1, inplace=True)

# Rename columns
df2020_geo.rename(columns={'dist_num':'district'}, inplace=True)

In [None]:
# Impute updated ward values
df2020_geo = impute_ward(df2020_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2020_geo.drop(['ward_left', 'index_right'], axis=1, inplace=True)

# Rename columns
df2020_geo.rename(columns={'ward_right':'ward'}, inplace=True)

In [None]:
# Impute updated district values
df2020_geo = impute_commareas(df2020_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2020_geo.drop(['community_area', 'index_right'], axis=1, inplace=True)

# Rename columns
df2020_geo.rename(columns={'area_numbe':'community_area'}, inplace=True)

In [None]:
# Balance clean up of dataset after transformation
# Drop columns
df2020_geo.drop(['x_coordinate', 'y_coordinate'], axis=1, inplace=True)

# Assign null values in location description to 'OTHER (Specify)'
df2020_geo['location_description'].fillna('OTHER (SPECIFY)', inplace = True)


In [None]:
df2020_geo.to_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2020_crime_clean.csv', index=False)

In [None]:
df2020_geo.isnull().sum()

id                      0
case_number             0
date                    0
block                   0
iucr                    0
primary_type            0
description             0
location_description    0
arrest                  0
domestic                0
beat                    0
fbi_code                0
year                    0
updated_on              0
latitude                0
longitude               0
location                0
day                     0
month                   0
day_of_week             0
time                    0
geometry                0
district                0
ward                    0
community_area          0
dtype: int64

## Clean data (2019)

In [None]:
# from geopy.geocoders import Nominatim
geocoder = Nominatim(user_agent = 'crime_predictions')

In [None]:
df_2019 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2019_crime.csv')
df_2019.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,beat,district,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude,location,day,month,day_of_week,time
0,11864018,JC476123,2019-09-24 08:00:00,022XX S MICHIGAN AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,COMMERCIAL / BUSINESS OFFICE,False,False,132,1.0,3.0,33.0,11,1177560.0,1889548.0,2019,10/20/2019 03:56:02 PM,41.852248,-87.623786,"(41.852248185, -87.623786256)",24,9,1,8
1,11859805,JC471592,2019-10-13 20:30:00,024XX W CHICAGO AVE,860,THEFT,RETAIL THEFT,GROCERY FOOD STORE,False,False,1221,12.0,26.0,24.0,06,1160005.0,1905256.0,2019,10/20/2019 04:03:03 PM,41.895732,-87.687784,"(41.895732399, -87.687784384)",13,10,6,20
2,11863808,JC476236,2019-10-05 18:30:00,0000X N LOOMIS ST,810,THEFT,OVER $500,RESIDENCE,False,False,1224,12.0,27.0,28.0,06,1166986.0,1900306.0,2019,10/20/2019 03:56:02 PM,41.882002,-87.662287,"(41.88200224, -87.662286977)",5,10,5,18
3,11859727,JC471542,2019-10-13 19:00:00,016XX W ADDISON ST,1320,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,False,1922,19.0,47.0,6.0,14,1164930.0,1923972.0,2019,10/20/2019 04:03:03 PM,41.946987,-87.669164,"(41.946987144, -87.669163602)",13,10,6,19
4,11859656,JC471240,2019-10-13 14:10:00,051XX N BROADWAY,560,ASSAULT,SIMPLE,GAS STATION,False,False,2033,20.0,47.0,3.0,08A,1167380.0,1934505.0,2019,10/20/2019 04:03:03 PM,41.975838,-87.659854,"(41.975837637, -87.659853835)",13,10,6,14


In [None]:
df_2019.isnull().sum()

id                         0
case_number                0
date                       0
block                      0
iucr                       0
primary_type               0
description                0
location_description    1172
arrest                     0
domestic                   0
beat                       0
district                   0
ward                      15
community_area             0
fbi_code                   0
x_coordinate            1618
y_coordinate            1618
year                       0
updated_on                 0
latitude                1618
longitude               1618
location                1618
day                        0
month                      0
day_of_week                0
time                       0
dtype: int64

In [None]:
replace_to_zero(df_2019)

In [None]:
index_list = missing_row_loc(df_2019)

1618


In [None]:
impute_data(df_2019, index_list)

[10, 18, 19, 23, 35, 37, 40, 43, 58, 69, 89, 92, 103, 121, 123, 125, 138, 160, 164, 188, 200, 218, 219, 224, 254, 276, 323, 336, 541, 542, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 603, 610, 618, 620, 622, 623, 624, 626, 627, 628, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 649, 653, 654, 658, 660, 662, 680, 729, 730, 737, 758, 779, 791, 812, 823, 1150, 1198, 1199, 1268, 1292, 1334, 1346, 1402, 1420, 1827, 1828, 1829, 1834, 1836, 1838, 1982, 2324, 2387, 2400, 2427, 2508, 2575, 2628, 2811, 2890, 2946, 2947, 3067, 3105, 3230, 3392, 3521, 3552, 3575, 3592, 3681, 3683, 3684, 3685, 3686, 3687, 3689, 3708, 3953, 4179, 4206, 4414, 4494, 4571, 4812, 4893, 5114, 5143, 5149, 5151, 5152, 5154, 5238, 5676, 5807, 5823, 5944, 5998, 6298, 13734, 24673, 25158, 29180, 2932

In [None]:
# Save Checkpoint 
df2019_geo = df_2019.copy()

In [None]:
# Filter location that is not null
df2019_geo = df2019_geo[df2019_geo['location'].notna()]

In [None]:
df2019_geo = impute_district(df2019_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2019_geo.drop(['district', 'index_right'], axis=1, inplace=True)

# Rename columns
df2019_geo.rename(columns={'dist_num':'district'}, inplace=True)

In [None]:
# Impute updated ward values
df2019_geo = impute_ward(df2019_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2019_geo.drop(['ward_left', 'index_right'], axis=1, inplace=True)

# Rename columns
df2019_geo.rename(columns={'ward_right':'ward'}, inplace=True)

In [None]:
# Impute updated district values
df2019_geo = impute_commareas(df2019_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2019_geo.drop(['community_area', 'index_right'], axis=1, inplace=True)

# Rename columns
df2019_geo.rename(columns={'area_numbe':'community_area'}, inplace=True)

In [None]:
# Balance clean up of dataset after transformation
# Drop columns
df2019_geo.drop(['x_coordinate', 'y_coordinate'], axis=1, inplace=True)

# Assign null values in location description to 'OTHER (Specify)'
df2019_geo['location_description'].fillna('OTHER (SPECIFY)', inplace = True)


In [None]:
df2019_geo.to_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2019_crime_clean.csv', index=False)

In [None]:
df2019_geo.isnull().sum()

id                      0
case_number             0
date                    0
block                   0
iucr                    0
primary_type            0
description             0
location_description    0
arrest                  0
domestic                0
beat                    0
fbi_code                0
year                    0
updated_on              0
latitude                0
longitude               0
location                0
day                     0
month                   0
day_of_week             0
time                    0
geometry                0
district                0
ward                    0
community_area          0
dtype: int64

## Clean data (2018)

In [None]:
# from geopy.geocoders import Nominatim
geocoder = Nominatim(user_agent = 'crime_predictions')

In [None]:
df_2018 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2018_crime.csv')
df_2018.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,beat,district,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude,location,day,month,day_of_week,time
0,11646166,JC213529,2018-09-01 00:01:00,082XX S INGLESIDE AVE,810,THEFT,OVER $500,RESIDENCE,False,True,631,6.0,8.0,44.0,06,,,2018,04/06/2019 04:04:43 PM,,,,1,9,5,0
1,11645648,JC212959,2018-01-01 08:00:00,024XX N MONITOR AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,2515,25.0,30.0,19.0,11,,,2018,04/06/2019 04:04:43 PM,,,,1,1,0,8
2,11645959,JC211511,2018-12-20 16:00:00,045XX N ALBANY AVE,2820,OTHER OFFENSE,TELEPHONE THREAT,RESIDENCE,False,False,1724,17.0,33.0,14.0,08A,,,2018,04/06/2019 04:04:43 PM,,,,20,12,3,16
3,11645557,JC212685,2018-04-01 00:01:00,080XX S VERNON AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,631,6.0,6.0,44.0,11,,,2018,04/06/2019 04:04:43 PM,,,,1,4,6,0
4,11646293,JC213749,2018-12-20 15:00:00,023XX N LOCKWOOD AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,APARTMENT,False,False,2515,25.0,36.0,19.0,11,,,2018,04/06/2019 04:04:43 PM,,,,20,12,3,15


In [None]:
df_2018.isnull().sum()

id                         0
case_number                0
date                       0
block                      0
iucr                       0
primary_type               0
description                0
location_description    1022
arrest                     0
domestic                   0
beat                       0
district                   0
ward                       4
community_area             0
fbi_code                   0
x_coordinate            4972
y_coordinate            4972
year                       0
updated_on                 0
latitude                4972
longitude               4972
location                4972
day                        0
month                      0
day_of_week                0
time                       0
dtype: int64

In [None]:
replace_to_zero(df_2018)

In [None]:
index_list = missing_row_loc(df_2018)

4972


In [None]:
impute_data(df_2018, index_list)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

W MORSE AVE
(42.0072486, -87.6859808)

S BURLEY AVE
(41.7376517, -87.5453086)

W SCHOOL ST
(33.3044468, -96.4028596)

W DRUMMOND PL
(41.9315035, -87.643809)

W DIVERSEY AVE
(41.9263942, -88.0053438)

N MARINE DR
(33.8164549, -117.8300018)

W MARQUETTE RD
(41.7726852, -87.6253898)

S ESCANABA AVE
(41.7482723, -87.55405)

S PULASKI RD
(41.6803208, -87.7198931)

N NORMANDY AVE
(32.3607574, -95.2956254)

W WASHINGTON BLVD
(43.0529509, -87.9892763)

E PEARSON ST
(43.053739, -87.900774)

W GRENSHAW ST
(41.8679092, -87.6630252)

S DR MARTIN LUTHER KING JR DR
(32.3874438, -80.5764049)

N OGDEN AVE
(41.2832412, -111.9678465)

S NORMAL AVE
(37.720105, -89.21769)

N DELPHIA AVE
(36.7559217, -119.7939473)

W DIVISION ST
(38.2597099, -87.9979169)

E 13TH ST
(40.1774606, -85.347364)

W 82ND PL
(41.4683852, -87.3613672)

W 47TH PL
(41.568397, -90.623596)

W ST
(55.8497133, -4.2656724)

N SACRAMENTO AVE
(39.341244, -74.483242)

W ST
(55

In [None]:
# Save Checkpoint 
df2018_geo = df_2018.copy()

In [None]:
# Filter location that is not null
df2018_geo = df2018_geo[df2018_geo['location'].notna()]

In [None]:
df2018_geo = impute_district(df2018_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2018_geo.drop(['district', 'index_right'], axis=1, inplace=True)

# Rename columns
df2018_geo.rename(columns={'dist_num':'district'}, inplace=True)

In [None]:
# Impute updated ward values
df2018_geo = impute_ward(df2018_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2018_geo.drop(['ward_left', 'index_right'], axis=1, inplace=True)

# Rename columns
df2018_geo.rename(columns={'ward_right':'ward'}, inplace=True)

In [None]:
# Impute updated district values
df2018_geo = impute_commareas(df2018_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2018_geo.drop(['community_area', 'index_right'], axis=1, inplace=True)

# Rename columns
df2018_geo.rename(columns={'area_numbe':'community_area'}, inplace=True)

In [None]:
# Balance clean up of dataset after transformation
# Drop columns
df2018_geo.drop(['x_coordinate', 'y_coordinate'], axis=1, inplace=True)

# Assign null values in location description to 'OTHER (Specify)'
df2018_geo['location_description'].fillna('OTHER (SPECIFY)', inplace = True)


In [None]:
df2018_geo.to_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2018_crime_clean.csv', index=False)

In [None]:
df2018_geo.isnull().sum()

id                      0
case_number             0
date                    0
block                   0
iucr                    0
primary_type            0
description             0
location_description    0
arrest                  0
domestic                0
beat                    0
fbi_code                0
year                    0
updated_on              0
latitude                0
longitude               0
location                0
day                     0
month                   0
day_of_week             0
time                    0
geometry                0
district                0
ward                    0
community_area          0
dtype: int64

## Clean data (2017)

In [None]:
# from geopy.geocoders import Nominatim
geocoder = Nominatim(user_agent = 'crime_predictions')

In [None]:
df_2017 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2017_crime.csv')
df_2017.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,beat,district,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude,location,day,month,day_of_week,time
0,11227287,JB147188,2017-10-08 03:00:00,092XX S RACINE AVE,281,CRIM SEXUAL ASSAULT,NON-AGGRAVATED,RESIDENCE,False,False,2222,22.0,21.0,73.0,2,,,2017,02/11/2018 03:57:41 PM,,,,8,10,6,3
1,11227583,JB147595,2017-03-28 14:00:00,026XX W 79TH ST,620,BURGLARY,UNLAWFUL ENTRY,OTHER,False,False,835,8.0,18.0,70.0,5,,,2017,02/11/2018 03:57:41 PM,,,,28,3,1,14
2,11227293,JB147230,2017-09-09 20:17:00,060XX S EBERHART AVE,810,THEFT,OVER $500,RESIDENCE,False,False,313,3.0,20.0,42.0,6,,,2017,02/11/2018 03:57:41 PM,,,,9,9,5,20
3,11227634,JB147599,2017-08-26 10:00:00,001XX W RANDOLPH ST,281,CRIM SEXUAL ASSAULT,NON-AGGRAVATED,HOTEL/MOTEL,False,False,122,1.0,42.0,32.0,2,,,2017,02/11/2018 03:57:41 PM,,,,26,8,5,10
4,11227508,JB146365,2017-01-01 00:01:00,027XX S WHIPPLE ST,1754,OFFENSE INVOLVING CHILDREN,AGG SEX ASSLT OF CHILD FAM MBR,RESIDENCE,False,False,1033,10.0,12.0,30.0,2,,,2017,02/11/2018 03:57:41 PM,,,,1,1,6,0


In [None]:
df_2017.isnull().sum()

id                         0
case_number                0
date                       0
block                      0
iucr                       0
primary_type               0
description                0
location_description    1252
arrest                     0
domestic                   0
beat                       0
district                   1
ward                       1
community_area             0
fbi_code                   0
x_coordinate            3896
y_coordinate            3896
year                       0
updated_on                 0
latitude                3896
longitude               3896
location                3896
day                        0
month                      0
day_of_week                0
time                       0
dtype: int64

In [None]:
replace_to_zero(df_2017)

In [None]:
index_list = missing_row_loc(df_2017)

3896


In [None]:
impute_data(df_2017, index_list)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

N SPRINGFIELD AVE
(41.9646307, -87.7254887)

S KEELER AVE
(36.7501253, -95.9789915)

E ST
(50.4738724, 4.2358174)

N AVERS AVE
(41.9628652, -87.724203)

W 65TH PL
(39.876, -86.192027)

S COLFAX AVE
(30.853988, -96.9914143)

N ROCKWELL ST
(33.3744844, -111.7302071)

S KARLOV AVE
(41.7139045, -87.7235728)

S COLFAX AVE
(30.853988, -96.9914143)

S DEARBORN ST
(47.5957569, -122.3197611)

S Stewart Ave
(28.290846, -81.4066339)

S INDIANA AVE
(41.5256039, -85.0627283)

N WASHTENAW AVE
(41.9253795, -87.6949927)

S MICHIGAN AVE
(42.59934, -83.9336293)

W Jackson Blvd
(37.377182, -89.6682716)

E MARQUETTE RD
(41.7727745, -87.6280678)

W WASHINGTON BLVD
(43.0529509, -87.9892763)

E PL
(14.8370417, -89.14104361631954)

E 67TH PL
(36.0649253, -95.9371847)

S SPAULDING AVE
(34.0288086, -118.3700242)

W ST
(55.8497133, -4.2656724)

N LAWNDALE AVE
(41.9592196, -87.720432)

S KEDVALE AVE
(41.6474182, -87.7222793)

W POTOMAC AVE
(43.0958

In [None]:
# Save Checkpoint 
df2017_geo = df_2017.copy()

In [None]:
# Filter location that is not null
df2017_geo = df2017_geo[df2017_geo['location'].notna()]

In [None]:
df2017_geo = impute_district(df2017_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2017_geo.drop(['district', 'index_right'], axis=1, inplace=True)

# Rename columns
df2017_geo.rename(columns={'dist_num':'district'}, inplace=True)

In [None]:
# Impute updated ward values
df2017_geo = impute_ward(df2017_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2017_geo.drop(['ward_left', 'index_right'], axis=1, inplace=True)

# Rename columns
df2017_geo.rename(columns={'ward_right':'ward'}, inplace=True)

In [None]:
# Impute updated district values
df2017_geo = impute_commareas(df2017_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2017_geo.drop(['community_area', 'index_right'], axis=1, inplace=True)

# Rename columns
df2017_geo.rename(columns={'area_numbe':'community_area'}, inplace=True)

In [None]:
# Balance clean up of dataset after transformation
# Drop columns
df2017_geo.drop(['x_coordinate', 'y_coordinate'], axis=1, inplace=True)

# Assign null values in location description to 'OTHER (Specify)'
df2017_geo['location_description'].fillna('OTHER (SPECIFY)', inplace = True)


In [None]:
df2017_geo.to_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2017_crime_clean.csv', index=False)

In [None]:
df2017_geo.isnull().sum()

id                      0
case_number             0
date                    0
block                   0
iucr                    0
primary_type            0
description             0
location_description    0
arrest                  0
domestic                0
beat                    0
fbi_code                0
year                    0
updated_on              0
latitude                0
longitude               0
location                0
day                     0
month                   0
day_of_week             0
time                    0
geometry                0
district                0
ward                    0
community_area          0
dtype: int64

In [None]:
df2017_geo['year'].head()

5       2017
172     2017
459     2017
2752    2017
8449    2017
Name: year, dtype: int64

## Clean data (2016)

In [None]:
# from geopy.geocoders import Nominatim
geocoder = Nominatim(user_agent = 'crime_predictions')

In [None]:
df_2016 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2016_crime.csv')
df_2016.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,beat,district,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude,location,day,month,day_of_week,time
0,11645836,JC212333,2016-05-01 00:25:00,055XX S ROCKWELL ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,824,8.0,15.0,63.0,11,,,2016,04/06/2019 04:04:43 PM,,,,1,5,6,0
1,11043021,JA367631,2016-10-19 19:00:00,075XX S YATES BLVD,610,BURGLARY,FORCIBLE ENTRY,RESTAURANT,False,False,421,4.0,7.0,43.0,5,,,2016,08/05/2017 03:50:08 PM,,,,19,10,2,19
2,11243066,JB168427,2016-03-29 07:00:00,067XX S RIDGELAND AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,False,False,332,3.0,5.0,43.0,11,,,2016,03/01/2018 03:54:55 PM,,,,29,3,1,7
3,11243020,HZ184094,2016-03-11 23:00:00,052XX N ST LOUIS AVE,281,CRIM SEXUAL ASSAULT,NON-AGGRAVATED,RESIDENCE PORCH/HALLWAY,False,False,1712,17.0,39.0,13.0,2,,,2016,03/01/2018 03:54:55 PM,,,,11,3,4,23
4,11227940,JB148122,2016-01-01 11:00:00,108XX S CALUMET AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,False,False,513,5.0,9.0,49.0,11,,,2016,02/12/2018 03:49:14 PM,,,,1,1,4,11


In [None]:
df_2016.isnull().sum()

id                         0
case_number                0
date                       0
block                      0
iucr                       0
primary_type               0
description                0
location_description    1252
arrest                     0
domestic                   0
beat                       0
district                   0
ward                       0
community_area             0
fbi_code                   0
x_coordinate            2348
y_coordinate            2348
year                       0
updated_on                 0
latitude                2348
longitude               2348
location                2348
day                        0
month                      0
day_of_week                0
time                       0
dtype: int64

In [None]:
replace_to_zero(df_2016)

In [None]:
index_list = missing_row_loc(df_2016)

2348


In [None]:
impute_data(df_2016, index_list)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
W MONTROSE AVE
(41.9607197, -87.7466474)

W JACKSON BLVD
(37.377182, -89.6682716)

S SPAULDING AVE
(34.0288086, -118.3700242)

E 85TH PL
(36.0395364, -95.9347648)

S MICHIGAN AVE
(42.59934, -83.9336293)

W GLADYS AVE
(41.915226, -87.959294)

N KILDARE AVE
(41.9718061, -87.7355466)

N LAVERGNE AVE
(41.9090105, -87.9045527)

S KOLIN AVE
(41.6804633, -87.7282162)

N HOMAN AVE
(41.8919714, -87.711335)

S ROCKWELL ST
(33.3290958, -111.7293356)

W FULTON BLVD
(41.8864528, -87.7038266)

S BENNETT AVE
(39.047033, -82.645968)

W WASHINGTON BLVD
(43.0529509, -87.9892763)

W ST
(55.8497133, -4.2656724)

W ST
(55.8497133, -4.2656724)

S INDIANA AVE
(41.5256039, -85.0627283)

S DR MARTIN LUTHER KING JR DR
(32.3874438, -80.5764049)

S HOYNE AVE
(41.8228899, -87.6774807)

S FAIRFIELD AVE
(41.8828928, -88.0058916)

W 91ST ST
(41.7265995, -87.7948245)

W MARQUETTE RD
(41.7726852, -87.6253898)

E 53RD ST
(40.7580248, -73.9701634)

W GORDON

In [None]:
# Save Checkpoint 
df2016_geo = df_2016.copy()

In [None]:
# Filter location that is not null
df2016_geo = df2016_geo[df2016_geo['location'].notna()]

In [None]:
df2016_geo = impute_district(df2016_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2016_geo.drop(['district', 'index_right'], axis=1, inplace=True)

# Rename columns
df2016_geo.rename(columns={'dist_num':'district'}, inplace=True)

In [None]:
# Impute updated ward values
df2016_geo = impute_ward(df2016_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2016_geo.drop(['ward_left', 'index_right'], axis=1, inplace=True)

# Rename columns
df2016_geo.rename(columns={'ward_right':'ward'}, inplace=True)

In [None]:
# Impute updated district values
df2016_geo = impute_commareas(df2016_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2016_geo.drop(['community_area', 'index_right'], axis=1, inplace=True)

# Rename columns
df2016_geo.rename(columns={'area_numbe':'community_area'}, inplace=True)

In [None]:
# Balance clean up of dataset after transformation
# Drop columns
df2016_geo.drop(['x_coordinate', 'y_coordinate'], axis=1, inplace=True)

# Assign null values in location description to 'OTHER (Specify)'
df2016_geo['location_description'].fillna('OTHER (SPECIFY)', inplace = True)


In [None]:
df2016_geo.to_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2016_crime_clean.csv', index=False)

In [None]:
df2016_geo.isnull().sum()

id                      0
case_number             0
date                    0
block                   0
iucr                    0
primary_type            0
description             0
location_description    0
arrest                  0
domestic                0
beat                    0
fbi_code                0
year                    0
updated_on              0
latitude                0
longitude               0
location                0
day                     0
month                   0
day_of_week             0
time                    0
geometry                0
district                0
ward                    0
community_area          0
dtype: int64

## Clean data (2015)

In [None]:
# from geopy.geocoders import Nominatim
geocoder = Nominatim(user_agent = 'crime_predictions')

In [None]:
df_2015 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2015_crime.csv')
df_2015.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,beat,district,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude,location,day,month,day_of_week,time
0,10224738,HY411648,2015-09-05 13:30:00,043XX S WOOD ST,486,BATTERY,DOMESTIC BATTERY SIMPLE,RESIDENCE,False,True,924,9.0,12.0,61.0,08B,1165074.0,1875917.0,2015,02/10/2018 03:50:01 PM,41.815117,-87.67,"(41.815117282, -87.669999562)",5,9,5,13
1,10224739,HY411615,2015-09-04 11:30:00,008XX N CENTRAL AVE,870,THEFT,POCKET-PICKING,CTA BUS,False,False,1511,15.0,29.0,25.0,06,1138875.0,1904869.0,2015,02/10/2018 03:50:01 PM,41.89508,-87.7654,"(41.895080471, -87.765400451)",4,9,4,11
2,10224740,HY411595,2015-09-05 12:45:00,035XX W BARRY AVE,2023,NARCOTICS,POSS: HEROIN(BRN/TAN),SIDEWALK,True,False,1412,14.0,35.0,21.0,18,1152037.0,1920384.0,2015,02/10/2018 03:50:01 PM,41.937406,-87.71665,"(41.937405765, -87.716649687)",5,9,5,12
3,10224741,HY411610,2015-09-05 13:00:00,0000X N LARAMIE AVE,560,ASSAULT,SIMPLE,APARTMENT,False,True,1522,15.0,28.0,25.0,08A,1141706.0,1900086.0,2015,02/10/2018 03:50:01 PM,41.881903,-87.755121,"(41.881903443, -87.755121152)",5,9,5,13
4,10224742,HY411435,2015-09-05 10:55:00,082XX S LOOMIS BLVD,610,BURGLARY,FORCIBLE ENTRY,RESIDENCE,False,False,614,6.0,21.0,71.0,05,1168430.0,1850165.0,2015,02/10/2018 03:50:01 PM,41.744379,-87.658431,"(41.744378879, -87.658430635)",5,9,5,10


In [None]:
df_2015.isnull().sum()

id                         0
case_number                0
date                       0
block                      0
iucr                       0
primary_type               0
description                0
location_description     561
arrest                     0
domestic                   0
beat                       0
district                   0
ward                       2
community_area             0
fbi_code                   0
x_coordinate            6649
y_coordinate            6649
year                       0
updated_on                 0
latitude                6649
longitude               6649
location                6649
day                        0
month                      0
day_of_week                0
time                       0
dtype: int64

In [None]:
replace_to_zero(df_2015)

In [None]:
index_list = missing_row_loc(df_2015)

6649


In [None]:
impute_data(df_2015, index_list)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

S HALSTED ST
(41.5420844, -87.6359687)

N LAKE SHORE DR
(28.905135, -81.763013)

W THOMAS ST
(43.230666, -75.469829)

S WABASH AVE
(34.1314901, -117.8625756)

S WESTERN AVE
(43.5185145, -96.751124)

W POTOMAC AVE
(43.0958445, -88.0105462)

S CALUMET AVE
(41.7034033, -86.871671)

S TORRENCE AVE
(41.6397957, -87.5592843)

W MAYPOLE AVE
(41.8835952, -87.7087902)

W 61ST ST
(44.89257, -93.315002)

S WELLS ST
(38.0067095, -89.2477996)

N CICERO AVE
(42.0121975, -87.7479383)

S WABASH AVE
(34.1314901, -117.8625756)

N SACRAMENTO AVE
(39.341244, -74.483242)

W 60TH ST
(44.89423, -93.301147)

N MAYFIELD AVE
(34.1105096, -117.2902727)

E BOWEN AVE
(46.8021128, -100.7877832)

S STONY ISLAND AVE
(41.772423, -87.5862745)

S LANGLEY AVE
(32.1874315, -110.8453564)

E 82ND ST
(41.5041895, -81.6308342)

N NEWCASTLE AVE
(41.9223699, -87.7970239)

W HARRISON ST
(47.453523, -116.785825)

S INDIANA AVE
(41.5256039, -85.0627283)

W PL
(39.04

In [None]:
# Save Checkpoint 
df2015_geo = df_2015.copy()

In [None]:
# Filter location that is not null
df2015_geo = df2015_geo[df2015_geo['location'].notna()]

In [None]:
df2015_geo = impute_district(df2015_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2015_geo.drop(['district', 'index_right'], axis=1, inplace=True)

# Rename columns
df2015_geo.rename(columns={'dist_num':'district'}, inplace=True)

In [None]:
# Impute updated ward values
df2015_geo = impute_ward(df2015_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2015_geo.drop(['ward_left', 'index_right'], axis=1, inplace=True)

# Rename columns
df2015_geo.rename(columns={'ward_right':'ward'}, inplace=True)

In [None]:
# Impute updated district values
df2015_geo = impute_commareas(df2015_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2015_geo.drop(['community_area', 'index_right'], axis=1, inplace=True)

# Rename columns
df2015_geo.rename(columns={'area_numbe':'community_area'}, inplace=True)

In [None]:
# Balance clean up of dataset after transformation
# Drop columns
df2015_geo.drop(['x_coordinate', 'y_coordinate'], axis=1, inplace=True)

# Assign null values in location description to 'OTHER (Specify)'
df2015_geo['location_description'].fillna('OTHER (SPECIFY)', inplace = True)


In [None]:
df2015_geo.to_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2015_crime_clean.csv', index=False)

In [None]:
df2015_geo.isnull().sum()

id                      0
case_number             0
date                    0
block                   0
iucr                    0
primary_type            0
description             0
location_description    0
arrest                  0
domestic                0
beat                    0
fbi_code                0
year                    0
updated_on              0
latitude                0
longitude               0
location                0
day                     0
month                   0
day_of_week             0
time                    0
geometry                0
district                0
ward                    0
community_area          0
dtype: int64

## Clean data (2014)

In [None]:
# from geopy.geocoders import Nominatim
geocoder = Nominatim(user_agent = 'crime_predictions')

In [None]:
df_2014 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2014_crime.csv')
df_2014.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,beat,district,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude,location,day,month,day_of_week,time
0,10224788,HY410899,2014-03-27 00:01:00,035XX S MICHIGAN AVE,5111,OTHER OFFENSE,GUN OFFENDER: ANNUAL REGISTRATION,POLICE FACILITY/VEH PARKING LOT,True,False,213,2.0,3.0,35.0,26,1177772.0,1881665.0,2014,02/10/2018 03:50:01 PM,41.830612,-87.623247,"(41.830611847, -87.623247369)",27,3,3,0
1,11645601,JC212935,2014-06-01 00:01:00,087XX S SANGAMON ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,2222,22.0,21.0,71.0,11,,,2014,04/06/2019 04:04:43 PM,,,,1,6,6,0
2,10225562,HY412692,2014-12-29 08:00:00,035XX W 58TH PL,1752,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,RESIDENCE,False,False,822,8.0,14.0,63.0,20,1153676.0,1865593.0,2014,02/10/2018 03:50:01 PM,41.787021,-87.712083,"(41.787020592, -87.712083182)",29,12,0,8
3,11028056,JA359834,2014-10-15 15:00:00,047XX S PULASKI RD,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,PARKING LOT/GARAGE(NON.RESID.),False,False,821,8.0,14.0,57.0,11,,,2014,07/24/2017 03:54:23 PM,,,,15,10,2,15
4,11227495,JB147292,2014-03-20 09:00:00,003XX E ERIE ST,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,False,False,1834,18.0,42.0,8.0,11,,,2014,02/11/2018 03:57:41 PM,,,,20,3,3,9


In [None]:
df_2014.isnull().sum()

id                         0
case_number                0
date                       0
block                      0
iucr                       0
primary_type               0
description                0
location_description     376
arrest                     0
domestic                   0
beat                       0
district                   0
ward                       2
community_area             0
fbi_code                   0
x_coordinate            1902
y_coordinate            1902
year                       0
updated_on                 0
latitude                1902
longitude               1902
location                1902
day                        0
month                      0
day_of_week                0
time                       0
dtype: int64

In [None]:
replace_to_zero(df_2014)

In [None]:
index_list = missing_row_loc(df_2014)

1902


In [None]:
impute_data(df_2014, index_list)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
W 71ST ST
(39.882403, -86.227212)

W FILLMORE ST
(43.7105602, -92.2692278)

W VAN BUREN ST
(42.3609721, -85.8840144)

S ELLIS AVE
(35.309552, -78.616879)

W THOMAS ST
(43.230666, -75.469829)

W OHIO ST
(40.646058, -83.609337)

S LOOMIS BLVD
(41.7570599, -87.6587609)

W 40TH ST
(30.3041337, -97.7342249)

S COTTAGE GROVE AVE
(41.6856412, -87.6114137)

W 76TH ST
(33.9707255, -118.2915697)

N FRANCISCO AVE
(26.2217801, -98.3199932)

W 61ST ST
(44.89257, -93.315002)

S ST LAWRENCE AVE
(41.7682281, -87.6104913)

N FRANCISCO AVE
(26.2217801, -98.3199932)

S BURNHAM AVE
(41.6423104, -87.5396238)

S INDEPENDENCE BLVD
(36.8059, -76.116529)

N KILBOURN AVE
(41.9717611, -87.7405534)

S UNION AVE
(47.2491253, -122.4836389)

E 73RD ST
(32.0274706, -81.1097986)

N WALLER AVE
(41.8910824, -87.7677998)

S KOSTNER AVE
(41.6651721, -87.7289951)

S DANTE AVE
(41.5394098, -87.5855724)

S MICHIGAN AVE
(42.59934, -83.9336293)

W SUPERIOR ST
(40

In [None]:
# Save Checkpoint 
df2014_geo = df_2014.copy()

In [None]:
# Filter location that is not null
df2014_geo = df2014_geo[df2014_geo['location'].notna()]

In [None]:
df2014_geo = impute_district(df2014_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2014_geo.drop(['district', 'index_right'], axis=1, inplace=True)

# Rename columns
df2014_geo.rename(columns={'dist_num':'district'}, inplace=True)

In [None]:
# Impute updated ward values
df2014_geo = impute_ward(df2014_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2014_geo.drop(['ward_left', 'index_right'], axis=1, inplace=True)

# Rename columns
df2014_geo.rename(columns={'ward_right':'ward'}, inplace=True)

In [None]:
# Impute updated district values
df2014_geo = impute_commareas(df2014_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2014_geo.drop(['community_area', 'index_right'], axis=1, inplace=True)

# Rename columns
df2014_geo.rename(columns={'area_numbe':'community_area'}, inplace=True)

In [None]:
# Balance clean up of dataset after transformation
# Drop columns
df2014_geo.drop(['x_coordinate', 'y_coordinate'], axis=1, inplace=True)

# Assign null values in location description to 'OTHER (Specify)'
df2014_geo['location_description'].fillna('OTHER (SPECIFY)', inplace = True)


In [None]:
df2014_geo.to_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2014_crime_clean.csv', index=False)

In [None]:
df2014_geo.isnull().sum()

id                      0
case_number             0
date                    0
block                   0
iucr                    0
primary_type            0
description             0
location_description    0
arrest                  0
domestic                0
beat                    0
fbi_code                0
year                    0
updated_on              0
latitude                0
longitude               0
location                0
day                     0
month                   0
day_of_week             0
time                    0
geometry                0
district                0
ward                    0
community_area          0
dtype: int64

## Clean data (2013)

In [None]:
# from geopy.geocoders import Nominatim
geocoder = Nominatim(user_agent = 'crime_predictions')

In [None]:
df_2013 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2013_crime.csv')
df_2013.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,beat,district,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude,location,day,month,day_of_week,time
0,11227517,JB138481,2013-02-10 00:00:00,071XX S LAFAYETTE AVE,266,CRIMINAL SEXUAL ASSAULT,PREDATORY,RESIDENCE,False,True,731,7.0,6.0,69.0,2,,,2013,08/30/2020 03:45:17 PM,,,,10,2,6,0
1,11042141,JA376559,2013-05-16 00:00:00,003XX W 64TH ST,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,HOSPITAL BUILDING/GROUNDS,False,False,722,7.0,20.0,68.0,11,,,2013,08/05/2017 03:50:08 PM,,,,16,5,3,0
2,11042759,JA376850,2013-07-08 16:10:00,056XX N KENMORE AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,2022,20.0,48.0,77.0,11,,,2013,08/05/2017 03:50:08 PM,,,,8,7,0,16
3,11042911,JA376915,2013-01-01 12:00:00,034XX N WESTERN AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,BANK,False,False,1921,19.0,47.0,5.0,11,,,2013,08/05/2017 03:50:08 PM,,,,1,1,1,12
4,11043823,JA378025,2013-08-11 17:00:00,004XX N CLARK ST,820,THEFT,$500 AND UNDER,RESTAURANT,False,False,1831,18.0,42.0,8.0,6,,,2013,08/06/2017 03:53:31 PM,,,,11,8,6,17


In [None]:
df_2013.isnull().sum()

id                         0
case_number                0
date                       0
block                      0
iucr                       0
primary_type               0
description                0
location_description     204
arrest                     0
domestic                   0
beat                       0
district                   0
ward                       3
community_area             4
fbi_code                   0
x_coordinate            1006
y_coordinate            1006
year                       0
updated_on                 0
latitude                1006
longitude               1006
location                1006
day                        0
month                      0
day_of_week                0
time                       0
dtype: int64

In [None]:
replace_to_zero(df_2013)

In [None]:
index_list = missing_row_loc(df_2013)

1006


In [None]:
impute_data(df_2013, index_list)

[0, 1, 2, 3, 4, 5, 6, 9, 11, 12, 15, 16, 20, 22, 23, 24, 29, 32, 33, 34, 36, 37, 42, 46, 53, 56, 89, 114, 117, 119, 127, 148, 149, 151, 160, 162, 164, 165, 166, 168, 169, 178, 180, 184, 186, 187, 188, 189, 190, 194, 196, 198, 199, 200, 201, 202, 203, 204, 206, 207, 208, 209, 210, 212, 214, 217, 218, 222, 225, 226, 227, 232, 233, 234, 235, 243, 249, 255, 256, 259, 260, 263, 264, 265, 266, 268, 271, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 285, 286, 287, 288, 289, 290, 291, 292, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 337, 338, 339, 342, 344, 345, 346, 349, 352, 353, 361, 362, 366, 372, 373, 374, 381, 382, 384, 389, 391, 398, 405, 411, 412, 413, 414, 415, 416, 417, 418, 419, 421, 422, 423, 424, 426, 428, 429, 430, 432, 433, 434, 435, 436, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 451, 452, 454, 455,

In [None]:
# Save Checkpoint 
df2013_geo = df_2013.copy()

In [None]:
# Filter location that is not null
df2013_geo = df2013_geo[df2013_geo['location'].notna()]

In [None]:
df2013_geo = impute_district(df2013_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2013_geo.drop(['district', 'index_right'], axis=1, inplace=True)

# Rename columns
df2013_geo.rename(columns={'dist_num':'district'}, inplace=True)

In [None]:
# Impute updated ward values
df2013_geo = impute_ward(df2013_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2013_geo.drop(['ward_left', 'index_right'], axis=1, inplace=True)

# Rename columns
df2013_geo.rename(columns={'ward_right':'ward'}, inplace=True)

In [None]:
# Impute updated district values
df2013_geo = impute_commareas(df2013_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2013_geo.drop(['community_area', 'index_right'], axis=1, inplace=True)

# Rename columns
df2013_geo.rename(columns={'area_numbe':'community_area'}, inplace=True)

In [None]:
# Balance clean up of dataset after transformation
# Drop columns
df2013_geo.drop(['x_coordinate', 'y_coordinate'], axis=1, inplace=True)

# Assign null values in location description to 'OTHER (Specify)'
df2013_geo['location_description'].fillna('OTHER (SPECIFY)', inplace = True)


In [None]:
df2013_geo.to_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2013_crime_clean.csv', index=False)

In [None]:
df2013_geo.isnull().sum()

id                      0
case_number             0
date                    0
block                   0
iucr                    0
primary_type            0
description             0
location_description    0
arrest                  0
domestic                0
beat                    0
fbi_code                0
year                    0
updated_on              0
latitude                0
longitude               0
location                0
day                     0
month                   0
day_of_week             0
time                    0
geometry                0
district                0
ward                    0
community_area          0
dtype: int64

## Clean data (2012)

In [None]:
# from geopy.geocoders import Nominatim
geocoder = Nominatim(user_agent = 'crime_predictions')

In [None]:
df_2012 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2012_crime.csv')
df_2012.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,beat,district,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude,location,day,month,day_of_week,time
0,11645833,JC213044,2012-05-05 12:25:00,057XX W OHIO ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,1511,15.0,29.0,25.0,11,,,2012,04/06/2019 04:04:43 PM,,,,5,5,5,12
1,11227247,JB147078,2012-01-01 09:00:00,105XX S INDIANAPOLIS AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,432,4.0,10.0,52.0,11,,,2012,02/11/2018 03:57:41 PM,,,,1,1,6,9
2,10225605,HY412867,2012-07-11 09:00:00,017XX W ALBION AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,APARTMENT,False,False,2432,24.0,40.0,1.0,11,1163498.0,1943889.0,2012,02/10/2018 03:50:01 PM,42.00167,-87.673864,"(42.00167049, -87.673863642)",11,7,2,9
3,11228588,JB149037,2012-06-04 12:00:00,037XX W 85TH PL,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,False,False,834,8.0,18.0,70.0,11,,,2012,02/12/2018 03:49:14 PM,,,,4,6,0,12
4,10751224,HZ513641,2012-01-01 08:00:00,010XX S MAYFIELD AVE,1752,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,RESIDENCE,False,False,1513,15.0,29.0,25.0,20,,,2012,07/27/2017 03:50:07 PM,,,,1,1,6,8


In [None]:
df_2012.isnull().sum()

id                        0
case_number               1
date                      0
block                     0
iucr                      0
primary_type              0
description               0
location_description    453
arrest                    0
domestic                  0
beat                      0
district                  0
ward                      7
community_area           26
fbi_code                  0
x_coordinate            731
y_coordinate            731
year                      0
updated_on                0
latitude                731
longitude               731
location                731
day                       0
month                     0
day_of_week               0
time                      0
dtype: int64

In [None]:
replace_to_zero(df_2012)

In [None]:
index_list = missing_row_loc(df_2012)

731


In [None]:
impute_data(df_2012, index_list)

[0, 1, 3, 4, 6, 8, 9, 10, 12, 13, 14, 15, 17, 18, 22, 26, 29, 45, 59, 61, 67, 70, 76, 87, 89, 92, 93, 97, 98, 99, 100, 101, 106, 110, 114, 116, 117, 118, 119, 121, 122, 123, 124, 126, 127, 128, 129, 132, 135, 136, 139, 140, 150, 157, 162, 163, 164, 165, 166, 168, 169, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 238, 242, 243, 248, 251, 253, 254, 255, 259, 263, 264, 265, 266, 269, 271, 272, 273, 274, 275, 277, 280, 281, 283, 284, 285, 287, 288, 289, 290, 291, 292, 296, 297, 300, 301, 302, 305, 326, 340, 345, 398, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 459, 460, 461, 462, 463, 464, 465, 467, 469, 470, 471, 476, 490, 493, 496, 497, 499, 502, 503, 504, 506, 507, 509, 510, 511, 687, 688, 691, 739, 8

In [None]:
# Save Checkpoint 
df2012_geo = df_2012.copy()

In [None]:
# Filter location that is not null
df2012_geo = df2012_geo[df2012_geo['location'].notna()]

In [None]:
df2012_geo = impute_district(df2012_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2012_geo.drop(['district', 'index_right'], axis=1, inplace=True)

# Rename columns
df2012_geo.rename(columns={'dist_num':'district'}, inplace=True)

In [None]:
# Impute updated ward values
df2012_geo = impute_ward(df2012_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2012_geo.drop(['ward_left', 'index_right'], axis=1, inplace=True)

# Rename columns
df2012_geo.rename(columns={'ward_right':'ward'}, inplace=True)

In [None]:
# Impute updated district values
df2012_geo = impute_commareas(df2012_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2012_geo.drop(['community_area', 'index_right'], axis=1, inplace=True)

# Rename columns
df2012_geo.rename(columns={'area_numbe':'community_area'}, inplace=True)

In [None]:
# Balance clean up of dataset after transformation
# Drop columns
df2012_geo.drop(['x_coordinate', 'y_coordinate'], axis=1, inplace=True)

# Assign null values in location description to 'OTHER (Specify)'
df2012_geo['location_description'].fillna('OTHER (SPECIFY)', inplace = True)

# Drop row for null case_number (1 case may not be significant)
df2012_geo = df2012_geo.dropna()

In [None]:
df2012_geo.to_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2012_crime_clean.csv', index=False)

In [None]:
df2012_geo.isnull().sum()

id                      0
case_number             0
date                    0
block                   0
iucr                    0
primary_type            0
description             0
location_description    0
arrest                  0
domestic                0
beat                    0
fbi_code                0
year                    0
updated_on              0
latitude                0
longitude               0
location                0
day                     0
month                   0
day_of_week             0
time                    0
geometry                0
district                0
ward                    0
community_area          0
dtype: int64

## Clean data (2011)

In [None]:
# from geopy.geocoders import Nominatim
geocoder = Nominatim(user_agent = 'crime_predictions')

In [None]:
df_2011 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2011_crime.csv')
df_2011.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,beat,district,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude,location,day,month,day_of_week,time
0,10225648,HY412901,2011-08-01 08:00:00,080XX S TRUMBULL AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,834,8.0,18.0,70.0,11,1154779.0,1851149.0,2011,02/09/2018 03:44:29 PM,41.747362,-87.708424,"(41.747362057, -87.708423712)",1,8,0,8
1,11042125,JA376558,2011-12-16 00:00:00,003XX W 64TH ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,HOSPITAL BUILDING/GROUNDS,False,False,722,7.0,20.0,68.0,11,,,2011,08/05/2017 03:50:08 PM,,,,16,12,4,0
2,11042582,JA377037,2011-01-01 00:01:00,054XX S CALIFORNIA AVE,1754,OFFENSE INVOLVING CHILDREN,AGG SEX ASSLT OF CHILD FAM MBR,APARTMENT,True,True,923,9.0,14.0,63.0,2,,,2011,08/13/2017 03:50:54 PM,,,,1,1,5,0
3,10230568,HY418187,2011-12-01 00:00:00,006XX N PARKSIDE AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,1511,15.0,29.0,25.0,11,1138569.0,1903659.0,2011,02/09/2018 03:44:29 PM,41.891766,-87.766554,"(41.891765632, -87.766553683)",1,12,3,0
4,10230575,HY418177,2011-10-01 00:01:00,008XX N PARKSIDE AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,1511,15.0,29.0,25.0,11,1138539.0,1904990.0,2011,02/09/2018 03:44:29 PM,41.895419,-87.766632,"(41.895418606, -87.766631575)",1,10,5,0


In [None]:
df_2011.isnull().sum()

id                        0
case_number               0
date                      0
block                     0
iucr                      0
primary_type              0
description               0
location_description    296
arrest                    0
domestic                  0
beat                      0
district                  0
ward                     14
community_area          181
fbi_code                  0
x_coordinate            600
y_coordinate            600
year                      0
updated_on                0
latitude                600
longitude               600
location                600
day                       0
month                     0
day_of_week               0
time                      0
dtype: int64

In [None]:
replace_to_zero(df_2011)

In [None]:
index_list = missing_row_loc(df_2011)

600


In [None]:
impute_data(df_2011, index_list)

[1, 2, 5, 8, 9, 10, 11, 18, 19, 22, 47, 59, 60, 69, 70, 71, 72, 73, 74, 75, 78, 80, 83, 84, 85, 86, 87, 89, 90, 91, 93, 94, 95, 97, 100, 106, 107, 109, 113, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 149, 150, 151, 152, 153, 154, 157, 158, 159, 165, 167, 171, 172, 175, 177, 178, 181, 184, 185, 186, 187, 188, 189, 190, 191, 193, 194, 195, 196, 197, 198, 199, 200, 201, 213, 230, 233, 259, 261, 263, 265, 284, 289, 403, 572, 625, 626, 627, 628, 629, 630, 631, 632, 635, 636, 637, 638, 640, 641, 642, 643, 644, 645, 646, 647, 648, 650, 652, 653, 654, 655, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 689, 690, 695, 696, 697, 703, 705, 708, 709, 710, 711, 712, 34381, 45452, 50261, 50263, 53287, 96592, 218872, 344239, 344241, 344386, 344387, 344388, 344389, 344393, 344396, 344397, 344398, 344399, 344400, 344401, 344403, 344404, 344405, 344406, 344407,

In [None]:
# Save Checkpoint 
df2011_geo = df_2011.copy()

In [None]:
# Filter location that is not null
df2011_geo = df2011_geo[df2011_geo['location'].notna()]

In [None]:
df2011_geo = impute_district(df2011_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2011_geo.drop(['district', 'index_right'], axis=1, inplace=True)

# Rename columns
df2011_geo.rename(columns={'dist_num':'district'}, inplace=True)

In [None]:
# Impute updated ward values
df2011_geo = impute_ward(df2011_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2011_geo.drop(['ward_left', 'index_right'], axis=1, inplace=True)

# Rename columns
df2011_geo.rename(columns={'ward_right':'ward'}, inplace=True)

In [None]:
# Impute updated district values
df2011_geo = impute_commareas(df2011_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2011_geo.drop(['community_area', 'index_right'], axis=1, inplace=True)

# Rename columns
df2011_geo.rename(columns={'area_numbe':'community_area'}, inplace=True)

In [None]:
# Balance clean up of dataset after transformation
# Drop columns
df2011_geo.drop(['x_coordinate', 'y_coordinate'], axis=1, inplace=True)

# Assign null values in location description to 'OTHER (Specify)'
df2011_geo['location_description'].fillna('OTHER (SPECIFY)', inplace = True)


In [None]:
df2011_geo.to_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2011_crime_clean.csv', index=False)

In [None]:
df2011_geo.isnull().sum()

id                      0
case_number             0
date                    0
block                   0
iucr                    0
primary_type            0
description             0
location_description    0
arrest                  0
domestic                0
beat                    0
fbi_code                0
year                    0
updated_on              0
latitude                0
longitude               0
location                0
day                     0
month                   0
day_of_week             0
time                    0
geometry                0
district                0
ward                    0
community_area          0
dtype: int64

## Clean data (2010)

In [None]:
# from geopy.geocoders import Nominatim
geocoder = Nominatim(user_agent = 'crime_predictions')

In [None]:
df_2010 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2010_crime.csv')
df_2010.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,beat,district,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude,location,day,month,day_of_week,time
0,11042930,JA374409,2010-01-01 00:01:00,080XX S MARSHFIELD AVE,266,CRIM SEXUAL ASSAULT,PREDATORY,RESIDENCE,False,False,611,6.0,21.0,71.0,2,,,2010,08/05/2017 03:50:08 PM,,,,1,1,4,0
1,10230609,HY416556,2010-09-09 20:10:00,074XX S MARYLAND AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,False,False,323,3.0,6.0,69.0,11,1183191.0,1855830.0,2010,02/09/2018 03:44:29 PM,41.759594,-87.604169,"(41.759593809, -87.604169095)",9,9,3,20
2,11033011,JA365922,2010-01-01 12:00:00,053XX S HARPER AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,APARTMENT,False,False,234,2.0,4.0,41.0,11,,,2010,07/28/2017 03:47:55 PM,,,,1,1,4,12
3,11649978,JC217323,2010-04-27 09:45:00,009XX W 72ND ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,733,7.0,6.0,68.0,11,,,2010,04/10/2019 04:14:14 PM,,,,27,4,1,9
4,11649965,JC217381,2010-11-09 03:30:00,045XX S CALUMET AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,215,2.0,3.0,38.0,11,,,2010,04/10/2019 04:14:14 PM,,,,9,11,1,3


In [None]:
df_2010.isnull().sum()

id                        0
case_number               1
date                      0
block                     0
iucr                      0
primary_type              0
description               0
location_description     68
arrest                    0
domestic                  0
beat                      0
district                  0
ward                     18
community_area          186
fbi_code                  0
x_coordinate            427
y_coordinate            427
year                      0
updated_on                0
latitude                427
longitude               427
location                427
day                       0
month                     0
day_of_week               0
time                      0
dtype: int64

In [None]:
replace_to_zero(df_2010)

In [None]:
index_list = missing_row_loc(df_2010)

427


In [None]:
impute_data(df_2010, index_list)

[0, 2, 3, 4, 5, 6, 13, 14, 18, 33, 35, 38, 47, 49, 51, 52, 54, 58, 59, 60, 67, 70, 71, 72, 73, 75, 77, 80, 83, 85, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 120, 121, 122, 123, 125, 126, 130, 134, 142, 143, 144, 148, 152, 153, 154, 155, 156, 158, 159, 160, 161, 163, 164, 166, 168, 169, 268, 285, 586, 587, 589, 590, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 604, 607, 608, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 631, 632, 633, 644, 646, 43270, 352579, 362529, 363199, 363941, 364406, 364407, 364409, 364411, 364412, 364413, 364414, 364415, 364416, 364419, 364427, 364429, 364430, 364432, 364434, 364436, 364438, 364439, 364441, 364446, 364447, 364448, 364452, 364461, 364462, 364464, 364465, 364474, 364479, 364481, 364486, 364487, 364490, 364497, 364498, 364499, 364500, 364504, 364505, 364510, 364511, 364513, 364514, 364515, 364522, 364526, 

In [None]:
# Save Checkpoint 
df2010_geo = df_2010.copy()

In [None]:
# Filter location that is not null
df2010_geo = df2010_geo[df2010_geo['location'].notna()]

In [None]:
df2010_geo = impute_district(df2010_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2010_geo.drop(['district', 'index_right'], axis=1, inplace=True)

# Rename columns
df2010_geo.rename(columns={'dist_num':'district'}, inplace=True)

In [None]:
# Impute updated ward values
df2010_geo = impute_ward(df2010_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2010_geo.drop(['ward_left', 'index_right'], axis=1, inplace=True)

# Rename columns
df2010_geo.rename(columns={'ward_right':'ward'}, inplace=True)

In [None]:
# Impute updated district values
df2010_geo = impute_commareas(df2010_geo)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326



In [None]:
# Drop columns
df2010_geo.drop(['community_area', 'index_right'], axis=1, inplace=True)

# Rename columns
df2010_geo.rename(columns={'area_numbe':'community_area'}, inplace=True)

In [None]:
# Balance clean up of dataset after transformation
# Drop columns
df2010_geo.drop(['x_coordinate', 'y_coordinate'], axis=1, inplace=True)

# Assign null values in location description to 'OTHER (Specify)'
df2010_geo['location_description'].fillna('OTHER (SPECIFY)', inplace = True)

# Drop row with null case_number (1 data may not be significant)
df2010_geo = df2010_geo.dropna()

In [None]:
df2010_geo.to_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/2010_crime_clean.csv', index=False)

In [None]:
df2010_geo.isnull().sum()

id                      0
case_number             0
date                    0
block                   0
iucr                    0
primary_type            0
description             0
location_description    0
arrest                  0
domestic                0
beat                    0
fbi_code                0
year                    0
updated_on              0
latitude                0
longitude               0
location                0
day                     0
month                   0
day_of_week             0
time                    0
geometry                0
district                0
ward                    0
community_area          0
dtype: int64

## Transform & Load (Combined) Data

In [None]:
# Load all clean dataset
df2021 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/clean_data/2021_crime_clean.csv')
df2020 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/clean_data/2020_crime_clean.csv')
df2019 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/clean_data/2019_crime_clean.csv')
df2018 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/clean_data/2018_crime_clean.csv')
df2017 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/clean_data/2017_crime_clean.csv')
df2016 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/clean_data/2016_crime_clean.csv')
df2015 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/clean_data/2015_crime_clean.csv')
df2014 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/clean_data/2014_crime_clean.csv')
df2013 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/clean_data/2013_crime_clean.csv')
df2012 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/clean_data/2012_crime_clean.csv')
df2011 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/clean_data/2011_crime_clean.csv')
df2010 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/clean_data/2010_crime_clean.csv')

In [None]:
df2017['year'].head()

0    2017
1    2017
2    2017
3    2017
4    2017
Name: year, dtype: int64

In [None]:
# Concat all dataset together (single dataset for EDA purposes)
# Ref: Concat all dataframes: https://stackoverflow.com/questions/53877687/how-can-i-concat-multiple-dataframes-in-python

crimelist = [df2021, df2020, df2019, df2018, df2017, df2016, df2015, df2014, df2013, df2012, df2011, df2010]  # List of your dataframes
df_crimes = pd.concat(crimelist)

In [None]:
# Reset index after joining different dataframes
df_crimes.reset_index()

Unnamed: 0,index,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,beat,fbi_code,year,updated_on,latitude,longitude,location,day,month,day_of_week,time,geometry,district,ward,community_area
0,0,12260346,JE102126,2021-01-03 13:23:00,070XX S EGGLESTON AVE,0486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,False,True,732,08B,2021,01/16/2021 03:49:23 PM,41.766435,-87.635964,"(41.766435144, -87.635963997)",3,1,6,13,POINT (-87.635963997 41.766435144),7,6,68
1,1,12284355,JE130457,2021-02-02 15:17:00,071XX S SANGAMON ST,0497,BATTERY,AGGRAVATED DOMESTIC BATTERY - OTHER DANGEROUS ...,RESIDENCE,False,True,733,04B,2021,02/09/2021 03:47:29 PM,41.764526,-87.648037,"(41.764526407, -87.648036798)",2,2,1,15,POINT (-87.64803679799999 41.764526407),7,6,68
2,2,12264568,JE107021,2021-01-08 15:06:00,070XX S SANGAMON ST,501A,OTHER OFFENSE,ANIMAL ABUSE / NEGLECT,RESIDENCE - PORCH / HALLWAY,False,False,733,26,2021,01/16/2021 03:49:23 PM,41.766363,-87.648087,"(41.766362642, -87.648086903)",8,1,4,15,POINT (-87.64808690299999 41.766362642),7,6,68
3,3,12264977,JE107479,2021-01-08 22:50:00,066XX S UNION AVE,0497,BATTERY,AGGRAVATED DOMESTIC BATTERY - OTHER DANGEROUS ...,STREET,False,True,723,04B,2021,01/16/2021 03:49:23 PM,41.773108,-87.642633,"(41.773108396, -87.64263345)",8,1,4,22,POINT (-87.64263345000001 41.773108396),7,6,68
4,4,12264092,JE106496,2021-01-07 23:53:00,067XX S GREEN ST,0453,BATTERY,AGGRAVATED POLICE OFFICER - OTHER DANGEROUS WE...,STREET,True,False,723,04B,2021,03/26/2021 04:58:11 PM,41.771796,-87.645797,"(41.771796387, -87.645796537)",7,1,3,23,POINT (-87.64579653700001 41.77179638699999),7,6,68
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3263316,369095,9494365,HX148399,2010-03-23 09:00:00,052XX W 63RD ST,0841,THEFT,FINANCIAL ID THEFT:$300 &UNDER,BANK,False,False,813,06,2010,02/04/2016 06:33:39 AM,41.778212,-87.753623,"(41.778211727, -87.753622799)",23,3,1,9,POINT (-87.753622799 41.778211727),8,23,64
3263317,369096,7299187,HS103020,2010-01-03 11:08:00,061XX S OAK PARK AVE,0890,THEFT,FROM BUILDING,APARTMENT,False,False,812,06,2010,02/10/2018 03:50:01 PM,41.780645,-87.791112,"(41.780645106, -87.791112009)",3,1,6,11,POINT (-87.791112009 41.780645106),8,23,64
3263318,369097,7461468,HS262174,2010-04-17 10:30:00,067XX W 64TH PL,0560,ASSAULT,SIMPLE,RESIDENCE,True,False,812,08A,2010,02/10/2018 03:50:01 PM,41.774817,-87.789541,"(41.774817017, -87.789540689)",17,4,5,10,POINT (-87.78954068899998 41.774817017),8,23,64
3263319,369098,7612740,HS417276,2010-07-18 15:00:00,071XX W 62ND ST,0810,THEFT,OVER $500,STREET,False,False,812,06,2010,02/10/2018 03:50:01 PM,41.779160,-87.799135,"(41.779159608, -87.799135466)",18,7,6,15,POINT (-87.79913546600002 41.779159608),8,23,64


In [None]:
df_crimes['year'].unique()

array([2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011,
       2010])

In [None]:
df_crimes.duplicated().sum()

0

In [None]:
df_crimes.to_csv('/content/drive/MyDrive/Colab Notebooks/capstone/assets/clean_data/2010_2021_crime_clean.csv', index=False)