### Read in our raw data from the [Chicago Data Portal](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2/data)
(This will take a while)

In [1]:
%%time
import requests
import time
import pandas as pd
import numpy as np

url = "https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD"

cleaned_df_pt1 = pd.read_csv(url)

Wall time: 9min 26s


In [2]:
# Get number of observations/rows
length_cleaned_df_pt1 = len(cleaned_df_pt1)
numbers_cleaned_df_pt1 = "{:,}".format(length_cleaned_df_pt1)
print(numbers_cleaned_df_pt1)

7,526,160


### 7,520,239 observations is A LOT!
... too many for our future models to run and our visuals. 

So let's reduce that by inspecting our data to see if we can:
 - keep only relevent date range (2018-present) to make sure we capture some pre-COVID data because it made such a huge impact on lives.
 - drop any duplicate columns
 - drop any rows will null values
 - then if still too large, get a random sample to get a more workable sized dataset

In [3]:
# View the data
cleaned_df_pt1.head()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10224738,HY411648,09/05/2015 01:30:00 PM,043XX S WOOD ST,486,BATTERY,DOMESTIC BATTERY SIMPLE,RESIDENCE,False,True,...,12.0,61.0,08B,1165074.0,1875917.0,2015,02/10/2018 03:50:01 PM,41.815117,-87.67,"(41.815117282, -87.669999562)"
1,10224739,HY411615,09/04/2015 11:30:00 AM,008XX N CENTRAL AVE,870,THEFT,POCKET-PICKING,CTA BUS,False,False,...,29.0,25.0,06,1138875.0,1904869.0,2015,02/10/2018 03:50:01 PM,41.89508,-87.7654,"(41.895080471, -87.765400451)"
2,11646166,JC213529,09/01/2018 12:01:00 AM,082XX S INGLESIDE AVE,810,THEFT,OVER $500,RESIDENCE,False,True,...,8.0,44.0,06,,,2018,04/06/2019 04:04:43 PM,,,
3,10224740,HY411595,09/05/2015 12:45:00 PM,035XX W BARRY AVE,2023,NARCOTICS,POSS: HEROIN(BRN/TAN),SIDEWALK,True,False,...,35.0,21.0,18,1152037.0,1920384.0,2015,02/10/2018 03:50:01 PM,41.937406,-87.71665,"(41.937405765, -87.716649687)"
4,10224741,HY411610,09/05/2015 01:00:00 PM,0000X N LARAMIE AVE,560,ASSAULT,SIMPLE,APARTMENT,False,True,...,28.0,25.0,08A,1141706.0,1900086.0,2015,02/10/2018 03:50:01 PM,41.881903,-87.755121,"(41.881903443, -87.755121152)"


In [4]:
# View colums & their datatypes
cleaned_df_pt1.dtypes

ID                        int64
Case Number              object
Date                     object
Block                    object
IUCR                     object
Primary Type             object
Description              object
Location Description     object
Arrest                     bool
Domestic                   bool
Beat                      int64
District                float64
Ward                    float64
Community Area          float64
FBI Code                 object
X Coordinate            float64
Y Coordinate            float64
Year                      int64
Updated On               object
Latitude                float64
Longitude               float64
Location                 object
dtype: object

In [5]:
# Keep only relevent data: 2018 and later
cleaned_df_pt1 = cleaned_df_pt1[cleaned_df_pt1.Year>2018]
cleaned_df_pt1.Year.unique()

array([2020, 2019, 2021, 2022], dtype=int64)

In [6]:
# Get number of observations/rows
length_cleaned_df_pt1 = len(cleaned_df_pt1)
numbers_cleaned_df_pt1 = "{:,}".format(length_cleaned_df_pt1)
print(numbers_cleaned_df_pt1)

735,615



### What exactly do some of these columns mean?  
Take a look at the source's descriptions to determine if we can drop any unimportant or redundant columns: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2

| Column_Name | Data_Type | Description
| :--- | :--- | :--- |     
| ID | Number | Unique identifier for the record
| Case Number | Plain Text | The Chicago Police Department RD Number (Records Division Number), which is unique to the incident
| Date | Floating Timestamp | Date when the incident occurred. this is sometimes a best estimate
| Block | Plain Text | The partially redacted **address where the incident occurred**, placing it on the same block as the actual address
| IUCR | Plain Text | The Illinois Unifrom Crime Reporting code. This is directly linked to the Primary Type and Description. See the [list of IUCR codes](https://data.cityofchicago.org/d/c7ck-438e)
| Primary Type | Plain Text | The primary description of the IUCR code
| Description | Plain Text | The secondary description of the IUCR code, a subcategory of the primary description
| Location Description | Plain Text | Description of the location where the incident occurred
| Arrest | Checkbox | Indicates whether an arrest was made
| Domestic | Checkbox | Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act
| Beat | Plain Text | Indicates the beat where the incident occurred. A beat is the **smallest police geographic area** – each beat has a dedicated police beat car. Three to five beats make up a police sector, and three sectors make up a police district. The Chicago Police Department has 22 police districts. See the [beats](https://data.cityofchicago.org/d/aerh-rz74)
| District | Plain Text | Indicates the **police district** where the incident occurred. See the [districts](https://data.cityofchicago.org/d/fthy-xz3r)
| Ward | Number | The ward (**City Council district**) where the incident occurred. See the [wards](https://data.cityofchicago.org/d/sp34-6z76)
| Community Area | Plain Text | Indicates the community area where the incident occurred. Chicago has 77 community areas. See the [community areas](https://data.cityofchicago.org/d/cauq-8yn6)
| FBI Code | Plain Text | Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS). See the Chicago Police Department listing of these [FBI code classifications](http://gis.chicagopolice.org/clearmap_crime_sums/crime_types.html)
| X Coordinate | Number | The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block
| Y Coordinate | Number | The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block
| Year | Number | Year the incident occurred
| Updated On | Floating Timestamp | Date and time the record was last updated
| Latitude | Number | The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block
| Longitudee | Number | The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block
| Location | Location | The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block

### Drop Unimportant & Duplicate Columns: 
- Case Number (unnecessary for analysis)
- Block (we looked at the map and between district, community, ward, and block... block too small and also redundant)
- IUCR (reporting code unnecessary when we have the description of the crime already)
- Beat (same as block ...beat is also too small and also redundant)
- Ward (same as block ...too politically divided)
- FBI Code (same as IUCR)
- X Coordinate (shift of the actual location; redundant
- Y Coordinate (same as X Coordinate)
- Year (already have a Date-Time column, which can be easily grouped by year if necessary, so redundant)
- Updated on (unnecessary because we don't need to know when it was last updated for our analysis)


In [7]:
# Make list of columns to drop
columns_to_drop = ['Case Number',
           'Block',
           'IUCR',
           'Beat',
           'Ward',
           'FBI Code',
           'X Coordinate',
           'Y Coordinate',
           'Year',
           'Updated On']

# Pass in list to df.drop function
cleaned_df_pt1 = cleaned_df_pt1.drop(columns_to_drop, axis='columns')

cleaned_df_pt1.head()

Unnamed: 0,ID,Date,Primary Type,Description,Location Description,Arrest,Domestic,District,Community Area,Latitude,Longitude,Location
90,12014684,03/17/2020 09:30:00 PM,THEFT,$500 AND UNDER,STREET,False,False,16.0,15.0,41.952052,-87.75466,"(41.952051946, -87.754660372)"
183,11864018,09/24/2019 08:00:00 AM,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,COMMERCIAL / BUSINESS OFFICE,False,False,1.0,33.0,41.852248,-87.623786,"(41.852248185, -87.623786256)"
235,11859805,10/13/2019 08:30:00 PM,THEFT,RETAIL THEFT,GROCERY FOOD STORE,False,False,12.0,24.0,41.895732,-87.687784,"(41.895732399, -87.687784384)"
420,12012127,03/18/2020 02:03:00 AM,MOTOR VEHICLE THEFT,AUTOMOBILE,APARTMENT,False,True,11.0,26.0,41.87711,-87.72399,"(41.877110187, -87.723989719)"
446,11863808,10/05/2019 06:30:00 PM,THEFT,OVER $500,RESIDENCE,False,False,12.0,28.0,41.882002,-87.662287,"(41.88200224, -87.662286977)"


### Drop Rows with Null values
...because they don't play nice when analyzing.

In [8]:
# Are there any null values in our dataset?

for column in cleaned_df_pt1.columns:
    print(f"Column {column} has {cleaned_df_pt1[column].isnull().sum()}null values")

Column ID has 0null values
Column Date has 0null values
Column Primary Type has 0null values
Column Description has 0null values
Column Location Description has 3387null values
Column Arrest has 0null values
Column Domestic has 0null values
Column District has 0null values
Column Community Area has 1null values
Column Latitude has 9749null values
Column Longitude has 9749null values
Column Location has 9749null values


In [9]:
# Drop all null rows
cleaned_df_pt1 = cleaned_df_pt1.dropna()

# Count the number of observations/rows remaining
length_cleaned_df_pt1 = len(cleaned_df_pt1)
numbers_cleaned_df_pt1 = "{:,}".format(length_cleaned_df_pt1)
print(numbers_cleaned_df_pt1)

723,136


### 723,136 Observations is STILL A LOT!
Lets get a sample size of that so it's easier to work with, yet still holds enough data to work with.

In [10]:
# Shrink our dataset for time and money sake when running models, Create a sample dataset
cleaned_df_pt1 = cleaned_df_pt1.sample(n=50000)

In [11]:
# Count the number of observations/rows remaining
length_cleaned_df_pt1 = len(cleaned_df_pt1)
numbers = "{:,}".format(length_cleaned_df_pt1)
print(numbers)

50,000


### Save the sample cleaned data to a csv called "Chicago_Crimes_cleaned_pt1"
...so it can be referenced to do the rest of the cleaning and if adjustments are ever made to the cleaning we don't have to wait for the 7 million rows to load into our jupyter notebook with each tweak. 

In [12]:
# Save smaller cleaned data to a csv
cleaned_df_pt1.to_csv("Resources/Chicago_Crimes_cleaned_pt1.csv", index=False)