# Table of Contents
 <p>

# Project Description 

In this project, we will explore Chicago Crime Dataset and implement a relational database for storing the data. The key tasks for this project are as follows: 

1. Indentify the features (attributes) in Chicago Crime dataset and design an entity-relationship model
2. Refine the model and convert each relation to 3NF (if required)
3. Using DDL implement the relations in a postgres server
4. Load the given data to the relations
5. Execute some interesting queries on the relations


## Dataset

* Dataset URL: **/dsa/data/DSA-7030/Chicago-Crime-Sample-2012.csv**
* Dataset Description: [pdf](./ChicagoData-Description.pdf)

## Dataset exploration

In [1]:
import pandas as pd
datapath = "/dsa/data/DSA-7030/Chicago-Crime-Sample-2012.csv"
df = pd.read_csv(datapath, index_col=0)

In [2]:
# check columns
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 334715 entries, 47398 to 2743778
Data columns (total 22 columns):
ID                      334715 non-null int64
Case Number             334715 non-null object
Date                    334715 non-null object
Block                   334715 non-null object
IUCR                    334715 non-null object
Primary Type            334715 non-null object
Description             334715 non-null object
Location Description    334384 non-null object
Arrest                  334715 non-null bool
Domestic                334715 non-null bool
Beat                    334715 non-null int64
District                334715 non-null int64
Ward                    334708 non-null float64
Community Area          334689 non-null float64
FBI Code                334715 non-null object
X Coordinate            334132 non-null float64
Y Coordinate            334132 non-null float64
Year                    334715 non-null int64
Updated On              334715 non-null ob

In [4]:
df.tail()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
2742387,8951459,HW100757,12/31/2012 23:50,028XX N HALSTED ST,890,THEFT,FROM BUILDING,RESIDENCE,False,False,...,44.0,6.0,6,1170439.0,1919244.0,2012,2/4/2016 6:33,41.933894,-87.649053,"(41.933894393, -87.649052922)"
2741932,8950836,HW100039,12/31/2012 23:55,0000X E OHIO ST,2890,PUBLIC PEACE VIOLATION,OTHER VIOLATION,SIDEWALK,True,False,...,42.0,8.0,26,1176775.0,1904213.0,2012,2/4/2016 6:33,41.892508,-87.626224,"(41.892507592, -87.626223996)"
2742001,8950918,HW100021,12/31/2012 23:55,035XX W MONTROSE AVE,610,BURGLARY,FORCIBLE ENTRY,OTHER,False,False,...,33.0,16.0,5,1152066.0,1929015.0,2012,2/4/2016 6:33,41.961089,-87.716315,"(41.961089289, -87.716314748)"
2743949,8954299,HW100700,12/31/2012 23:55,058XX S MARYLAND AVE,890,THEFT,FROM BUILDING,HOSPITAL BUILDING/GROUNDS,False,False,...,5.0,41.0,6,1182887.0,1866434.0,2012,2/4/2016 6:33,41.788699,-87.604954,"(41.788699253, -87.604954085)"
2743778,8953937,HW102973,12/31/2012 23:58,037XX N NORA AVE,610,BURGLARY,FORCIBLE ENTRY,RESIDENCE-GARAGE,False,False,...,36.0,17.0,5,1128745.0,1924002.0,2012,2/4/2016 6:33,41.947762,-87.802171,"(41.947761848, -87.802170774)"


In [8]:
df['ID'].nunique()

334715

In [9]:
df['Case Number'].nunique()

334715

I have shown that ID and Case Number both contain all unique values, so either can be used as a primary key

In [10]:
df[['Year', 'Date']]

Unnamed: 0,Year,Date
47398,2012,1/1/2012 0:00
47420,2012,1/1/2012 0:00
802910,2012,1/1/2012 0:00
803605,2012,1/1/2012 0:00
831733,2012,1/1/2012 0:00
...,...,...
2742387,2012,12/31/2012 23:50
2741932,2012,12/31/2012 23:55
2742001,2012,12/31/2012 23:55
2743949,2012,12/31/2012 23:55


In [11]:
df[['Primary Type', 'Description']]

Unnamed: 0,Primary Type,Description
47398,SEX OFFENSE,AGG CRIMINAL SEXUAL ABUSE
47420,SEX OFFENSE,SEXUAL EXPLOITATION OF A CHILD
802910,SEX OFFENSE,CRIMINAL SEXUAL ABUSE
803605,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300
831733,SEX OFFENSE,CRIMINAL SEXUAL ABUSE
...,...,...
2742387,THEFT,FROM BUILDING
2741932,PUBLIC PEACE VIOLATION,OTHER VIOLATION
2742001,BURGLARY,FORCIBLE ENTRY
2743949,THEFT,FROM BUILDING


In [12]:
df[['Location', 'Location Description']]

Unnamed: 0,Location,Location Description
47398,,RESIDENCE
47420,,RESIDENCE
802910,,APARTMENT
803605,,RESIDENCE
831733,,RESIDENCE
...,...,...
2742387,"(41.933894393, -87.649052922)",RESIDENCE
2741932,"(41.892507592, -87.626223996)",SIDEWALK
2742001,"(41.961089289, -87.716314748)",OTHER
2743949,"(41.788699253, -87.604954085)",HOSPITAL BUILDING/GROUNDS


In [None]:
# Based on what I have read in the Slack channel, all coordinates (except for one specific set) coorespond to exactly one beat,
# district, ward, and community area. After looking at the maps for each one, I have determined that block, ward, and
# community area are completely independent from every other city subdivision, but beats are indeed a subdivision of districts. 
# As a result, the table containing coordinates will connect to tables for block, beat, ward, and community area. The table for
# beats will connect to a table for districts, similar to how in the film rental database from earlier in the class, 
# country connected to city, which connected with address. 

# There is one set of coordinates that does not correspond to Chicago, but to a location in Missouri. Records with these
# coordinates do not follow the usual rules (i.e. records with these coordinates will have multiple values for district),
# so they must be thrown out.



## 1.1 Design an Entity Relationship Model for the Chicago Crime Dataset

* List all the entities with associated attributes
* Indentify primary and foreign keys

## 1.2 If required, refine your initial set of relations and convert each of the relations to 3NF

While converting a relation to 3NF, please write down the process in the following cell. 

## 1.3 Final ERD

* Draw an entitiy relationship diagram once you are done with 1.1 and 1.2 
* Use crow's foot notation to specify the cardinality 
* Show the primary and foreign keys in the diagram

Please upload your ERD to the Module 8 exercises folder. Link the file [./dsa7030_module7_ERD.png](./dsa7030_module7_ERD.png). Once you are done, change this cell type to Markdown and execute. 

## <center> Part-I ends here</center>

To access Part II, use this link: [Final Project Part II](./Final-Project-Part-II.ipynb)