# Part 1: Identify features of the dataset and design entity relationship model

To develop an effective entity-relationship model for the Chicago Crime dataset it is important to determine all entities and with their associated attributes and identify primary and foriegn keys. This can be used to develop an initial set of relations. Once an initial set of relations is determined those relations can be converted to 3NF.

3NF is a database schema design approach for relational databases that uses normalization to reduce the duplication of data, avoid data anomalies, ensure referential integrity, and simplify data management.

To convert the relations to 3NF it is important to ensure they meet the criteria of 1NF and 2NF. The criteria for 1NF, 2NF, and 3NF are: 1NF: Each table cell should contain a single value, and each record needs to be unique

2NF: Each table must be 1NF and there should be no partial dependencies. A partial dependency is when you have a composite primary key and one or more of the non-key coloumns is functionally depoendent on one, but not all, of the columns in the composite primary key.

3NF: Each table must be 2NF and there should be no transitive dependencies. An example of a transitive dependency is if P -> Q & Q -> R then P -> R.

Once all relations are converted to 3NF and entity relationship diagram can be developed, which will act as a guide to table implementation.

## Dataset Exploration

The dataset will be explored and any data carpentry will be done here.

In [2]:
import pandas as pd

In [3]:
# path to data
datapath = "Crimes_2012.csv"

# read data
df = pd.read_csv(datapath, index_col = False)

In [4]:
# check out the first few rows
df.head()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,11645833,JC213044,05/05/2012 12:25:00 PM,057XX W OHIO ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,...,29.0,25.0,11,,,2012,04/06/2019 04:04:43 PM,,,
1,11227247,JB147078,01/01/2012 09:00:00 AM,105XX S INDIANAPOLIS AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,...,10.0,52.0,11,,,2012,02/11/2018 03:57:41 PM,,,
2,10225605,HY412867,07/11/2012 09:00:00 AM,017XX W ALBION AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,APARTMENT,False,False,...,40.0,1.0,11,1163498.0,1943889.0,2012,02/10/2018 03:50:01 PM,42.00167,-87.673864,"(42.00167049, -87.673863642)"
3,11228588,JB149037,06/04/2012 12:00:00 PM,037XX W 85TH PL,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,False,False,...,18.0,70.0,11,,,2012,02/12/2018 03:49:14 PM,,,
4,10751224,HZ513641,01/01/2012 08:00:00 AM,010XX S MAYFIELD AVE,1752,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,RESIDENCE,False,False,...,29.0,25.0,20,,,2012,07/27/2017 03:50:07 PM,,,


In [5]:
#renaming columns to make things easier like getting rid of spaces
df = df.rename(columns = {'ID': 'id', 'Case Number': 'case_number', 'Date': 'date', 'Block': 'block', 'IUCR': 'IUCR',
              'Primary Type': 'primary_type', 'Description': 'description', 'Location Description': 'location_description',
              'Arrest': 'arrest', 'Domestic': 'domestic', 'Beat': 'beat', 'District': 'district', 'Ward': 'ward',
              'Community Area': 'community_area', 'FBI Code': 'fbi_code', 'X Coordinate': 'x_coordinate',
              'Y Coordinate': 'y_coordinate', 'Year': 'year', 'Updated On': 'updated_on', 'Latitude': 'latitude',
              'Longitude': 'longitude', 'Location': 'location'})

In [6]:
# check columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336194 entries, 0 to 336193
Data columns (total 22 columns):
id                      336194 non-null int64
case_number             336193 non-null object
date                    336194 non-null object
block                   336194 non-null object
IUCR                    336194 non-null object
primary_type            336194 non-null object
description             336194 non-null object
location_description    335741 non-null object
arrest                  336194 non-null bool
domestic                336194 non-null bool
beat                    336194 non-null int64
district                336194 non-null int64
ward                    336187 non-null float64
community_area          336168 non-null float64
fbi_code                336194 non-null object
x_coordinate            335452 non-null float64
y_coordinate            335452 non-null float64
year                    336194 non-null int64
updated_on              336194 non-null object


In [7]:
# Location will prevent table from being 1NF, and year isn't necessary information so it will be dropped along with location
#X_coordinate and Y_coordinate are alos redundant attributes that should be removed
df = df.drop(['location', 'year', 'x_coordinate', 'y_coordinate'], axis = 1)
df.head()

Unnamed: 0,id,case_number,date,block,IUCR,primary_type,description,location_description,arrest,domestic,beat,district,ward,community_area,fbi_code,updated_on,latitude,longitude
0,11645833,JC213044,05/05/2012 12:25:00 PM,057XX W OHIO ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,1511,15,29.0,25.0,11,04/06/2019 04:04:43 PM,,
1,11227247,JB147078,01/01/2012 09:00:00 AM,105XX S INDIANAPOLIS AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,432,4,10.0,52.0,11,02/11/2018 03:57:41 PM,,
2,10225605,HY412867,07/11/2012 09:00:00 AM,017XX W ALBION AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,APARTMENT,False,False,2432,24,40.0,1.0,11,02/10/2018 03:50:01 PM,42.00167,-87.673864
3,11228588,JB149037,06/04/2012 12:00:00 PM,037XX W 85TH PL,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,False,False,834,8,18.0,70.0,11,02/12/2018 03:49:14 PM,,
4,10751224,HZ513641,01/01/2012 08:00:00 AM,010XX S MAYFIELD AVE,1752,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,RESIDENCE,False,False,1513,15,29.0,25.0,20,07/27/2017 03:50:07 PM,,


In [8]:
# check columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336194 entries, 0 to 336193
Data columns (total 18 columns):
id                      336194 non-null int64
case_number             336193 non-null object
date                    336194 non-null object
block                   336194 non-null object
IUCR                    336194 non-null object
primary_type            336194 non-null object
description             336194 non-null object
location_description    335741 non-null object
arrest                  336194 non-null bool
domestic                336194 non-null bool
beat                    336194 non-null int64
district                336194 non-null int64
ward                    336187 non-null float64
community_area          336168 non-null float64
fbi_code                336194 non-null object
updated_on              336194 non-null object
latitude                335452 non-null float64
longitude               335452 non-null float64
dtypes: bool(2), float64(4), int64(3), object(

A location_id will also be introduced to the data to avoid a composite key of 8 attributes. This will be added in Part 2 when developing the relations. With the addition of location_id the data should be 3NF.