# Final Project Part - II

In this part, we will be implementing the tables and loading data into the tables that we have designed in Part - I. 

## 3.1 Using DDL, create each of the relations in the postgres server. 

* Use `dsa_student` database
* You are free to use any of the following tools:
  * psql
      * If you use psql, copy and paste your query in the following cell
  * sql magic
  * psycopg2
  * SQLAlchemy
* Add additional cells if required

In [1]:
import pandas as pd
import getpass
mypasswd = getpass.getpass()
username = 'bmgwd9'
host = 'pgsql.dsa.lan'
database = 'dsa_student'

# Then connects to the DB
from sqlalchemy.engine.url import URL
from sqlalchemy import create_engine

# SQLAlchemy Connection Parameters
postgres_db = {'drivername': 'postgres',
               'username': username,
               'password': mypasswd,
               'host': host,
               'database' :database}
engine = create_engine(URL(**postgres_db), echo=False)
del mypasswd

········


In [73]:
query = """
DROP TABLE IF EXISTS location CASCADE;
DROP TABLE IF EXISTS crime_code CASCADE;
"""

with engine.connect() as connection:
    res = connection.execute(query)
    print(res)

<sqlalchemy.engine.result.ResultProxy object at 0x7f03ccbffb38>


In [74]:
query = """
DROP TABLE IF EXISTS location;
CREATE TABLE location (
    location_id             INT,
    longitude               REAL, -- Floating point number
    latitude                REAL, 
    location_description    varchar(100), -- Character String, varied length
    block                   varchar(100), 
    beat                    INT,
    ward                    REAL,
    community_area          REAL, 
    district                INT,
    PRIMARY KEY (location_id)
)
"""

with engine.connect() as connection:
    res = connection.execute(query)
    print(res)

<sqlalchemy.engine.result.ResultProxy object at 0x7f03cfffc828>


In [75]:
query = """
DROP TABLE IF EXISTS crime_code;
CREATE TABLE crime_code (
    iucr                    varchar(100),
    primary_type            varchar(100),
    description             varchar(100),
    fbi_code                varchar(100),
    PRIMARY KEY (iucr, fbi_code)
)
"""

with engine.connect() as connection:
    res = connection.execute(query)
    print(res)

<sqlalchemy.engine.result.ResultProxy object at 0x7f03cfffc9e8>


In [76]:
query = """
DROP TABLE IF EXISTS record;
CREATE TABLE record (
    id                      INT, -- Integer
    case_number             varchar(100),
    date                    timestamp,
    iucr                    varchar(100), 
    arrest                  BOOL,
    domestic                BOOL,
    location_id             INT,
    updated_on              timestamp,
    fbi_code                varchar(100),
    PRIMARY KEY (id),
    FOREIGN KEY (iucr, fbi_code) REFERENCES crime_code(iucr, fbi_code),
    FOREIGN KEY (location_id) REFERENCES location(location_id)
)
"""

with engine.connect() as connection:
    res = connection.execute(query)
    print(res)

<sqlalchemy.engine.result.ResultProxy object at 0x7f03ccc6b390>


## 3.2 Show the table definitions using psql or querying information_schema.colums catalog
* Add additional cells if required

## 4.1 Load the data from the given csv file to the relations



* Assuming there will be more than one relations, you need to extract a subsets of data from the csv data. As Python may not be your first choice, you can use any languages to create subsets of data. Then store these data into the M8 exercises folder. 
* After curating the data use any of the following tools to load the data into the tables
  * psql
      * If you use psql, copy and paste your command/query in the following cell
  * sql magic
  * psycopg2
  * SQLAlchemy
* Add additional cells if required

In [64]:
df = pd.read_csv("/dsa/data/DSA-7030/Chicago-Crime-Sample-2012.csv")
df['location_id'] = df.index + 1 # Create a location ID column to avoid a large composite PK
df.head()

Unnamed: 0.1,Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,...,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location,location_id
0,47398,10433096,HZ170962,1/1/2012 0:00,026XX N MC VICKER AVE,1562,SEX OFFENSE,AGG CRIMINAL SEXUAL ABUSE,RESIDENCE,True,...,19.0,17,,,2012,5/11/2016 15:48,,,,1
1,47420,10433124,HZ170983,1/1/2012 0:00,026XX N MC VICKER AVE,1544,SEX OFFENSE,SEXUAL EXPLOITATION OF A CHILD,RESIDENCE,True,...,19.0,17,,,2012,5/11/2016 15:48,,,,2
2,802910,10532867,HZ276514,1/1/2012 0:00,036XX S RHODES AVE,1563,SEX OFFENSE,CRIMINAL SEXUAL ABUSE,APARTMENT,False,...,35.0,17,,,2012,5/26/2016 15:51,,,,3
3,803605,10536876,HZ280873,1/1/2012 0:00,062XX S ROCKWELL ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,...,66.0,11,,,2012,5/27/2016 15:48,,,,4
4,831733,9581929,HX232501,1/1/2012 0:00,006XX W 66TH ST,1563,SEX OFFENSE,CRIMINAL SEXUAL ABUSE,RESIDENCE,False,...,68.0,17,,,2012,8/17/2015 15:03,,,,5


In [65]:
df.dtypes

Unnamed: 0                int64
ID                        int64
Case Number              object
Date                     object
Block                    object
IUCR                     object
Primary Type             object
Description              object
Location Description     object
Arrest                     bool
Domestic                   bool
Beat                      int64
District                  int64
Ward                    float64
Community Area          float64
FBI Code                 object
X Coordinate            float64
Y Coordinate            float64
Year                      int64
Updated On               object
Latitude                float64
Longitude               float64
Location                 object
location_id               int64
dtype: object

In [77]:
location=pd.DataFrame({'location_id': df['location_id'],
                       'longitude': df['Longitude'],
                       'latitude': df['Latitude'],
                       'location_description': df['Location Description'],
                       'block': df['Block'],
                       'beat': df['Beat'],
                       'ward': df['Ward'],
                       'community_area': df['Community Area'],
                       'district': df['District']
                })

location.to_sql('location', # The table to load
          engine,             # The engine created above
          schema= username,   # The schema where the table lives, our pawprint
          if_exists='append', # If the table is found, it would keep loading the end of table.
          index=False,        # Ignore creating an index for the index col in the dataframe
          chunksize=20)       # Do 20 records from the data frame at a time

In [78]:
crime_code=pd.DataFrame({'iucr': df['IUCR'],
                       'primary_type': df['Primary Type'],
                       'description': df['Description'],
                       'fbi_code': df['FBI Code']
                })

In [83]:
crime_code[crime_code['iucr'] == '1563'].head(5) # This shows that the dataset contained duplicate rows

Unnamed: 0,iucr,primary_type,description,fbi_code
2,1563,SEX OFFENSE,CRIMINAL SEXUAL ABUSE,17
4,1563,SEX OFFENSE,CRIMINAL SEXUAL ABUSE,17
15,1563,SEX OFFENSE,CRIMINAL SEXUAL ABUSE,17
22,1563,SEX OFFENSE,CRIMINAL SEXUAL ABUSE,17
33,1563,SEX OFFENSE,CRIMINAL SEXUAL ABUSE,17


In [84]:
# Remove the duplicate rows so the primary key is unique
crime_code = crime_code.drop_duplicates(subset=['iucr', 'fbi_code'])

In [85]:
crime_code.to_sql('crime_code', # The table to load
          engine,             # The engine created above
          schema= username,   # The schema where the table lives, our pawprint
          if_exists='append', # If the table is found, it would keep loading the end of table.
          index=False,        # Ignore creating an index for the index col in the dataframe
          chunksize=20)       # Do 20 records from the data frame at a time

In [87]:
record=pd.DataFrame({'id': df['ID'],
                    'case_number': df['Case Number'],
                    'date': df['Date'],
                    'iucr': df['IUCR'],
                    'arrest': df['Arrest'],
                    'domestic': df['Domestic'],
                    'location_id': df['location_id'],
                    'updated_on': df['Updated On'],
                    'fbi_code': df['FBI Code']
                })

record.to_sql('record', # The table to load
          engine,             # The engine created above
          schema= username,   # The schema where the table lives, our pawprint
          if_exists='append', # If the table is found, it would keep loading the end of table.
          index=False,        # Ignore creating an index for the index col in the dataframe
          chunksize=20)       # Do 20 records from the data frame at a time

## 4.2 For each of the tables, show the number of rows in the table using a sql query

* Add additional cells if required

In [89]:
SSO="bmgwd9"
hostname='pgsql.dsa.lan'
database='dsa_student'

# Read the Password into mem for a moment
import getpass
read_password = getpass.getpass("Type Password and hit enter")

connection_string = f"postgres://{SSO}:{read_password}@{hostname}/{database}"
    
%load_ext sql
%sql $connection_string 

Type Password and hit enter········


'Connected: bmgwd9@dsa_student'

In [90]:
%%sql
SELECT COUNT(*) FROM record

 * postgres://bmgwd9:***@pgsql.dsa.lan/dsa_student
1 rows affected.


count
334715


In [92]:
%%sql
SELECT COUNT(*) FROM crime_code

 * postgres://bmgwd9:***@pgsql.dsa.lan/dsa_student
1 rows affected.


count
319


In [93]:
%%sql
SELECT COUNT(*) FROM location

 * postgres://bmgwd9:***@pgsql.dsa.lan/dsa_student
1 rows affected.


count
334715
