# Midterm Project - Miranda Khoury (mrk6xcb)

## Overview

For this project, I use the Sakila sample database. The fact table, **fact_rentals**, contains information about movie rentals at the fictional Sakila movie rental store. **Fact_rentals** combines information from Sakila's **rental**, **inventory**, and **payment** tables to streamline the dataset and reduce the number of dimension tables in the final database. My designed database also includes four dimension tables, **dim_customers**, **dim_films**, **dim_staff**, and **dim_time**.

## Prerequisites


#### Import the necessary libraries

In [None]:
import os
import numpy
import mysql.connector
import pandas as pd
from sqlalchemy import create_engine
import json
import datetime
import pymongo

#### Declare & Assign Connection Variables for the MySQL Server, MongoDB Server, & Databases with which You'll be Working 

In [None]:
host_name = "localhost"
host_ip = "127.0.0.1"
port = "3306"

# my credentials to log into my MySQL Workbench
mysql_uid = "root"
mysql_pwd = "huntersMark6%"

# my credentials to log into my personal MongoDB cluster
atlas_cluster_name = "testCluster"
atlas_user_name = "main_user"
atlas_password = "butterfly"

# connection string to connect to my MongoDB cluster
my_conn_str = "mongodb+srv://main_user:butterfly@testcluster.gymn7h2.mongodb.net/?retryWrites=true&w=majority"

mysql_src_db = "northwind"
mongo_src_db = "sakila_mongo"
dst_db = "sakila_dw_final"

#### Define Functions for Getting Data From and Setting Data Into Databases

In [None]:
def get_sql_dataframe(user_id, pwd, db_name, sql_query):
    '''Create a connection to the MySQL database'''
    conn_str = f"mysql+pymysql://{user_id}:{pwd}@localhost/{db_name}"
    sqlEngine = create_engine(conn_str, pool_recycle=3600)
    
    '''Invoke the pd.read_sql() function to query the database, and fill a Pandas DataFrame.'''
    conn = sqlEngine.connect()
    dframe = pd.read_sql(sql_query, conn);
    conn.close()
    
    return dframe


def get_mongo_dataframe(connect_str, db_name, collection, query):
    '''Create a connection to MongoDB'''
    client = pymongo.MongoClient(connect_str)
    
    '''Query MongoDB, and fill a python list with documents to create a DataFrame'''
    db = client[db_name]
    dframe = pd.DataFrame(list(db[collection].find(query)))
    dframe.drop(['_id'], axis=1, inplace=True)
    client.close()
    return dframe


def set_dataframe(user_id, pwd, db_name, df, table_name, pk_column, db_operation):
    '''Create a connection to the MySQL database'''
    conn_str = f"mysql+pymysql://{user_id}:{pwd}@localhost/{db_name}"
    sqlEngine = create_engine(conn_str, pool_recycle=3600)
    connection = sqlEngine.connect()
    
    '''Invoke the Pandas DataFrame .to_sql( ) function to either create, or append to, a table'''
    if db_operation == "insert":
        df.to_sql(table_name, con=connection, index=False, if_exists='replace')
        sqlEngine.execute(f"ALTER TABLE {table_name} ADD PRIMARY KEY ({pk_column});")
            
    elif db_operation == "update":
        df.to_sql(table_name, con=connection, index=False, if_exists='append')
    
    connection.close()

#### Create the New Data Warehouse database, and to Use it, Switch the Connection Context.

In [None]:
conn_str = f"mysql+pymysql://{mysql_uid}:{mysql_pwd}@{host_name}"
sqlEngine = create_engine(conn_str, pool_recycle=3600)

sqlEngine.execute(f"DROP DATABASE IF EXISTS `{dst_db}`;")
sqlEngine.execute(f"CREATE DATABASE `{dst_db}`;")
sqlEngine.execute(f"USE {dst_db};")

#sqlEngine.connect().close()

## Sourcing Data from A SQL Server (MySQL)

First, I read in some of the data I'll be working with from MySQL: the **rental**, **staff**, and **customer** tables from the Sakila database loaded on my MySQL server.

#### Extract Data from the Source Database Tables

In [None]:
sakila_customers = "SELECT * FROM northwind.employees;"#"SELECT * FROM sakila.customer;"
df_customers = get_sql_dataframe(mysql_uid, mysql_pwd, mysql_src_db, sakila_customers)
df_customers.head(2)

In [None]:
sakila_rentals = "SELECT * FROM sakila.rental;"
df_rentals = get_sql_dataframe(mysql_uid, mysql_pwd, mysql_src_db, sakila_rentals)
df_rentals.head(3)

In [None]:
sakila_staff = "SELECT * FROM sakila.staff;"
df_staff = get_sql_dataframe(mysql_uid, mysql_pwd, mysql_src_db, sakila_staff)
df_staff.head(2)

## Sourcing Data from A NoSQL Server (MongoDB)

Then, I read in some more data I'll be working with, the **inventory** and **film** tables from Sakila. First, I load the JSON versions of these tables into MongoDB. Then, we can pretend this data was only available from MongoDB to start with, and I will read it back into this Jupyter Notebook to conduct transformations on.

#### Populate MongoDB with Source Data

In [None]:
client = pymongo.MongoClient(my_conn_str)
db = client[mongo_src_db]

# note that the data are all in the same folder that this JN file is in, so the default data path 
# is the data path needed to access the files
data_dir = os.getcwd()

json_files = {"inventory" : 'inventory.json',
              "film" : "sakila_film.json"
            }

for file in json_files:
    db.drop_collection(file)
    json_file = os.path.join(data_dir, json_files[file])
    with open(json_file, 'r') as openfile:
        json_object = json.load(openfile)
        file = db[file]
        result = file.insert_many(json_object)
        print(f"{file} was successfully loaded.")

        
client.close()        


#### Extract Data from the Source MongoDB Collections Into DataFrames

In [None]:
query = {}
collection = "inventory"

df_inventory = get_mongo_dataframe(my_conn_str, mongo_src_db, collection, query)
df_inventory.head(2)

In [None]:
query = {}
collection = "film"

df_films = get_mongo_dataframe(my_conn_str, mongo_src_db, collection, query)
df_films.head(2)

## Sourcing Data from A File System

For the last step of the Extract phase, I read in the final pieces of data from a local file system. The file I'll be reading in is in CSV format.

In [None]:
# note that the data are all in the same folder that this JN file is in, so the default data path 
# is the data path needed to access the files
data_dir = os.getcwd()
data_file = os.path.join(data_dir, 'sakila_payment.csv')

df_payments = pd.read_csv(data_file, header=0, index_col=0)
df_payments.head(3)

## Transformation of Data Using Pandas Dataframes

Now that all the raw tables -- **rental**, **film**, **inventory**, **customer**, and **payment** -- have been read in from their various sources, I can perform necessary transformations on the tables and combine some to form the fact table.

First, I'll perform some transformations on the dimension tables.

#### Transform the Customers Dimension Table

In [None]:
drop_cols = ['store_id','address_id','active','create_date','last_update']
df_customers.drop(drop_cols, axis=1, inplace=True)
df_customers.rename(columns={"customer_id":"customer_key"}, inplace=True)

df_customers.head(2)

#### Transform the Films Dimension Table

In [None]:
drop_cols = ['language_id','original_language_id']
df_films.drop(drop_cols, axis=1, inplace=True)
df_films.rename(columns={"film_id":"film_key"}, inplace=True)

df_films.head(2)

#### Transform the Staff Dimension Table

In [None]:
drop_cols = ['address_id','picture','password', 'last_update', 'active', 'store_id']
df_staff.drop(drop_cols, axis=1, inplace=True)
df_staff.rename(columns={"staff_id":"staff_key"}, inplace=True)

df_staff.head(2)

Then, I prep the inventory, payment, and rental tables to be ready to be joined to form the fact rental table.
#### Pre-transforming the Tables that Will Become Fact Rentals

In [None]:
drop_cols = ['last_update']
df_rentals.drop(drop_cols, axis=1, inplace=True)
df_rentals.rename(columns={"rental_id":"rental_key", "inventory_id":"inventory_key", "customer_id":"customer_key","staff_id":"staff_key"}, inplace=True)

df_rentals.head(2)

In [None]:
drop_cols = ['last_update', "store_id"]
df_inventory.drop(drop_cols, axis=1, inplace=True)
df_inventory.rename(columns={"inventory_id":"inventory_key"}, inplace=True)

df_inventory.head(2)

In [None]:
drop_cols = ['last_update', "customer_id", "staff_id"]
df_payments.drop(drop_cols, axis=1, inplace=True)
df_payments.rename(columns={"rental_id":"rental_key", "payment_id":"payment_key"}, inplace=True)

df_payments.head(2)

Now we can merge the tables together.
#### Merging Rentals, Payment, and Inventory into Fact Rentals

In [None]:
# first, join payments to rentals on the rental key.
# There is a one-to-one relationship between each rental and payment. i.e. there is a unique payment for each rental
# and a unique rental for each payment. We can therefore use an inner join.
df_fact_rentals1 = pd.merge(df_rentals, df_payments, on='rental_key', how='left')
df_fact_rentals1.rename(columns={"amount":"payment_amount"}, inplace=True)

df_fact_rentals1.head(2)

In [None]:
df_fact_rentals1.shape[0]

In [None]:
# then, join inventory to rentals on the inventory key. Inventory ids are repeated 
# between rentals i.e. rental 3 and 13 both could correspond to inventory id 26. left join.
# then, drop inventory key
df_fact_rentals = pd.merge(df_fact_rentals1, df_inventory, on='inventory_key', how='left')
df_fact_rentals.drop(['inventory_key'], axis=1, inplace=True)
df_fact_rentals.rename(columns={"film_id":"film_key"}, inplace=True)

df_fact_rentals.head(10).sort_values(by=['rental_key'])

## Making A Date Dimension Table Using SQL
Now that the raw data tables are all read in and have been transformed most of the way, I'll make a data table from scratch -- one for the new date dimension I'm going to add to the data warehouse. I'll do this using SQL commands. Then, I'll integrate it into the data warehouse.

##### 2.2.5. Get the Data from the Date Dimension Table.
First, fetch the Surrogate Primary Key (date_key) and the Business Key (full_date) from the Date Dimension table using the **get_dataframe()** function. Also, be certain to cast the **full_date** column to the **datetime64** data type using the **.astype()** function that is native to Pandas DataFrame columns.

In [None]:
sql_dim_date = "SELECT date_key, full_date FROM dim_date;"
df_dim_date = get_sql_dataframe(mysql_uid, mysql_pwd, dst_db, sql_dim_date)
df_dim_date.full_date = df_dim_date.full_date.astype('datetime64')
df_dim_date.head(2)

##### 2.2.6. Lookup the DateKeys from the Date Dimension Table.
Next, for each date typed column in the fact table, lookup the corresponding Surrogate Primary Key column.

In [None]:
# Lookup the Surrogate Primary Key (date_key) that Corresponds to the "rental_date" Column.
df_dim_rental_date = df_dim_date.rename(columns={"date_key" : "order_date_key", "full_date" : "order_date"})
df_fact_orders = pd.merge(df_fact_orders, df_dim_rental_date, on='order_date', how='inner')
df_fact_orders.drop(['order_date'], axis=1, inplace=True) 
df_fact_orders.head(2)

In [None]:
# Lookup the Surrogate Primary Key (date_key) that Corresponds to the "return_date" Column.
df_dim_paid_date = df_dim_date.rename(columns={"date_key" : "paid_date_key", "full_date" : "paid_date"})
df_fact_orders = pd.merge(df_fact_orders, df_dim_paid_date, on='paid_date', how='inner')
df_fact_orders.drop(['paid_date'], axis=1, inplace=True)
df_fact_orders.head(2)

In [None]:
# Lookup the Surrogate Primary Key (date_key) that Corresponds to the "shipped_date" Column.
df_dim_shipped_date = df_dim_date.rename(columns={"date_key" : "shipped_date_key", "full_date" : "shipped_date"})
df_fact_orders = pd.merge(df_fact_orders, df_dim_shipped_date, on='shipped_date', how='inner')
df_fact_orders.drop(['shipped_date'], axis=1, inplace=True)
df_fact_orders.head(2)

## Loading the Tables Into the Destination Database

Then, I'll load the finalized tables into my destination database, and the data warehouse is done!

In [None]:
# Loading in the dimension tables
db_operation = "insert"

tables = [('dim_customers', df_customers, 'customer_key'),
          ('dim_staff', df_staff, 'staff_key'),
          ('dim_films', df_products, 'film_key')]

for table_name, dataframe, primary_key in tables:
    set_dataframe(mysql_uid, mysql_pwd, dst_db, dataframe, table_name, primary_key, db_operation)

In [None]:
# Loading in the fact table

db_operation = "insert"

table_name = 'fact_rentals'
dataframe = df_fact_rentals
primary_key = 'rental_key'

set_dataframe(mysql_uid, mysql_pwd, dst_db, dataframe, table_name, primary_key, db_operation)

## Querying the Finalized Data Warehouse
Finally, I'll write some queries to prove that my data warehouse was implemented successfully.

In [None]:
sql_query = '''
SELECT * FROM fact_rentals;
'''

df_test = get_sql_dataframe(mysql_uid, mysql_pwd, dst_db, sql_query)
df_test.head(5)

In [None]:
d = {'store_key': [1, 2], 'address_key': [3, 4]}
df_test = pd.DataFrame(data=d)
df_test

In [None]:
# Loading in the fact table

db_operation = "insert"

table_name = 'test_dim'
dataframe = df_test
primary_key = 'store_key'

set_dataframe(mysql_uid, mysql_pwd, dst_db, dataframe, table_name, primary_key, db_operation)

In [None]:
sql_query = '''
SELECT * FROM test_dim;
'''

df = get_sql_dataframe(mysql_uid, mysql_pwd, dst_db, sql_query)
df