# Lab Report 6: Creating and Connecting to Databases
## Name: Afnan Alabdulwahab

### Instructions
Please answer the following questions as completely as possible using text, code, and the results of code as needed. Format your answers in a Jupyter notebook. To receive full credit, make sure you address every part of the problem, and make sure your document is formatted in a clean and professional way.

**Please note: you will not be able to use Rivanna for this lab as Rivanna is not set up to work with Docker or with Databases. If you need help getting your local system running, please let me know.**

In [3]:
import numpy as np
import pandas as pd
import wget
import sqlite3
import sqlalchemy
import requests
import json
import os
import sys
import dotenv
import mysql.connector
import psycopg
import pymongo
from sqlalchemy import create_engine


dotenv.load_dotenv()
POSTGRES_PASSWORD = os.getenv('POSTGRES_PASSWORD')
MONGO_INITDB_ROOT_USERNAME = os.getenv('MONGO_INITDB_ROOT_USERNAME')
MONGO_INITDB_ROOT_PASSWORD = os.getenv('MONGO_INITDB_ROOT_PASSWORD')
mongo_init_db = os.getenv('mongo_init_db')
MYSQL_ROOT_PASSWORD = os.getenv('MYSQL_ROOT_PASSWORD')

### Problem 1 
**This problem requires you to create Markdown tables** 

To create a table in a markdown cell, I recommend using the markdown table generator here: https://www.tablesgenerator.com/markdown_tables. This interface allows you to choose the number of rows and columns, fill in those rows and colums, and push the "generate" button. The website will display markdown table code that looks like:
```
| Day       | Temp | Rain |
|-----------|------|------|
| Monday    | 74   | No   |
| Tuesday   | 58   | Yes  |
| Wednesday | 76   | No   |
```
Copy the markdown code and paste it into a markdown cell in your notebook. Markdown will read the code and display a table that looks like this:

| Day       | Temp | Rain |
|-----------|------|------|
| Monday    | 74   | No   |
| Tuesday   | 58   | Yes  |
| Wednesday | 76   | No   |

Suppose that we have (fake) data on people who were hospitalized and received at least one prescription for a medication. Here are ten records in the data:

(If this table gets cut off in the PDF, please look at the .ipynb notebook file on the module 6 page on Canvas)

| patient_name       | date_of_birth | prescribed_drug | prior_conditions                     | patient_sex | patient_insurance      | drug_maker               | drug_cost | attending_physician | AP_medschool                      | AP_years_experience | hospital                       | hospital_location |
|--------------------|---------------|-----------------|--------------------------------------|-------------|------------------------|--------------------------|-----------|---------------------|-----------------------------------|--------------------|--------------------------------|-------------------|
| Nkemdilim Arendonk | 2/21/1962     | Amoxil          | [Pneumonia, Diabetes]                | M           | Aetna                  | USAntibiotics            | 14.62     | Earnest Caro        | University of California (Irvine) | 14                 | UPMC Presbyterian Shadyside    | Pittsburgh, PA    |
| Nkemdilim Arendonk | 2/21/1962     | Micronase       | [Pneumonia, Diabetes]                | M           | Aetna                  | Pfizer                   | 20.55     | Earnest Caro        | University of California (Irvine) | 14                 | UPMC Presbyterian Shadyside    | Pittsburgh, PA    |
| Raniero Coumans    | 8/15/1990     | Zosyn           | [Appendicitis, Crohn's disease]      | M           | Cigna                  | Baxter International Inc | 394.00    | Pamela English      | University of Michigan            | 29                 | Northwestern Memorial Hospital | Chicago, IL       |
| Raniero Coumans    | 8/15/1990     | Humira          | [Appendicitis, Crohn's disease]      | M           | Cigna                  | Abbvie                   | 7000.00   | Pamela English      | University of Michigan            | 29                 | Northwestern Memorial Hospital | Chicago, IL       |
| Mizuki Debenham    | 3/12/1977     | Inlyta          | [Kidney Cancer]                      | F           | Kaiser Permanente      | Pfizer                   | 21644.00  | Lewis Conti         | North Carolina State University   | 8                  | Houston Methodist Hospital     | Houston, TX       |
| Zoë De Witt        | 11/23/1947    | Atenolol        | [Cardiomyopathy, Diabetes, Sciatica] | F           | Medicare               | Mylan Pharmaceuticals    | 10.58     | Theresa Dahlmans    | Lake Erie College of Medicine     | 17                 | Mount Sinai Hospital           | New York, NY      |
| Zoë De Witt        | 11/23/1947    | Micronase       | [Cardiomyopathy, Diabetes, Sciatica] | F           | Medicare               | Pfizer                   | 20.55     | Theresa Dahlmans    | Lake Erie College of Medicine     | 17                 | Mount Sinai Hospital           | New York, NY      |
| Zoë De Witt        | 11/23/1947    | Demerol         | [Cardiomyopathy, Diabetes, Sciatica] | F           | Medicare               | Pfizer                   | 37.50     | Theresa Dahlmans    | Lake Erie College of Medicine     | 17                 | Mount Sinai Hospital           | New York, NY      |
| Bonnie Hooper      | 7/4/1951      | Xeloda          | [Pancreatic Cancer, Sciatica]        | F           | Blue Cross Blue Shield | Genentech                | 860.00    | Steven Garbutt      | Ohio State University             | 36                 | UCSF Medical Center            | San Francisco, CA |
| Bonnie Hooper      | 7/4/1951      | Demerol         | [Pancreatic Cancer, Sciatica]        | F           | Blue Cross Blue Shield | Pfizer                   | 37.50     | Steven Garbutt      | Ohio State University             | 36                 | UCSF Medical Center            | San Francisco, CA |

The columns in this dataset are:

* **patient_name**: The patient's name
* **date_of_birth**: The patient's date of birth
* **prescribed_drug**: The brand name of the medication that patient has been prescribed
* **prior_conditions**: A list of the conditions that the patient had been diagnosed with prior to the patient's hospitalization
* **patient_sex**: The patient's sex
* **patient_insurance**: The company responsible for the patient's health insurance coverage
* **drug_maker**: The company that manufactures the prescribed drug
* **drug_cost**: The cost of the prescribed drug
* **attending_physician**: The name of the attending physician for the patient
* **AP_medschool**: The name of the school where the attending physician got a medical degree
* **AP_years_experience**: The attending physician's number of years of experience post-residency
* **hospital**: The hospital where the attending physicial is employed
* **hospital_location**: The location of the hospital

For this problem, assume that 

1. No two rows in this table share both the same patient and the same prescribed drug.
   
2. Some patients in the data share the same name, but no two patients in the data share the same name and date of birth.

3. No two different drugs share the same brand name.

4. No two attending physicians have the same name, and every attending physician is employed at only one hospital.

5. No two hospitals share the same name, and every hospital exists at only one location.
   
6. Each patient has only one attending physician. (In real-world applications we may want to design a database that allows for multiple hospitalizations for some patients, but here we'll keep it simpler by assuming each patient has one hospitalization with one attending physician.)

#### Part a 

To achieve First Normal Form (1NF) we need to ensure:
1. Every table has a primary key
2. All values are atomic
3. There are no repeating groups

Given this criteria, here's how I will arrange the data into 1NF:
1. Main Table:
    * **Primary Key: (patient_name, date_of_birth, prescribed_drug)**
    * All the colmns for the original table, except **prior_conditions**
3. Diagnoses Table:
    * **Primary Key: (patient_name, date_of_birth, prior_condition)**
    * patient_name
    * date_of_birth
    * prior_conditon

**prior_conditions** is a list in a single cell so seperating it into a new table to make the values atomic.

Main Table:

| patient_name       | date_of_birth | prescribed_drug | patient_sex | patient_insurance      | drug_maker               | drug_cost | attending_physician | AP_medschool                      | AP_years_experience | hospital                       | hospital_location |
|--------------------|---------------|-----------------|-------------|------------------------|--------------------------|-----------|---------------------|-----------------------------------|--------------------|--------------------------------|-------------------|
| Nkemdilim Arendonk | 2/21/1962     | Amoxil          | M           | Aetna                  | USAntibiotics            | 14.62     | Earnest Caro        | University of California (Irvine) | 14                 | UPMC Presbyterian Shadyside    | Pittsburgh, PA    |
| Nkemdilim Arendonk | 2/21/1962     | Micronase       | M           | Aetna                  | Pfizer                   | 20.55     | Earnest Caro        | University of California (Irvine) | 14                 | UPMC Presbyterian Shadyside    | Pittsburgh, PA    |
| Raniero Coumans    | 8/15/1990     | Zosyn           | M           | Cigna                  | Baxter International Inc | 394.00    | Pamela English      | University of Michigan            | 29                 | Northwestern Memorial Hospital | Chicago, IL       |
| Raniero Coumans    | 8/15/1990     | Humira          | M           | Cigna                  | Abbvie                   | 7000.00   | Pamela English      | University of Michigan            | 29                 | Northwestern Memorial Hospital | Chicago, IL       |
| Mizuki Debenham    | 3/12/1977     | Inlyta          | F           | Kaiser Permanente      | Pfizer                   | 21644.00  | Lewis Conti         | North Carolina State University   | 8                  | Houston Methodist Hospital     | Houston, TX       |
| Zoë De Witt        | 11/23/1947    | Atenolol        | F           | Medicare               | Mylan Pharmaceuticals    | 10.58     | Theresa Dahlmans    | Lake Erie College of Medicine     | 17                 | Mount Sinai Hospital           | New York, NY      |
| Zoë De Witt        | 11/23/1947    | Micronase       | F           | Medicare               | Pfizer                   | 20.55     | Theresa Dahlmans    | Lake Erie College of Medicine     | 17                 | Mount Sinai Hospital           | New York, NY      |
| Zoë De Witt        | 11/23/1947    | Demerol         | F           | Medicare               | Pfizer                   | 37.50     | Theresa Dahlmans    | Lake Erie College of Medicine     | 17                 | Mount Sinai Hospital           | New York, NY      |
| Bonnie Hooper      | 7/4/1951      | Xeloda          | F           | Blue Cross Blue Shield | Genentech                | 860.00    | Steven Garbutt      | Ohio State University             | 36                 | UCSF Medical Center            | San Francisco, CA |
| Bonnie Hooper      | 7/4/1951      | Demerol         | F           | Blue Cross Blue Shield | Pfizer                   | 37.50     | Steven Garbutt      | Ohio State University             | 36                 | UCSF Medical Center            | San Francisco, CA |

Diagnoses Table:

| patient_name       | date_of_birth | prior_condition  |
|--------------------|---------------|------------------|
| Nkemdilim Arendonk | 2/21/1962     | Pneumonia        |
| Nkemdilim Arendonk | 2/21/1962     | Diabetes         |
| Raniero Coumans    | 8/15/1990     | Appendicitis     |
| Raniero Coumans    | 8/15/1990     | Crohn's disease  |
| Mizuki Debenham    | 3/12/1977     | Kidney Cancer    |
| Zoë De Witt        | 11/23/1947    | Cardiomyopathy   |
| Zoë De Witt        | 11/23/1947    | Diabetes         |
| Zoë De Witt        | 11/23/1947    | Sciatica         |
| Bonnie Hooper      | 7/4/1951      | Pancreatic Cancer|
| Bonnie Hooper      | 7/4/1951      | Sciatica         |

#### Part b 

To move to Second Normal Form (2NF), we need to eliminate partial dependecies. All non-key attributes must depend on the entire key, not just part of it.
To achieve 2NF, we need to separate out the attributes that depend only on part of the primary key in the main table. To achieve that, I will create the following tables:
1. Patient Table:
    * **Primary Key: (patient_name, date_of_birth)**
    * patient_name
    * date_of_birth
    * patient_sex
    * patient_insurance
    * attending_physician
2. Drugs Table:
    * **Primary Key: (prescribed_drug)**
    * prescribed_drug
    * drug_maker
    * drug_cost

And we're left with:
Main Table:
* **Primary Key: (patient_name, date_of_birth, prescribed_drug)**
* patient_name
* date_of_birth
* prescribed_drug
* AP_medschool
* AP_years_experience
* hospital
* hospital_location

And we carry the Diagnoses Table from part a (1NF):

Diagnoses Table (unchanged)

Patients Table (new):

| patient_name       | date_of_birth | patient_sex | patient_insurance      | attending_physician |
|--------------------|---------------|-------------|------------------------|---------------------|
| Nkemdilim Arendonk | 2/21/1962     | M           | Aetna                  | Earnest Caro        |
| Raniero Coumans    | 8/15/1990     | M           | Cigna                  | Pamela English      |
| Mizuki Debenham    | 3/12/1977     | F           | Kaiser Permanente      | Lewis Conti         |
| Zoë De Witt        | 11/23/1947    | F           | Medicare               | Theresa Dahlmans    |
| Bonnie Hooper      | 7/4/1951      | F           | Blue Cross Blue Shield | Steven Garbutt      |

Drugs Table (new):

| prescribed_drug | drug_maker               | drug_cost |
|-----------------|--------------------------|-----------|
| Amoxil          | USAntibiotics            | 14.62     |
| Micronase       | Pfizer                   | 20.55     |
| Zosyn           | Baxter International Inc | 394.00    |
| Humira          | Abbvie                   | 7000.00   |
| Inlyta          | Pfizer                   | 21644.00  |
| Atenolol        | Mylan Pharmaceuticals    | 10.58     |
| Demerol         | Pfizer                   | 37.50     |
| Xeloda          | Genentech                | 860.00    |

Main Table:
| patient_name       | date_of_birth | prescribed_drug | AP_medschool                      | AP_years_experience | hospital                       | hospital_location |
|--------------------|---------------|-----------------|-----------------------------------|--------------------|--------------------------------|-------------------|
| Nkemdilim Arendonk | 2/21/1962     | Amoxil          | University of California (Irvine) | 14                 | UPMC Presbyterian Shadyside    | Pittsburgh, PA    |
| Nkemdilim Arendonk | 2/21/1962     | Micronase       | University of California (Irvine) | 14                 | UPMC Presbyterian Shadyside    | Pittsburgh, PA    |
| Raniero Coumans    | 8/15/1990     | Zosyn           | University of Michigan            | 29                 | Northwestern Memorial Hospital | Chicago, IL       |
| Raniero Coumans    | 8/15/1990     | Humira          | University of Michigan            | 29                 | Northwestern Memorial Hospital | Chicago, IL       |
| Mizuki Debenham    | 3/12/1977     | Inlyta          | North Carolina State University   | 8                  | Houston Methodist Hospital     | Houston, TX       |
| Zoë De Witt        | 11/23/1947    | Atenolol        | Lake Erie College of Medicine     | 17                 | Mount Sinai Hospital           | New York, NY      |
| Zoë De Witt        | 11/23/1947    | Micronase       | Lake Erie College of Medicine     | 17                 | Mount Sinai Hospital           | New York, NY      |
| Zoë De Witt        | 11/23/1947    | Demerol         | Lake Erie College of Medicine     | 17                 | Mount Sinai Hospital           | New York, NY      |
| Bonnie Hooper      | 7/4/1951      | Xeloda          | Ohio State University             | 36                 | UCSF Medical Center            | San Francisco, CA |
| Bonnie Hooper      | 7/4/1951      | Demerol         | Ohio State University             | 36                 | UCSF Medical Center            | San Francisco, CA |


Diagnoses Table (unchanged from 1NF):

| patient_name       | date_of_birth | prior_condition  |
|--------------------|---------------|------------------|
| Nkemdilim Arendonk | 2/21/1962     | Pneumonia        |
| Nkemdilim Arendonk | 2/21/1962     | Diabetes         |
| Raniero Coumans    | 8/15/1990     | Appendicitis     |
| Raniero Coumans    | 8/15/1990     | Crohn's disease  |
| Mizuki Debenham    | 3/12/1977     | Kidney Cancer    |
| Zoë De Witt        | 11/23/1947    | Cardiomyopathy   |
| Zoë De Witt        | 11/23/1947    | Diabetes         |
| Zoë De Witt        | 11/23/1947    | Sciatica         |
| Bonnie Hooper      | 7/4/1951      | Pancreatic Cancer|
| Bonnie Hooper      | 7/4/1951      | Sciatica         |

Now, the data satisfies 2NF as all non-key attributes are fully functionally dependent on their respective primary keys. Separating patient information and drug information into their own tables, as these depended only on part of the original primary key. The main table maintains the full composite key from the original table, and includes all attributes that depend on the full key.

#### Part c 

To meet the requirements of Third Normal Form (3NF), we need to eliminate **transitive dependencies**. 
Looking at the 2NF tables, we can identify the following transitive dependincy in the Main Table, :
1. AP_medschool, AP_years_experience, and hospital depend on attending_physician.
2. hospital_location depends on hospital.

To resolve this, I will create two new tables for the physicians and hospitals.

Physicians Table:
* **Primary Key: attending_physician**
* attending_physician
* AP_medschool
* AP_years_experience
* hospital

Hospital Table:
* **Primary Key: hospital**
* hospital
* hospital_location


This leaves the Main Table with the following items (nothing but the key) and it is essintally a **prescription table**:

Main/Prescription Table:
* **Primary Key: (patient_name, date_of_birth, prescribed_drug)**
* patient_name
* date_of_birth
* prescribed_drug

So now we have the following tables:

Patients Table (unchanged from 2NF):

| patient_name       | date_of_birth | patient_sex | patient_insurance      | attending_physician |
|--------------------|---------------|-------------|------------------------|---------------------|
| Nkemdilim Arendonk | 2/21/1962     | M           | Aetna                  | Earnest Caro        |
| Raniero Coumans    | 8/15/1990     | M           | Cigna                  | Pamela English      |
| Mizuki Debenham    | 3/12/1977     | F           | Kaiser Permanente      | Lewis Conti         |
| Zoë De Witt        | 11/23/1947    | F           | Medicare               | Theresa Dahlmans    |
| Bonnie Hooper      | 7/4/1951      | F           | Blue Cross Blue Shield | Steven Garbutt      |

Drugs Table (unchanged from 2NF):

| prescribed_drug | drug_maker               | drug_cost |
|-----------------|--------------------------|-----------|
| Amoxil          | USAntibiotics            | 14.62     |
| Micronase       | Pfizer                   | 20.55     |
| Zosyn           | Baxter International Inc | 394.00    |
| Humira          | Abbvie                   | 7000.00   |
| Inlyta          | Pfizer                   | 21644.00  |
| Atenolol        | Mylan Pharmaceuticals    | 10.58     |
| Demerol         | Pfizer                   | 37.50     |
| Xeloda          | Genentech                | 860.00    |

Diagnoses Table (unchanged from 1NF):

| patient_name       | date_of_birth | prior_condition  |
|--------------------|---------------|------------------|
| Nkemdilim Arendonk | 2/21/1962     | Pneumonia        |
| Nkemdilim Arendonk | 2/21/1962     | Diabetes         |
| Raniero Coumans    | 8/15/1990     | Appendicitis     |
| Raniero Coumans    | 8/15/1990     | Crohn's disease  |
| Mizuki Debenham    | 3/12/1977     | Kidney Cancer    |
| Zoë De Witt        | 11/23/1947    | Cardiomyopathy   |
| Zoë De Witt        | 11/23/1947    | Diabetes         |
| Zoë De Witt        | 11/23/1947    | Sciatica         |
| Bonnie Hooper      | 7/4/1951      | Pancreatic Cancer|
| Bonnie Hooper      | 7/4/1951      | Sciatica         |

Main/Prescription Table (modified):

| patient_name       | date_of_birth | prescribed_drug |
|--------------------|---------------|-----------------|
| Nkemdilim Arendonk | 2/21/1962     | Amoxil          |
| Nkemdilim Arendonk | 2/21/1962     | Micronase       |
| Raniero Coumans    | 8/15/1990     | Zosyn           |
| Raniero Coumans    | 8/15/1990     | Humira          |
| Mizuki Debenham    | 3/12/1977     | Inlyta          |
| Zoë De Witt        | 11/23/1947    | Atenolol        |
| Zoë De Witt        | 11/23/1947    | Micronase       |
| Zoë De Witt        | 11/23/1947    | Demerol         |
| Bonnie Hooper      | 7/4/1951      | Xeloda          |
| Bonnie Hooper      | 7/4/1951      | Demerol         |

Physicians Table (new):

| attending_physician | AP_medschool                      | AP_years_experience | hospital                       |
|---------------------|-----------------------------------|--------------------|--------------------------------|
| Earnest Caro        | University of California (Irvine) | 14                 | UPMC Presbyterian Shadyside    |
| Pamela English      | University of Michigan            | 29                 | Northwestern Memorial Hospital |
| Lewis Conti         | North Carolina State University   | 8                  | Houston Methodist Hospital     |
| Theresa Dahlmans    | Lake Erie College of Medicine     | 17                 | Mount Sinai Hospital           |
| Steven Garbutt      | Ohio State University             | 36                 | UCSF Medical Center            |


Hospitals Table (new):

| hospital                       | hospital_location |
|--------------------------------|-------------------|
| UPMC Presbyterian Shadyside    | Pittsburgh, PA    |
| Northwestern Memorial Hospital | Chicago, IL       |
| Houston Methodist Hospital     | Houston, TX       |
| Mount Sinai Hospital           | New York, NY      |
| UCSF Medical Center            | San Francisco, CA |

### Problem 2
For this problem, create ER diagrams of the database you created in problem 1, part c using https://dbdocs.io/. Make sure you install DBDocs on your system by following these instructions: https://dbdocs.io/docs

#### Part a 
My code using the [database markup language](https://dbml.dbdiagram.io/home/) (DBML) that represents all of the tables I created above and the connections between the tables: 

```
Project Patients {
  database_type: ''
  Note: '''
    # (Fake) Data on people who were hospitalized and received at least one prescription for a medication.
    **Databased created for DS6001: Lab 6 July 2024**
  '''
}
Table PATIENTS as P {
    patient_name varchar [pk]
    date_of_birth varchar [pk]
    patient_sex varchar
    patient_insurance varchar
    attending_physician varchar
    note: "table 'PATIENTS' contains individual patient information"
}
Table DIAGNOSES as D {
    patient_name varchar [pk]
    date_of_birth varchar [pk]
    prior_conditon varchar [pk]
    note: "table 'DIAGNOSES' contains patients' condition diagnoses prior to hospitalization"
}
Table DRUGS as DR {
    prescribed_drug varchar [pk]
    drug_maker varchar
    drug_cost float
    note: "table 'DRUGS' contains patients' prescriped drugs information: maker and cost"
}
Table PRESCRIPTIONS as PR {
    patient_name varchar [pk]
    date_of_birth varchar [pk]
    prescribed_drug varchar [pk]
    note: "table 'PRESCRIPTIONS' contains patients' prescriped drugs"
}
Table PHYSICIANS as PH {
    attending_physician varchar [pk]
    AP_medschool varchar
    AP_years_experience int
    hospital varchar
    note: "table 'PHYSICIANS' contains patients' attending physicians information"
}
Table HOSPITALS as H {
    hospital varchar [pk]
    hospital_location varchar
    note: "table 'HOSPITALS' contains the hospital location"
}

Ref: P.patient_name < PR.patient_name
Ref: P.date_of_birth < PR.date_of_birth
Ref: PR.patient_name <> D.patient_name
Ref: PR.date_of_birth <> D.date_of_birth
Ref: PR.prescribed_drug > DR.prescribed_drug
Ref: PH.attending_physician < P.attending_physician
Ref: H.hospital < PH.hospital
```

### Part b

Link to the ER diagram generated using dbdocs:

https://dbdocs.io/AfnanAbdul/Patients?view=relationships

### Problem 3
For this problem, you will download the individual CSV files that comprise a relational database on album reviews from [Pitchfork Magazine](https://pitchfork.com/), collected via webscraping by [Nolan B. Conaway](https://github.com/nolanbconaway/pitchfork-data), and use them to initialize local databases using SQlite, MySQL, and PostgreSQL. 

The following code of code will download the CSV files. Please run this as is:

In [4]:
url = "https://github.com/nolanbconaway/pitchfork-data/raw/master/pitchfork.db"
pfork = wget.download(url)
pitchfork = sqlite3.connect(pfork)
for t in ['artists','content','genres','labels','reviews','years']:
    datatable = pd.read_sql_query("SELECT * FROM {tab}".format(tab=t), pitchfork)
    datatable.to_csv("{tab}.csv".format(tab=t))

100% [....................................................] 83585024 / 83585024

Note: this code downloaded a SQlite database and extracted the tables, saving each one as a CSV. That seems backwards, as the purpose of this exercise is to create databases. But the point here is to practice creating databases from individual data frames. Next we load the CSVs to create the data frames in Python:

In [5]:
reviews = pd.read_csv("reviews.csv")
artists = pd.read_csv("artists.csv")
content = pd.read_csv("content.csv")
genres = pd.read_csv("genres.csv")
labels = pd.read_csv("labels.csv")
years = pd.read_csv("years.csv")

#### Part a
#### Initializing a new database using SQlite and the `sqlite3` library:

Setting the working directory to where I want to save the database:

In [None]:
os.chdir("/Users/afnan/Documents/DS6001/databases/ds6001databases/M06/Lab")

Creating the database file (calling it 'albumreviews') and establising a connection to the database with the `.connect()` method:

In [None]:
album_db = sqlite3.connect("albumreviews.db")

To add the six dataframes as entities in the database I just created, I'll use the `.to_sql()` method:

In [10]:
reviews.to_sql('reviews', album_db, index=False, chunksize=1000, if_exists='replace')
artists.to_sql('artists', album_db, index=False, chunksize=1000, if_exists='replace')
content.to_sql('content', album_db, index=False, chunksize=1000, if_exists='replace')
genres.to_sql('genres', album_db, index=False, chunksize=1000, if_exists='replace')
labels.to_sql('labels', album_db, index=False, chunksize=1000, if_exists='replace')
years.to_sql('years', album_db, index=False, chunksize=1000, if_exists='replace')

19108

### Query using the `.cursor()` Method:
To issue queries of a database, the first step is to create a cursor for the database using the `.cursor()` method:

In [11]:
album_cursor = album_db.cursor()

The next step is to use the `.execute()` method with a string containing the query in SQL code. Then `.fetchall()` to get the output and pass it to `pd.DataFrame()` to arrange the output in a dataframe:

In [15]:
query = '''
SELECT title, artist, score FROM reviews WHERE score=10
'''
album_cursor.execute(query)
reviews_df1 = album_cursor.fetchall()
colnames = [x[0] for x in album_cursor.description]
pd.DataFrame(reviews_df1, columns=colnames)

Unnamed: 0,title,artist,score
0,metal box,public image ltd,10.0
1,blood on the tracks,bob dylan,10.0
2,another green world,brian eno,10.0
3,songs in the key of life,stevie wonder,10.0
4,in concert,nina simone,10.0
...,...,...,...
71,source tags and codes,...and you will know us by the trail of dead,10.0
72,the olatunji concert: the last live recording,john coltrane,10.0
73,kid a,radiohead,10.0
74,animals,pink floyd,10.0


### Query using the `pd.read_sql_query()` Method:

Passing the SQL query as a string as the first argument and the name of the database as the second argument, and the function returns a dataframe:

In [16]:
reviews_df2 = pd.read_sql_query("SELECT title, artist, score FROM reviews WHERE score=10", album_db)
reviews_df2

Unnamed: 0,title,artist,score
0,metal box,public image ltd,10.0
1,blood on the tracks,bob dylan,10.0
2,another green world,brian eno,10.0
3,songs in the key of life,stevie wonder,10.0
4,in concert,nina simone,10.0
...,...,...,...
71,source tags and codes,...and you will know us by the trail of dead,10.0
72,the olatunji concert: the last live recording,john coltrane,10.0
73,kid a,radiohead,10.0
74,animals,pink floyd,10.0


To keep track of the changes made and save the new version of the database, I'll use the `.commit()` method. And to free up the resources the database is using on my machine, I'l use the `.close()` method:

In [17]:
album_db.commit()
album_db.close()

#### Part b
#### Initializing a new database using MySQL and the `mysql.connector`:

I have MySQL server running in my docker container and I've installed `mysql.connector`. Next, I will access the sever by providing my password (which is loaded form `.env` and saved into the `MYSQL_ROOT_PASSWORD` variable), specify 'root' as my username:

In [34]:
dbserver = mysql.connector.connect(
    user='root',
    passwd=MYSQL_ROOT_PASSWORD,
    host="localhost"
)

To work with htis sever, I need to create a cursor, which uses the `.execute()` and `fetchall()` methods similar to SQLite:

In [35]:
cursor = dbserver.cursor()

To create a new database within my MySQL sever:

In [36]:
try:
    cursor.execute("CREATE DATABASE albumdb")
except:
    cursor.execute("DROP DATABASE albumdb")
    cursor.execute("CREATE DATABASE albumdb")

Before we can place data into this empty database, we have to connect to it. This time specifying the database I want to connect to:

In [37]:
albumdb = mysql.connector.connect(
    user='root',
    passwd=MYSQL_ROOT_PASSWORD,
    host="localhost",
    database="albumdb"
)

To input data into the local MySQL database, I will use `create_engine()` function from the `sqlalchemy` library to interface with the database.
Setting up the engine:

In [38]:
dbms = 'mysql'
connector = 'mysqlconnector'
user = 'root'
pw = MYSQL_ROOT_PASSWORD
host = 'localhost'
database = 'albumdb'
engine_string = f'{dbms}+{connector}://{user}:{pw}@{host}/{database}'

In [39]:
engine = create_engine(engine_string)

This engine allows us to use the `.to_sql()` method from pandas to place the data frames into the album database. 
First we set a name for each entity, then we pass the sqlalchemy engine, then we specify that the entity should be overwritten with new data if it already exists:

In [40]:
reviews.to_sql('reviews', con=engine, index=False, chunksize=1000, if_exists='replace')
artists.to_sql('artists', con=engine, index=False, chunksize=1000, if_exists='replace')
content.to_sql('content', con=engine, index=False, chunksize=1000, if_exists='replace')
genres.to_sql('genres', con=engine, index=False, chunksize=1000, if_exists='replace')
labels.to_sql('labels', con=engine, index=False, chunksize=1000, if_exists='replace')
years.to_sql('years', con=engine, index=False, chunksize=1000, if_exists='replace')

19108

### Query using the `.cursor()` Method:
To query the database, the first step is to create a cursor for the database using the `.cursor()` method. The next step is to use the `.execute()` method with a string containing the query in SQL code. Then `.fetchall()` to get the output and pass it to `pd.DataFrame()` to arrange the output in a dataframe

In [41]:
cursor = albumdb.cursor()

In [42]:
cursor.execute("SELECT title, artist, score FROM reviews WHERE score=10")
reviews_df1 = cursor.fetchall()
colnames = [x[0] for x in cursor.description]
pd.DataFrame(reviews_df1, columns=colnames)

Unnamed: 0,title,artist,score
0,metal box,public image ltd,10.0
1,blood on the tracks,bob dylan,10.0
2,another green world,brian eno,10.0
3,songs in the key of life,stevie wonder,10.0
4,in concert,nina simone,10.0
...,...,...,...
71,source tags and codes,...and you will know us by the trail of dead,10.0
72,the olatunji concert: the last live recording,john coltrane,10.0
73,kid a,radiohead,10.0
74,animals,pink floyd,10.0


### Query using the `pd.read_sql_query()` Method:
Passing the SQL query as a string as the first argument and the engine interfacing with the database as the second argument. The function returns a dataframe:

In [43]:
reviews_df2 = pd.read_sql_query("SELECT title, artist, score FROM reviews WHERE score=10", con=engine)
reviews_df2

Unnamed: 0,title,artist,score
0,metal box,public image ltd,10.0
1,blood on the tracks,bob dylan,10.0
2,another green world,brian eno,10.0
3,songs in the key of life,stevie wonder,10.0
4,in concert,nina simone,10.0
...,...,...,...
71,source tags and codes,...and you will know us by the trail of dead,10.0
72,the olatunji concert: the last live recording,john coltrane,10.0
73,kid a,radiohead,10.0
74,animals,pink floyd,10.0


To save and keep track of any changes made in the MySQL server, I'll use the `.commit()` method. And to free up the resources the database is using on my machine, I'l use the `.close()` method:

In [44]:
dbserver.commit()
dbserver.close()

#### Part c
#### Initializing a new database using PostgresSQL and the `psycopg` library:

I have PostgresSQL server running in my docker container and I've installed the `psycopg` library. To connect to the PostgresSQL server, we can use the `.connect()` method and supplying a username, password (which is loaded form `.env` and saved into the `POSTGRES_PASSWORD` variable), and setting `host="localhost"` to refer to the server:

In [73]:
# connect to postgres server
dbserver = psycopg.connect(
    user='postgres', 
    password=POSTGRES_PASSWORD, 
    host='localhost',
    port = '5432'
)
dbserver.autocommit = True

After establishing connection with the local PostgresSQL server, next, I'll use a cursor that points to the server. After that, I can use the `.execute()` and `.fetchall()` methods to create an empty database for the album data with a SQL query:

In [74]:
cursor = dbserver.cursor()
try:
    cursor.execute('CREATE DATABASE albumdb')
except:
    cursor.execute('DROP DATABASE albumdb')
    cursor.execute('CREATE DATABASE albumdb')

To connect to the "albumdb" database, I can use th `.connect()` method and specify `dbname="albumdb"` (For psycopg3, the correct parameter for specifying the database is `dbname` instead of `database`):

In [83]:
albumdb = psycopg.connect(
    user='postgres', 
    password=POSTGRES_PASSWORD, 
    host='localhost',
    port = '5432',
    dbname="albumdb"
)
albumdb.autocommit = True

To upload the data frames into the database, I'll create an engine with `sqlalchemy` to enable us to use the `.to_sql()` method in pandas:

In [84]:
dbms = 'postgresql'
connector = 'psycopg'
user = 'postgres'
pw = POSTGRES_PASSWORD
host = 'localhost'
port = '5432'
database = 'albumdb'
engine_string = f'{dbms}+{connector}://{user}:{pw}@{host}:{port}/{database}'

In [85]:
engine = create_engine(engine_string)

Now I can pass this engine to the `to_sql()` method to place the data frames into `albumdb`:

In [86]:
reviews.to_sql('reviews', con=engine, index=False, chunksize=1000, if_exists='replace')
artists.to_sql('artists', con=engine, index=False, chunksize=1000, if_exists='replace')
content.to_sql('content', con=engine, index=False, chunksize=1000, if_exists='replace')
genres.to_sql('genres', con=engine, index=False, chunksize=1000, if_exists='replace')
labels.to_sql('labels', con=engine, index=False, chunksize=1000, if_exists='replace')
years.to_sql('years', con=engine, index=False, chunksize=1000, if_exists='replace')

-20

### Query using the `.cursor()` Method:
To query the database, the first step is to create a cursor for the database using the `.cursor()` method. The next step is to use the `.execute()` method with a string containing the query in SQL code. Then `.fetchall()` to get the output and pass it to `pd.DataFrame()` to arrange the output in a dataframe

In [87]:
cursor = albumdb.cursor()
cursor.execute("SELECT title, artist, score FROM reviews WHERE score=10")
reviews_df1 = cursor.fetchall()
colnames = [x[0] for x in cursor.description]
pd.DataFrame(reviews_df1, columns=colnames)

Unnamed: 0,title,artist,score
0,metal box,public image ltd,10.0
1,blood on the tracks,bob dylan,10.0
2,another green world,brian eno,10.0
3,songs in the key of life,stevie wonder,10.0
4,in concert,nina simone,10.0
...,...,...,...
71,source tags and codes,...and you will know us by the trail of dead,10.0
72,the olatunji concert: the last live recording,john coltrane,10.0
73,kid a,radiohead,10.0
74,animals,pink floyd,10.0


### Query using the `pd.read_sql_query()` Method:
Alternatively, I can use the `sqlalchemy` engine to issue the same query with the `pd.read_sql_query()` function:

In [88]:
reviews_df2 = pd.read_sql_query("SELECT title, artist, score FROM reviews WHERE score=10", con=engine)
reviews_df2

Unnamed: 0,title,artist,score
0,metal box,public image ltd,10.0
1,blood on the tracks,bob dylan,10.0
2,another green world,brian eno,10.0
3,songs in the key of life,stevie wonder,10.0
4,in concert,nina simone,10.0
...,...,...,...
71,source tags and codes,...and you will know us by the trail of dead,10.0
72,the olatunji concert: the last live recording,john coltrane,10.0
73,kid a,radiohead,10.0
74,animals,pink floyd,10.0


To save and keep track of any changes made in the MySQL server, I'll use the `.commit()` method. And to free up the resources the database is using on my machine, I'l use the `.close()` method:

In [90]:
dbserver.commit()
dbserver.close()

### Problem 4
[Colin Mitchell](http://muffinlabs.com/) is a web-developer and artist who has a bunch of [cool projects](http://muffinlabs.com/projects.html) that play with what data can do on the internet. One of his projects is [Today in History](https://history.muffinlabs.com/), which provides an API to access all the Wikipedia pages for historical events that happened on this day in JSON format. The records in this JSON are stored in the `['data']['events']` path. Here's the first listing for today:

In [89]:
history = requests.get("https://history.muffinlabs.com/date")
history_json = json.loads(history.text)
events = history_json['data']['Events']
events[0]

{'year': '70',
 'text': 'Siege of Jerusalem: Titus, son of emperor Vespasian, storms the Fortress of Antonia north of the Temple Mount. The Roman army is drawn into street fights with the Zealots.',
 'html': '70 - <a href="https://wikipedia.org/wiki/Siege_of_Jerusalem_(AD_70)" class="mw-redirect" title="Siege of Jerusalem (AD 70)">Siege of Jerusalem</a>: <a href="https://wikipedia.org/wiki/Titus" title="Titus">Titus</a>, son of emperor <a href="https://wikipedia.org/wiki/Vespasian" title="Vespasian">Vespasian</a>, storms the <a href="https://wikipedia.org/wiki/Antonia_Fortress" title="Antonia Fortress">Fortress of Antonia</a> north of the <a href="https://wikipedia.org/wiki/Temple_Mount" title="Temple Mount">Temple Mount</a>. The <a href="https://wikipedia.org/wiki/Roman_army" title="Roman army">Roman army</a> is drawn into street fights with the <a href="https://wikipedia.org/wiki/Zealots_(Judea)" class="mw-redirect" title="Zealots (Judea)">Zealots</a>.',
 'no_year_html': '<a href="ht

I have MongoDB server up and running in my docker container and I've installed the `pymongo` library. 

First, I am constructing the `mongo_uri` using the environemnt variables loaded form my `.env` file in this format `mongodb://username:password@host:port/database?authSource=admin`. Then, to access the local MongoDB server, I am using `.MongoClient()` method and passing `mongo_uri`:

In [None]:
# Create the connection URI
mongo_uri = f"mongodb://{MONGO_INITDB_ROOT_USERNAME}:{MONGO_INITDB_ROOT_PASSWORD}@localhost:27017/{mongo_init_db}?authSource=admin"

# Create the MongoClient
myclient = pymongo.MongoClient(mongo_uri)

To create a database on this server, I am passing a string element to `myclient` with the name of the database I want to create:

In [130]:
historydb = myclient["history"]

Here, I am setting up a new collection named "today" on the MongoDB server:

In [131]:
try:
    todaycol = historydb["today"]
except:
    # check if this collection already exists in the "historydb" database
    collist = historydb.list_collection_names()
    if "today" in collist:
        # if so, drop this collection before creating a new
        historydb.todaycol.drop()
    todaycol = historydb["today"]

To insert **all** the records in `events` (which is already in JSON format) into this collection, I am using the `.insert_many()` method applied to `todaycol` and passing `events`:

In [132]:
#Remove the _id field from your documents before insertion, allowing MongoDB to generate new, unique _id values
for event in events:
    if '_id' in event:
        del event['_id']
        
todayevents = todaycol.insert_many(events)
todaycol.count_documents({})

268

To issue a query to find all of the records whose text contain the word "England" using the following query, I am using the `.find()` method applied to `todaycol` and pasing the query:

In [133]:
query = {
    "text":{
        "$regex": 'England'
    }
}
engevents = todaycol.find(query)
for event in engevents:
    print(event)

{'_id': ObjectId('669c1cfeb37ee024c41a5c82'), 'year': '1189', 'text': 'Richard I of England officially invested as Duke of Normandy.', 'html': '1189 - <a href="https://wikipedia.org/wiki/Richard_I_of_England" title="Richard I of England">Richard I of England</a> officially invested as <a href="https://wikipedia.org/wiki/Duke_of_Normandy" title="Duke of Normandy">Duke of Normandy</a>.', 'no_year_html': '<a href="https://wikipedia.org/wiki/Richard_I_of_England" title="Richard I of England">Richard I of England</a> officially invested as <a href="https://wikipedia.org/wiki/Duke_of_Normandy" title="Duke of Normandy">Duke of Normandy</a>.', 'links': [{'title': 'Richard I of England', 'link': 'https://wikipedia.org/wiki/Richard_I_of_England'}, {'title': 'Duke of Normandy', 'link': 'https://wikipedia.org/wiki/Duke_of_Normandy'}]}
{'_id': ObjectId('669c20a7b37ee024c41a5cc8'), 'year': '1189', 'text': 'Richard I of England officially invested as Duke of Normandy.', 'html': '1189 - <a href="https

To display the count of the number of documents that match this query, I am using `.count_documents()` method on the collection and passing the query:

In [134]:
todaycol.count_documents(query)

4

To collect this data and output a JSON file with these records, I am using the `dumps` and `loads` fubctions from `bson.json_util` module.

In [135]:
from bson.json_util import dumps, loads

To convert the query outpur to plain text, I am passing the query to `dumps()`. Then to register the text as JSON formatted data, I am passing the result to `loads()`:

In [136]:
engevents_text = dumps(todaycol.find(query))
engevents_records = loads(engevents_text)
engevents_records 

[{'_id': ObjectId('669c1cfeb37ee024c41a5c82'),
  'year': '1189',
  'text': 'Richard I of England officially invested as Duke of Normandy.',
  'html': '1189 - <a href="https://wikipedia.org/wiki/Richard_I_of_England" title="Richard I of England">Richard I of England</a> officially invested as <a href="https://wikipedia.org/wiki/Duke_of_Normandy" title="Duke of Normandy">Duke of Normandy</a>.',
  'no_year_html': '<a href="https://wikipedia.org/wiki/Richard_I_of_England" title="Richard I of England">Richard I of England</a> officially invested as <a href="https://wikipedia.org/wiki/Duke_of_Normandy" title="Duke of Normandy">Duke of Normandy</a>.',
  'links': [{'title': 'Richard I of England',
    'link': 'https://wikipedia.org/wiki/Richard_I_of_England'},
   {'title': 'Duke of Normandy',
    'link': 'https://wikipedia.org/wiki/Duke_of_Normandy'}]},
 {'_id': ObjectId('669c20a7b37ee024c41a5cc8'),
  'year': '1189',
  'text': 'Richard I of England officially invested as Duke of Normandy.',
  

Lastly, to end the session, I am closing the connection to the MongoDB server:

In [137]:
myclient.close()