


# HW 5 - Building a normalized RDB

The goal of this homework is to take a semi-structured non-normalized CSV file and turn it into a set of normalized tables that you then push to your MySQL database on AWS (or to your local MySQL).

The original dataset contains 100k district court decisions, but it has been to downsampled to only 1000 rows to make the uploads faster. Each row contains info about a judge, their demographics, party affiliation, etc. Rows also contain information about the case they were deciding on. Was it a criminal or civil case? What year was it? Was the direction of the decision liberal or conservative?

While the current denormalized format is fine for analysis, it's not fine for a database as it violates many normalization rules. Your goal is to normalize it by designing a simple schema, then wrangling it into the proper dataframes, then pushing it all to database server.

For the first part of this assignment you should wind up with four tables. One with case information, one with judge information, one that has casetype information, and for category info. Each table should be reduced so that there are not then repeating rows, and primary keys should be assigned within each. These tables should be called 'casedb_case', 'casedb_judge','casedb_casetype', and 'casedb_category'.

For the last part you should make a rollup table that calculates the percent of liberal decisions for each party level and each case category. This will allow for one to get a quick look at how the political party affiliation of judges impacts the direction of a decision for different case categories (e.g. criminal, civil, labor).

**Submission**

1) Make a copy and replace blank with your name

2) Complete and run all cells. (For DDL and DML cells, re-running will result in error unless you drop your table first)

3) Download .ipynb of the notebook (make sure all cells have appropariate output).

4) Submit on Gradescope


## Bring in data, explore, make schema

Start by bringing in your data to `cases`. Call a `.head()` on it to see what columns are there and what they contain.

In [None]:
import pandas as pd
all_cases_df = pd.read_csv('https://docs.google.com/spreadsheets/d/1AWLK06JOlSKImgoHNTbj7oXR5mRfsL2WWeQF6ofMq1g/gviz/tq?tqx=out:csv')

In [None]:
all_cases_df.head()

### Make schema

OK, given that head, you need to make four related tables that will make up a normalized database. Those tables are 'casedb_cases', 'casedb_judges', 'casedb_category', and 'casedb_casetype'. If it's not clear what info should go into each, explore the data more. Find the functional dependencies, and create the tables based on thoes.

Remember, you might not have keys, will need to reduce the rows, select certain columns, etc. There isn't a defined path here.



Let's start by bringing the coonection info, run_query, and sql_head

In [None]:
!pip install mysql-connector-python

In [None]:
import mysql.connector

In [None]:
#get_conn_cur/run_query/sql_head


## Make casetype - 5 points


We start by tables that do not have foreign keys. First create a table that contains just each casetype info. I would call this table that you're going to upload `casestype_df` so you don't overwrite your raw data.

Go make the casetype table. This should have only two columns that allow you to link the casetype name back to the ID in the 'cases' table. Note that when you select attributes from the `all_cases_df` there would be many duplicated rows, so you have remove duplicated rows using `drop_dublicates`. Finally, there should be only 27 rows for casetype.



In [None]:
# Make casetype_df



### Make cases table in your database

Put the helper function to create the connection here.
Once you do that you'll need to do the following

* Connect, make a table called 'casedb_casetype' with the correct column names,data types, and primary key. Be sure to execute and commit the table.
* Make tuples of your data
* Write a SQL string that allows you to insert each tuple of data into the correct columns
* Execute the string many times to fill out 'cases'
* Commit changes and check the table.

I'm not going to leave a full roadmap beyond this. Feel free to add cells as needed to do the above.

In [None]:
#create casedb_casetype table


In [None]:
#run this cell
sql_head(table_name='casedb_casetype')

In [None]:
#load data into casedb_casetype


In [None]:
#TEST #this must return 27
run_query("""SELECT COUNT(*) FROM casedb_casetype;""")

In [None]:
#TEST #this must return contempt of court
run_query("""SELECT casetype_name FROM casedb_casetype WHERE casetype_id = 4;""")

#Make Categoty 5 points

Do the same to create the `casedb_category` table and load data

In [None]:
#create category_df



In [None]:
#create table


In [None]:
#load data


In [None]:
#TEST
run_query("SELECT COUNT(*) FROM casedb_category;")

In [None]:
#TEST
run_query("SELECT category_name FROM casedb_category WHERE category_id = 3;")

In [None]:
#[-2] each failed test
#[-1] missing primary key
#[-4] only create dataframe

## Make judges - 5 points

Now make your judges table from the original `all_cases_df` dataframe.

Judges should have five columns, including the `judge_id` that you have to create using `pd.factorize` (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html) on judge name. There should be 553 rows after you drop duplicates (remember that judges may have had more than one case).

After you make the dataset go and push to a SQL table called 'judges'.

In [None]:
#Create judge id and assign to judge_id attribute on all_cases_df


In [None]:
#create judge_df



In [None]:
#create table casedb_judge


In [None]:
#load data


In [None]:
#TEST
run_query("SELECT COUNT(*) FROM casedb_judge")

In [None]:
#TEST
run_query('SELECT judge_name FROM casedb_judge WHERE judge_id = 2')

## Make cases table. - 5 points

Finally we create the table that contains case's info: `casedb_cases`.

This table should have five columns and 1000 rows.

Note, one of these columns should be a judge_id that links to the judges table. You'll need to make this foreign key. You have two other foriegn_keys as well.



In [None]:
# select necessary columns to make cases_df


In [None]:
#create table casedb_cases (note that case_id requires larger data type than INT)



In [None]:
#load data into cases


In [None]:
#TEST
run_query("SELECT COUNT(*) FROM casedb_case;")

In [None]:
#TEST
run_query("SELECT * FROM casedb_case WHERE case_id = 15660871")

## A quick test of your tables - 3 point

Below is a query to get the number of unique judges that have ruled on criminal court motion cases. You should get a value of 119 as your return if your database is set up correctly!

In [None]:
run_query("""SELECT COUNT(DISTINCT(casedb_judge.judge_id)) FROM casedb_case
    JOIN casedb_judge ON casedb_case.judge_id = casedb_judge.judge_id
        WHERE casetype_id = (SELECT casetype_id FROM casedb_casetype
                  WHERE casetype_name = 'criminal court motions'); """)


## Make rollup table - 7 points

Now let's make that rollup table! The goal here is to make a summary table easily accessed. We're going to roll the whole thing up by the judges party and the category, but you could imagine doing this for each judge to track how they make decisions over time which would then be useful for an analytics database. The one we're making could also be used as a dimension table where we needed overall party averages.

We want to get a percentage of liberal decisions by each grouping level (party_name, category_name). To do this we need first, the number of cases seen at each level, and second, the number of liberal decisions made at each level. `cases` contains the columns `libcon_id` which is a 0 if the decision was conservative in its ruling, and a 1 if it was liberal in its ruling. Thus, you can get a percentage of liberal decisions if you divide the sum of that column by the total observations. Your `agg()` can both get the sum and count.

After you groupby you'll need to reset the index, rename the columns, then make the percentage.

Once you do that you can push to a SQL table called 'rollup'

Let's get started

In [None]:
# Make a groupby called cases_rollup. This should group by party_name and categrory name. It should aggregate the count and sum of libcon_id


In [None]:
# reset your index


In [None]:
# rename your columns now. Keep the first to the same but call the last two 'total_cases' and 'num_lib_decisions'


Now make a new column called 'percent_liberal'

This should calucalte the percentage of decisions that were liberal in nature. Multiple it by 100 so that it's a full percent. Also use the `round()` function on the whole thing to keep it in whole percentages.

In [None]:
# make your metric called 'percent_liberal'



Now go and push the whole thing to a table called 'rollup'

There should be five columns and nine rows.

In [None]:
#create casedb_rollup table



In [None]:
#load data


In [None]:
# Run this cell
sql_head('casedb_rollup')