To start, we'll load up a few modules with the `import` statement. When you load a module using import, all of the functions available are now accessible to you. Modules and import statements help programmers avoid naming conflicts because you can use short, straightforward names for functions and variables without worrying that they're already taken. Matlab does not have anything equivalent to Python's module system and therefore can be harder to read.

In [4]:
import sqlite3
import pandas as pd
import re
import numpy as np

# Getting and cleaning Data

### Download Data

Notes:

1) Need to download SQLite browser (???)
Download sqlite from https://www.sqlite.org/download.html
or sqlite browser from https://sqlitebrowser.org/dl/
2) Need sqlite3 module and probably pandas

### Look at the Data

Let's load the data up using the `read_csv` function from the Pandas package, which we've abbreviated as `pd`. By using `pd.read_csv`, we return what is called a pandas DataFrame. A DataFrame can be thought of as a 2D table, but the values within each of the columns must be the same datatype. For example, any entry in the Year column must be an integer, while an entry in the Cause column must be a string.

In [5]:
# Use a # to "comment out" anything in a code block - this is a nice way to take notes and document your code!

# Data sets available at https://catalog.data.gov/dataset
data = pd.read_csv('NCHS_-_Leading_Causes_of_Death__United_States.csv',',')


The first thing to do with our DataFrame is to look at the first few rows of the function using `head()`. We often do this just to confirm that we loaded the data correctly (that it has the correct column names).

In [20]:
# Look at data
data.head()

Unnamed: 0,Year,113 Cause Name,Cause Name,State,Deaths,Age-adjusted Death Rate
0,2012,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Vermont,21,2.6
1,2016,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Vermont,30,3.7
2,2013,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Vermont,30,3.8
3,2000,"Intentional self-harm (suicide) (*U03,X60-X84,...",Suicide,District of Columbia,23,3.8
4,2014,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Arizona,325,4.1


In [23]:
# Describe returns summary statistics on numerical variables
data.describe()

Unnamed: 0,Year,Deaths,Age-adjusted Death Rate
count,10296.0,10296.0,10296.0
mean,2007.5,15367.93,128.037383
std,5.188379,112145.7,224.381865
min,1999.0,21.0,2.6
25%,2003.0,606.0,19.2
50%,2007.5,1704.5,35.8
75%,2012.0,5678.0,153.025
max,2016.0,2744248.0,1087.3


In [8]:
# What are the column names?
print([col for col in data.columns])

['Year', '113 Cause Name', 'Cause Name', 'State', 'Deaths', 'Age-adjusted Death Rate']


### Clean Data

didactic summary: spaces and names that start with numbers cause problems in pandas/SQL. go through rest of this section briefly

In [9]:
# Clean names
def clean_names(df):
    L = []
    for col in df.columns:
        L.append(re.sub(r"\s+|-", '_', col))
    df.columns = L    

In [10]:
# Clean names and print out new names
clean_names(data)
print([col for col in data.columns])

['Year', '113_Cause_Name', 'Cause_Name', 'State', 'Deaths', 'Age_adjusted_Death_Rate']


In [11]:
#113 Cause Name will still cause trouble
cols = [col for col in data.columns]
cols[1] = 'Cause_Description'
data.columns = cols
print([col for col in data.columns])

['Year', 'Cause_Description', 'Cause_Name', 'State', 'Deaths', 'Age_adjusted_Death_Rate']


# Making the database

In [12]:
# Make year table
conn = sqlite3.connect("leading_cases_of_death.sqlite")
cur = conn.cursor()
# Sort list of unique years
yr_unique = np.sort(data.Year.unique()).tolist()
sql_statement1 = ("DROP TABLE IF EXISTS Year")
sql_statement2 = '''CREATE TABLE Year(
                    YearID INTEGER PRIMARY KEY,
                    Year INTEGER NOT NULL
                    )'''
cur.execute(sql_statement1)
cur.execute(sql_statement2)
for yr in yr_unique:
    cur.execute("INSERT INTO Year (Year) VALUES (?)", (yr,))
    conn.commit()
conn.close()


In [13]:
# Check year table created
conn = sqlite3.connect("leading_cases_of_death.sqlite")
df_year = pd.read_sql_query("SELECT * FROM Year", conn)
conn.close()
df_year.head()

Unnamed: 0,YearID,Year
0,1,1999
1,2,2000
2,3,2001
3,4,2002
4,5,2003


In [14]:
# Create Cause Table
conn = sqlite3.connect("leading_cases_of_death.sqlite")
cur = conn.cursor()
# Create list of unique causes
cause_unique = data.Cause_Name.unique().tolist()
maxLengthCause = max([len(item) for item in cause_unique])
cause_desc_unique = data.Cause_Description.unique().tolist()
maxLengthDesc = max([len(item) for item in cause_desc_unique])
sql_statement1 = ("DROP TABLE IF EXISTS Cause")
sql_statement2 = '''CREATE TABLE Cause(
                    CauseID INTEGER PRIMARY KEY,
                    Cause_Name VARCHAR({0}),
                    Cause_Description VARCHAR({1})
                    )'''.format(maxLengthCause,maxLengthDesc)
cur.execute(sql_statement1)
cur.execute(sql_statement2)
for i in range(len(cause_unique)):
    cur.execute("INSERT INTO Cause (Cause_Name,Cause_Description) VALUES (?,?)", 
                (cause_unique[i], cause_desc_unique[i]))
    conn.commit()
conn.close()

In [15]:
# Check cause table created
conn = sqlite3.connect("leading_cases_of_death.sqlite")
df_cause = pd.read_sql_query("SELECT * FROM Cause", conn)
conn.close()
df_cause.head()

Unnamed: 0,CauseID,Cause_Name,Cause_Description
0,1,Kidney disease,"Nephritis, nephrotic syndrome and nephrosis (N..."
1,2,Suicide,"Intentional self-harm (suicide) (*U03,X60-X84,..."
2,3,Alzheimer's disease,Alzheimer's disease (G30)
3,4,Influenza and pneumonia,Influenza and pneumonia (J09-J18)
4,5,Diabetes,Diabetes mellitus (E10-E14)


In [16]:
# Make state table
conn = sqlite3.connect("leading_cases_of_death.sqlite")
cur = conn.cursor()
# Sort list of unique years
state_unique = np.sort(data.State.unique()).tolist()
maxLength = max([len(item) for item in state_unique])
sql_statement1 = ("DROP TABLE IF EXISTS State")
sql_statement2 = '''CREATE TABLE State(
                    StateID INTEGER PRIMARY KEY,
                    State VARCHAR({0})
                    )'''.format(maxLength,)
cur.execute(sql_statement1)
cur.execute(sql_statement2)
for state in state_unique:
    cur.execute("INSERT INTO State (State) VALUES (?)", (state,))
    conn.commit()
conn.close()

In [17]:
# Check state table created
conn = sqlite3.connect("leading_cases_of_death.sqlite")
df_state = pd.read_sql_query("SELECT * FROM State", conn)
conn.close()
df_state.head()

Unnamed: 0,StateID,State
0,1,Alabama
1,2,Alaska
2,3,Arizona
3,4,Arkansas
4,5,California


In [21]:
# # Make deaths table
conn = sqlite3.connect("leading_cases_of_death.sqlite")
cur = conn.cursor()
# Merge ID values to dataframe
merged = pd.merge(data,df_state,how='outer',on=['State'])
merged = pd.merge(merged,df_cause,how='outer',on=['Cause_Name'])
merged = pd.merge(merged, df_year, how ='outer',on=['Year'])

sql_statement1 = ("DROP TABLE IF EXISTS Deaths")
sql_statement2 = '''CREATE TABLE Deaths (
                    ID INTEGER PRIMARY KEY,
                    YearID INTEGER,
                    CauseID INTEGER,
                    StateID INTEGER,
                    Deaths INTEGER,
                    Age_adjusted_Death_Rate FLOAT,
                    FOREIGN KEY (CauseID) REFERENCES Cause(CauseID),
                    FOREIGN KEY (YearID) REFERENCES Year(YearID),
                    FOREIGN KEY (StateID) REFERENCES State(StateID)
                    );'''
cur.execute(sql_statement1)
cur.execute(sql_statement2)

# Get data into right format
deaths = merged['Deaths'].values.tolist()
age_adj_rate = merged['Age_adjusted_Death_Rate'].values.tolist()
causeID = merged['CauseID'].values.tolist()
yearID = merged['YearID'].values.tolist()
stateID = merged['StateID'].values.tolist()

for i in range(len(merged)):
    cur.execute('''INSERT INTO Deaths 
    (Deaths,Age_adjusted_Death_Rate,CauseID,YearID,StateID) 
    VALUES (?,?,?,?,?)''',
    (deaths[i],age_adj_rate[i],causeID[i],yearID[i],stateID[i]))
    conn.commit()
conn.close()

In [22]:
# Check death table created
conn = sqlite3.connect("leading_cases_of_death.sqlite")
df_death = pd.read_sql_query("SELECT * FROM Deaths", conn)
conn.close()
df_death.head()

Unnamed: 0,ID,YearID,CauseID,StateID,Deaths,Age_adjusted_Death_Rate
0,1,14,1,47,21,2.6
1,2,14,1,9,44,7.2
2,3,14,1,3,414,5.7
3,4,14,1,42,58,5.6
4,5,14,1,49,482,6.6


# Querying the database

## Background

### About Relational Databases

Unlike an excel spreadsheet or a `pandas dataframe`, data is typically spread across multiple tables in a relational
database. The process of spreading data across multiple tables is called `normalization`. Normalization reduces
redundancies in the database (making the normalized database more compact in terms of disk space), makes it easier
(and safer) to change the value of a cell in a database, and can optimize the queries (or searches).

### About SQLite

Most relational databaes require a separate server process. This means you have to access the server in order to interact with the relational database. Here we're using `SQLite`, which is a relational database which has similar features to relational databases requiring servers (such as `MySQL`, `Postgre SQL`, or `SQL Server`), but doesn't require a server and is essentially plug-and-play. Most importantly, SQLite uses similar query language to the other databases. The language used to query all of these databases is based upon `SQL`, or `Structured Query Language`. We will sometimes point out equivalent commands for MySQL.

### Using Python to Interact With Relational Databases

`sqlite3` is a module within python that allows you to interact with `SQLite` within the comfort of python. Alternatively, you could interact with SQLite through a command line interface, but python makes it easier to run, store, and alter your queries. The `sqlite3` module with establish a connection with the sqlite3 database and assign this connection to an object, which we are calling `connection`:
```python
connection = sqlite3.connect("my_database")
```
We will use a nice feature of `pandas` which allows us to run a query and load it as a dataframe after we've opened up a connection with the database:
```python
result_df = pd.read_sql_query("My SQLite Query", connection)
```
The `pd` is how we tell the `pandas` module that we are talking to it, and the `read_sql_query` is the command we're giving the `pandas` module, which tells `pandas` to use the connection we've given it (inside of the paretheses) to run the query we've also given it, and ultimately to load the results of the query into a dataframe. Notice that I have to type the name of the dataframe at the end of the code block to see the results of the query. Finally, it is important to remember to close the database when you're done querying it, which you do by telling the connection to close:
```python
connection.close()
```
Lastly, do not worry if everything you just read sounded like gobble-dee-muck.  It's sufficient to just think of the commands as a series of spells that will work if you say them in the proper order.

## Getting to Know Your Database

One of the first things you'll want to do with a database with which you're unfamiliar is find out the names of the
`tables` within the database.

In [23]:
connection = sqlite3.connect("leading_cases_of_death.sqlite")
result_df = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table';", connection)
connection.close()
result_df

Unnamed: 0,name
0,Year
1,Cause
2,State
3,Deaths


So our column names are `Year`, `Cause`, `State`, and `Deaths`. Notice that I have to type the name of the dataframe at the end of the code block to see the results of the query.

_As an aside:_ Although hopefully the command we used might make a little more sense as you continue with this notebook, it is fairly esoteric. In MySQL the equivalent command is
```SQL
SHOW tables;
```

Next you'll want to find out what are the `column` names in each table.

_As an aside_: In MySQL you can use the command
```SQL
DESCRIBE my_table;
```

### Year Table

In [24]:
connection = sqlite3.connect("leading_cases_of_death.sqlite")
result_df = pd.read_sql_query("SELECT * FROM Year LIMIT 5", connection)
conn.close()
result_df

Unnamed: 0,YearID,Year
0,1,1999
1,2,2000
2,3,2001
3,4,2002
4,5,2003


### Cause Table

In [25]:
connection = sqlite3.connect("leading_cases_of_death.sqlite")
result_df = pd.read_sql_query("SELECT * FROM Cause LIMIT 5", connection)
conn.close()
result_df

Unnamed: 0,CauseID,Cause_Name,Cause_Description
0,1,Kidney disease,"Nephritis, nephrotic syndrome and nephrosis (N..."
1,2,Suicide,"Intentional self-harm (suicide) (*U03,X60-X84,..."
2,3,Alzheimer's disease,Alzheimer's disease (G30)
3,4,Influenza and pneumonia,Influenza and pneumonia (J09-J18)
4,5,Diabetes,Diabetes mellitus (E10-E14)


### State Table

In [26]:
connection = sqlite3.connect("leading_cases_of_death.sqlite")
result_df = pd.read_sql_query("SELECT * FROM State LIMIT 5", connection)
conn.close()
result_df

Unnamed: 0,StateID,State
0,1,Alabama
1,2,Alaska
2,3,Arizona
3,4,Arkansas
4,5,California


### Deaths Table

In [27]:
connection = sqlite3.connect("leading_cases_of_death.sqlite")
result_df = pd.read_sql_query("SELECT * FROM Deaths LIMIT 5", connection)
conn.close()
result_df

Unnamed: 0,ID,YearID,CauseID,StateID,Deaths,Age_adjusted_Death_Rate
0,1,14,1,47,21,2.6
1,2,14,1,9,44,7.2
2,3,14,1,3,414,5.7
3,4,14,1,42,58,5.6
4,5,14,1,49,482,6.6


## Comparison to Original Dataframe

Lets take a look at our original dataframe, which we can do by typing its name `data`. We're going to once again use the `head` method to not overwhelm ourselves with the full table.

In [28]:
data.head()

Unnamed: 0,Year,Cause_Description,Cause_Name,State,Deaths,Age_adjusted_Death_Rate
0,2012,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Vermont,21,2.6
1,2016,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Vermont,30,3.7
2,2013,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Vermont,30,3.8
3,2000,"Intentional self-harm (suicide) (*U03,X60-X84,...",Suicide,District of Columbia,23,3.8
4,2014,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Arizona,325,4.1


Comparing the original dataframe (which has a format identical to the `csv` file you earlier opened in `Excel` to the `Deaths` table, we see that the year, cause name, and state are recoded as numbers. This prevents Vermont from being stored in the database 1000's of times and instead a single number (47) is stored in the `Deaths` table. This is exactly what we meant by `normalization`: we've broken up the original table into several tables in order to store the data in a more efficient manner. We can figure out which state the number 47 refers to by looking in the `State` table. We can figure out which year the number 14 refers to by looking in the `Year` table. We can also figure out what cause of death the number 1 refers to by looking in the `Cause` table.

## Query Language Basics

Now we're going to introduce you to the SQLite query language. This will make some of the queries we ran in the section **Getting To Know Your Database** a little easier to understand, and we invite you to review that section after completing this one.

### The Select Statement

A very common SQL statement is the `SELECT` statement which allows you to select columns from a particular table:
```SQL
SELECT Column Names
FROM Table Names
```

In [29]:
connection = sqlite3.connect("leading_cases_of_death.sqlite")
result_df = pd.read_sql_query("SELECT Cause_Name FROM Cause", connection)
conn.close()
result_df

Unnamed: 0,Cause_Name
0,Kidney disease
1,Suicide
2,Alzheimer's disease
3,Influenza and pneumonia
4,Diabetes
5,CLRD
6,Unintentional injuries
7,Stroke
8,Heart disease
9,Cancer


If you don't want to see all of the output you can add a `LIMIT` expression:
```SQL
SELECT Column Names LIMIT n
FROM Table Names
```

In [30]:
connection = sqlite3.connect("leading_cases_of_death.sqlite")
result_df = pd.read_sql_query("SELECT Cause_Name FROM Cause LIMIT 3", connection)
conn.close()
result_df

Unnamed: 0,Cause_Name
0,Kidney disease
1,Suicide
2,Alzheimer's disease


You can either select all of the columns by typing them out as a list, or by using `*`:
```SQL
SELECT * LIMIT n
FROM Table Names
```

In [31]:
connection = sqlite3.connect("leading_cases_of_death.sqlite")
result_df = pd.read_sql_query("SELECT Cause_Name, Cause_Description FROM Cause LIMIT 3", connection)
conn.close()
result_df

Unnamed: 0,Cause_Name,Cause_Description
0,Kidney disease,"Nephritis, nephrotic syndrome and nephrosis (N..."
1,Suicide,"Intentional self-harm (suicide) (*U03,X60-X84,..."
2,Alzheimer's disease,Alzheimer's disease (G30)


**Exercise**: Modify the query above so that you get the same result using `*` instead of the column names.

You can also specify a condition using a `WHERE` statement:
```SQL
SELECT Column Names
FROM Table Names
WHERE condition
```

In [32]:
connection = sqlite3.connect("leading_cases_of_death.sqlite")
result_df = pd.read_sql_query("SELECT Cause_Description FROM Cause WHERE Cause_Name = 'CLRD'", connection)
conn.close()
result_df

Unnamed: 0,Cause_Description
0,Chronic lower respiratory diseases (J40-J47)


**Exercise**: What is the description of unintential injuries?

### Joins

In [35]:
# Queries
def query(sql_statement, db = "leading_cases_of_death.sqlite"):
    conn = sqlite3.connect(db)
    df = pd.read_sql_query(sql_statement, conn)
    conn.close()
    

In [36]:
'''Select statement sytnax: SELECT [Distinct] column 
    FROM table
    WHERE condition'''
question = "DESCRIBE Cause"
answer = query(question)
answer

DatabaseError: Execution failed on sql 'DESCRIBE Cause': near "DESCRIBE": syntax error

In [182]:
# What were the leading causes of death in Oregon in 2016?
db = "leading_cases_of_death.sqlite"
df = query(db, '''SELECT State.State, Deaths.Deaths, Year.Year, Cause.Cause_Name 
      FROM Deaths LEFT OUTER JOIN State
      ON Deaths.StateID = State.StateID
      LEFT OUTER JOIN Year ON Deaths.YearID = Year.YearID
      LEFT OUTER JOIN Cause on Deaths.CauseID = Cause.CauseID
      WHERE State.State = 'Oregon' AND Year.Year = '2016';''')
# Year AS y, Cause as c
# y.Year, c.Cause_Name

In [184]:
df

Unnamed: 0,State,Deaths,Year,Cause_Name
0,Oregon,398,2016,Kidney disease
1,Oregon,452,2016,Influenza and pneumonia
2,Oregon,772,2016,Suicide
3,Oregon,1240,2016,Diabetes
4,Oregon,1786,2016,Alzheimer's disease
5,Oregon,1943,2016,Stroke
6,Oregon,2105,2016,Unintentional injuries
7,Oregon,2080,2016,CLRD
8,Oregon,6968,2016,Heart disease
9,Oregon,8078,2016,Cancer


end with super complicated query