# Dataset Exploration
----------

## Introduction

In an ideal world, we will have all of the data we want with all of the desirable properties (no missing values, no errors, standard formats, and so on). 
However, that is hardly ever true - and we have to work with using our datasets to answer questions of interest as intelligently as possible. 

In this notebook, we will explore our datasets to answer some questions of interest. 

### Learning Objectives

This notebook will give you the opportunity to spend some hands-on time with the data. 

This notebook will take you around the different ways you can analyze your data. This involves looking at basic metrics in the larger dataset, taking a random sample, creating derived variables, making sense of the missing values, and so on. 

This will be done using both SQL and `pandas` in Python. The `sqlite3` Python package will give you the opportunity to interact with the database using SQL to pull data into Python. Some additional manipulations will be handled by Pandas in Python (by converting your datasets into dataframes).

This notebook will provide an introduction and examples for: 

- How to create new tables from the larger tables in database (sometimes called the "analytical frame")
- How to explore different variables of interest
- How to explore aggregate metrics
- How to handle missing values
- How to join newly created tables

### Methods

We will be using the `sqlite3` Python package to access tables in our database - SQLite3. 

To read the results of our queries, we will be using the `pandas` Python package, which has the ability to read tabular data from SQL queries into a pandas DataFrame object. Within `pandas`, we will use various commands to:

- Create statistical summaries
- Create subsets of the data

Within SQL, we will use various queries to:

- select data subsets
- Sum over groups
- create new tables
- Count distinct values of desired variables
- Order data by chosen variables

## Python Setup

In Python, we `import` packages. The `import` command allows us to use libraries created by others in our own work by "importing" them. You can think of importing a library as opening up a toolbox and pulling out a specific tool. Among the most famous Python packages:
- `numpy` is short for "numerical Python". `numpy` is a lynchpin in Python's scientific computing stack. Its strengths include a powerful *N*-dimensional array object, and a large suite of functions for doing numerical computing. 
- `pandas` is a library in Python for data analysis that uses the DataFrame object (modeled after R DataFrames, for those familiar with that language) which is similiar to a spreedsheet but allows you to do your analysis programaticaly rather than the point-and-click of Excel. It is a lynchpin of the PyData stack and is built on top of `numpy`.  
- `sqlite3` is a library that helps us connect to an sqlite3 database.

In [None]:
# pandas-related imports
import pandas as pd

# database interaction imports
import sqlite3

__When in doubt, use shift + tab to read the documentation of a method.__

__The `help()` function provides information on what you can do with a function.__

In [None]:
# for example
help(sqlite3.connect)

## Load the Data

We can execute SQL queries using Python to get the best of both worlds. For example, Python - and pandas in particular - make it much easier to calculate descriptive statistics of the data. Additionally, as we will see in the Data Visualization exercises, it is relatively easy to create data visualizations using Python. 

Pandas provides many ways to load data. It allows the user to read the data from a local csv or excel file, pull the data from a relational database, or read directly from a URL (when you have internet access). Since we are working with an SQLite3 database, we will demonstrate how to use pandas to read data from a relational database. For examples to read data from a CSV file, refert to the pandas documentation [Getting Data In/Out](pandas.pydata.org/pandas-docs/stable/10min.html#getting-data-in-out).

The function to run a SQL query and pull the data into a pandas dataframe (more to come) is `pd.read_sql()`. Just like doing a SQL query from pgAdmin, this function will ask for some information about the database, and what query you would like to run. Let's walk through the example below.

### Establish a Connection to the Database

The first parameter is the connection to the database. To create a connection we will use the SQLAlchemy package and tell it which database we want to connect to, just like in pgAdmin. Additional details on creating a connection to the database are provided in the [Databases](02_1_Databases.ipynb) notebook.

__Parameter 1: Connection__

In [None]:
# to create a connection to the database, 
# we need to pass the name of the database 

DB = 'testing/ncdoc.db'

conn = sqlite3.connect(DB)

### Formulate Data Query

Depending on what data we are interested in, we can use different queries to pull different data. In this example, we will pull all the content of the offenders data.

__Create a query as a `string` object in Python__

In [None]:
query = '''
SELECT *
FROM inmate
LIMIT 20;
'''

Note:

- the three quotation marks surrounding the query body is called multi-line string. It is quite handy for writing SQL queries because the new line character will be considered part of the string, instead of breaking the string

In [None]:
# Now that we have defined a variable `query`, we can call it in the code
print(query)

> Note that the `LIMIT` provides one simple way to get a "sample" of data; however, using `LIMIT` does **not provide a _random_** sample. You may get different samples of data than others using just the `LIMIT` clause, but it is just based on what is fastest for the database to return.

### Pull Data from the Database

Now that we have the two parameters (database connection and query), we can pass them to the `pd.read_sql()` function, and obtain the data.

In [None]:
# here we pass the query and the connection to the pd.read_sql() function and assign the variable `wage` 
# to the dataframe returned by the function
df = pd.read_sql(query, conn)

## Analysis: Using Python and SQL

__What are the characteristics of inmates/offenders in North Carolina?__

Before we go any further, let's take a look at some of the data that we're working with.

__North Carolina Department of Corrections Data__:
- `inmate`: Characteristics about each inmate.
- `offender`: Characteristics about each offender.
- `sentences`: Transactional-level data about sentences.

Note that each row in the both the `inmate` and `offender` tables represents one person. However, the same person can have multiple sentences. Information about each person is included in the `inmate` and `offender` tables, separate from the actual sentences, because we don't want to have to repeat the information about about each person multiple times, as would be the case if we had just one large table with all of the relevant information. 

Let's bring in a subset of the offender data to explore it.

In [None]:
query = '''
SELECT *
FROM offender
limit 100;
'''
offender = pd.read_sql(query, conn)

In [None]:
offender.head()

Here, we use the `head()` method to look at the top few rows of the offender data. As you can see, we have lots of information about the person, such as date of birth, gender, height, weight, hair color and so on. Let's see all of the types of variables that we have in this table.

In [None]:
offender.columns

## Identifying Missing Values

We might be concerned about missing values in our data. Let's take a look at some inmate data to show an example of how we might find them.

In [None]:
query = '''
SELECT *
FROM inmate
limit 10000;
'''
inmate = pd.read_sql(query, conn)

In [None]:
inmate.head()

Some values seem to be missing. We don't really care as much about a missing middle initial, but we might be concerned that the inmate's race might be missing. Let's see if we can identify if there are any missing in that variable.

In [None]:
# Missing values
inmate['INMATE_RACE_CODE'].value_counts()

It looks like there's one missing value out of the 10,000 sample that we took from the inmate table.

Also, some offenders are missing the NC County where they were born. Let's see how many.

In [None]:
offender['NC_COUNTY_WHERE_OFFENDER_BORN'].value_counts() # some are missing

This is just for the sample. What about for the whole dataset?

In [None]:
#generating read SQL
query = '''
SELECT count(distinct OFFENDER_NC_DOC_ID_NUMBER)
FROM offender
WHERE NC_COUNTY_WHERE_OFFENDER_BORN IS ""
'''
# read the query into a DataFrame
missing_county = pd.read_sql(query, conn)
# print the resulting DataFrame
missing_county

For reference, we can also find he total number of people. We count both the distinct inmate DOC numbers as well as the number of rows just to make sure that we don't have any duplicates.

In [None]:
#generating read SQL
query = '''
SELECT count(distinct INMATE_DOC_NUMBER), count(*)
FROM inmate
'''
# read the query into a DataFrame
unique_offender = pd.read_sql(query, conn)
# print the resulting DataFrame
unique_offender

## Date Variables

SQL and Python have specific ways of dealing with date variables so that we can use them in intuitive ways. For example, we can extract out the year from a date and use that separately from the date itself. For example, suppose we want to get everyone who was an inmate during the 1980s.

In [None]:
# Let's look at every inmate in the 1980s

# set the SQL query
query ="""
SELECT *, CAST(strftime("%Y",ACTUAL_SENTENCE_END_DATE) as integer) as release_year
FROM sentences
WHERE release_year >= 1980 AND release_year < 1990
"""

# print the query for reference
print(query)

# read the query 

in80 = pd.read_sql(query, conn)

In [None]:
in80.shape

## Summary Statistics

In this section, let's start looking at aggregate statistics on the data. 

In [None]:
qry = """
SELECT *
FROM sentences
"""
# print results
sentences = pd.read_sql(qry,conn)

In [None]:
sentences.head()

Note the `INMATE_SENTENCE_COMPONENT` column. This shows that the there might be multiple rows for multi-part sentences, and the sentence end date is the same for each of these separate sentences. Since we want to make sure they are treated as one whole sentence, we can simply take the first component of each sentence (since we are only interested in the sentence end date for right now). We'll make sure to do this for all future queries.

Let's look at how many sentences ended in the 1980s.

In [None]:
# Note that we're using a slightly different way of determining who ended their sentence 
qry = """
SELECT count(*)
FROM sentences
WHERE ACTUAL_SENTENCE_END_DATE >= '1980-01-01' AND ACTUAL_SENTENCE_END_DATE < '1990-01-01'
AND INMATE_SENTENCE_COMPONENT == '001'
"""
# print results
print(pd.read_sql(qry, conn))

In [None]:
# Let's get this in a data frame to explore further

qry = """
SELECT *
FROM sentences
WHERE ACTUAL_SENTENCE_END_DATE >= '1980-01-01' AND ACTUAL_SENTENCE_END_DATE < '1990-01-01'
AND INMATE_SENTENCE_COMPONENT == '001'
"""
# print results
df = pd.read_sql(qry, conn)

In [None]:
# we can get descriptive stats from the DataFrame:
df.describe(include='all')

In [None]:
# check how many records from our inmate data matches the offender data

qry = """
SELECT *
FROM offender
JOIN inmate
ON offender.OFFENDER_NC_DOC_ID_NUMBER = inmate.INMATE_DOC_NUMBER
"""
inoff = pd.read_sql(qry, conn)

In [None]:
# what is the distribution of height in our sample?
height = inoff['OFFENDER_HEIGHT_(IN_INCHES)'].astype(float)
inoff['OFFENDER_HEIGHT_(IN_INCHES)'] = height

In [None]:
# Percentiles of height
inoff['OFFENDER_HEIGHT_(IN_INCHES)'].describe(percentiles=[0.1,0.25,0.5, 0.75, 0.9])

In [None]:
# Percentiles of height by gender
inoff.groupby('INMATE_GENDER_CODE')['OFFENDER_HEIGHT_(IN_INCHES)'].describe(percentiles=[0.1,0.25,0.5, 0.75, 0.9])