# Intro to Python and Pandas

## Table of Contents
1. [Overview of Python](#Overview-of-Python)
    1. [Loading Libraries](#Loading-Libraries)
    1. [Loading Data](#Load-the-Data)
1. [Data Analysis in Pandas](#Data-Analysis-in-Pandas)
    1. [Displaying Data](#Displaying-Data)
    1. [Subsetting Data](#Subsetting-Data)
    1. [Statistics](#Statistics)
    1. [Adding and Updating Data](#Adding-and-Updating-Data)
    1. [Grouping and Aggregating Data](#Grouping-and-Aggregating-Data)
    1. [Data Cleaning](#Data-Cleaning)
    1. [Saving Data](#Saving-Data)
    1. [Loading CSV Files](#Loading-CSV-Files)
    1. [Merging Dataframes](#Merging-Dataframes)
1. [Visualization](#Visualization)
    1. [Plotting](#Plotting)
    1. [Geospatial Visualization](#Geospatial-Visualization)

## Overview of Python
---
- Back to [Table of Contents](#Table-of-Contents)


> Before coming to class, you should have completed the [DataCamp Intro to Python for Data Science](https://www.datacamp.com/courses/intro-to-python-for-data-science) course. It is free and takes about four hours. 


Python is a high-level interpreted general purpose programming language named after the Monty Python British comedy troupe. Python was created by 
Guido van Rossum (Python's benovolent dictator for life), and is maintained by an international group of enthusiasts. 

As of the time of this writing (10/2016) Python is currently the fifth most popular programming language. It is popular for data science because it is powerful and fast, it "plays well" with other languages, it runs everywhere, it's easy to learn, it's highly readable, open-source and its fast development time compared to other languages. Because of its general-purpose nature and its ability to call compiled languages like FORTRAN or C it can be used in full-stack development. There is a growing and always-improving list of open-source libraries for scientific programming, data manipulation, and data analysis (e.g., Numpy, Scipy, Pandas, Scikit-Learn, Statsmodels, Matplotlib, Seaborn, PyTables, etc.)

[IPython](http://www.ipython.org) is an enhanced, interactive python interpreter that started as a grad school project by Fernando Perez. The project evolved into the IPython notebook, which allowed users to archive their code, figures, and analysis in a single document, making doing reproducible research and sharing said research much easier. The creators of the IPython notebook quickly realized that the "notebook" aspects were agnostic with respect to programming language, and ported the notebook to other languages including but not limited to Julia, Python and R. This then led to a rebranding known as the Jupyter Project. 

This tutorial will go over the basics of Data Analysis in Python using the PyData stack. 

## Loading Libraries
- Back to [Table of Contents](#Table-of-Contents)

### Python Setup

- In Python, we `import` packages. The `import` command allows us to use libraries created by others in our own work by "importing" them. You can think of importing a library as opening up a toolbox and pulling out a specific tool. 
- NumPy is short for numerical python. NumPy is a lynchpin in Python's scientific computing stack. Its strengths include a powerful *N*-dimensional array object, and a large suite of functions for doing numerical computing. 
- Pandas is a library in Python for data analysis that uses the DataFrame object from R which is similiar to a spreedsheet but allows you to do your analysis programaticaly rather than the point-and-click of Excel. It is a lynchpin of the PyData stack.  
- Psycopg2 is a python library for interfacing with a PostGreSQL database. 
- Matplotlib is the standard plotting library in python. 
`%matplotlib inline` is a so-called "magic" function of Jupyter that enables plots to be displayed inline with the code and text of a notebook. 

In [None]:
# remember to put this line in your notebook, otherwise the visualization won't show up
%pylab inline
# import the packages
# numpy for array and matrix computation
import numpy as np

# pandas for data analysis
import pandas as pd

# matplotlib and seaborn are the data visualization packages
import matplotlib.pyplot as plt
import seaborn as sns

# sqlalchemy an psycopg2 are sql connection packages
from sqlalchemy import create_engine

# GeoPandas for spatial data manipulation
import geopandas as gpd
# PySAL for spatial statistics
import pysal as ps
# shapely for specific spatial data tasks (GeoPandas uses Shapely objects)
from shapely.geometry import Point, LineString, Polygon

# configure pandas display: set the maximum number of columns displayed to 25
pd.options.display.max_columns = 25

# use the __future__ version of division and print
from __future__ import division, print_function

In practice we typically load libraries like `numpy` and `pandas` with shortened aliases, e.g, `import numpy as np`. This is like saying, "`import numpy`, and wherever you see `np`, read it as `numpy`." Similarly, you'll often see `import pandas as pd`, or `import matplotlib.pyplot as plt`. 

Another shortcut is `%pylab inline`. This command includes both `import numpy as np` and `import matplotlib.pyplot as plt `. This shortcut was invented because it's faster to type `plt.plot()` rather than `matplotlib.pyplot.plot()`, and even programmers don't like to type more than they have to. 

In documentation and in examples, you will frequently see `numpy` commands starting with the alias `np` rather than `numpy` (e.g, `np.array()` or `np.argsort`) and `pandas` commands starting with `pd` (e.g., `pd.DataFrame()` or `pd.concat()`).

#### When in doubt, use shift + tab to read the documentation of a method.
#### The `help()` function provides information on what you can do with a function.

## Load the Data
- Back to [Table of Contents](#Table-of-Contents)

Instead of using pgAdmin or the command line sql too directly, we can also carry out sql queries using python. But more power of python and pandas comes from that they can greatly facilitate descpritive statistics of the data, which is rather complicated to do, if not possible, in sql per se. Moreover, python and pandas plus matplotlib package can create data visualizations that greatly helps data analysis. We will see some of these advantages in the following content.

Pandas provides many ways to load data. It allows the user to read the data from a local csv or excel file, or pull the data from a relational database. Since we are working with the relational database appliedda in this course, we will demonstrate how to use pandas to read data from a relational database. For examples to read data from a csv file, refert to the pandas documentation [Getting Data In/Out](pandas.pydata.org/pandas-docs/stable/10min.html#getting-data-in-out).

The function to create a sql query and put the data into a pandas dataframe (more to come) is `pd.read_sql()`. Just like doing a sql query from pgAdmin, this function will ask for some information about the database, and what query you woul like to run. Let's walk through the example below.

### Establish a connection to the appliedda database
In the most simple case, only 2 parameters are required by the `pd.read_sql()` function to pull data. The first parameter is the connection to the database. To create a connection we need to use the psycopg2 package and tell it which database and which host we want to connect to, just like in pgAdmin.

#### Parameter 1: connection

In [None]:
# to create a connection to the database, we need to pass the name of the database and host of the database
# to the psycopg2.connect function
DB_NAME = "appliedda" # specify the name `appliedda`
DB_HOST = "10.10.2.10" # specify the host address
# pass parameters to the function, and save the resulting connection to a variable (sql_connection)
sql_connection = create_engine('postgresql://{}/{}'.format(DB_HOST, DB_NAME))
print("success")

Note:

- a good practice in python variable naming is to use all upper case letters for variables whose value you don't plan to change througout the program

#### parameter 2: query
This part is similar to writing a sql query in pgAdmin. Depending on what data we are interested in, we can use different queries to pull different data. In this example, we will pull all the content of wage_person data.

In [None]:
QUERY = '''
SELECT *
FROM idhs.hh_indcase_spells h
WHERE h.start_date >= '2015-01-01' AND h.end_date <= '2015-03-31'
'''

Note:

- the three quotation marks surrounding the query body is called multi-line string. It is quite handy for writing sql queries because the new line character will be considered part of the string, instead of breaking the string

### Pull data from the database
Now that we know what the arguments are for the query, we can pass them to the `pd.read_sql()` function, and obtain the data.

In [None]:
# here we pass the query and the connection to the pd.read_sql() function and assign the variable `wage` to
# the dataframe returned by the function
member_spell = pd.read_sql(QUERY, con=sql_connection, parse_dates=['start_date', 'end_date'])

By now we have finished loading the data, and we are ready to do some data analysis

## Data Analysis in Pandas
- Back to [Table of Contents](#Table-of-Contents)

Before diving into the data analysis, let's have a basic understanding of dataframe. It's a pandas representation of a spreadsheet/ sql table. It contains information such as column names, row indices (starting from 0), and the actual data. They are the basic objects on which we will perform our data analysis.

### Displaying Data

#### The shape of the dataframe
When we get the data, we usually want to know how many rows and columns are there in the data. We can find out the row and column numbers by calling the shape instance variable with a dot operator.

In [None]:
# shape of a dataframe (row number, column number)
member_spell.shape
# our wage dataframe contains 187371 rows, and 23 columns

Note:

- pay attention to `shape`. Notice that it is an _instance variable_, therefore, you should call it by using the name of the dataframe followed by a dot and `shape` directly. You should not add a pair of parenthesis after `shape`. The parentheses are only used by _methods_. Since the concepts _instance variable_ and _method_ in Objective Oriented Programming is beyong the scope of this course, just keep in mind that when calling `shape`, don't add parentheses to the end.

#### The head and tail of the dataframe
It is also helpful to have a look at the first or last few rows of the data for a first impression, as well as a sanity check. We can call the head()/tail() methods

In [None]:
# display the first few rows of the dataframe
# we can also specify how many lines we would like to see in the parentheses at the end. We choose to display 10.
# If not specified, by default the first 5 lines will be returned
member_spell.head(10)

In [None]:
# last few rows of the dataframe
# the syntax is similar to head
member_spell.tail()

### Columns, rows, data selection

#### Column names
to see which columns are there in the data, use the following syntax. Notice that `columns` is also a variable name, not a method name, so no parentheses at the end.

In [None]:
member_spell.columns

#### Single column selection
If we want to select a specific column, we can use the following syntax:

In [None]:
# select a single column: the dataframe variable name, followed by square brackets, and then put the
# the column name between quotes (either single or double). We can save this column to a new variable `race`
race = member_spell['rootrace']
race.head()

In [None]:
member_spell.head()

#### Multiple-column selection
to select multiple columns, wrap the column names in a python list, then put the list or tuple betwee the brackets after the dataframe

In [None]:
# here we selected the columns recptno, sex, rac, start_date_orig and end_date_orig in the member_spell dataframe
# and assigned them to a new variable some_attributes
some_attri = member_spell[['recptno', 'sex', 'rootrace',
                           'start_date','end_date']]
some_attri.head()


#### single/ multiple cell(s) selection
Use the `loc` method for cell seleciton. Pass the row and column indices in the _square brackets_ after `loc`. Specify the row index first, and then column name, separated by a comma. Note that both indices will be included.

In [None]:
# single cell selection
# select the cell in row 5 and column rootrace
cell = some_attri.loc[5, 'start_date']
cell

In [None]:
# multiple cells selection
# option 1: use a python list to explicitly list the rows/columns
cell = some_attri.loc[[5, 7, 9], 'start_date']
cell

In [None]:
# option 2: use colon to indicate contiguous selection
cell = some_attri.loc[5:10, 'start_date']

#### row selection
We can also use the `loc` method to select row(s).

In [None]:
# if we want to select all columns, we can use a colon symbol :.
row5 = some_attri.loc[5, :]
row5

### Subsetting Data
#### Subsetting numerical data
Similar to the `where` statement in sql, we can also select only data that meet certain condition. Depending on whether the data is numberical or string, we should choose to use different syntax for each situation. For example, if we would like to select columns that start from year 2015 (let's consider datetime data as numerical for the moment), we can use a larger than or equal to operator condition to subset.

In [None]:
# conditional subsetting: put the conditional statement within the square brackets after the dataframe
# the conditional statement here is that we want the date to be later than or equal to July 1st.

march_start= some_attri[some_attri['start_date'] >= datetime.date(2015,3,1)]
march_start.head()

In [None]:
some_attri

#### Subsetting string/categorical data
When the column contains string data or categorical data, the comparison operators might not be the choice for data selection. Instead, we can compare each data in a column to a target list to see if the data in column is included in the list. This is done by calling the `isin` method.

In [None]:
# select race whose value is 1 or 2
# we specify the target list within the parentheses of the `isin` method
rac1_2 = some_attri[some_attri['rootrace'].isin([1, 2])]
rac1_2.head()

#### Subsetting with multiple conditions
If we want to subset the data with more than one condition, we can specify all the conditions and concatenate them with the python keyword `&`. Remember to put every single condition within a pair of parentheses.

In [None]:
# select data whose start and end date both fall within Feb
feb = some_attri[(some_attri['start_date'] >= datetime.date(2015,2,1)) &
                        (some_attri['end_date'] <= datetime.date(2015,2,28))]
feb.head()

### Statistics
- Back to [Table of Contents](#Table-of-Contents)
#### Descriptive stats
Pandas has integrated some very useful tools to help us understand the distribution of the data. The `describe` method computes the most commonly used descriptive statistics, such as count, mean, standard deviation and quantiles for a dataframe. Since the welfare data doesn't contain much information about numerical data, we will use the Illinois wage data to demonstrate the corresponding methods. For example, if we want to know the wage distribution of Illinois, we can use the `describe` method on the wage data

In [None]:
# if we are interested in the summary statistics of the second quarter of 2015,
# it will take a more complicated SQL query. 
# Instead, we can retrieve the data of the second quarter of 2015 (limited here for example)
# from a SQL database to pandas, and use pandas to describe the data
WAGE_QUERY = '''
SELECT ssn, year, quarter, wage
FROM ides.il_wage
WHERE year = 2015 AND quarter = 2
LIMIT 100;
'''
wage = pd.read_sql(WAGE_QUERY, con=sql_connection)

In [None]:
# see the descriptive statistics of the wage column
wage['wage'].describe()

We can see the average income of the second quarter of 2015 is around XXX, and the median income is around XXX. Therefore we know the wage data is right-skewed.

#### Value counts and unique values
For categorical values, it is often helpful to figure out what are the unique values of a given column, and the quantity of each data. Let's go back to the welfare data

In [None]:
# find out how many different race are there in the data

some_attri['rootrace'].unique()

In [None]:
# to count how many race appeared in the data
some_attri['rootrace'].value_counts()

We know from the data dictionary that 0 represent not applicable, 1 represents white, not of hispanic origin, 2 represents black, not of hispanic origin, etc. So we can see from the data that most race information is not available in the welfare data.

### Adding and Updating Data
- Back to [Table of Contents](#Table-of-Contents)
#### Creating columns
We sometimes need to creat a new column, either to save the previously calculation from other columns, or add new information to the dataframe. The syntax is given below:
`dataframe['column_name'] = value`
where:
dataframe is the dataframe in which the new column is created;
column_name is the string of the new column name;
value is the value of the each cell;
Let's see how to creat a new column in the wage dataframe that contains the average monthly wage of the second quarter of 2015

In [None]:
# step1: we can start by creating a new column named monthly, and pass a default None value to every cell
wage['monthly'] = None
# setp2: we can then calculate the monthly wage by dividing the wage column by 3, and assign this newly computed
# column to the monthly column
wage['monthly'] = wage['wage']/3

# wage['monthly'].head()
wage.head()
# note: we can skip step1 after getting more familiar with the dataframe

In [None]:
# let's see the descriptive stats of the newly created monthly wage column
wage.describe()

### Grouping and Aggregating Data
- Back to [Table of Contents](#Table-of-Contents)
#### Group by and aggregation functions
Like in SQL, it is also possible to group the dataframe by a column, and use aggregation function on them, and sort the result

In [None]:
# calculate the how many cases each head of the household experienced
# step1: in the groupby method, we pass the column we want to group by, we can also select what columns
# we want to carry out the operation
# step2: use the count method to count the number of cases
# step3: sort the value in descending order (set the ascending parameter to False)
some_attri.groupby('recptno')['start_date']\
          .count()\
          .sort_values(ascending=False)
        
some_attri.groupby
# Note: since groupby and aggregation functions usually involves several steps, it is good practice to use one line
# for each step, followed by a backslash in the unfinished lines
some_attri

Other useful aggregation functions are:
sum(): sum
mean(): average
agg(): use a python dictionary to specify aggregation function based on each column

We can also look at each group if we know the data on which we grouped the data

In [None]:
# 151724812 is a recipient number
# NOTE: depending on what data was returned you may have to change this value
some_attri.groupby('recptno').get_group(151724812)

### Data Cleaning
- Back to [Table of Contents](#Table-of-Contents)

More often than not, the data we receive are not perfect. Duplicated rows, as well as N/A values, if present, will have an impact on the data analysis we carry out if we don't rule out these cases. Here's some simple examples as to how to do basic data cleaning. Syntax for drop duplicates:
`dataframe.drop_duplicates()`

where:
dataframe is the target dataframe;
By default this method will consider all the rows in the dataframe, and keep only the first one if there are multiple copies of the same row. We can also specify the paramters to compare columns instead of rows, or keep the last copy instead of first.

In [None]:
# drop the ducplicates in the member_spell dataframe
# step1: see how many rows are there before dropping the duplicates
before_drop =  member_spell.shape[0]
# step2: drop duplicates, and save the result as a new dataframe
member_spell_clean = member_spell.drop_duplicates()
# step3: see how many rows are there after dropping the duplicates
after_drop =  member_spell_clean.shape[0]
# step4: compare the number before and after. There is no N/A value in the data in this case. Hurray!
print(before_drop == after_drop)

There are also occasions when there are many N/A values in a column. Depending on the situation, we can choose to either drop the N/As, or fill them with a specific value such 0 or the mean.
The syntax for dropping N/A is:

`dataframe.drop_na()`

where:
dataframe is the target dataframe. By default it drops the row where there is an N/A value. It can also drop a column is explicitly stated by setting the `axis` parameter to 1.

The syntax for fill the N/A values with a specific values is:

`dataframe.fillna(value)`

where:
dataframe is the target dataframe;
values is the value to substitute N/A.

## Saving Data
- Back to [Table of Contents](#Table-of-Contents)

We've seen in the previous code how to save the visualizations to the local machine. We can save the new data (dataframes) we created during the exploration process to the relational database for future reference.

The method to save a pandas dataframe to a table in sql is `to_sql()`. It has a syntax very similar to `read_sql()`. For example, it also requires a connection to the sql database. We can use the one we created at the beginning of this notebook:`sql_connection`.

The difference between `to_sql()` and `read_sql()` is that instead of passing a query to the database, `to_sql()` requires we assign a new name to the new table, and which schema to put it. Let's see how it is done with the start2015 dataframe.


In [None]:
# to save the some_attri dataframe to a new table in the group
# only one person in your team needs to run this cell (you only need to create the table once)
# change the schema to your group name before running this cell

some_attri.to_sql(name='test_welf',
                 con=sql_connection,
                 schema='m6') #update this to your group name

## Loading CSV Files
- Back to [Table of Contents](#Table-of-Contents)

For some of the datasets, you may need to download copies of the data from ADRF Explorer; many of these files are available as .csv (comma separated values) files. After saving the file, you can load these files as dataframes using Pandas.

In [None]:
# Load a sample of 1000 rows from New York dataset and display first 5 rows
ny_df = pd.read_csv("data/adrf-000064-df_hra_caselist_1_linked_sample.csv")
ny_df.head()

In [None]:
ny_df.shape

In [None]:
# Load New York wage data and display first 5 rows
ny_wage_df = pd.read_csv("data/adrf-000064-df_dol_wages_linked.csv")
ny_wage_df.head()

In [None]:
ny_wage_df.shape

## Merging Dataframes
- Back to [Table of Contents](#Table-of-Contents)

Pandas provides an ability to merge (join) two datasets together, like you can do in SQL. You can store the results in a new dataframe.

In [None]:
ny_merge_df = pd.merge(ny_df, ny_wage_df, on="linked_unique_id", how="inner")
ny_merge_df.head()

In [None]:
ny_merge_df.shape

### Saving a CSV

In addition to saving tables back into the ADRF database, you can save a copy of your dataframe as a .csv file.

In [None]:
ny_merge_df.to_csv("data/merged_ny_data.csv")

## Visualization
- Back to [Table of Contents](#Table-of-Contents)

A picture is worth a thousand words. Visualization, if created accurately, can greatly help us understand the meaning of the otherwise abstract numbers. Therefore, visualization is widely used in the data analysis field today. Some of the most commonly used visualizations in data analysis are:

- histogram
- bar chart
- line chart
- pie chart
- area chart

In this session, we will focus on how to create the first three types of visualizations to help us better understand the welfare data.

### Plotting
- Back to [Table of Contents](#Table-of-Contents)

#### Histogram
A histogram represents the distribution of a variable. According to wikipedia, it is 'an estimate of the probability distribution of a continuous variable'. Let's visualize the Illinois wage data in a histogram and see the distribution.

A common practice in pandas visualization is to create a figure and an ax on which the plots can be drawn on. This will come in handy if we want to draw several plots on the same ax, or save the plot for presentation.
`fig, ax = plt.subplots(figsize=(8, 6))`

where:
the argument `figsize` takes in a 2-item tuple, the first item specifying the length of the figure, the second item, the height

Then we can move on to create the plot. Depending on the `kind` of graph such as `hist`, `bar` or `line`, we can pass different arguments to the plot.

After creating the graph, we can choose the save the plot to the local repository for future references.

In [None]:
# create figure and ax
fig, ax = plt.subplots(figsize=(8, 6))
# plot the data (consider data that are not to extreme)
wage['wage'][wage['wage']<=100000].plot(kind='hist')
# add a title
ax.set_xlabel('wage in $')
ax.set_title('wage_distribution_2edquarter_2015', fontsize=14)
# save the data. dpi stands for dot per inch. It is a measure of resolution.
fig.savefig('wage_distribution_2edquarter_2015.jpg', dpi=600)

#### Bar chart
The bar chart is used when representing the frequency of multiple _categories_ for a _categorical variable_. Back to the welfare data, number of cases opened and closed per year, race, gender, type of benefit are all categorical variables because there is only a finite set of categories these data can be (year, white, black, etc. for race, etc.). Let's see a few examples below.

Firstly, let's visualize how many new cases were opened and closed in 2015.

Note: remember our dataframe member_spell contains only cases that *both* started *and* ended in 2015. Therefore, cases that started in 2015 and continued further, or cases that started before 2015 and ended in 2015 are not accounted for.

In [None]:
# create a new column: quarter
some_attri['start_quarter'] = some_attri['start_date'].dt.to_period('Q')
# create figure and ax
fig, ax = plt.subplots(figsize=(8, 6))
# group data by quarter, count the numbers and visualize (the argument rot will rotate the ticks on the x label)
some_attri.groupby('start_quarter')['recptno']\
          .count()\
          .plot(kind='bar', ax=ax, rot=45)
# add title
ax.set_title('welfare: new case count by quarter in 2015', fontsize=14)

Next we can explore what is the gender ratio of all the new cases in 2015

In [None]:
# create fig and ax
fig, ax = plt.subplots(figsize=(8,6))
# count the cases by gender, and plot the bar chart
some_attri.groupby('sex')['start_date']\
          .count()\
          .sort_values(ascending=True).plot(kind='bar', ax=ax)
# set title
ax.set_title('welfare: new case count by gender (2015)', fontsize=14)

# save the graph
fig.savefig('welfare_count_by_gender.jpg',dpi=600)
# 1 stands for male, 2 stands for female

We can move one step further to combine the previous 2 pieces of information, and visualize the new case count for each quarter broken down by gender.

In [None]:
# create figure and ax
fig, ax = plt.subplots(figsize=(8, 6))
# group by both start_quarter and sex, count the case number, expand the groupby object to a full dataframe and plot
some_attri.groupby(['start_quarter', 'sex'])['start_quarter']\
          .count()\
          .unstack()\
          .plot(kind='bar', ax=ax,
                rot=45, legend=True)
# add title
ax.set_title('welfare: quarterly new case count broken down by gender (2015)',
             fontsize=14)    

We can visualize the same data with stacked bar charts if we need to compare the total amount of cases by quarter

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))
some_attri.groupby(['start_quarter', 'sex'])['start_quarter']\
          .count()\
          .unstack()\
          .plot(kind='bar', ax=ax, stacked=True,
                rot=45, legend=True)
ax.set_title('welfare: quarterly new case count broken down by gender (stacked, 2015)',
             fontsize=14) 

#### Line chart
A line chart depicts the trend of one variable with the change of the other variable. Let's see how to create a line chart. Assume we want to find out how many new welfare cases were created each month in 2015, and how many cases ended in each month. Let's see how to create the line chart.

In [None]:
### what does this cell do?

# first we need to group the data by month and count the number of cases for each month
# grouping for date and time is easier if we use the date as index
# so we create a new dataframe using date as index
# first let's select data from January 2015
#jan2015 = some_attri[(some_attri['start_date']>=datetime.date(2015,1,1)) &
#                          (some_attri['start_date']<=datetime.date(2015,1,31))]

In [None]:
# then we create a pandas Series (a single column dataframe) whose index is of the special type DatetimeIndex
new_case = member_spell['recptno']
new_case.name = 'new_case'
new_case.index = pd.DatetimeIndex(member_spell['start_date'])

In [None]:
# we group the new data by month, and count the occurance for each month
# resample can be considered a special group by operation for datetime indices
start2015 = new_case.resample('M').count()

In [None]:
fig, ax = plt.subplots(figsize=(8,6))
# plot the line chart, also add legend
start2015.plot(kind='line', ax=ax, legend=True)
# set title
ax.set_title('welfare: new case count by month in Q1 2015', fontsize=14);
# save graph
#fig.savefig('welfare_new_case_count_by_month.jpg', dpi=600)

We can repeat the process for cases ended in March 2015

In [None]:
march2015_end = some_attri[(some_attri['end_date']>=datetime.date(2015,1,1)) &
                          (some_attri['end_date']<=datetime.date(2015,12,31))]

In [None]:
end_case = march2015_end['recptno']
end_case.name = 'end_case'
end_case.index = pd.DatetimeIndex(march2015_end['end_date'])

In [None]:
end2015 = end_case.resample('M').count()

In [None]:
fig, ax = plt.subplots(figsize=(8,6))
end2015.plot(kind='line', ax=ax, legend=True)
ax.set_title('welfare: ended case count by month in 2015',
             fontsize=14)
# save graph
# fig.savefig('welfare_closed_case_count_by_month.jpg', dpi=600)

We can visualize the 2 line charts in the same ax

In [None]:
fig, ax = plt.subplots(figsize=(8,6))
start2015.plot(kind='line', ax=ax, legend=True)
end2015.plot(kind='line', ax=ax, legend=True)
ax.set_title('welfare: new case count and ended case count by month in 2015',
             fontsize=14)
# save graph
fig.savefig('welfare_cases_count_by_month.jpg', dpi=600)

In [None]:
# and if we want to see the difference between the number of new cases and cases ended per month:
# fig, ax = plt.subplots(figsize=(8,6))
# (start2015 - end2015).plot(kind='line')
# ax.set_title('welfare: count difference',fontsize=14)

### Geospatial Visualization
- Back to [Table of Contents](#Table-of-Contents)

We can also visualize geospatial information with python, specifically the geopandas package. We will create a choropleth map of new cases created in 2015 by zipcode. For more details about geospatial analysis, please refer to the [exploratory_spatial_data_analysis](../spatial_notebooks/exploratory_spatial_data_analysis.ipynb) notebook.

First we retrieve the zipcodes that are counties of Illinois from the database.

In [None]:
# query for list of zipcodes for cook county (FIPS code: 17031)
query = """SELECT DISTINCT z.geoid10 as zipcode 
FROM public.tl_2016_us_zcta510 z JOIN public.tl_2016_us_county c 
ON z.geom && c.geom 
WHERE c.geoid = '17031'; """

# get the zipcodes as a pandas dataframe
zipcodes = pd.read_sql(query, sql_connection)
zipcodes.head()

In [None]:
# now convert the zipcode dataframe into a single string list to select out just those zipcodes
zipcodes = ','.join("'" + zipcode + "'" for zipcode in zipcodes.zipcode)
# view first 20 characters to see if it looks right
zipcodes[:20]

In [None]:
# set up the query
# and we'll subset to only those zipcodes we want
qry = '''
SELECT z.gid, z.geoid10 AS zipcode, COUNT(ch_dpa_caseid) AS new_case, z.geom_2163
FROM tl_2016_us_zcta510 AS z
JOIN (SELECT h.ch_dpa_caseid, geom_2163
          FROM idhs.hh_indcase_spells h
          JOIN idhs.case_geocode g
          ON h.ch_dpa_caseid = g.ch_dpa_caseid
          WHERE h.start_date >= '2015-01-01' AND h.end_date <= '2015-03-31'
) AS case_tract
ON ST_Contains(z.geom_2163, case_tract.geom_2163)
WHERE z.geoid10 in ({query_zips})
GROUP BY z.gid, zipcode, z.geom_2163;
'''.format(query_zips=zipcodes)

# print it out to see if it looks right
#print(qry)

> NOTE: this query will take a long time to run. We do not recommend running this during class time.

In [None]:
# Get data - total number of new cases by zipcode for 2015 joined to zipcode polygons
welfare_zip = gpd.read_postgis(qry, sql_connection,
                               geom_col='geom_2163',
                               index_col='gid',
                               crs='+init=epsg:2163')

welfare_zip.info()

In [None]:
# display the map
welfare_zip.plot(column='new_case', scheme='QUANTILES', 
                 cmap='YlGnBu', linewidth=0.1, edgecolor='grey', legend=True);