# Introduction to Data Manipulation and Plotting in Python¶

ECON 3127/4414/8014 Computational methods in economics  
Week 3  
Fedor Iskhakov  
<img src="../img/lecture.png" width="64px"/>

&#128214; Kevin Sheppard "Introduction to Python for Econometrics, Statistics and Data Analysis."
*Chapters: 9, 15*

## What is pandas?

* Pandas provides structures for working with data (`Series`, `DataFrame`)

* Data structures have **methods** for manipulating data eg. indexing, sorting, grouping, filling in missing data

* Pandas does not provide modeling tools eg. regression, prediction
    * These tools are found in packages such as `scikit-learn` and `statsmodels`, which are built on top of pandas

## DataFrames

A `DataFrame` combines multiple 'columns' of data into a two-dimensional object, similar to a spreadsheet

In [None]:
from IPython.display import Image
Image('img/dataframe.jpg')

We will create a `DataFrame` by reading in a CSV file and assigning it to the variable name `majors`

### Info on the data set

* The data come from ['The Economic Guide to Picking a College Major'](https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/), Fivethirtyeight magazine
* Other interesting datasets can be found on the Fivethirtyeight GitHub: https://github.com/fivethirtyeight/data/

In [None]:
import pandas as pd

majors = pd.read_csv('recent-grads.csv')
majors.head()

In [None]:
majors.info()

We can access individual columns of data, returning a `Series`

In [None]:
majors['Major'].head()

We can also select multiple columns, returning a new dataframe

In [None]:
majors[['Major', 'ShareWomen']].head()

We can add a new column to our dataframe like so

In [None]:
majors['Employment rate'] = majors['Employed'] / majors['Total']
majors.head()

If we want to know the average unemployment rate...

In [None]:
majors['Unemployment_rate'].mean() * 100

`.describe()` returns useful summary statistics 

In [None]:
majors['Unemployment_rate'].describe()

Pandas also provides a simple way to generate matplotlib plots

In [None]:
import matplotlib.pyplot as plt

majors.plot(x='ShareWomen', y='Median', kind='scatter', figsize=(10, 8), color='red')
plt.xlabel('Share of women')
plt.ylabel('Median salary')
plt.show()

## Selecting and filtering

We can use integer slicing to select rows as follows

In [None]:
majors[:3]

We might want to find the majors with the highest share of women

First we will sort our values by a column in the dataframe

In [None]:
majors.sort_values(by='ShareWomen', ascending=False)[:3]

Another way to select rows is to use row labels, ie. set a row index

Similar to the column labels, we can add row labels (the index)

In [None]:
majors.set_index('Major_code').head()

Note: we haven't actually changed the DataFrame `majors`

In [None]:
majors.head()

Need to overwrite `majors` with the new copy

In [None]:
majors = majors.set_index('Major_code')   # Can also use majors.set_index('Major_code', inplace=True)
majors.head()

In [None]:
majors.loc[2405]

In [None]:
code_list = [6102, 5001]

majors.loc[code_list]

We can also sort our index (this is recommended for efficient selecting and filtering)

In [None]:
majors.sort_index(inplace=True)
majors.head()

Alternatively, we can filter our dataframe (select rows) using *boolean conditions*

In [None]:
majors['Major_category'] == 'Arts'

Selecting rows with this boolean condition will return only rows of the dataframe where `Major_cateogory == 'Business'` is `True`

In [None]:
majors[majors['Major_category'] == 'Business']

In [None]:
majors[(majors['Major_category'] == 'Business') & (majors['Total'] > 100000)]

## Grouping and aggregating data

We might want to summarize our data by grouping it by major categories

To do this, we will use the `.groupby()` function

In [None]:
grouped = majors.groupby('Major_category')
grouped

In [None]:
grouped.groups

To return an *aggregated* dataframe, we need to specify the function we would like pandas to use to aggregate our groups

In [None]:
grouped.mean()

In [None]:
grouped['Median'].mean()

In [None]:
grouped['Median'].agg(['mean', 'median', 'std'])

A list of built-in aggregatation functions can be found [here](http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics)

Pandas provides methods to plot from GroupBy objects

In [None]:
import matplotlib.pyplot as plt

grouped['Median'].mean().plot(kind='bar', figsize=(10, 8))
plt.show()

## Another plotting library: `seaborn`

* `seaborn` is a plotting library built on top of `matplotlib`

* It is geared towards producing pretty plots for statistical applications

* You can find an example gallery of `seaborn` plots [here](https://seaborn.pydata.org/examples/index.html)

In [None]:
import seaborn as sns  # Import the package

In [None]:
sns.lmplot(x="Median", y="ShareWomen", hue="Major_category", size=9, data=majors, fit_reg=False)
plt.xlabel('Median salary')
plt.ylabel('Share of women')
plt.show()

In [None]:
plt.figure(figsize=(15, 7))
sns.boxplot(x='Major_category', y='Median', data=majors)
plt.xticks(rotation=90)
plt.xlabel('Major category')
plt.show()

## Bokeh example

Bokeh is a Python library that makes creating interactive plots super easy - an example gallery is [here](http://bokeh.pydata.org/en/latest/docs/gallery.html)

In [None]:
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import ColumnDataSource, HoverTool, NumeralTickFormatter
from bokeh.palettes import Category20

output_notebook()

# Add title tooltips
hover = HoverTool(tooltips=[
    ("Title", "@Major"),
    ("Share", "$y"),
    ("Median salary", "$x{$0,}")
])


# Create figure
p = figure(tools=[hover, 'pan', 'wheel_zoom'])

# List of majors
majors_list = majors['Major_category'].unique()

# Plot scatter
for major, color in zip(majors_list, Category20[20]):
    
    # Filter data based on major color catergory
    source_major = majors[majors['Major_category'] == major]
    
    # Create data source
    source = ColumnDataSource(source_major[['Major', 'Major_category', 'ShareWomen', 'Median']])
    p.scatter(x='Median', y='ShareWomen', source=source, 
              size=10, legend='Major_category',
              fill_color=color, line_color='grey')

p.legend.click_policy = 'hide'
p.legend.location = "top_right"
p.legend.label_text_font_size = "8pt"
p.xaxis.axis_label = 'Median salary'
p.yaxis.axis_label = 'Share of Women'
p.xaxis[0].formatter = NumeralTickFormatter(format="$0,")

show(p)

## Further learning resources
* QuantEcon lectures: [Pandas](https://lectures.quantecon.org/py/pandas.html), [Pandas for Panel Data](https://lectures.quantecon.org/py/pandas_panel.html), [Matplotlib](https://lectures.quantecon.org/py/matplotlib.html)
* QuantEcon [Stata-R-Pandas cheatsheet](https://cheatsheets.quantecon.org/stats-cheatsheet.html)
* SciPy 2017: [Anatomy of Matplotlib](https://www.youtube.com/watch?v=rARMKS8jE9g)
* Coursera/University of Michigan: [Introduction to Data Science in Python](https://www.coursera.org/learn/python-data-analysis)