https://github.com/dunovank/jupyter-themes

# Pandas

If you've never used `pandas` before, it's amazingly useful, and at times frustrating.

Recommended links: 

 - [Jake Vanderplas Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook)
 - http://pandas.pydata.org/pandas-docs/stable/gotchas.html
 - https://github.com/dwhitena/blog-content/blob/master/python_pitfalls/Python-Pitfalls.ipynb
 - http://pandas.pydata.org/pandas-docs/stable/cookbook.html

Read through this full series of excellent blog posts by [Tom Augspurger](http://tomaugspurger.github.io/modern-1.html).

High level tip

 - try to represent data in the proper format
   - floats as floats; ints as ints; etc. 
   - Especially if you have dates, or timestamps, or datetimestamps, keep them in that format. 
      
This pdf [Tidy Data](http://vita.had.co.nz/papers/tidy-data.pdf) by Hadley Wickham is an excellent read with a lot that relates to data analysis in any language. 

In [None]:
from __future__ import absolute_import, division, print_function

%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
import seaborn as sns
sns.set_context('poster')
sns.set_style('whitegrid') 
# sns.set_style('darkgrid') 
plt.rcParams['figure.figsize'] = 12, 8  # plotsize 

In [None]:
import numpy as np
import pandas as pd
from pandas.tools.plotting import scatter_matrix
from sklearn.datasets import load_boston

import warnings
warnings.filterwarnings('ignore')

### Note

Using cleaned data from [Data Cleaning](Data%20Cleaning.ipynb) Notebook. See Notebook for details.

In [None]:
df = pd.read_csv("../data/coal_prod_cleaned.csv")

In [None]:
df.head()

In [None]:
plt.scatter(df['Average_Employees'], 
            df.Labor_Hours)
plt.xlabel("Number of Employees")
plt.ylabel("Total Hours Worked");

In [None]:
colors = sns.color_palette(n_colors=df.Year.nunique())

In [None]:
color_dict = {key: value for key, value in zip(sorted(df.Year.unique()), colors)}

In [None]:
color_dict

In [None]:
for year in sorted(df.Year.unique()[[0, 2, -1]]):
    plt.scatter(df[df.Year == year].Labor_Hours,
                df[df.Year == year].Production_short_tons, 
                c=color_dict[year],
                s=50,
                label=year,
               )
plt.xlabel("Total Hours Worked")
plt.ylabel("Total Amount Produced")
plt.legend()
plt.savefig("ex1.png")

In [None]:
import matplotlib as mpl

In [None]:
mpl.style.use('seaborn-colorblind')

In [None]:
plt.style.available

In [None]:
for year in sorted(df.Year.unique()[[0, 2, -1]]):
    plt.scatter(df[df.Year == year].Labor_Hours,
                df[df.Year == year].Production_short_tons, 
                c=color_dict[year],
                s=50,
                label=year,
               )
plt.xlabel("Total Hours Worked")
plt.ylabel("Total Amount Produced")
plt.legend();
# plt.savefig("ex1.png")

In [None]:
df_dict = load_boston()
features = pd.DataFrame(data=df_dict.data, columns = df_dict.feature_names)
target = pd.DataFrame(data=df_dict.target, columns = ['MEDV'])
df = pd.concat([features, target], axis=1)
df.head()

In [None]:
# Target variable
fig, ax = plt.subplots(figsize=(6, 4))
sns.distplot(df.MEDV, ax=ax, rug=True, hist=False)

In [None]:
fig, ax = plt.subplots(figsize=(10,7))
sns.kdeplot(df.LSTAT,
            df.MEDV,
            ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
scatter_matrix(df[['MEDV', 'LSTAT', 'CRIM', 'RM', 'NOX', 'DIS']], alpha=0.2, diagonal='hist', ax=ax);

In [None]:
pd.cut()