<img style="float: left;" src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png">
<br>
<br>
#   Pandas
### Exporatory Data Analysis (EDA) and Visualization Cheatsheet
<br>

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# display plots in the notebook
%matplotlib inline
# increase default figure and font sizes for easier viewing
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14


In [None]:
# read in the drinks data
drink_cols = ['country', 'beer', 'spirit', 'wine', 'liters', 'continent']
url = 'https://github.com/JamesByers/Datasets/raw/master/drinks.csv'
drinks = pd.read_csv(url, header=0, names=drink_cols, na_filter=False)

In [None]:
# sort the beer column and mentally split it into 3 groups
drinks.beer.order().values

In [None]:
# compare with histogram
drinks.beer.plot(kind='hist', bins=3)

In [None]:
# try more bins
drinks.beer.plot(kind='hist', bins=20)

In [None]:
# add title and labels
drinks.beer.plot(kind='hist', bins=20, title='Histogram of Beer Servings')
plt.xlabel('Beer Servings')
plt.ylabel('Frequency')

In [None]:
# compare with density plot (smooth version of a histogram)
drinks.beer.plot(kind='density', xlim=(0, 500))

In [None]:
# select the beer and wine columns and sort by beer
drinks[['beer', 'wine']].sort('beer').values

In [None]:
# compare with scatter plot
drinks.plot(kind='scatter', x='beer', y='wine')

In [None]:
# add transparency
drinks.plot(kind='scatter', x='beer', y='wine', alpha=0.3)

In [None]:
# vary point color by spirit servings
drinks.plot(kind='scatter', x='beer', y='wine', c='spirit', colormap='Blues')

In [None]:
# scatter matrix of three numerical columns
pd.scatter_matrix(drinks[['beer', 'spirit', 'wine']])

In [None]:
# increase figure size
pd.scatter_matrix(drinks[['beer', 'spirit', 'wine']], figsize=(10, 8))

In [None]:
# count the number of countries in each continent
drinks.continent.value_counts()

In [None]:
# compare with bar plot
drinks.continent.value_counts().plot(kind='bar')

In [None]:
# calculate the mean alcohol amounts for each continent
drinks.groupby('continent').mean()

In [None]:
# side-by-side bar plots
drinks.groupby('continent').mean().plot(kind='bar')

In [None]:
# drop the liters column
drinks.groupby('continent').mean().drop('liters', axis=1).plot(kind='bar')

In [None]:
# stacked bar plots
drinks.groupby('continent').mean().drop('liters', axis=1).plot(kind='bar', stacked=True)

**Box Plots** show quartiles (and outliers) for one or more numerical variables
Five-number summary:
* min = minimum value
* 25% = first quartile (Q1) = median of the lower half of the data
* 50% = second quartile (Q2) = median of the data
* 75% = third quartile (Q3) = median of the upper half of the data
* max = maximum value

(More useful than mean and standard deviation for describing skewed distributions)

Interquartile Range (IQR) = Q3 - Q1

Outliers:
below Q1 - 1.5 * IQR
above Q3 + 1.5 * IQR

In [None]:
# sort the spirit column
drinks.spirit.order().values

In [None]:
# show "five-number summary" for spirit
drinks.spirit.describe()

In [None]:
# compare with box plot
drinks.spirit.plot(kind='box')

In [None]:
# include multiple variables
drinks.drop('liters', axis=1).plot(kind='box')

### Line Plot: show the trend of a numerical variable over time

In [None]:
# read in the ufo data
url = 'https://github.com/JamesByers/Datasets/raw/master/ufo.csv'

ufo = pd.read_csv(url)
ufo['Time'] = pd.to_datetime(ufo.Time)
ufo['Year'] = ufo.Time.dt.year

In [None]:

# count the number of ufo reports each year (and sort by year)
ufo.Year.value_counts().sort_index()

In [None]:
# compare with line plot
ufo.Year.value_counts().sort_index().plot()

In [None]:
# don't use a line plot when there is no logical ordering
drinks.continent.value_counts().plot()

In [None]:
### Grouped Box Plots: show one box plot for each group

In [None]:
# reminder: box plot of beer servings
drinks.beer.plot(kind='box')

In [None]:
# box plot of beer servings grouped by continent
drinks.boxplot(column='beer', by='continent')

In [None]:
# box plot of all numeric columns grouped by continent
drinks.boxplot(by='continent')

### Grouped Histograms: show one histogram for each group¶

In [None]:
# reminder: histogram of beer servings
drinks.beer.plot(kind='hist')

In [None]:
# histogram of beer servings grouped by continent
drinks.hist(column='beer', by='continent')

In [None]:
# share the x axes
drinks.hist(column='beer', by='continent', sharex=True)

In [None]:
# share the x and y axes
drinks.hist(column='beer', by='continent', sharex=True, sharey=True)

In [None]:
# change the layout
drinks.hist(column='beer', by='continent', sharex=True, layout=(2, 3))

### Assorted Functionality

In [None]:
# saving a plot to a file
drinks.beer.plot(kind='hist', bins=20, title='Histogram of Beer Servings')
plt.xlabel('Beer Servings')
plt.ylabel('Frequency')
plt.savefig('assets/beer_histogram.png')

In [None]:
# list available plot styles
plt.style.available

In [None]:
# change to a different style
drinks.beer.plot(kind='hist', bins=20, title='Histogram of Beer Servings')
plt.xlabel('Beer Servings')
plt.ylabel('Frequency')
plt.savefig('assets/beer_histogram.png')
plt.style.use('fivethirtyeight')