![ADSA Logo](http://i.imgur.com/BV0CdHZ.png?2 "ADSA Logo")

# ADSA Workshop 4 - Introduction to Pandas and Matplotlib
Workshop content adapted from:
* [Data Science from Scratch - First Principles with Python](http://www.amazon.com/Data-Science-Scratch-Principles-Python/dp/149190142X)
* [Greg Reda's Intro to pandas data structures](http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/)

This workshop will dive into data processing and visualization with Numpy, Pandas, and Matplotlib.

***

# Pandas

As stated on the official [pandas site](http://pandas.pydata.org/) "pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language." Pandas is built on top of NumPy, and provides two key data structures for processing data: Series and DataFrames.

To begin, we first need to import pandas, numpy, and (for future use) matplotlib.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

'''
For easiest visual display later on.
'''
pd.set_option('max_columns', 50)

'''
This line is Jupyter Notebook specific and allows for graphs
to be displayed in the notebook.
''' 
%matplotlib inline

## Series

A Series is a one-dimensional object containing a series of items, and is similar to an array or list in Python. A Numpy series assigns a labeled index to every entry in the series, and uses the numbers 0 through n (length of the series - 1) by default.

To make a series, we can pass in a Python list to the pd.Series() function. Note the convenient printing format and indices given when we print the series.

In [None]:
s = pd.Series(['ADSA', 5, True, -3.14 ])
print s

If you want, you can specify index labels to be used rather than the default 0 to n, by passing in an index list. (Note that the index list must be the same length as the series).

In [None]:
s = pd.Series(['ADSA', 5, True, -3.14 ], index=['A', 'B', 'C', 'D'])
print s

We can also take an existing Python dictionary and convert it to a series by passing it into the pd.Series() function.

In [None]:
# Let's assume we have a dictionary of cities and weather data
d = {'Chicago': 75,
     'Boston': 65,
     'New York': 70,
     'San Francisco': 80,
     'Los Angeles': 82,
     'Austin': None
}

weather = pd.Series(d)
print weather

We can then access data for specific indices by passing either a single index, or a list of indices in brackets.

In [None]:
print weather['Chicago']

print '\n'

print weather[['Chicago', 'Austin']]

We can use boolean statements involving our series in order to check if something is in a series or generate a series of true and false values for those entries which satisfy the statement.

In [None]:
print 'Chicago' in weather

print '\n'

weather_less_than_80 = weather < 80
print weather_less_than_80

By passing in these boolean statements, we can query the series for entries which satisfy the boolean.

In [None]:
print weather[weather < 80]

print '\n'

print weather[weather <= 70]

print '\n'

print weather[weather < 65]

print '\n'

We can also perform scalar multiplication and division, and numpy operations on series.

In [None]:
print weather / 3

print '\n'

print weather * 2

print '\n'

print np.square(weather)

We can also add two series together. If the same index exists in both series, then their values will be added, otherwise a Null/NaN (Not a Number) value will be assigned to the resulting series.

In [None]:
# Note that the two dictionaries share the key 'New York' but not Chicago or Boston
d1 = {'Chicago': 65, 
      'New York': 55
     }

d2 = {'New York': 10,
      'Boston': 60
     }

s1 = pd.Series(d1)
s2 = pd.Series(d2)

# The value for New York will be added, but the values for Chicago and Boston are indeterminate and marked as NaN
s3 = s1 + s2
print s3

Finally, to tell if values in a series or Null (NaN) or not, we can use the functions .isnull() and .notnull() respectively. Note that we can use the same boolean logic as before to either display True and False values for every index, or query for indices which are Null.

In [None]:
print s3.isnull()

print '\n'

print s3[s3.isnull()]

# Data Visualization

"A fundamental part of the data scientist’s toolkit is data visualization. Although it is
very easy to create visualizations, it’s much harder to produce good ones.
There are two primary uses for data visualization:
* To explore data
* To communicate data"

-Joel Grus, Data Science from Scratch

There are many tools that we can use to visualize data, however one of the most widely used tools is the [matplotlib](http://matplotlib.org/) library. While other libraries such as [d3.js](https://d3js.org/) are more commonly used for web visualizations, the matplotlib.pyplot module does an excellent job at quickly producing bar charts, line charts, and scatterplots in Python.

To begin, we will first import the pyplot module from matplotlib.

In [None]:
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

## Bar Charts

A bar chart can be a very helpful, simple visualization when you need to illustrate quantities of a discrete set of items.