# Table of Contents
* [Pandas Data Structures](#Pandas-Data-Structures)
	* [Learning Objectives:](#Learning-Objectives:)
	* [Overview](#Overview)
	* [Set-up](#Set-up)
* [Pandas Series](#Pandas-Series)
	* [Creating a Series](#Creating-a-Series)
	* [Pandas Index](#Pandas-Index)
	* [Iterating over a Series](#Iterating-over-a-Series)
	* [Pandas Slicing](#Pandas-Slicing)
	* [Warning: Slicing compared:  python, Numpy, Pandas](#Warning:-Slicing-compared:--python,-Numpy,-Pandas)
	* [Series Arithmetic](#Series-Arithmetic)
	* [Series with NumPy](#Series-with-NumPy)
	* [Series with Matplotlib](#Series-with-Matplotlib)
	* [Series Operations](#Series-Operations)
* [Pandas DataFrame](#Pandas-DataFrame)
	* [Creating a DataFrame](#Creating-a-DataFrame)
	* [DataFrame Indexing](#DataFrame-Indexing)
	* [Iterating over a DataFrame](#Iterating-over-a-DataFrame)
	* [DataFrame Manipulation](#DataFrame-Manipulation)
* [Section Review](#Section-Review)


# Pandas Data Structures

## Learning Objectives:

After completion of this module, learners should be able to:

* describe 3 differences between (`pandas`) data structures (`Index`, `Series`, and `DataFrame`) and (`numpy`) `ndarrays`
* load, create, manipulate, and examine a `Series` container
* use *slicing*, *broadcasting*, *boolean/fancy indexing*, & *data alignment* in the context of `DataFrames`

## Overview

Pandas provides two primary data structures:
* `Series` is built for 1-dimensional series data
* `DataFrame` is built for 2-dimensional collections of tabular data

These data structures differ and are related as follows:
* **Series**: 1-dimensional array of **homogeneous** data
* **DataFrame**: 2-dimensional table of **heterogeneous** data, composed of multiple **Series**

Each of these also contains an integrated **`index`**:
* **Series** objects include a second array called the **index** that can contain homogeneous values of any type like integers, strings or datetime objects. 
* **DataFrame** objects include a column **index**. The **index** objects allow for very sophisticated selection operations using square brackets (`[]`) along with some specialized selection functions.

*side note: Pandas has another data structure called Panels which is not commonly used*

## Set-up

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import pandas as pd

# Set Jupyter notebook display options
pd.options.display.max_rows = 8
pd.options.display.max_columns = 8

# Check version number
pd.__version__

Should you have an older version, use the following conda to upgrade pandas

In [None]:
## To upgrade pandas, uncomment and run the following:

# !conda install pandas=0.16.0

# Pandas Series

* Designed for 1-dimensional, homogeneous series data
* Composed of an array of data and an special Index object
* Access to data contents can be done by `position` index (0,1,2,...) or by `label` index ('a','b','c',...)
* Index label type can be integers, strings, or datetime objects. 
* Operations like *retrieving*,  *joining*, and *aligning* data are made easy.

## Creating a Series

In [None]:
# Example: Creating a Series from a data column from file

file   = "data/exoplanets.csv"
data   = pd.read_csv(file)
series = data['NAME']

print( "\n" )
print( series )

print( "\n" )
print( type( series ) )

In [None]:
# Example: Creating a Series a list

new_list = [5,10,15,20,25]
pd.Series(new_list)

In [None]:
# Example: Creating a Series from a dictionary... index will be sorted

new_dictionary = {'b': 100, 'a': 200.0, 'd':450, 'c':700}
pd.Series(new_dictionary)

In [None]:
# Example: Creating a Series from an ORDERED dictionary
#          ... to preserver order of the indices.

from collections import OrderedDict
od = OrderedDict([('b', 100), ('a',200), ('d',450), ('c',700)])
pd.Series(od)

In [None]:
# Example: Creating a Series from a NumPy array

array1  = np.arange(1,6)*10.0
series1 = pd.Series(array1)

print( "\n" )
print( array1 )
print( type(array1) )

print( "\n" )
print( series1 )
print( type(series1) )

## Pandas Index

The pandas Series data structure always has an `Index`

The ``RangeIndex`` is analogous to the ``range`` (or ``xrange`` in py2) object. It is an efficient & lazy way to represent a sequence.

In [None]:
# The default Index is just a sequence of integers:
array1  = np.arange(1,6)*10.0
series1 = pd.Series(array1)

series1.index

In [None]:
# The Index is used for indexing

print( series1.iloc[0] )  # "Positional" indexing

In [None]:
# The Index can be set to a sequence of labels:

index2 = ['a','b','c','d','a']
series2 = pd.Series(np.arange(1,6)*10.0, index=index2)

In [None]:
# The Index labels are used for indexing

series2.loc['b']  # "label" indexing

In [None]:
# Indexing when the label is not unique?

series2.loc['a']

## Iterating over a Series

In [None]:
for key, value in series2.iteritems():
    print(key, value)

## Pandas Slicing

In [None]:
# Positional indexing and slicing

series1.iloc[1:3]

In [None]:
# Be careful when slicing with integers versus labels

print( series2.iloc[1:2]     )  # Positional indexing is permitted with slicing
print("\n")
print( series2.loc['b':'c'] )  # Ranges of labels can be sliced (slicing here is *inclusive*)

In [None]:
# Example: slice from the end
series2.iloc[-1:]

In [None]:
# Example: slice from the beginning
series2.iloc[:-3]

In [None]:
# Example: slice with labels, but with integer step size
series2.loc['d':'b':-2]

## Warning: Slicing compared:  python, Numpy, Pandas

In [None]:
# "Native" Python
py_list = [1, 2, 3, 4]
py_slice = py_list[1:3]
py_slice[1] = 99

# NumPy arrays
np_array = np.array([1,2,3,4])
np_slice = np_array[1:3]
np_slice[1] = 99

# Pandas Series
pd_series = pd.Series([1,2,3,4])
pd_slice = pd_series.iloc[1:3]
pd_slice.iloc[1] = 99

# Afterwards
print("py_list:\n", py_list, sep='')
print("\nnp_array:\n", np_array, sep='')
print("\npd_series:\n", pd_series, sep='')

## Series Arithmetic

Series support the numpy array protocol. Meaning that you can pass a Series where you would pass a numpy array and it will 'just work'

In [None]:
# Arithmetic with Series is as with numpy arrays
print(series1)

series2 = series1 * 2 
print(series2)

series3 = series2 * 3
print(series3)

## Series with NumPy

In [None]:
# numpy ufuncs can be applied elementwise to Series

series_2pi = pd.Series( np.linspace(-1*np.pi, +1*np.pi, 100) )
series_2pi

Even though pandas was built on numpy, numpy methods on Series return a Series. This makes sense because a Series is *more* that just a numpy array, it *contains* a numpy array AND an index.



In [None]:
series_sin = np.sin( series_2pi )
print( series_sin )
print( type( series_sin ) )

## Series with Matplotlib

This is a further demonstration of the array protocol. The Series is automatically converted to its ``.values`` when passed where a numpy array is expected.

In [None]:
plt.plot(series_2pi, series_sin)

## Series Operations

In [None]:
# You can add two series

series1 = pd.Series([1,2,3,4,5])
print(series1)

series2 = pd.Series([10,20,30,40,50])
print(series2)

series3 = series1 + series2
print(series3)

This is ``alignment``. It is a key element of pandas data friendliness. Operations involving an pandas objects ``align`` so that the ``indexes`` match up. Missing elements are ``NaN`` filled.

In [None]:
# Even better when you assign labels to your data!

series1 = pd.Series([1,2,3,4,5],      index=['a','b','c','d','e'])
series2 = pd.Series([10,20,30,40,50], index=['c','d','e','f','g'])

print( series1 + series2 )

Missing value handling is another key element of pandas. NaN's *propogate* across operations. Reduction operations will *skipna*, meaning that they will exclude ``NaN``s. These handlings are the most natural and 'obvious'.

In [None]:
# So when you apply a method like mean(), the NaNs are handled intelligently...

(series1 + series2).mean()

``numpy`` will by default NOT ignore ``NaN``

In [None]:
# Contrast how NumPy and Pandas handle NaN:

print( "Pandas mean (default is nan skipping): ", pd.Series( [1.0, 2.0, np.nan] ).mean() )
print( "Pandas mean (no nan skipping): ", pd.Series( [1.0, 2.0, np.nan] ).mean(skipna=False) )
print( "Numpy mean:  ", np.array(  [1.0, 2.0, np.nan] ).mean() )

Numpy provides a second set of nan-aware functions (note that these are functions, and NOT methods on ``ndarray``)

In [None]:
print( "Numpy mean:  ", np.nanmean(np.array(  [1.0, 2.0, np.nan] )) )

In [None]:
# Exercise: Write code using Numpy to get the same mean() behavior:

my_array = np.array([1.0, 2.0, np.nan, np.inf])
print(np.isfinite(my_array))
my_array[np.isfinite(my_array)].mean()

# Pandas DataFrame


What is a DataFrame?

* 2-dimensional labled tabular data (think spreadsheet or SQL table) 
* **heterogeneous** data, composed of multiple **Series**
* an **Index** or "row label" for selection operations

DataFrame inputs:

* python dict of 1D sequences (e.g. ndarrays, lists, dicts, or Series)
* 2-D numpy.ndarray
* pandas Series
* another DataFrame

DataFrame Operations:
- Pandas DataFrame objects have column names:
    - dictionary style: df['DATE']
    - attribute style: df.DATE
- DataFrame binary operations (+ - / *) defaults to outer join, on both columns as well as the index
- NaN can be handled after join

## Creating a DataFrame

In [None]:
# Creating a DataFrame from a file read:

file    = "data/exoplanets.csv"
df_file = pd.read_csv(file)
df_file

In [None]:
# Creating a DataFrame from a dictionary

dictionary = {'a':[1,2,3,4,5,6,7,8,9,10],
              'b':np.linspace(0,np.pi,10),
              'c':0.0,
              'd':["a","b","c","a","b","c","a","b","c","a"]}

df_dict = pd.DataFrame(dictionary)
df_dict

In [None]:
# Inspect the DataFrame colums and index
df_dict.columns

In [None]:
df_dict.index

Now use the Pandas `DataFrame.info()` method to see a bit more detail, including:
* the name of each column
* the Numpy dtype of each column
* how many real values are present
* amount of memory used

*side note: Pandas has excellent support for not-a-number (NaN) entries in DataFrames and Series.*

In [None]:
df_dict.info()

You can go to and from a numpy recarray easily. Note that the index is just treated as a regular column, so we are excluding it here.

In [None]:
# to/from a np.recarray
ra = df_dict.to_records(index=False)
ra

In [None]:
DataFrame.from_records(ra)

## DataFrame Indexing

* Pandas DataFrames have a special Index, goes beyond integer indexing
* Pandas DataFrames are indexed by column label, then (row) Index:

In [None]:
df_grades = pd.DataFrame({'Alice' : [1,2,3,4,5], 
                         'Bob':np.random.random(5), 
                         'Chuck' : np.random.random(5)},
                         index=['Jan', 'Feb', 'Mar', 'Apr', 'May'])
df_grades

In [None]:
# Indexing is by column first (gives a Series)
series_a = df_grades['Alice']

print( series_a )
print( type(series_a) )

In [None]:
# Indexing the column name, gives a Series, then 'positional' indexing
df_grades['Alice'].iloc[1]

In [None]:
# Indexing the column name, gives a Series, then 'label' indexing
df_grades.loc['Jan','Alice']

## Iterating over a DataFrame

Generally, you *dont'* want to iterate explicity over pandas objects. Pandas will iterate for you when you do vectorized operations!

As seen with Series, we can iterate over DataFrames

In [None]:
for key, value in df_grades['Alice'].iteritems():
    print('Key:', '%s' % key, ', Value:', '%s' % value)

DataFrames support iteration over the `Index`

In [None]:
for index, row in df_grades.iterrows():
    print('Index:', '%s' % index, ', Row:', '%s' % row['Alice'])

## DataFrame Manipulation

This is just a brief introduction working with DataFrames. More detail is to come in lessons to follow on using Pandas for examining and cleaning data.

In [None]:
# Extract a column as a new DataFrame
df_col = df_grades[['Alice']]
print( df_col, "\n", type(df_col) )

In [None]:
# Extract a row as a new DataFrame
df_row = df_grades.loc['Jan':'Jan']
print( df_row, "\n", type(df_row) )

In [None]:
# Slicing along rows works just as with Series
df_grades.loc['Jan':'Mar','Alice']

This is an *inclusive* action, in-other-words *both* the start and the stop endpoints will be included!

In [None]:
# For column range, need to use the .ix attribute of the DataFrame
df_grades.loc[:,'Alice':'Bob']

In [None]:
# For extracting row and column ranges
df_grades.loc['Mar':'Apr', 'Bob':'Chuck']

In [None]:
# Boolean masks work too
df_grades.loc[df_grades['Bob'] < .5]

In [None]:
# Pandas DataFrames can be cast to other types

my_list  = list(df_grades['Alice'])
print( my_list, type(my_list) )

my_array = np.array(df_grades['Alice'])
print( my_array, type(my_array) )

# Section Review

Summary of what we've just seen:

Pandas Data Structures
* Series
* Index
* DataFrame

Pandas Series
* Creating a Series
* Pandas Index
* Pandas Slicing
* Series Arithmetic
* Series with NumPy
* Series with Matplotlib
* Series Operations

Pandas DataFrame
* Creating a DataFrame
* DataFrame Indexing
* DataFrame Manipulation


***