In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Table of Contents
* [Lecture 2A - Introduction to working with data using Python Pandas*](#Lecture-2A---Introduction-to-working-with-data-using-Python-Pandas*)
	* &nbsp;
		* [Content](#Content)
		* [Learning Outcomes](#Learning-Outcomes)
	* [Pandas Data Structures](#Pandas-Data-Structures)
		* [Series - Univariate Data](#Series---Univariate-Data)
		* [Selecting values based on index](#Selecting-values-based-on-index)
		* [Basic plotting](#Basic-plotting)
		* [Selecting data points based on values](#Selecting-data-points-based-on-values)
		* [Indices and alignment](#Indices-and-alignment)
	* [File input](#File-input)


# Lecture 2A - Introduction to working with data using Python Pandas*


---

### Content

1. Why pandas...
2. pandas' Series data structure
3. Selecting and filtering data from Series
4. Basic plotting with Series
5. Series indices and alignment
6. Open files and load file data into Series

\* This notebook material is adapted from Assoc. Prof. Fonnesbeck's tutorial on statistical data analysis in Python and closely follows "Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython" By Wes McKinney.

### Learning Outcomes

At the end of this lecture, you should be able to:

* describe the reasons for using the Series data type  
* select data from Series using the index 
* filter data points from Series using their values 
* perform plotting of Series data at an introductory level
* explain the role of indices in Series and how operations on multiple Series objects use indices
* load univariate data into a Series object from a flat file


**pandas** is a package that is a fundamental ingredient in enabling working with structured data in a fast, easy and expressive manner using Python. It provides powerful data structures with all the key manipulation operations enabling, slicing and dicing, aggregating, integrating and extracting data. Pandas is particularly designed and well suited for working with **two dimensional data** that is arranged as SQL tables or Excel spreadsheets, where the data can of **different types**.

Pandas key features:
    
- Easy handling of **missing data**
- **Size mutability**: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit **data alignment**: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
- Powerful, flexible **group by functionality** to perform split-apply-combine operations on data sets
- Intelligent label-based **slicing, fancy indexing, and subsetting** of large data sets
- Intuitive **merging and joining** data sets
- Flexible **reshaping and pivoting** of data sets
- **Hierarchical labeling** of axes
- Robust **IO tools** for loading data from flat files, Excel files, databases, JSON, CSV, XML formats
- **Time series functionality**: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

In [None]:
from IPython.core.display import HTML
HTML("<iframe src=http://pandas.pydata.org width=800 height=350></iframe>")

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
pd.__version__

In [None]:
#import some extra plotting libraries that pretty-fy the plots 
import seaborn as sns
from pylab import rcParams

rcParams['figure.dpi'] = 350
rcParams['lines.linewidth'] = 2
rcParams['axes.facecolor'] = 'white'
rcParams['patch.edgecolor'] = 'white'
rcParams['font.family'] = 'StixGeneral'

In [None]:
#this line enables the plots to be embedded into the notebook
%matplotlib inline

In [None]:
# Set some Pandas options for displaying the data as you like
pd.set_option('max_columns', 40)
pd.set_option('max_rows', 20)

## Pandas Data Structures

The real power of pandas lies in its data structures. The two structures we will look at are the Series and DataFrame, where the DataFrame is the key and the most powerful in the pandas suite.

### Series - Univariate Data

A **Series** is a single vector of data (like a list or array) with an *index* that labels each element in the vector. Series holds univariate data which represents multiple observations of some phenomenon, viewed from a single perspective (one variable).

In [None]:
#by calling the Series class and placing a list inside its parentheses, we create a series object
my_series = pd.Series([3778000, 19138000, 20000, 447000])
my_series

If an index is not specified, a default sequence of integers is assigned as the index seen above.

We can extract all the values in a series object. 

In [None]:
my_series.values

Meanwhile, individual values can be accessed like in a list:

In [None]:
my_series[1]

The entire index of a series object can also be extracted.

In [None]:
my_series.index

In [None]:
list(my_series.index)

We can assign meaningful labels to the index, if they are available:

In [None]:
country = pd.Series(
                    [3778000, 19138000, 20000, 447000], 
                    index=['New Zealand', 'Australia', 'Cook Islands', 'Solomon Islands']
                    )
country

In order to quickly plot and visualize the data in the Series data structure, we simply call plot which gives us a default line graph:

In [None]:
country.plot()

### Selecting values based on index

The new index values we created can now be used to refer to the values in the series object.

In [None]:
country['Australia']

In order to access multiple values, we pass a **list** of index values:

In [None]:
country[ ['New Zealand', 'Australia'] ]

**Exercise**: Display the values for 'Australia', 'Cook Islands' and the 'Solomon Islands' using the above method.

However, we do not have to use full index name to access the values, instead we can still use positional index values

In [None]:
country[ [0, 1] ]

We can also use slicing and avoid explicitly typing in the names or indexes of values we need to extract.

In [None]:
country[ country.index[0:2] ]
# or
# country.iloc[0:2]

**Exercise**: Display the values for 'Australia', 'Cook Islands' and the 'Solomon Islands' using slicing.

Additionally, we pass a list of booleans in order to print certain values 

In [None]:
country[ [False, False, True, True] ]

This becomes handy when the values we wish to extract, depend on some boolean operation that we need to perform.

In [None]:
list_of_values = []
for val in country.index:
    if 'Islands' in val:
        list_of_values.append(True)
    else:
        list_of_values.append(False)

print(list_of_values)
country[ list_of_values ]

There are almost always more ways than one way of coding for a particular requirement. The above can be shortened using functional programming constructs like list comprehension to the following:

In [None]:
country[['Islands' in name for name in country.index]]

**Exercise**: Using the long form construct above, write code below to list the values for countries whose name is smaller than 12 characters. Hint: if using the list comprehension, use the 'if' keyword to define the condition.

**Exercise**: Try to replicate the same logic using list comprehension:

We can give both the array of values and the index meaningful labels themselves:

In [None]:
country.name = 'POPULATION'
country.index.name = 'NATION'
country

Python's NumPy package adds support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays. NumPy's math functions and other operations can be applied to Series objects at an element-wise level without losing the data structure.

In [None]:
#import numpy as np
np.sqrt(country)

### Basic plotting

We can also plot the result of Numpy maths operations on the Series in an easy manner:

In [None]:
np.log(country).plot()

We can easily change to different types of graphs. The options for graphs are ‘line’, ‘bar’, ‘barh’, ‘kde’, ‘density’ and ‘scatter’

In [None]:
_ = country.plot(kind='bar',rot=0)

**Exercise**: Plot the country data using horizontal bar graphs (barh).

### Selecting data points based on values

We can do filtering/selection based on an arbitrary number and more complex conditions as follows:

In [None]:
country[country <= 20000]

In [None]:
country[(country > 10000) & (country < 500000)]

**Exercise**: plot a bar graph for countries with less than 1,000,000 population

### Indices and alignment

A `Series` can be thought of as an ordered **key-value** data structure. In fact, we can create one from a `dict`:

In [None]:
country_dict = {'New Zealand': 3778000, 
                'Australia': 19138000, 
                'Cook Islands': 20000, 
                'Solomon Islands': 447000}
type(country_dict)

In [None]:
pd.Series(country_dict)

Notice that the `Series` is created in key-sorted order.

If we pass a custom index to `Series`, it will select the corresponding values from the dict, and treat indices without corresponding values as **missing**. Pandas uses `NaN` (Not a Number) type for representing missing values.

In [None]:
country2 = pd.Series(country_dict, index=['Niue','New Zealand', 'Australia', 'Cook Islands'])
country2

We can find out which keys are associated with null values:

In [None]:
country2.isnull()

**Exercise**: plot a bar graph on country2 for countries that do not have null/NaN for population

One of the great advantages of working with pandas data structures, comes in their ability to ensure alignment. Critically, the labels are used to **align data** when used in operations with other Series objects:

In [None]:
country

In [None]:
country2

In [None]:
country + country2

This is in contrast  with conventional python NumPy arrays, where arrays of the same length will combine values **element-wise**; adding Series combined values with the **same label** in the resulting series. Notice also that the **missing values were propagated by addition**.

## File input

Up until this point, we have explored trivial datasets. Real-world datasets are much larger and are often stored in flat files that have to be opened and loaded into a program for processing. Below is an example of a file containing response times (in milliseconds) for queries against a web server or database, from the book "Data Analysis with Open Source Tools" by Philipp K. Janert.

The data can be opened and read using pandas pd.read_csv() function and then converted into a Series object as follows:

In [None]:
server_data = pd.read_csv("../datasets/ch02_serverdata", header=None)[0]
server_data.head()

**Exercise**: From the output above, what is the first thing you can tell about the dataset that has been loaded?

**Exercise**: Call the *describe()* function on the server_data object in the cell below.