# Python Modules - pandas

## Objectives

* Know about data analysis functions provided by `pandas`
* Understand the concept of a `pandas.DataFrame`
* Know how to access summary statistics with `.describe()`
* Know how to access elements from a `pandas.DataFrame` by indices
* Know how to select elements from a `pandas.DataFrame` by comparison
* Know how to access the `pandas` documentation

**Time**: 30 minutes

## pandas

`pandas` is a large and well developed module focussed on data analytics functions and datatypes in Python. `pandas` is a large module and we will only introduce you to a couple of functions today.

<div style="background-color:#cdefff; border-radius: 5px; padding: 10pt"><strong>Task:</strong> Create a new code cell beneath this cell and import the <code style="background-color:#cdefff">pandas</code> module. It is convential to give <code style="background-color:#cdefff">pandas</code> module the alias <code style="background-color:#cdefff">pd</code>.</div>

<br>
<div style="background-color:#cdefff; border-radius: 5px; padding: 10pt"><strong>Task:</strong> Find the pandas Documentation on-line. Can you easily navigate the documentation to find useful functions?</div>

### DataFrames

One of the key features of `pandas` is the introduction of another new datatype: the `pandas.DataFrame`.

We will use the `pandas.DataFrame` for the mini-project this afternoon (on your own data if your brought it).

A `DataFrame` is a collection of `Series`. You can consider a `DataFrame` to be a table of data and a `Series` to be a column of data.

`pandas` is built on top of NumPy and many NumPy array methods can be applied to `DataFrames` and `Series`.

If you're used to the `R` programming language then `DataFrames` and `Series` may already be familiar to you; although Python has it's own special ways to deal with these.

There are many benefits to using a `DataFrame` instead on a NumPy array and these include the ability for `pandas` to deal with missing values and the ability to use relational database operations between DataFrames.

<div style="background-color:#cdefff; border-radius: 5px; padding: 10pt"><strong>Task:</strong> Search the on-line pandas documentation to find the <code style="background-color:#cdefff">pandas.DataFrame</code> and <code style="background-color:#cdefff">pandas.Series</code> pages.</div>

## Loading CSV Data

The easiest way to load data as a DataFrame with `pandas` is to read a 'comma separated value' file. These can easily be exported from Excel or similar software if you already have data in a different format.

<div style="background-color:#cdefff; border-radius: 5px; padding: 10pt"><strong>Task:</strong> Run the code cell beneath this one to see how to load a <code style="background-color:#cdefff">pandas.DataFrame</code> from a csv file (here from example data on the internet). Note how the DataFrame has an extra column at the start - this is an 'index'.</div>
<br>
Note that this dataset is about sepal and petal size in irises.

In [None]:
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
print("Our data is of type {0}".format(type(iris)))  # print the datatype
print(iris.head())  # print only the first five rows

### Accessing Elements of a DataFrame - Selecting by Indexing

DataFrames are accessible by indexing (as are the `numpy.ndarray`, `list` and `string` datatypes).

However, unlike other datatypes we don't just use square brackets to select values by indexing. If we want to access elements of a `DataFrame` by index we must use the `.iloc` (Integer-LOCation) attribute.

<div style="background-color:#cdefff; border-radius: 5px; padding: 10pt"><strong>Task:</strong> Run the code cell beneath this one to access elements of <code style="background-color:#cdefff">iris</code> by indexing. Add two `print()` commands that display elements accessed by indexing:
    
* the last row, fifth column
* the last row and last two columns.
</div>

In [None]:
# Access the first row, first column
print(iris.iloc[0,0])

In [None]:
# Access the second and third rows, third column
print(iris.iloc[1:3,2])

In [None]:
# Access the first ten rows, last three columns
print(iris.iloc[0:10,2:5])

### Accessing Elements of a DataFrame - Selecting by Labels

DataFrames are also accessible by labels, i.e. column headers (this is unlike the `numpy.ndarray`, `list` and `string` datatypes).

If we want to access elements of a `DataFrame` by labels we must use the `.loc` (label-LOCation) attribute.

<div style="background-color:#cdefff; border-radius: 5px; padding: 10pt"><strong>Task:</strong> Run the code cells beneath this one to access elements of <code style="background-color:#cdefff">iris</code> by label. Note how, because of the index column the row labels happen to be the same as the row integer locations. Note also how slicing (<code style="background-color:#cdefff">:</code>) with labels acts includes the last value (unlike slicing with indices above).</div>

In [None]:
# Access the first row, 'sepal_length' column,
print(iris.loc[0,'sepal_length'])

In [None]:
# Access the second and third rows, 'petal_length' column
print(iris.loc[1:2,'petal_length'])

In [None]:
# Access the first ten rows, last three columns
print(iris.loc[0:9,'petal_width':'species'])

<div style="background-color:#cdefff; border-radius: 5px; padding: 10pt"><strong>Task:</strong> Run the cell below - why does <code style="background-color:#cdefff">-1</code> not work when using <code style="background-color:#cdefff">.loc</code>?</div>

In [None]:
# Access the last row, last three columns
print(iris.loc[-1,'petal_width'])

### Accessing Elements of a DataFrame - Selecting Whole Columns or Rows

To access a whole column (or row), we can use just a colon to indicate 'everything'.

<div style="background-color:#cdefff; border-radius: 5px; padding: 10pt"><strong>Task:</strong> Run the cell below to see an example of printing a whole column. Note that <code style="background-color:#cdefff">print()</code> truncates the data, but variable contains the whole column's data.</div>

In [None]:
# Access the whole 'species' column
myVariable = iris.loc[:,'species']
print(myVariable)

### Describing a DataFrame

Often, we just want a quick summary of numerical data, e.g. the mean and standard deviation. `pandas.DataFrame` objects have a method to give you a quick overview: `.describe()`.

<div style="background-color:#cdefff; border-radius: 5px; padding: 10pt"><strong>Task:</strong> Run the cell below to se an example of the <code style="background-color:#cdefff">.describe()</code> method. Note how <code style="background-color:#cdefff">.describe()</code> only considers numerical <code style="background-color:#cdefff">Series</code> and ignores 'species'.</div>

In [None]:
print(iris.describe())

### Accessing Elements of a DataFrame - Selecting By Comparison

But what if we only want to see summary statistics for one species in our dataset? We could manually look through the `DataFrame` and pick the indices for each 'setosa' iris. In this case that will be indiced `0-49`, but in many cases the ordering of our data may be random.

Luckily, DataFrames can be accessed not just be indices and labels but also by comparisons. Essentially, we create a Boolean 'mask' - a `True`/`False` value at every element, which tells the DataFrame what data to use.

<div style="background-color:#cdefff; border-radius: 5px; padding: 10pt"><strong>Task:</strong> 
Read the cell below. This cells aims to create a mask for only setosa iris data and print out summary statistics for that species. Replace all the gaps (<code style="background-color:#cdefff">____</code>) in the cell so that it runs without errors and produces the right output values.</div>

In [None]:
# Create a mask from the 'species' column
mask = iris.____[:,____]=='setosa'

# Print the masked DataFrame
print(iris.loc[____])

# Print summary statistics
print(iris.____[mask].____())

## Key Points

* pandas increases the functionality of Python for data analaysis
* Whilst pandas documentation can look overwhelming, it can easily be interpreted
* `pandas.DataFrame` objects hold data in a table-like way and can be loaded from your existing data
* There are lots of ways to access data from your `pandas.DataFrame`, not all will be appropriate for your scenario