# Table of Contents
 <p><div class="lev1"><a href="#Pandas-Python-Data-Analysis-Library"><span class="toc-item-num">1 - </span><a href="http://pandas.pydata.org" target="_blank">Pandas</a> Python Data Analysis Library</a></div><div class="lev2"><a href="#Data-Structures-in-Pandas"><span class="toc-item-num">1.1 - </span>Data Structures in Pandas</a></div><div class="lev3"><a href="#Series"><span class="toc-item-num">1.1.1 - </span>Series</a></div><div class="lev3"><a href="#DataFrame"><span class="toc-item-num">1.1.2 - </span>DataFrame</a></div><div class="lev3"><a href="#Panels-(3D,-4D,-ND)"><span class="toc-item-num">1.1.3 - </span>Panels (3D, 4D, ND)</a></div><div class="lev2"><a href="#IO-Tools"><span class="toc-item-num">1.2 - </span>IO Tools</a></div><div class="lev3"><a href="#pd.from_csv"><span class="toc-item-num">1.2.1 - </span>pd.from_csv</a></div><div class="lev3"><a href="#pd.DataFrame.to_csv"><span class="toc-item-num">1.2.2 - </span>pd.DataFrame.to_csv</a></div><div class="lev3"><a href="#pd.DataFrame.to_hdf"><span class="toc-item-num">1.2.3 - </span>pd.DataFrame.to_hdf</a></div><div class="lev2"><a href="#Reshaping"><span class="toc-item-num">1.3 - </span>Reshaping</a></div><div class="lev2"><a href="#Indexing-and-Selecting-Data"><span class="toc-item-num">1.4 - </span>Indexing and Selecting Data</a></div><div class="lev2"><a href="#Group-by-and-apply"><span class="toc-item-num">1.5 - </span>Group-by and apply</a></div><div class="lev3"><a href="#Applying-a-function"><span class="toc-item-num">1.5.1 - </span>Applying a function</a></div><div class="lev2"><a href="#Filtering-(Numeric-and-String)"><span class="toc-item-num">1.6 - </span>Filtering (Numeric and String)</a></div>

> `Usual stuff to import`

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

from IPython.display import display, HTML

# [Pandas](http://pandas.pydata.org) Python Data Analysis Library

If you find manipulating dataframes in `R` a bit too cumbersome, why don't you give Pandas a chance. On top of easy and efficient table management, plotting functionality is pretty great.

## Data Structures in Pandas

- Data alignment is intrinsic in pandas.

### Series

One-dimensional labelled array which can hold any data type (even Python objects). 

In [None]:
series_one = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
series_one

In [None]:
series_two = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5})
series_two

- ***<span class="mark">Starting from version v0.8.0, pandas supporst non-unique index values</span>***

Series is ndarray-like, dick-like, supports vectorized operations and label alignment

In [None]:
series_one[2:4]

In [None]:
series_one['a']

In [None]:
series_one + series_two

In [None]:
series_one * 3

### DataFrame

DataFrame is a 2-dimensional labelled data structure, like a spreadsheet or SQL table or a dict of Series objects. Obviously, the most used data structure in Pandas and what we'll be discussing more often.

In [None]:
df_one = pd.DataFrame({'one': pd.Series(np.random.rand(5), 
                                        index=['a', 'b', 'c', 'd' , 'e']),
                     'two': pd.Series(np.random.rand(4), 
                                      index=['a', 'b', 'c', 'e'])})
df_one

There are several other constructors for creating a DataFrame object
- `pd.DataFrame.from_records`
- `pd.DataFrame.from_dict`
- `pd.DataFrame.from_items`

Other Pandas data objects which we are not going to talk about are

### Panels (3D, 4D, ND)

## IO Tools

The Pandas I/O API is a set of nice reader functions which generally return a pandas object

### pd.from_csv

Some important parameters
- **`sep`** - Delimiter
- **`index_col`** - Specifies which column to select as index
- **`usecols`** - Specify which columns to read when reading a file
- **`compression`** - Can handle gzip, bz2 compressed text files
- **`comment`** - Comment character
- **`names`** - If `header=None`, you can specify the names of columns 
- **`iterator`** - Return an iterator `TextFileReader` object

In [None]:
iris = pd.read_csv("iris.csv", index_col=0)
iris.head()

Let's see the power of pandas. We'll use [Gencode v24](http://www.gencodegenes.org/releases/24.html) to demonstrate and read the annotation file.

In [None]:
url = "ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/gencode.v24.primary_assembly.annotation.gtf.gz"
gencode = pd.read_csv(url, compression="gzip", iterator=True, header=None, 
                      sep="\t", comment="#", quoting=3, 
                      usecols=[0, 1, 2, 3, 4, 6])
gencode.get_chunk(10)

### pd.DataFrame.to_csv

Dumps data to a csv file. A lot of optional parameters apply which will help you save the file just like you want.

```python
iris.to_csv("iris_copy.csv")
```

### pd.DataFrame.to_hdf

```python
iris.to_hdf("iris_copy.h5", "df")
```

Creates a HDF5 file (binary indexed file for faster loading and index filtering during load times). Requires `pytables` as a depandency if you want to go full on with it's functionality

## Reshaping

Almost everyone will be familiar on how much you need to reshape the data if we want to plot it properly. This functionality is also pretty well covered in pandas.

- `pd.melt`

In [None]:
planets = pd.read_csv("planets.csv", index_col=0)
planets.head()

In [None]:
planets_melt = pd.melt(planets, id_vars="method")
planets_melt.head()

## Indexing and Selecting Data

`pd.DataFrame` and `pd.Series` support basic array-like indexing. To get into detail, it's better to use `.loc` and `.iloc`

In [None]:
heatmap = pd.read_csv("Heatmap.tsv", sep="\t", index_col=0)
heatmap.head(10)

In [None]:
heatmap.iloc[4:8]

In [None]:
heatmap.loc[['prisons', 'jacks', 'irons']]

<span class="burk"><span class="girk">Almost forgot, HTML conditional formatting just made it into the latest release `0.17.1` and it's pretty awesome. Use a function to your liking or do it with a background gradient</span></span>

In [None]:
def color_negative_red(val):
    """
    Takes a scalar and returns a string with
    the css property `'color: red'` for negative
    strings, black otherwise.
    """
    color = 'red' if val < 0 else 'black'
    return 'color: %s' % color

# Apply the function like this
heatmap.head(10).style.applymap(color_negative_red)

In [None]:
heatmap.head(10).style.background_gradient(cmap="RdBu_r")

## Group-by and apply

You can group data (on both axes) based on a criteria. It returns an iterator but you can directly apply a function without the need to iterate through. 

Remember though, you'll get a new index based on what you group with if you directly apply the function without iterating over the groups.

- `pd.DataFrame.groupby`

In [None]:
# No need to iter through to apply mean based on species
iris_species_grouped = iris.groupby('species')
iris_species_grouped.mean()

In [None]:
# The previous iterator has reached it's end, so re-initialize
iris_species_grouped = iris.groupby('species')

for species, group in iris_species_grouped:
    display(HTML(species))
    display(pd.DataFrame(group.mean(axis=0)).T)

### Applying a function

- `pd.DataFrame.apply`

In [None]:
pd.DataFrame(iris[[0, 1, 2, 3]].apply(np.std, axis=0)).T

In [None]:
def add_length_width(x):
    """
    Adds up the length and width of the features and returns
    a pd.Series object so as to get a pd.DataFrame
    """
    sepal_sum = x['sepal_length'] + x['sepal_width']
    petal_sum = x['petal_length'] + x['petal_width']
    return pd.Series([sepal_sum, petal_sum, x['species']], 
                    index=['sepal_sum', 'petal_sum', 'species'])

iris.apply(add_length_width, axis=1).head(5)

## Filtering (Numeric and String)

There's always need for that. Obviously needed float & int filters but exceptional string filtering options baked in... So much good stuff this..

Inside the `pd.DataFrame.loc`, you can specify *and* (`&`), *or* (`|`), *not* (`~`) as logical operators. This stuff works and is tested ;)
- `>, <, >=, <=`
- `str.contains, str.startswith, str.endswith`

In [None]:
iris.loc[iris.sepal_width > 3.5]

In [None]:
iris.loc[(iris.sepal_width > 3.5) & (iris.species == 'virginica')]

In [None]:
heatmap.loc[heatmap.index.str.contains("due|ver|ap")]

There is a ton of stuff that can be done in `Pandas`. The online docs is super detailed and amazing. Explore, search, stack overflow it and most probably you'll get what you're looking for. The current version docs (as of this talk) [Pandas v`0.17.1`](http://pandas.pydata.org/pandas-docs/version/0.17.1/)

Things that I can't cover because of the time constraints
- Plotting - Uses matplotlib as the backend and makes big data analyses/visualization quicker
- Lots of mathematical functions to easily use in day to day life