# Notebook 5: Pandas

In this notebook, we will look at the package `pandas`. `pandas` is built on top of `numpy` (see Notebook 4 for more on `numpy`) and specializes in dealing with heterogenous and/or missing data often stored in a labelled format. 

As this is an introductory course we will only give a brief introduction to `pandas` and a flavour of why it may be useful to you. For more information see the [`pandas` documentation here](https://pandas.pydata.org/pandas-docs/stable/).

The strength of `pandas` lies in it's ability to handle database and spreadsheet type operations that may be more familiar to users of languages such as `R` and `SQL`. 

 > **Note:** By convention, `pandas` is often aliased as `pd`.

In [32]:
import pandas as pd
import numpy as np

## Pandas dataypes

The world of `pandas` revolves predominantly around the `Series` and `DataFrame` datatypes. We shall look at each one of these in turn now. 

### Pandas `Series`

A `pandas` `Series` object is an indexed one-dimensional array of data.`

In [7]:
example_series = pd.Series([2.1,3.9,4.2])
print(example_series)

0    2.1
1    3.9
2    4.2
dtype: float64


We can convert a `pandas` `Series` to a `numpy` `array` very easily using the `values` attribute and we can access the elements of a `Series` using square brackets in the same way we would a one-dimensional `numpy` array.

In [51]:
print(example_series.values)
print(example_series[0])
print(example_series[0:2])
print(example_series[example_series>3])

[2.1 3.9 4.2]
2.1
a    2.1
b    3.9
dtype: float64
b    3.9
c    4.2
dtype: float64


However, with a `pandas` `Series` we can change how we index the data. For example:

In [16]:
example_series = pd.Series([2.1,3.9,4.2],
                           index=['a', 'b', 'c'])

We can now access the data using our both own indices, much like with a Python `dict`, and numerical indices, like a Python `list`.

In [13]:
print(example_series['b'])
print(example_series[2])

3.9
4.2


In fact, we can even use `numpy` style array indexing syntax but on our own labels.

 > **Note**: This indexing, unlike `numpy`, is inclusive. For example, `'a':'b'` includes element `'b'`, unlike in `numpy` where `0:2` does not include element `2`.

In [19]:
print(example_series['a':'b'])

a    2.1
b    3.9
dtype: float64


 > **Note:** If we use a number as an index for a column; our indices take precedence over the inbuilt `numpy` like indexing. For example:

In [14]:
example_series = pd.Series([2.1,3.9,4.2],
                           index=[2, 'b', 'c'])
print(example_series[2])

2.1


We can construct a `pandas` `Series` object from a `dict` like so:

In [15]:
example_dict = {'a': 1, 'b': 33, 'c': 2}
print(example_dict)
print(pd.Series(example_dict))

{'a': 1, 'b': 33, 'c': 2}
a     1
b    33
c     2
dtype: int64


 > **Note:** If we use the `index` keyword when creating a `series` object from a `dict` only the object with a key in our `index` list will be retained. For example:

In [20]:
example_dict = {'a': 1, 'b': 33, 'c': 2}
print(example_dict)
print(pd.Series(example_dict, index=['a','c']))

{'a': 1, 'b': 33, 'c': 2}
a    1
c    2
dtype: int64


## Pandas `Dataframes`

A `pandas` `DataFrame` object can be thought of as a collection of aligned `pandas` `Series` objects, where by aligned we mean sharing the same index. We can construct a `DataFrame` from several `Series` using the `pandas.DataFrame` object. For example:

In [53]:
# Series of heights in inches
heights = pd.Series({'Pete': 69, 'Mo': 72, 'Katy': 64, 'Alex': 80})
# Series of weights in kg
weights = pd.Series({'Pete': 60, 'Mo': 80, 'Jay': 55, 'Claire': 70})

example_dataframe = pd.DataFrame({'heights': heights, 'weights': weights})

# In this notebook we can display a pandas dataframe in a
# nice format without printing just by typing the dataframes
# name.
example_dataframe

Unnamed: 0,heights,weights
Alex,80.0,
Claire,,70.0
Jay,,55.0
Katy,64.0,
Mo,72.0,80.0
Pete,69.0,60.0


We can also construct a `DataFrame` from a list of `dict`s like so:

In [54]:
# Series of heights in inches
heights = {'Pete': 69, 'Mo': 72, 'Katy': 64, 'Alex': 80}
# Series of weights in kg
weights = {'Pete': 60, 'Mo': 80, 'Jay': 55, 'Claire': 70}

example_dataframe = pd.DataFrame({'heights': heights, 'weights': weights})

example_dataframe

Unnamed: 0,heights,weights
Alex,80.0,
Claire,,70.0
Jay,,55.0
Katy,64.0,
Mo,72.0,80.0
Pete,69.0,60.0


Alternatively we can construct a `DataFrame` from a `numpy` array and label the columns using the `columns` argument:

In [55]:
example_dataframe = pd.DataFrame(
    np.random.randint(60,80, size=(2, 3)),
    columns=['heights', 'weights', 'favourite number'],
    index=['Joe', 'Catelyn'])

example_dataframe

Unnamed: 0,heights,weights,favourite number
Joe,65,72,67
Catelyn,79,65,79


## Getting indices

We can retrieve the indices from a `Series` or `DataFrame` easily using the `index` attribute.

In [40]:
print(example_series.index)
print(example_dataframe.index)

Index(['a', 'b', 'c'], dtype='object')
Index(['Joe', 'Catelyn'], dtype='object')


We can also receieve the column headers from a `DataFrame` using the `columns` attribute.

In [41]:
print(example_dataframe.columns)

Index(['heights', 'weights', 'favourite number'], dtype='object')


## Working with Pandas

The `pandas` datatypes are incredibly useful and efficient when dealing with heterogeneous data. In this section we will give a, by no means comprehenive, selection of useful functions that `pandas` offers. We will use the below `DataFrame` in the following examples.

In [98]:
example_dataset = pd.DataFrame.from_dict({
                    'length of eyelashes': np.random.rand(30),
                    'subject_ID': np.arange(30),
                    'birth_year': np.random.randint(1980, 1990, size=30),
                    'group': list('aabcdeacbeabcabcdeeedcdaebcaed'),
                    'sex': list('MMMFMFFMFFMMMFMFFMFFMFMFFMFMMF')
})

# Add in some missing data
rows = np.random.randint(0,30,size=10)
cols = np.random.randint(0,5,size=10)
for i in range(10):
    example_dataset.iloc[rows[i],cols[i]]= np.nan

# Show our dataset
example_dataset

Unnamed: 0,length of eyelashes,subject_ID,birth_year,group,sex
0,0.556329,0.0,1986.0,a,M
1,0.196549,1.0,1980.0,a,M
2,0.161299,2.0,,b,M
3,0.325355,3.0,1980.0,c,F
4,0.467345,4.0,1985.0,d,
5,0.69878,,1987.0,e,F
6,0.536562,6.0,1987.0,a,F
7,0.011462,7.0,1986.0,c,M
8,0.591013,8.0,1986.0,b,F
9,0.024704,9.0,1987.0,e,


### Boolean indexing

We can use Boolean indexing to return subsets of the data. What is particularly nice is that in `Pandas` boolean logic is much easier to interpret, due to the use of column names. For example, hopefully the below line of code should be fairly intuitive;

In [99]:
example_dataset[(example_dataset['sex']=='M') & 
                (example_dataset['birth_year']>1987)]

Unnamed: 0,length of eyelashes,subject_ID,birth_year,group,sex
22,0.273878,22.0,1988.0,d,M
25,0.393614,25.0,1988.0,b,M


 > **Note:** When using multiple boolean statements always use `()` brackets to make your logic clearer and less susceptible to coding errors.

This above is useful but the notation is a bit clunky - we had to type `example_dataset` four times to do this operation. We can do this a lot more easily with the `query` function.

### Querying

We can perform the same operation as in the above section with much cleaner syntax using the `query` method like so:

In [100]:
example_dataset.query('(sex == "M") & (birth_year > 1987)')

Unnamed: 0,length of eyelashes,subject_ID,birth_year,group,sex
22,0.273878,22.0,1988.0,d,M
25,0.393614,25.0,1988.0,b,M


 > **Warning:** The `query` function may have problems with column names which can't be used as python identifiers (for example column names including a space). This is a common cause of `SyntaxError`'s for users new to `Pandas`.

### Removing `NaN`s


When selecting data we may wish to first remove `NaN` values/missing data which could interfere with our logic. We can locate and remove all rows with `NaN` values in a specified column by using the `isna` function like so:


In [102]:
# Remove all subjects for whom we didn't record sex
example_dataset[~example_dataset.sex.isna()]

Unnamed: 0,length of eyelashes,subject_ID,birth_year,group,sex
0,0.556329,0.0,1986.0,a,M
1,0.196549,1.0,1980.0,a,M
2,0.161299,2.0,,b,M
3,0.325355,3.0,1980.0,c,F
5,0.69878,,1987.0,e,F
6,0.536562,6.0,1987.0,a,F
7,0.011462,7.0,1986.0,c,M
8,0.591013,8.0,1986.0,b,F
10,0.828239,10.0,,a,M
11,0.891091,11.0,1986.0,b,M


There is actually a method that lets us do this more in a more neat fashion; the `dropna` method. 

In [103]:
# Remove all subjects with missing sex and birthyear information
example_dataset.dropna(subset=['birth_year', 'sex'])

Unnamed: 0,length of eyelashes,subject_ID,birth_year,group,sex
0,0.556329,0.0,1986.0,a,M
1,0.196549,1.0,1980.0,a,M
3,0.325355,3.0,1980.0,c,F
5,0.69878,,1987.0,e,F
6,0.536562,6.0,1987.0,a,F
7,0.011462,7.0,1986.0,c,M
8,0.591013,8.0,1986.0,b,F
11,0.891091,11.0,1986.0,b,M
12,0.984716,12.0,1986.0,c,M
13,0.277757,13.0,1980.0,a,F


### Summarizing data

### Grouping by

###  Sorting data

### Reshaping

### Reading/Writing files