
<a href="http://www.cosmostat.org/" target="_blank"><img align="left" width="300" src="http://www.cosmostat.org/wp-content/uploads/2017/07/CosmoStat-Logo_WhiteBK-e1499155861666.png" alt="CosmoStat Logo"></a>
<br>
<br>
<br>
<br>

# Pandas Intro

---

> Author: <a href="http://www.cosmostat.org/people/santiago-casas" target="_blank" style="text-decoration:none; color: #F08080">Santiago Casas</a>  (based on the Python Data Science Handbook)
> Email: <a href="mailto:santiago.casas@cea.fr" style="text-decoration:none; color: #F08080">santiago.casas@cea.fr</a>  
> Year: 2019  
> Version: 1.0


<a href="https://pandas.pydata.org/" target="_blank"><img align="left" width="500" src="https://files.realpython.com/media/Python-Pandas-10-Tricks--Features-You-May-Not-Know-Watermark.e58bb5ce9835.jpg" alt="CosmoStat Logo"></a>

---
<br>

## Let's start by importing the necessary libraries


In [None]:
import numpy as np

In [None]:
import pandas as pd

# Numpy with structured arrays

Numpy is quite powerful because it can host structured data into the array, using a dictionary structure.

Imagine we have this data about people and their age and weight.

In [None]:
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]

We can create a structured array using a compound data type specification

In [None]:
# Use a compound data type for structured arrays
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
                          'formats':('U10', 'int32', 'float32')})
print(data.dtype)

We have a data type that contains unicode strings, ints and floats.

We can now fill this structure data array using `keys` as we did for dictionaries.

In [None]:
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)

This allows us to access the array with indices

In [None]:
# Get first row of data
data[0]

In [None]:
#Get name of the last row of data
data[-1]['name']

### Masking with structured arrays

Using the powerful masking we saw before, we can ask questions such as: What is the name of all the persons below an age of 30?

In [None]:
# Get names where age is under 30
data[data['age'] < 30]['name']

Despite the power of numpy, once we start dealing with more complicated structures, we need to resort to Pandas. A python package specialized for databases, structured data and statistical analysis.

# Enter Pandas

## Pandas Series

A Pandas `Series` is a one-dimensional array of indexed data. It is the most basic structure. It can be created from a list or array as follows:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

Notice how we now get an index column, next to the array values. This index column is by default just a numbering of the entries.

The values are simply a numpy array:

In [None]:
data.values

While the index is just an array-type object, special to `Pandas`

In [None]:
data.index

Data can be accessed like in a numpy array

In [None]:
data[1:3]

In [None]:
data[1]

The Pandas Series has an explicitly defined index associated with the values.

This explicit index definition gives the Series object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type. For example, if we wish, we can use strings as an index:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

In [None]:
# Now we can access the data through an index
data['b']

## Series as specialized dictionaries

Instead of specifying the data and the indices seprately, we can inmediatly create a series from a dictionary.

In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

So we can simply ask, What is the population of California?

In [None]:
population['California']

And we can even perform slicing on the `'string'` indices:

In [None]:
population['California':'New York']

Notice that the indices can be non-contiguous numbers in any order and the values can be strings

In [None]:
series = pd.Series({2:'a', 1:'b', 3:'c', 6:'d'})

In [None]:
series[1]

Which can lead to counter-intuitive slicing such as

In [None]:
series[1:6].values

## Data Frames as generalized numpy arrays and special dictionaries

From several dictionaries one can also construct a `DataFrame` object, basically what we scientists call tables, with a row of headers and a column of indices.

In [None]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}

In [None]:
# Create a Series object
area = pd.Series(area_dict)

In [None]:
area

Now use population and area as two different columns in a `"database"`, which corresponds to a DataFrame.

In [None]:
states = pd.DataFrame({'population': population,
                       'area': area})

In [None]:
states

> **<font color='red'>NOTE:</font>** Notice the nice printing of a DataFrame in the form of a table !

DataFrame also has an index attribute

In [None]:
states.index

But now also the columns, have a generalized index

In [None]:
states.columns

A `DataFrame` behaves like a dictionary in which the `keys` are the columns

In [None]:
states['area']

And as opposed to a `numpy` array it does not return the rows when indexed with numbers

In [None]:
# This will raise an error, uncomment to see it
#states[0]

A `DataFrame` can be created from a list of dictionaries and `Pandas` will merge the indices accordingly.

In [None]:
list_of_dicts = [{'a': 1, 'b': 2}, {'b': 3, 'a': 4}]

In [None]:
pd.DataFrame(list_of_dicts)

If some indices are not present in one or more of the dicts, `Pandas` will fill the columns with `naN`s.

In [None]:
list_of_dicts = [{'a': 1, 'b': 2, 'c':3}, {'a': 3, 'd': 4, 'b':6}]
pd.DataFrame(list_of_dicts)

A useful tool for scientists working with numerical data, is to construct a `Pandas` `DataFrame` out of a numpy array, in which we label the rows and columns with strings.

In [None]:
xy_data=pd.DataFrame(np.random.rand(3, 2),
             columns=['X', 'Y'],
             index=['a', 'b', 'c'])
xy_data

In the same way that dictionaries can be "filled-up" by providing a new "key-value" pair, the same applies for `DataFrames`:

In [None]:
xy_data['Z'] = pd.Series([1.,1.,1.], index=xy_data.index)  # We pass the same index of the original DataFrame
xy_data

> **Puzzle 6:** Do you think data frames can also be transposed like numpy arrays? 
What is the output of `(xy_data.T).columns[2]` ?

  * Option a): `c`
  * Option b): `X`
  * Option c): `0.303641`

In [None]:
#Answer Puzzle 5:
#Uncomment to see the result
#(xy_data.T).columns[2]

## Manipulating Data Frames

A very useful tool for scientists is to be able to compute derived quantities from given data.
In our `states` DataFrame:

In [None]:
states

We can create quickly a new `density` column by operating on the available data columns

In [None]:
states['density'] = states['population']/states['area']

In [None]:
states

Notice that now the columns are attributes of the `DataFrame`, which return `Series`.

In [None]:
states.area

In [None]:
states.density

*Masking* also works for `DataFrames`

In [None]:
# Which states have a density larger than 100?
states[states.density > 100]

As you can easily see, this properties are quite powerful when one has to analyze large datasets with thousands of columns and rows.

## Indexing with integers: loc and iloc

In [None]:
data = pd.Series(['a', 'b', 'c', 'd'], index=[1, 2, 5, 6])

Notice that when the indices are integers, some confusions might appear.

In [None]:
# explicit index when indexing
data[2]

In [None]:
# This will raise an error: (uncomment to see it and comment back)
#data[3]

> **Puzzle 6:** What do you expect from the following slice: `data[2:4]` ? Which values will be returned?

In [None]:
#Answer Puzzle 6:
#Uncomment to see the answer
#data[2:4]

The maybe surprising result, is because `Pandas` uses the explicit indices in `index` when indexing, while the implicit (standard numpy) indices when slicing.

Due to this confusion, it is better to use the attributes `loc`, `iloc`.

### loc

First, the `loc` attribute allows indexing and slicing that always references the ***explicit*** index:

In [None]:
data.loc[1]

In [None]:
data.loc[1:3]

In [None]:
states.loc['California':'Texas']

### iloc

The `iloc` attribute allows indexing and slicing that always references the implicit Python-style index:

In [None]:
data.iloc[1]

In [None]:
data.iloc[1:3]

> **<font color='red'>NOTE:</font>** Notice the difference with `loc` above !

In [None]:
states.iloc[0:2]

One guiding principle of Python code is that "explicit is better than implicit." The explicit nature of `loc` and `iloc` make them very useful in maintaining clean and readable code; especially in the case of integer indexes, I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention. (from the PDSH)

### Subtleties of indexing

For `DataFrames`, the situation is a bit more complicated. Passing a single index to the values, as we saw above, returns a row:

In [None]:
states.values[0:2]

Passing a column name as a key, returns a column:

In [None]:
states['area']

But slicing is always performed on the rows:

In [None]:
states['California':'Texas']

In [None]:
states.iloc[:3, :2]


Equivalently:

In [None]:
states.loc[:'New York',:'area']

## Operations between DataFrames and Series

In the same way as boradcasting for `numpy` where we could have operations between arrays of different dimensions, here we can operate between `DataFrames` and `Series`.

This is quite useful again for treatment of numerical and statistical data.

Create a `DataFrame` of random ints of size (3,4), with given column names.

In [None]:
df = pd.DataFrame(np.random.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

Apply a complicated function on the whole `DataFrame`

In [None]:
np.sin(df * np.pi / 4)

Internally, this has been done using `uFuncs` and `broadcasting`.

For numpy arrays, the operation works `row-wise` (notice the first row always being zero):

In [None]:
A = np.random.randint(10, size=(3, 4))
A - A[0]

Create a `DataFrame` from `A`:

In [None]:
df = pd.DataFrame(A, columns=list('ABCD'))
df

Subtract the `0` column:

In [None]:
df - df.iloc[0]

If you wish to operate column-wise, use `loc` and the corresponding `uFunc` along the wanted axis.

In [None]:
df.subtract(df['A'], axis=0)

Pandas has lots more of functionality that we can't cover in this session. But a look at the Python Data Science Handbook can reveal many more methods, attributes and tricks.

# Importing CSV files

A useful property of Pandas, is that it allows us to work with spreadsheets in a natural way.

In [None]:
#Import CSV data of average monthly Temperatures in Turrialba, Costa Rica from 1958 until 2016

In [None]:
cr_temp = pd.read_csv('./materials/CR_Temp.csv')

In [None]:
# Look at the first 5 rows with head()
cr_temp.head()

If we want to avoid having an index column which is just an integer count of the rows, we can specify it at import time.

In [None]:
cr_temp = pd.read_csv('./materials/CR_Temp.csv', index_col=1)
cr_temp.head()

In [None]:
#Get the columns
cr_temp.columns

In [None]:
#Get the index
cr_temp.index

In [None]:
#Get the temperature of the 'ENE' month (January) in 1965
cr_temp.ENE[1965]

> **Exercise 1:** Compute the yearly average and check if it matches what is given in the last column of the csv file. You can drop the `N` column with pd.drop(columns=...)

> **Exercise 2:** Averaging over all the years, which is the hottest month in Turrialba, CR? *Hints:* np.mean, np.max, `loc` and slicing.

> **Exercise 3:** Using the built-in plotting routine of Pandas, can you notice any effect of climate change over the years? *Hint:* plot is a built-in method of a Pandas `DataFrame`. You can plot any column or any row using the masking and slicing seen above.

---

> **[Answers](./Answers.ipynb) to puzzles and exercises**