# A quick introduction to the Pandas library

Based on https://github.com/mcrovella/CS506-Computational-Tools-for-Data-Science/blob/master/02B-Pandas.ipynb

## Overview

Pandas is the Python Data Analysis Library, used for loading, processing and generally manipulating datasets efficiently.

It can also be used with matplotlib and other plotting libraries to create nice data visualisations.
Internally, it uses arrays provided by the NumPy library for efficient operation.

The most important tool provided by Pandas is the **data frame** (Class `DataFrame`).

A data frame is a **table** in which each row and column is given a label.

Pandas DataFrames are documented at:

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html

## Getting started

Import the pandas library as `pd`:

In [None]:
import pandas as pd


## Fetching, storing and retrieving your data

You can fetch data from online sources

In [None]:
df_remote = pd.read_csv(r'https://gist.githubusercontent.com/ariskou/6441e960da38d395b1305e9d388f92d5/raw/62c459a79631efb35ca7000c3fcdf66d2f4a2cc5/height_weight.csv')

Print the first 5 rows of the data

In [None]:
df_remote.head(5)


Print some information about the structure of the data.

Note the type of the columns. One is an integer and the other two are floats.

**All the values in the same column MUST be of the same type**

In [None]:
df_remote.info()

Add a new column named `WeightCopy` which is just a copy of the `Weight` column.

It will have the same type (`float`).

In [None]:
df_remote['WeightCopy'] = df_remote['Weight']
df_remote.info()

Change **one** of the values in this new column to a string and show the DataFrame again.

In [None]:
df_remote.loc[3, 'WeightCopy'] = "I'm a string"
df_remote.head(5)

Now show again the DataFrame structure information, and note that the columnt ype has changed to `object`, which is much more inefficient and can be anything.

In [None]:
df_remote.info()

### Reading data from a local .csv file

You can download the same file https://moodle.imt-atlantique.fr/pluginfile.php/75403/mod_folder/content/0/height_weight.csv?forcedownload=1 from Moodle and put it in the same folder as this current ipynb file.

After this you can read the file locally, without needing access to a network.

In [None]:
df = pd.read_csv(r'height_weight.csv')

In [None]:
df.head(5)

## Simple examples with pandas

Get the number of rows in the DataFrame:

In [None]:
len(df)

Get the shape of the DataFrame (rows, columns):

In [None]:
df.shape

## Working with data columns

The columns or "features" in your data

In [None]:
df.columns

Selecting a single column from your data

In [None]:
df['Height']

Another way of selecting a single column from your data

In [None]:
df.Height

Select two columns

In [None]:
df[['Height','Weight']].head()

Get the first 10 rows

In [None]:
df.Weight.head(10)

Get the last 10 rows

In [None]:
df.Weight.tail(10)

Changing the column names:

In [None]:
new_column_names = [x.lower().replace('ght','GHT') for x in df.columns]
df.columns = new_column_names
df.info()

Make all the names lowercase

In [None]:
new_column_names = [x.lower() for x in df.columns]
df.columns = new_column_names
df.info()

Drop a column (delete it)

In [None]:
df.drop('index', axis='columns', inplace=True)
df.head()

## Data Frame methods

A DataFrame object has many useful methods.

Get the averages for all the columns

In [None]:
df.mean()

Get the standard deviations for all the columns

In [None]:
df.std()

Get the medians for all the columns

In [None]:
df.median()

Get the average for just one column

In [None]:
df.weight.mean()

The **values** property of the column returns a list of values for the column.

In [None]:
first_weight = df.weight.values[0]
first_weight

You can apply a function to all the values in a column and store the result in another (or the same column)

In [None]:
df['height_in_cm'] = df.height.apply(lambda h: h*2.54)
df.head()

In [None]:
df['weight_in_kg'] = df.weight.apply(lambda w: w*0.453592)
df.head()

Each row in a DataFrame is associated with an index, which is a label that uniquely identifies a row.

In [None]:
df.index

### Accessing rows of the DataFrame

So far we've seen how to access a column of the DataFrame.  To access a row we use a different notation.

To access a row by its index value, use the **`.loc()`** method.

In [None]:
df.loc[5]

To access a row by its sequence number (ie, like an array index), use **`.iloc()`** ('Integer Location')

In [None]:
df.iloc[0,:]

To iterate over the rows, use **`.iterrows()`**

In [None]:
num_taller_than_average_people = 0
for idx, row in df.iterrows():
    if row.height_in_cm > df.height_in_cm.mean():
        num_taller_than_average_people += 1
        
print(f"The total number of people taller than the average is {num_taller_than_average_people}")

## Filtering

It is very easy to select interesting rows from the data.  

All these operations below return a new DataFrame, which itself can be treated the same way as all DataFrames we have seen so far.

In [None]:
tmp_high = df.height_in_cm >= 185
tmp_high.head()

Summing a Boolean array is the same as counting the number of **`True`** values.

In [None]:
sum(tmp_high)

Now, let's select only the rows of **`df`** that correspond to **`tmp_high`**

In [None]:
df[tmp_high]

Putting it all together, we have the following commonly-used patterns:

In [None]:
very_tall_people = df[df.height_in_cm >= 185]
very_tall_people

In [None]:
import numpy as np
very_close_to_average_people = df[np.abs(df.height_in_cm - df.height_in_cm.mean()) < 0.5]
very_close_to_average_people

## Creating new columns

To create a new column, simply assign values to it.  Think of the columns as a dictionary. Calulate the BMI (weight (kg) / height^2 (m))

In [None]:
df['bmi'] = (df.weight_in_kg / (df.height_in_cm/100)**2)
df.head()

You can also create new categorical comuns like this, based on the values of other columns:

In [None]:
for idx, row in df.iterrows():
    if row.bmi < 18.5:
        df.loc[idx,'category']='under'
    elif row.bmi >= 18.5 and row.bmi < 25.0:
        df.loc[idx,'category']='average'
    else:
        df.loc[idx,'category']='over'
df.head()

Here is another, more "functional", way to accomplish the same thing.

Define a function that classifies rows, and **`apply`** it to each row.

In [None]:
def namerow(row):
    if row.bmi < 18.5:
        return 'under'
    elif row.bmi >= 18.5 and row.bmi < 25.0:
        return 'average'
    else:
        return 'over'

df['test_category'] = df.apply(namerow, axis = 1)


In [None]:
df.head()

OK, delete that extraneous `test_category`:

In [None]:
df.drop('test_category', axis = 1, inplace=True)

## Grouping

An **extremely** powerful DataFrame method is **`groupby()`**. 

This is entirely analagous to **`GROUP BY`** in SQL.

It will group the rows of a DataFrame by the values in one (or more) columns, and let you iterate through each group.

Here we will look at the BMI we defined above and stored in column `category`.

In [None]:
category_groups = df.groupby('category')

Essentially, **`category_groups`** behaves like a dictionary
* whose keys are the unique values found in the `category` column, and 
* whose values are DataFrames that contain only the rows having the corresponding unique values.

In [None]:
for category, category_data in category_groups:
    print(category)
    print(category_data.head())
    print('=============================')

In [None]:
for category, category_data in df.groupby("category"):
    print('The average weight value for the {} group is {} kg'.format(category,
                                                           category_data.weight_in_kg.mean()))

# Your turn now