# Lightspeed introduction to `pandas`

Pandas is the library providing tools to crunch data simply.

It mainly provides a `DataFrame` object

In [None]:
import numpy as np
import pandas as pd

df = pd.DataFrame({ 
    'A' : 1.,
    'B' : pd.Timestamp('20130102'),
    'C' : pd.Series(1, dtype='float32'),
    'C' : pd.Series(1, index=[0, 2], dtype='float32'),
    'D' : pd.Series([1, 2, 3, 4], dtype='int32'),
    'E' : pd.Categorical(["test", "train", "test", "train"]),
    'F' : 'foo',
    'G' : np.random.randn(4)
})
df

## Basic things

In [None]:
# Columns can be accessed by name (when it makes sense)
df.B

In [None]:
# Or like keys in a dict
df['B']

In [None]:
type(df.B)

In [None]:
# To select a list of columns
df[['A', 'C']]

In [None]:
df.dtypes

In [None]:
df.info()

## Groupby and aggregations

Basic syntax: group rows using the categories of a column and perform an aggregation on the groups

In [None]:
df

In [None]:
# Compute the sum of D for each category in E
df.groupby('E').sum().reset_index()

# Lightspeed introduction to `seaborn`

It's a graphics library built on top of `matplotlib` which
- works pretty neatly with `pandas` `DataFrame`s
- provides simpler ways to make nice visualization of datasets

Let's illustrate this using the toy `tips` dataset that comes with `seaborn`

In [None]:
import seaborn as sns

# Load one of the data sets that comes with seaborn
tips = sns.load_dataset("tips")

# First 10 rows of the dataframe
tips.head(n=10)

In [None]:
tips.describe(include='all')

In [None]:
tips['day'].unique()

In [None]:
tips.dtypes

In [None]:
tips.info()

In [None]:
sns.jointplot("total_bill", "tip", data=tips)

## Exercice 1

Compute the tip percentage of Dinner VS Lunch for each day of the week

In [None]:
tips.head()

### Answer

In [None]:
tips['tip_percentage'] = 100 * tips['tip'] / tips['total_bill']
tips.groupby(['time', 'day']).mean()[['tip_percentage']]

## Exercice 2 

Convert `size` as a categorical variable

### Answer

In [None]:
# We want to deal with size as a categorical variable
tips['size'] = tips['size'].astype('category')
tips.head(5)

## Exercice 3

One-hot encode (or "create dummies" or "binarize") the categorical variables (this can be easily achieved with the `pandas.get_dummies` function)

### Answer

In [None]:
data = pd.get_dummies(tips, prefix_sep='#')
data.head(5)

Only categorical columns have been "binarized". For instance, the `'day'` column is replaced by 4 columns named `'day#Thur'`, `'day#Fri'`, `'day#Sat'`, `'day#Sun'`, since `'day'` has 4 modalities (see next line).

In [None]:
tips['day'].unique()

## Remark

Sums over dummies for `sex`, `smoker`, `day`, `time` and `size` are all equal to one.

- Leads to colinearities, hence bad conditioning of the features matrix
- Can be checked through a SVD (but don't compute the SVD of a large matrix!)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

s = np.linalg.svd(data, compute_uv=False)
plt.yscale('log')
plt.title('Spectrum of the features matrix', fontsize=16)
_ = plt.stem(s, use_line_collection=True)

In [None]:
data = pd.get_dummies(tips, prefix_sep='#', drop_first=True)
data.head()

Now, if a categorical feature has $K$ modalities, we use only $K-1$ dummies

In [None]:
data.head()

## Exercice 4

Normalize the continuous features

### Answer

In [None]:
def normalize_min_max(columns, data):
    """Min-max scale columns in data

    Parameters
    ----------
    columns : `List[str]`
        A list of columns to min-max scale

    data : `pandas.DataFrame`
        A dataframe containing the given columns

    Returns
    -------
    output : `None`
        data is modified inplace and not return by the fonction
    """
    min_max = data[columns].agg(['min', 'max'])
    for col in columns:
        data[col] -= min_max.loc['min', col]
        data[col] /= (min_max.loc['max', col] - min_max.loc['min', col])    

In [None]:
normalize_min_max(['total_bill', 'tip'], data)

In [None]:
data[['total_bill', 'tip']].describe()