In [1]:
from IPython.display import Markdown, display

display(Markdown("header.md"))

<div>
    <img src="images/emlyon.png" style="height:60px; float:left; padding-right:10px; margin-top:5px" />
    <span>
        <h1 style="padding-bottom:5px;"> AI Booster Week 02 - Python for Data Science </h1>
        <a href="https://masters.em-lyon.com/fr/msc-in-data-science-artificial-intelligence-strategy">[Emlyon]</a> MSc in Data Science & Artificial Intelligence Strategy (DSAIS) <br/>
         Paris | © Antoine SCHERRER
    </span>
</div>

Please make sure you have a working installation of Jupyter Notebook / Jupyter Lab, with Python 3.6+ up and running.

## Naming conventions

Since we will implement functions that are already available in python standard library or other libraries, you will have to *prefix* every function with `msds_` prefix.

For instance, the function implementing the `mean` function should be named `msds_mean`.

For every function you write, **you will need to write a test function** that should be named `test_[function_name]`.

For instance, the test function for `msds_mean` will be: `test_msds_mean`.

**don't forget to document all you function with Python docstring**

For instance:
```
def msds_my_awesome_function():
    """
    This function computes an awesome function
    """
    # Awesome code
    ...
```

All function should be in snake case (no Camel case!)

When creating classes, then follow these rules:
 - class names should be in camel case
 - method names should be in snake case
 - attribute names should be in snake case

## Exercise's difficulty

Every exercise will be prefixed with an indication of its difficulty:
 - [easy]: easy exercise, should be pretty straightforward for you
 - [moderate]: intermediate level exercise, you all should manage to solve them
 - [advanced]: for advanced students who want to go deeper/further

**Advanced exercises are not mandatory.**

## Required libraries

These are the libraries we will use (to check our computations for instance), you need to install them in your virtual environment:

 - `pandas`: data manipulation library
 - `scipy`: scientific library in Python
 - `numpy`: vector/matrix computations
 - `statistics`: statistics library
 - `matplotlib`: plotting lib
 - `seaborn`: alternative plotting lib (based on matplotlib)
 - `jupyter_black`: plugin for jupyter to allow `black` (code formatter) to run
 - `unittest`: testing library 




## Session 01 - Introduction, central tendencies and variations

In [2]:
# Initial imports
import unittest
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sb

# Ignore warnings from seaborn
import warnings

warnings.filterwarnings("ignore")

import jupyter_black

jupyter_black.load()

### Central tendancy

#### [easy] write a function that computes the mean of an iterable given as parameter
This should be pretty straightforward, since the formula for the population mean and the sample mean are the same you will only define one function.

In [3]:
def msds_sum(data):
    """
    Computes the sum of every element of the iterable
    """
    result = 0
    for value in data:
        result += value
    return result


def msds_mean(data):
    """
    Computes the average of every element of the iterable
    """
    if len(data) > 0:
        return msds_sum(data) / len(data)


def test_msds_mean():
    assert msds_mean([]) is None
    assert msds_mean([1]) == 1
    assert msds_mean([1, 2, 3]) == 2
    assert msds_mean([1, 2, 2, 3, 4]) == 2.4
    tc = unittest.TestCase()
    with tc.assertRaises(Exception):
        msds_mean("sdfds")


test_msds_mean()

#### [moderate] write a function that computes the median of an iterable given as parameter.
Be careful to refer to the definition of the median!

In [4]:
def msds_median(data):
    """
    Computes the median value of an iterable
    """
    data = sorted(data)
    n = len(data)
    if n > 0:
        if n % 2 == 0:
            # n is odd
            return (data[(n - 1) // 2] + data[(n) // 2]) / 2.0
        else:
            # n is even
            return data[(n - 1) // 2]


def test_msds_median():
    assert msds_median([]) is None
    assert msds_median([1]) == 1
    assert msds_median([1, 2, 2, 3, 4]) == 2
    assert msds_median([1, 2, 2, 3, 3, 4]) == 2.5
    assert msds_median("sdfds") == "f"


test_msds_median()

#### [moderate] write a function that computes the mode of an iterable given as parameter.
Be careful to refer to the definition of the mode!

In [5]:
def msds_mode(data):
    """
    Computes the mode value of an iterable (most frequent value)
    """
    freq = {}
    for value in data:
        freq[value] = freq.get(value, 0) + 1
    if len(freq) > 0:
        return max(freq, key=freq.get)


def test_msds_mode():
    assert msds_mode([]) is None
    assert msds_mode([1]) == 1
    assert msds_mode([1, 2, 2, 3, 4]) == 2
    assert msds_mode([1, 2, 2, 3, 3, 4]) == 2
    assert msds_mode("sdfdss") == "s"


test_msds_mode()

### Dispersion and variation

#### [easy] write a function that computes the range of an iterable given as parameter.

You can use standard python `max` and `min` functions!<br>
Don't hesitate to write you own min/max functions is you like!

In [6]:
def msds_range(data):
    """
    Computes the range value of an iterable
    """
    if len(data) > 0:
        return max(data) - min(data)


def test_msds_range():
    assert msds_range([]) is None
    assert msds_range([1]) == 0
    assert msds_range([1, 2, -10, 10, 3]) == 20
    tc = unittest.TestCase()
    with tc.assertRaises(Exception):
        msds_harmonic_mean("sdfds")


test_msds_range()

#### [easy] write a function that computes the sample variance of an iterable

Remember, the sample variance is defined as:
$\Large \sum_{i=1}^{n}\frac{(X_i-\bar{X})^2}{n-1}$

In [7]:
def msds_sample_variance(data):
    """
    Computes sample variance of an iterable
    """
    n = len(data)
    if n > 1:
        m = msds_mean(data)
        return msds_sum([(d - m) ** 2 for d in data]) / (n - 1)


def test_msds_sample_variance():
    test_data1 = [6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49]
    test_data2 = [7, 15, 36, 39, 40, 41]
    import statistics

    assert msds_sample_variance([]) is None
    assert msds_sample_variance([1]) is None
    assert msds_sample_variance([1, 2, 3]) == 1
    assert (
        abs(msds_sample_variance(test_data1) - statistics.variance(test_data1)) < 1e-10
    )
    assert (
        abs(msds_sample_variance(test_data2) - statistics.variance(test_data2)) < 1e-10
    )


test_msds_sample_variance()

#### [easy] write a function that computes the population variance of an iterable

The formula is very close to the sample variance, but the denominator is slightly different:

$\Large \sum_{i=1}^{n}\frac{(X_i-\bar{X})^2}{n}$

In [8]:
def msds_variance(data):
    """
    Computes sample variance of an iterable
    """
    if len(data) > 0:
        m = msds_mean(data)
        return msds_sum([(d - m) ** 2 for d in data]) / len(data)


def test_msds_variance():
    test_data1 = [6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49]
    test_data2 = [7, 15, 36, 39, 40, 41]
    import statistics

    assert msds_variance([]) is None
    assert (msds_variance([1, 2, 3]) - 0.666666666) < 1e-5
    assert abs(msds_variance(test_data1) - statistics.pvariance(test_data1)) < 1e-10
    assert abs(msds_variance(test_data2) - statistics.pvariance(test_data2)) < 1e-10


test_msds_variance()

#### [easy] write a function that computes the sample standard deviation of an iterable

Standard deviation is the square root of the variance.

In [9]:
def msds_sample_std(data):
    """
    Computes sample variance of an iterable
    """
    variance = msds_sample_variance(data)
    if variance is not None:
        return math.sqrt(variance)


def test_msds_sample_std():
    test_data1 = [6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49]
    test_data2 = [7, 15, 36, 39, 40, 41]
    import statistics

    assert msds_sample_std([]) is None
    assert msds_sample_std([1]) is None
    assert msds_sample_std([1, 2, 3]) == 1
    assert abs(msds_sample_std(test_data1) - statistics.stdev(test_data1)) < 1e-5
    assert abs(msds_sample_std(test_data2) - statistics.stdev(test_data2)) < 1e-5


test_msds_sample_std()

#### [easy] write a function that computes the population standard deviation of an iterable

Same as the previous one, but on the population variance!

In [10]:
def msds_std(data):
    """
    Computes sample variance of an iterable
    """
    variance = msds_variance(data)
    if variance is not None:
        return math.sqrt(variance)


def test_msds_std():
    test_data1 = [6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49]
    test_data2 = [7, 15, 36, 39, 40, 41]
    import statistics

    assert msds_std([]) is None
    assert abs(msds_std([1, 2, 3]) - 0.816496580927726) < 1e-5
    assert abs(msds_std(test_data1) - statistics.pstdev(test_data1)) < 1e-5
    assert abs(msds_std(test_data2) - statistics.pstdev(test_data2)) < 1e-5


test_msds_std()

#### [easy] write a function that computes the coefficient of variation of an iterable

The coefficient of variation is the ratio of the sample standard deviation and the sample mean:

$\Large \frac{s}{\bar{X}}$

In [11]:
def msds_coef_var(data):
    """
    Compute coefficient of variation of an iterable
    """
    if len(data) > 1:
        return msds_sample_std(data) / msds_mean(data)


def test_msds_coef_var():
    test_data1 = [6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49]
    test_data2 = [7, 15, 36, 39, 40, 41]
    import statistics

    assert msds_coef_var([]) is None
    assert msds_coef_var([1]) is None
    assert msds_coef_var([1, 2, 3]) == 0.5


test_msds_coef_var()

#### [moderate] write a function that computes a quartile

The function should take two parameters: an iterable with the data, and a number indicating which quartile to compute (1,2,3).

There are multiple ways to deal with the fact that most of the time $kn/4, k \in [1,2,3]$ is not an integer. We will use the same as numpy's default implementation (linear approximation).

To do so, we will compute the integer part ($i$) and fractional part ($f$) of $k(n-1)/4$.

The result will be computed as:

$\Large Q_k = X_i + f (X_{i+1} - X_{i})$

$X_i$ is $i$-th data point in the sorted dataset, starting at 0.



In [12]:
def msds_quartile(data, p):
    """
    Computes p-th quartile (p=1,2,3) of given data
    Alternative implementation
    """
    sorted_data = sorted(data)
    n = len(data)
    if n == 0:
        return None
    k = p * (n - 1) // 4
    alpha = p * (n - 1) / 4 - k
    return sorted_data[k] + alpha * (sorted_data[k + 1] - sorted_data[k])


def test_msds_quartile():
    test_data1 = [6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49]
    test_data2 = [7, 15, 36, 39, 40, 41]
    assert msds_quartile(test_data1, 1) == np.quantile(test_data1, 0.25)
    assert msds_quartile(test_data1, 2) == np.quantile(test_data1, 0.5)
    assert msds_quartile(test_data1, 3) == np.quantile(test_data1, 0.75)

    assert msds_quartile(test_data2, 1) == np.quantile(test_data2, 0.25)
    assert msds_quartile(test_data2, 2) == np.quantile(test_data2, 0.5)
    assert msds_quartile(test_data2, 3) == np.quantile(test_data2, 0.75)


test_msds_quartile()

#### [easy] write a function that computes the inter-quartile range (IQR)

IQR is simply the difference between $Q_3$ and $Q_1$!

In [13]:
def msds_iqr(data):
    """
    Computes the inter-quartile range (IQR)
    """
    return msds_quartile(data, 3) - msds_quartile(data, 1)


def test_msds_iqr():
    test_data1 = [6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49]
    test_data2 = [7, 15, 36, 39, 40, 41]
    assert msds_iqr(test_data1) == 17, f"{msds_iqr(test_data1)}"
    assert msds_iqr(test_data2) == 19.5, f"{msds_iqr(test_data2)}"


test_msds_iqr()

#### [moderate] write a function that Nth-order moment 

Function should have data and $N$ parameters.
The Nth-order moment is defined as:

$\Large m_N = \frac{1}{N}\sum_i(X_i - \bar{X})^N$

with:
 - $\mu$ is the mean
 - $x_i$ is $i$-th data point



In [14]:
def msds_moment(data, N):
    """
    Computes the N-th order moment
    """
    mu = msds_mean(data)
    moment_data = [((x - mu)) ** N for x in data]
    return msds_sum(moment_data) / len(data)


def test_msds_moment():
    test_data1 = [6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49]
    test_data2 = [7, 15, 36, 39, 40, 41]
    assert (msds_moment(test_data1, 2) - msds_variance(test_data1)) < 1e-5
    assert (msds_moment(test_data2, 2) - msds_variance(test_data2)) < 1e-5
    assert (msds_moment(test_data1, 3) - -3163.492111194589) < 1e-5
    assert (msds_moment(test_data1, 4) - 113326.19657127243) < 1e-5


test_msds_moment()

#### [moderate] write a function that computes N-order standard moment 

Function should have data and $N$ parameters, data should be standardized before computing the moment.

3rd standard moment is the **skewness**, an indicator of the **symetry** of the underlying distribution<br>
4th standard moment is **kurtosis**, an indicator of the **tailedness** of the underlying distrution

write helper functions to compute skewness and kurtosis

In [15]:
from scipy.stats import skew, kurtosis


def msds_standard_moment(data, N):
    """
    Computes the N-th standard moment
    """
    mu = msds_mean(data)
    sigma = msds_std(data)
    moment_data = [((x - mu) / sigma) ** N for x in data]
    return msds_sum(moment_data) / len(data)


def msds_skewness(data):
    """
    Helper function for skewness
    """
    return msds_standard_moment(data, 3)


def msds_kurtosis(data):
    """
    Helper function for kurtosis
    """
    return msds_standard_moment(data, 4)


def test_msds_standard_moment():
    test_data1 = [6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49]
    test_data2 = [7, 15, 36, 39, 40, 41]

    assert (msds_standard_moment(test_data1, 3) - skew(test_data1)) < 1e-5
    assert (
        msds_standard_moment(test_data1, 4) - kurtosis(test_data1, fisher=False)
    ) < 1e-5
    assert (msds_standard_moment(test_data2, 3) - skew(test_data2)) < 1e-5
    assert (
        msds_standard_moment(test_data2, 4) - kurtosis(test_data2, fisher=False)
    ) < 1e-5
    assert msds_iqr(test_data2) == 19.5, f"{msds_iqr(test_data2)}"


test_msds_standard_moment()

# Data cleaning

#### [moderate] identify and fix issues in a dataset 

Load the `err_salary.csv` dataset.
It contains several mistakes, identify and fix them!

To keep it simple, remove all rows that contain erroneous values!

In [35]:
df = pd.read_csv("data/err_salary.csv")
df.dropna()

df = df.loc[(df.Age > 0) & (df.Salary < 1e6)]
df = df.drop("Age_Months", axis=1)
df.describe()

Unnamed: 0,Age,Years of Experience,Salary
count,376.0,376.0,376.0
mean,37.425532,10.896277,100426.462766
std,7.139358,11.464665,48293.870948
min,23.0,0.0,350.0
25%,31.0,4.0,55000.0
50%,36.0,9.0,95000.0
75%,44.0,16.0,140000.0
max,53.0,126.0,250000.0
