In [11]:
from IPython.display import Markdown, display

display(Markdown("header.md"))

<div>
    <img src="images/emlyon.png" style="height:60px; float:left; padding-right:10px; margin-top:5px" />
    <span>
        <h1 style="padding-bottom:5px;"> AI Booster Week 02 - Python for Data Science </h1>
        <a href="https://masters.em-lyon.com/fr/msc-in-data-science-artificial-intelligence-strategy">[Emlyon]</a> MSc in Data Science & Artificial Intelligence Strategy (DSAIS) <br/>
         Paris | © Antoine SCHERRER
    </span>
</div>

Please make sure you have a working installation of Jupyter Notebook / Jupyter Lab, with Python 3.6+ up and running.

## Naming conventions

Since we will implement functions that are already available in python standard library or other libraries, you will have to *prefix* every function with `msds_` prefix.

For instance, the function implementing the `mean` function should be named `msds_mean`.

For every function you write, you will need to write a test function that should be names `test_msds_[function_name]`.

For instance, the test function for the mean will be: `test_msds_mean`.

All function should be in snake case (no Camel case!)

When creating classes, then follow these rules:
 - class names should be in camel case
 - method names should be in snake case
 - attribute names should be in 

## Exercise's difficulty

Every exercise will be prefixed with an indication of its difficulty:
 - [easy]: for very easy exercise
 - [moderate]: for intermediate level exercise
 - [advanced]: for advanced students

Advanced exercises are not mandatory.


In [30]:
# Initial imports
import unittest
import math
import pandas as pd

# This imports all functions defined in the Session_01 notebook!
import import_ipynb
from Session_01_solved import *

importing Jupyter notebook from Session_01_solved.ipynb


<div>
    <img src="images/emlyon.png" style="height:60px; float:left; padding-right:10px; margin-top:5px" />
    <span>
        <h1 style="padding-bottom:5px;"> AI Booster Week 02 - Python for Data Science </h1>
        <a href="https://masters.em-lyon.com/fr/msc-in-data-science-artificial-intelligence-strategy">[Emlyon]</a> MSc in Data Science & Artificial Intelligence Strategy (DSAIS) <br/>
         Paris | © Antoine SCHERRER
    </span>
</div>

Please make sure you have a working installation of Jupyter Notebook / Jupyter Lab, with Python 3.6+ up and running.

## Naming conventions

Since we will implement functions that are already available in python standard library or other libraries, you will have to *prefix* every function with `msds_` prefix.

For instance, the function implementing the `mean` function should be named `msds_mean`.

For every function you write, **you will need to write a test function** that should be named `test_[function_name]`.

For instance, the test function for `msds_mean` will be: `test_msds_mean`.

**don't forget to document all you function with Python docstring**

For instance:
```
def msds_my_awesome_function():
    """
    This function computes an awesome function
    """
    # Awesome code
    ...
```

All function should be in snake case (no Camel case!)

When creating classes, then follow these rules:
 - class names should be in camel case
 - method names should be in snake case
 - attribute names should be in snake case

## Exercise's difficulty

Every exercise will be prefixed with an indication of its difficulty:
 - [easy]: easy exercise, should be pretty straightforward for you
 - [moderate]: intermediate level exercise, you all should manage to solve them
 - [advanced]: for advanced students who want to go deeper/further

**Advanced exercises are not mandatory.**

## Required libraries

These are the libraries we will use (to check our computations for instance), you need to install them in your virtual environment:

 - `pandas`: data manipulation library
 - `scipy`: scientific library in Python
 - `numpy`: vector/matrix computations
 - `statistics`: statistics library
 - `matplotlib`: plotting lib
 - `seaborn`: alternative plotting lib (based on matplotlib)
 - `jupyter_black`: plugin for jupyter to allow `black` (code formatter) to run
 - `unittest`: testing library 




# Session 01 - Introduction - Practice

Solving **all** question from Session_01 is mandatory before solving these practice exercises.

### [easy] write a function that computes the geometric mean of an iterable given as parameter.

The geometric mean is defined as:

$\Large {\displaystyle \left(\prod{X_i}\right)^{1/n}}$

$X_i$ is $i$-th data point


In [13]:
def msds_geo_mean(data):
    try:
        n = len(data)
        print(n)
        result = 1
        for element in data:
            result *= element
        result = pow(result,1/n)
        return result
    except Exception as e:
        print(e)

In [14]:
msds_geo_mean([1,2,3])

3


1.8171205928321397

In [None]:
def test_msds_mean():
    assert msds_geo_mean([]) is None
    assert msds_geo_mean([1]) == 1
    assert msds_geo_mean([1, 2, 3]) == 2
    assert msds_geo_mean([1, 2, 2, 3, 4]) == 2.4
    tc = unittest.TestCase()
    with tc.assertRaises(Exception):
        msds_geo_mean("sdfds")

### [easy] write a function that computes the harmonic mean of an iterable given as parameter.

The harmonic mean is defined as:

$\Large {\displaystyle \frac{n}{\sum_{i=1}^{n} \frac{1}{X_i}}}$

$X_i$ is $i$-th data point


In [17]:
def msds_harm_mean(data):
    try:
        n = len(data)
        result = 0
        for element in data:
            result += (1/element)
        result = n / result
        return result
    except Exception as e:
        print(e)

In [18]:
msds_harm_mean([1,2,3])

1.6363636363636365

### [moderate] write a function that computes all data needed for a box-plot

Function should output two lists, the first list is the data needed to plot a box plot (in this order)
 - lower fence ($Q1 - 1.5\times IQR$)
 - 1st quartile ($Q1$) 
 - median ($Q2$)
 - 3rd quartile ($Q3$)
 - upper fence ($Q3 + 1.5\times IQR$)

The second list should contain all outliers ($v$ < lower fence or $v$ > upper fence)

In [67]:
def msds_box_fence(data):
    iqr = msds_iqr(data)
    q1 = msds_quartile(data, 1)
    q2 = msds_quartile(data, 2)
    q3 = msds_quartile(data, 3)
    lower_fence = q1 - 1.5 * iqr
    upper_fence = q3 + 1.5 * iqr

    outliers = [v for v in data if v < lower_fence or v > upper_fence]
    return [lower_fence, q1, q2, q3, upper_fence], outliers

In [21]:
# def msds_box_fence(data):
#     try:
#         # Sort the data in ascending order
#         data_sorted = sorted(data)

#         # Calculate quartiles (Q1 and Q3)
#         n = len(data_sorted)
#         q1_index = int(0.25 * n)
#         q3_index = int(0.75 * n)
#         q1 = data_sorted[q1_index]
#         q3 = data_sorted[q3_index]

#         # Calculate the interquartile range (IQR)
#         iqr = q3 - q1

#         # Calculate the lower and upper fences
#         lower_fence = q1 - 1.5 * iqr
#         upper_fence = q3 + 1.5 * iqr

#         # Find outliers
#         outliers = [v for v in data_sorted if v < lower_fence or v > upper_fence]
        
#         # Return the box plot statistics and outliers
#         return [lower_fence, q1, (q1 + q3) / 2, q3, upper_fence], outliers
#     except Exception as e:
#         print(e)

In [68]:
# Example usage:
data = [15, 20, 21, 22, 22, 23, 24, 25, 30, 200]
box_plot_stats, outliers = calculate_box_plot_statistics(data)

print("Box Plot Statistics:")
print("Lower Fence:", box_plot_stats[0])
print("1st Quartile (Q1):", box_plot_stats[1])
print("Median (Q2):", box_plot_stats[2])
print("3rd Quartile (Q3):", box_plot_stats[3])
print("Upper Fence:", box_plot_stats[4])

print("\nOutliers:", outliers)

Box Plot Statistics:
Lower Fence: 15.0
1st Quartile (Q1): 21
Median (Q2): 23.0
3rd Quartile (Q3): 25
Upper Fence: 31.0

Outliers: [200]


### [moderate] write a function that computes all deciles

Function should output the list of 9 deciles (D1 .. D9).

You should write a test function and compare your results to the one from a statistics package.

Use `salary.csv` and `heights_weights.csv` datasets to check your results.

In [65]:
def msds_decile(data, p):
    """
    Computes p-th decile (p=1,2,3) of given data
    Alternative implementation
    """
    sorted_data = sorted(data)
    n = len(data)
    if n == 0:
        return None
    k = p * (n - 1) // 10
    alpha = p * (n - 1) / 10 - k
    return sorted_data[k] + alpha * (sorted_data[k + 1] - sorted_data[k])


def msds_idr(data):
    """
    Computes the inter-quartile range (IDR)
    """
    return msds_decile(data, 9) - msds_decile(data, 1)


def msds_fence_decile(data):
    idr = msds_idr(data)
    d1 = msds_decile(data, 1)
    d2 = msds_decile(data, 2)
    d3 = msds_decile(data, 3)
    d4 = msds_decile(data, 4)
    d5 = msds_decile(data, 5)
    d6 = msds_decile(data, 6)
    d7 = msds_decile(data, 7)
    d8 = msds_decile(data, 8)
    d9 = msds_decile(data, 9)

    lower_fence = d1 - 1.5 * idr
    upper_fence = d9 + 1.5 * idr

    outliers = [v for v in data if v < lower_fence or v > upper_fence]
    return [lower_fence, d1, d2, d3, d4, d5, d6, d7, d8, d9, upper_fence], outliers

In [62]:
# def msds_decile(data):
#     try:
#         # Sort the data in ascending order
#         data_sorted = sorted(data)

#         # Calculate deciles (d1 and d9)
#         n = len(data_sorted)
#         print(n)
#         d1_index = int(0.1 * n)
#         d2_index = int(0.2 * n)
#         d3_index = int(0.3 * n)
#         d4_index = int(0.4 * n)
#         d5_index = int(0.5 * n)
#         d6_index = int(0.6 * n)
#         d7_index = int(0.7 * n)
#         d8_index = int(0.8 * n)
#         d9_index = int(0.9 * n)

#         d1 = data_sorted[d1_index]
#         d2 = data_sorted[d2_index]
#         d3 = data_sorted[d3_index]
#         d4 = data_sorted[d4_index]
#         d5 = data_sorted[d5_index]
#         d6 = data_sorted[d6_index]
#         d7 = data_sorted[d7_index]
#         d8 = data_sorted[d8_index]
#         d9 = data_sorted[d9_index]

#         # Calculate the interdecile range (IDR)
#         idr = d9 - d1

#         # Calculate the lower and upper fences
#         lower_fence = d1 - 1.5 * idr
#         upper_fence = d9 + 1.5 * idr

#         # Find outliers
#         outliers = [v for v in data_sorted if v < lower_fence or v > upper_fence]

#         # Return the box plot statistics and outliers
#         return [lower_fence, d1, d2, d3, d4, d5, d6, d7, d8, d9, upper_fence], outliers
#     except Exception as e:
#         print(e)

In [42]:
salary = pd.read_csv("data/salary.csv")
salary["Age"]

0      32.0
1      28.0
2      45.0
3      36.0
4      52.0
       ... 
370    35.0
371    43.0
372    29.0
373    34.0
374    44.0
Name: Age, Length: 375, dtype: float64

In [71]:
age_decile, outliers = msds_fence_decile(salary["Salary"])
age_decile

[-115000.0,
 50000.0,
 78999.99999999999,
 105000.0,
 140000.0,
 nan,
 50000.0,
 90000.0,
 120000.0,
 160000.0,
 325000.0]

### [moderate] Application to data

Using the functions you implemented today, perform a basic statistical description of `highest_mountains.csv`, `salary.csv` and `heights_wweights.csv` datasets.

In [None]:
## YOUR CODE HERE

### [advanced] Deciles, percentiles, etc.

Write a function that takes a data set and an integer `P` as parameter.
`P` will be the number of segment in which we want to slice our data (4 means quartiles, 10 deciles, 100 percentiles).
Function should output all values (N-1).
Function should also print N messages like this, replacing `[...]` by the correct value:
```
[...]% of data values are below [...]
[...]% of data values are between [...] and [...]
...
[...]% percent of data values are above [...]
```
For instance, if `P=4`, the message should look like this (replacing `[Q1]` by actual values):
```
25% of data values are below [Q1]
50% of data values are below [Q2]
75% of data values are below [Q3]
```

Validate on weights dataset.

In [None]:
## YOUR CODE HERE

### [advanced] Compute empirical critical values

When performing statistical tests or building confidence intervals, 
you will sometimes need to compute the critical values for a given significance level $\alpha$  (typically $\alpha=0.05$).

Using function from previous question, write a function that will compute this critical values (lower and upper), for two-tailed tests (lower critical value $c_l$, such that $P(X < c_l) = \alpha / 2$, upper critical value is $c_h$, such that $P(X < c_h) = 1-\alpha / 2$


In [None]:
## YOUR CODE HERE

## Object-oriented programming

### Q6 [advanced] Convert all your functions and organize them in classes using OOP

The idea is that using OOP you can build your own statistics package, tailored to your needs.
Think carefully how you will organize your package, how you want to use it, etc.


In [2]:
## YOUR CODE HERE