# STATISTICS Applied to data science

## Exercises PART 1: Descriptive statistics and data exploration

Employing descriptive statistics is one of the main steps of the POC stage (proof-of-concept) and extremely helpful during model evaluation.  
A sound knowledge of statistics will help you design your machine learning experiments and interpret the results easily.   
In this notebook you'll find some common routines for descriptive statistics in Python, and exercises about data transformation and scaling. 

![Image](../images/data_1.jpg)

### Libraries and configs

In [None]:
import numpy as np
from numpy import random
import pandas as pd
from numpy.random import seed, randn
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (8, 5)
#%matplotlib inline

from scipy import stats

# jupyter lab configs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# precision options
pd.set_option('display.float_format', lambda x: '%.2f' % x)
%precision 4
np.set_printoptions(precision=4, suppress=True)

# Exercise 1 - Write your own summary statistics and descriptors

Implement code for the functions below. In each function, make sure you call the function written before. E.g., in `my_rmse()` use the values returned by `my_mse()`. The aim of this exercise is just to understand how these diferent metrics are related, and which aspect of the data they are representing.   

**You can use the map below to see the relationships between metrics and then plan how to structure your functions** 

![Image](../images/map.png)

In [None]:
def my_mean(x):
    if len(x)>0:
        return sum(x)/len(x)

def my_sum_squares():
    pass

def my_mse():
    # mean squared error
    pass

def my_rmse():
    # rooted mean squared error
    pass

def my_variance():
    pass

def my_std_dev():
    pass

def my_std_error():
    pass

def my_confidence_95():
    pass
    
def my_covariance():
    pass

def my_coeficient_variation():
    pass

### Make sure it works!! In Python use `assert`

In [None]:
x = random.randint(500, size=(32))
assert my_mean(x) == np.mean(x)

---

# Exercise 2. Practice data description and summarization with pandas

### Here's a collection of `pandas` functions I find most useful during the data exploration stage:
* `.describe()`  and `.describe(include=np.object)` 
* `.info()`
* `.unique()` and `.nunique()`
* `.count_values()`
* `.group_by().agg()`
* `.pd.cut()` and `pd.qcut()` for binning continuous vars into discrete

In [None]:
# load a dataset
from sklearn.datasets import load_boston
dt = load_boston(return_X_y=False)

BOSTON DATASET  
**TARGET**  
`MEDV` Median value of owner-occupied homes in thousands

**POSSIBLE FACTORS**  
`CRIM` per capita crime rate by town    
`ZN` proportion of residential land zoned for lots over 25,000 sq.ft.    
`INDUS` proportion of non-retail business acres per town  
`CHAS` Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)   
`NOX` nitric oxides concentration (parts per 10 million)   
`RM` average number of rooms per dwelling      
`AGE` proportion of owner-occupied units built prior to 1940    
`DIS` weighted distances to five Boston employment centres       
`RAD` index of accessibility to radial highways    
`TAX` full-value property-tax rate per $10,000     
`PTRATIO` pupil-teacher ratio by town  

In [None]:
# Load Boston house prices data - CONTINUOUS DATA
dt = load_boston(return_X_y=False)
df = pd.DataFrame(data = np.c_[dt['data'],dt['target']])
df.columns = np.append(dt['feature_names'], 'MED_VALUE')
df.drop(['B', 'LSTAT'], inplace=True, axis=1)

Use pandas `describe()` for continuous data and `describe(include=np.object)` for categorical 

In [None]:
df.describe()

Check the number of unique values per variable to understand which are continuous and which are discrete

In [None]:
for c in df.columns:
    print(c, 'has',  df[c].nunique(), 'unique values')

Look at the table above and pay attention to the continuous variables you identified.   
Just looking at the relationship between the **mean** and **std**, which variables seem to be normally distributed?

Which seem to be not normally distributed?

In [None]:
df.DIS.plot.hist(bins=30)

Let's create some categories in the data using `pd.cut()`.     
Check the variable `NOX` that indicates a measure of pollution.    
How many categories could we extract from this data?

In [None]:
df.NOX.plot.hist(bins=30)

In [None]:
bins = [0, 0.48, 0.58, 0.68, 0.78, 1]
labels = ['level1', 'level2', 'level3', 'level4', 'level5']
df['NOX_categories'] = pd.cut(df['NOX'], labels=labels, bins=bins)
df[['NOX', 'NOX_categories']].head(10)

Overview of the new variable `NOX_categories`

In [None]:
df.NOX_categories.describe(include=np.object)
df.NOX_categories.value_counts()

---

# Working with probability distributions

Examples using the **stats** module of **scipy**:  

Probability function: `stats.poisson.pmf()`  
Cumulative function: `stats.poisson.cdf()`  
Generate samples: `stats.poisson.rvs(3, size=5000)`  


In [None]:
import math

The number of items per order follows a poisson distribution with lambda = 2.  
What is the probability of having an order with exactly 6 items?

In [None]:
stats.poisson.pmf(6, 2)

The number of clicks per add follows a poisson distribution with lambda = 10.  
What is the probability of having 15 clicks in one add?

In [None]:
stats.poisson.pmf(15, 10)

---

# Three ways to check the distribution of your data

## 1. Use histograms

you can show the frequency as absolute values:

In [None]:
df.CRIM.plot.hist(bins=20)

...of you can show as percentages:

In [None]:
df.CRIM.plot.hist(bins=10, density=True)

## 2. Use hypothesis tests

A common way of testing if a variable has a normal distribution is to use the **Shapiro-Wilk Test**.        
In this test, the null hypothesis is that the data comes from a normal distribution. When **p < 0.05** we can reject this hypothesis  

In [None]:
# import the test from scipy
from scipy.stats import shapiro

# create a variable by drawing from a normal distribution
normal_data = np.random.normal(8, 3.3, 100)
# apply the test, which returns the statistic and the p-value
shapiro(normal_data)

The p-value is > 0.05 (by far), so what do we do? 

**Now repeat the test using one of the dataset's variables:**

Is this variable normally distributed?  
Try it yourself using the other variables in the dataset

In [None]:
shapiro(df.RM)

## 3. Use QQ-plots

In [None]:
from statsmodels.graphics.gofplots import qqplot

# example of a normally distributed variable
p = qqplot(normal_data, line='s')


# example of a variable approaches normality but has outliers
q = qqplot(df.RM, line='s')

# example of a very not-normally distributed variable
r = qqplot(df.CRIM, line='s')

---

# Data transformations

## The most common procedures are *feature scaling* and *linearization*:

1. `Feature scaling` means you transform the data so all quantitative features are, let's say, *speaking the same language*.   
Common scaling techniques are:
* Min-max (a.k.a. **normalization**)
* z-score (a.k.a. **standardization**)  

Particularly, I always use z-score, and this transformation is also the most common method employed in *unsupervised learning* such as PCA, clustering, etc.

2. `Linearization` will be usually needed to transform the `target`, or `dependent` variable, i. e., what you are trying to model

# Feature scaling (a.k.a. standardization, normalization)

## Z-score transformation 

![Image](../images/zscore.gif)   



You can use `scipy.stats.zscore()` or write your own function, which is way more fun:

In [None]:
def my_z_score(data):
    """ Applies z-score transformation to a vector"""
    return data

In [None]:
# generate some data and check the mean and sd before transformation
data = randn(5)
np.mean(data), np.std(data)

Now check what happens to the mean and std after the z-score transformation:

In [None]:
data_std = my_z_score(data)
np.mean(data_std), np.std(data_std)

# Linearization
## *Dealing with non-gaussian data* 

There's usually four ways of carrying on the analysis if you are working with regression problems and quantitative **target** variables that are not normally-distributed.
1. Look for models that don't need linear relationships in the data (E. g. random forests, boosted trees)
2. Look for models that can handle different distributions, like Poisson or Binomial (a.k.a. Generalized Linear Models)
3. If you are using a hypothesis test, use bootstrapping to generate to generate the null model 
4. Apply transformations (log, sqrt, box-cox)

**Warning!**  

Log-transformation is a common tool in statistics. However, there is a pitfall in using log transformation of your data.  
Especially if you have a wide numerical range in a feature, keep in mind that log will "compress" the data significantly more, and this can prevent the identification of interesting patterns.

In [None]:
# difference betwee the log and sqrt transformation of a "big" value
np.sqrt(34565)
np.log(34565)

# difference betwee the log and sqrt transformation of a "small" value
np.sqrt(107)
np.log(107)

The function below plots the diagnostic plots **QQ Plots** for two sets of variables, like raw (unstransformed) and transformed data, for comparison. 

In [None]:
def plot_compare_transformations(raw_data, transformed_data, transformation_used):
    fig = plt.figure(figsize = (18, 7))
    ax1 = fig.add_subplot(211)
    prob = stats.probplot(raw_data, dist=stats.norm, plot=ax1)
    ax1.set_xlabel('')
    ax1.set_title('Probplot against the normal distribution (line) ')
    ax2 = fig.add_subplot(222)
    prob = stats.probplot(transformed_data, dist=stats.norm, plot=ax2)
    ax2.set_title('Probplot after ' + transformation_used + ' transformation')
    #plt.show()

### Example: 
Try **box-cox** (available in **scipy**)

In [None]:
# generate some data with noise
raw_data = stats.loggamma.rvs(14, size=50) + 250

# apply box-cox
transformed_data, _ = stats.boxcox(raw_data)

# plot and compare 
plot_compare_transformations(raw_data, transformed_data, 'box-cox')

In [None]:
# apply sqrt
transformed_data, _ = stats.boxcox(df.RM)

# plot and compare 
plot_compare_transformations(df.RM, transformed_data, 'box-cox')

# apply sqrt
transformed_data = np.log(df.RM)

# plot and compare 
plot_compare_transformations(df.RM, transformed_data, 'log')

Now let's see the same effect in numbers:

Example with **Shapiro-Wilks's** test:

In [None]:
print('Test of normal distribution with Shapiros Test')
print('stat:', stats.shapiro(raw_data)[0],'p-value:', stats.shapiro(raw_data)[1])

In [None]:
print('Test of normal distribution with Shapiros Test')
print('stat:', stats.shapiro(transformed_data)[0],'p-value:', stats.shapiro(transformed_data)[1])

----

<a href='https://www.freepik.com/vectors/data'>Data vector created by stories - www.freepik.com</a>