# Stats

## Libraries

In [10]:
import numpy as np
import scipy 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns # Set default Seaborn style

#PLOTLY
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff

sns.set(style="whitegrid")
pd.set_option("display.max.columns", None)
pd.set_option('display.notebook_repr_html', True)

In [11]:
# Loading built-in Datasets:
iris = sns.load_dataset("iris")

In [15]:
iris = pd.DataFrame(iris)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [13]:
print(iris.info)

<bound method DataFrame.info of      sepal_length  sepal_width  petal_length  petal_width    species
0             5.1          3.5           1.4          0.2     setosa
1             4.9          3.0           1.4          0.2     setosa
2             4.7          3.2           1.3          0.2     setosa
3             4.6          3.1           1.5          0.2     setosa
4             5.0          3.6           1.4          0.2     setosa
..            ...          ...           ...          ...        ...
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica

[150 rows x 5 columns]>


# Measures of Central Tendency
The measures of central tendency show the central or middle values of datasets. There are several definitions of what’s considered to be the center of a dataset. In this tutorial, you’ll learn how to identify and calculate these measures of central tendency:

* Mean
* Weighted mean
* Geometric mean
* Harmonic mean
* Median
* Mode


## Mean absolute deviation

In [9]:
iris.mad()

sepal_length    0.687556
sepal_width     0.336782
petal_length    1.562747
petal_width     0.658133
dtype: float64

## Variance

In [14]:
iris.var()

sepal_length    0.685694
sepal_width     0.189979
petal_length    3.116278
petal_width     0.581006
dtype: float64

# Measures of Variability
The measures of central tendency aren’t sufficient to describe data. You’ll also need the measures of variability that quantify the spread of data points. In this section, you’ll learn how to identify and calculate the following variability measures:

* Variance
* Standard deviation
* Skewness
* Percentiles
* Ranges

## Coefficient of variation

In [8]:
np.std(iris)/np.mean(iris)


sepal_length    0.141238
sepal_width     0.142088
petal_length    0.468176
petal_width     0.633429
dtype: float64

## Standard deviation

In [29]:
iris.std()

sepal_length    0.828066
sepal_width     0.435866
petal_length    1.765298
petal_width     0.762238
dtype: float64

## Skewness

In [33]:
iris.skew()

sepal_length    0.314911
sepal_width     0.318966
petal_length   -0.274884
petal_width    -0.102967
dtype: float64

## Quantiles

In [39]:
quantiles = iris.quantile([0.25, 0.5, 0.75])

print(quantiles)

      sepal_length  sepal_width  petal_length  petal_width
0.25           5.1          2.8          1.60          0.3
0.50           5.8          3.0          4.35          1.3
0.75           6.4          3.3          5.10          1.8


## Covariance

### Measures of Correlation Between Pairs of Data
You’ll often need to examine the relationship between the corresponding elements of two variables in a dataset. Say there are two variables, 𝑥 and 𝑦, with an equal number of elements, $𝑛$. Let $𝑥_₁$ from $𝑥$ correspond to $𝑦₁$ from $𝑦$, $𝑥_2$ from $𝑥$ to $𝑦_2$ from $𝑦$, and so on. You can then say that there are $𝑛$ pairs of corresponding elements: $(𝑥_₁, 𝑦_₁)$, $(𝑥₂, 𝑦₂)$, and so on.

You’ll see the following measures of correlation between pairs of data:

Positive correlation exists when larger values of $𝑥$ correspond to larger values of $𝑦$ and vice versa.
Negative correlation exists when larger values of $𝑥$ correspond to smaller values of $𝑦$ and vice versa.
Weak or no correlation exists if there is no such apparent relationship.

The two statistics that measure the correlation between datasets are covariance and the correlation coefficient.

<p style=“
   color: #32516b;
   background-color: #dfebf5;
   border-color: #d3e3f1;”>When working with correlation among a pair of variables, and that’s that correlation is not a measure or indicator of causation, but only of association!</p>


In [43]:
covariance = iris.cov()
print(covariance)

              sepal_length  sepal_width  petal_length  petal_width
sepal_length      0.685694    -0.042434      1.274315     0.516271
sepal_width      -0.042434     0.189979     -0.329656    -0.121639
petal_length      1.274315    -0.329656      3.116278     1.295609
petal_width       0.516271    -0.121639      1.295609     0.581006
