In [None]:
# read file in a pandas dataframe
import pandas as pd
import numpy as np

In [None]:
import matplotlib.pyplot as plt

# Correlation analysis

Consider an array $X$ of size $n\times p$ containing numerical data.

The dimension $n$ (the number of rows in the table) gives the number of *individuals*.
The dimension $p$ (the number of columns in the table) gives the number of *variables*.

- A row $i$ of table $X$ describes an individual using $p$ measures: it's a **datum**.
- A column $j$ of table $X$ is a series or a statistical variable



## Mean vector

The variables are denoted $X^1$, $X^2$,..., $X^{p}$. These are the columns of the matrix $X$.
They are therefore vectors of $\mathbb{R}^{n}$.
The mean of the $j$th variable is denoted $\bar{X}^j$ and is calculated by the following formula
$$
\bar{X}^j = \frac{1}{n} \sum_{i=1}^n x_{ij}.
$$
The vector of averages is the vector
$$
\bar{X} = (\bar{X}^1,\ldots,\bar{X}^p)^T.
$$
It can be calculated using the numpy function ``np.mean``.

## Variances and covariances

The **variance-covariance matrix** is a table which gives **second order** *statistics* of the observations contained
in the array $X$ of size $n\times p$.

If $X^j$ and $X^{j\prime}$ are two statistical variables of dimension $n$, the empirical variance and covariance
are given by the formulas
$$
\mathrm{Var}(X^j) = \frac{1}{n} \sum_{i=1}^n (x_{ij}-\bar{X}^j)^2 = \frac{1}{n} \sum_{i=1}^n x_{ij}^2 -(\bar{X}^j)^2
$$
$$
\mathrm{Cov}(X^j, X^{j\prime})\frac{1}{n} \sum_{i=1}^n (x_{ij}-\bar{X}^j)(x_{ij\prime}-\bar{X}^{j\prime})
= \sum_{i=1}^n x_{ij}x_{ij\prime}-\bar{X}^j\bar{X}^{j\prime}
$$
The variance-covariance matrix, denoted $\Sigma$, is a matrix of size $p\times p$ which contains
the following elements
$$
\sigma_{jj\prime} = \left\{
\begin{array}{lcl}
\mathrm{Cov}(X^j, X^{j\prime}) & \text{si} & j\neq j\prime \\\
\mathrm{Var}(X^j) &\text{sinon}
\end{array}
\right..
$$
The diagonal of $\Sigma$ contains the variances of the variables, and outside the diagonal
we find the covariances between the variables.

The variance-covariance matrix of a numpy array can be calculated using the following code
```
cov = np.cov(X, rowvar=False)
cov
```

## Corrélations

La matrice de **corrélation** est un tableau qui donne les *corrélations* enbtre les
$p$ variables d'un tableau $X$ de taille $n\times p$.

Si $X^j$ et $X^{j\prime}$ sont deux variables statistiques de dimension $n$, la corrélation empirique
entre ces deux variables est donnée par la formule suivante
$$
\rho(X^j, X^{j\prime}) = \frac{\mathrm{Cov}(X^j, X^{j\prime})}{\sqrt{\mathrm{Var}(X^j)\mathrm{Var}(X^{j\prime})}}
$$
La matrice de corrélation, notée $R$, est une matrice de taille $p\times p$ qui contient
les éléments suivants
$$
r_{jj\prime} = \left\{
\begin{array}{lcl}
\rho(X^j, X^{j\prime}) & \text{si} & j\neq j\prime \\
1 &\text{sinon}
\end{array}
\right..
$$
La diagonale de $R$ ne contient que des 1, car la corrélation d'une variable avec elle même est parfaite.

Il est possible de calculer la matrice de variance covariance d'un tableau ``numpy`` en utilisant le code suivant
```
cor = np.corrcoef(X, rowvar=False)
cor
```

## Correlations

The **correlation matrix** is a table that gives the *correlations* between the
$p$ variables of an array $X$ of size $n\times p$.

If $X^j$ and $X^{j\prime}$ are two statistical variables of dimension $n$, the empirical correlation
between these two variables is given by the following formula
$$
\rho(X^j, X^{j\prime}) = \frac{\mathrm{Cov}(X^j, X^{j\prime})}{\sqrt{\mathrm{Var}(X^j)\mathrm{Var}(X^{j\prime})}}
$$
The correlation matrix, denoted $R$, is a matrix of size $p\times p$ which contains
the following elements
$$
r_{jj\prime} = \left\{
\begin{array}{lcl}
\rho(X^j, X^{j\prime}) & \text{si} & j\neq j\prime \\
1 &\text{sinon}
\end{array}
\right..
$$
The diagonal of $R$ contains only 1's, because the correlation of a variable with itself is perfect.

The variance-covariance matrix of a numpy array can be calculated using the following code
```
cor = np.corrcoef(X, rowvar=False)
cor
```

## Training: Olympic data file

Let's read a data file and try to interpret these statistics.

The file *olympic.csv* contains the men's decathlon results from the
1988 Olympic Games. The variables are

- IDEN: athlete number
- C100: the 100 m race,
- SLONG: long jump,
- LWEIGHT: shot put,
- SHAUT: high jump,
- C400: 400 m race,
- C110: 110 m hurdles,
- LDISQ: discus throw,
- SPERCH: pole vault,
- LJAVE: javelin throw,
- C1500: 1500 m race
- SCORE: total score obtained.

These data were published as example no. 357 in the book
*A handbook of small data sets "* published by Chapman & Hall, London (1994)
authored by Hand, D.J., Daly, F., Lunn, A.D., McConway, K.J. & Ostrowski, E.

This was a repeat of the example cited by
Lunn, A. D. & McNeil, D.R. in "*Computer-Interactive Data Analysis*", Wiley, New York (1991).

For races, the common unit is the second; for jumps and throws, it's the meter.
Finally, the score is given in points.


In [None]:
df = pd.read_csv("olympic.csv",delimiter=";")
df = df.rename(columns=lambda x: x.strip()) # remove whitespaces
df

In [None]:
df.columns

In [None]:
# Export values to numpy array X, y
# ncols : number of columns of the dataframe
# X : the series
# y : the score
ncols  = df.shape[1]
sports = df.columns[1:ncols-1]
score  = "SCORE"
X = df[sports].to_numpy()
y = df["SCORE"].to_numpy()
X.shape, y.shape

There are $n=34$ athletes (individuals) and $p=10$ sports disciplines (variables). The SCORE is not
a sport discipline. It's a variable calculated from the other variables.

### 1. Compute the sample mean vector, and the sample covariance and correlation matrices

For easier reading, we recommend transforming the numpy vector of means and both
matrices into a dataframe. The following code converts the numpy vector ``mu`` and the matrices
``cov`` and ``cor`` into a DataFrame.

```
pd.DataFrame(mu.reshape((1,10)), columns = sports, index=['Means'])
pd.DataFrame(cov, columns = sports, index=sports)
pd.DataFrame(cor, columns = sports, index=sports)
```

#### Computation of the sample mean vector

In [None]:
# Compute the means using numpy (mu)


In [None]:
# copy the result in a dataframe


#### Computation of the variance-covariance matrix

In [None]:
# compute the variance-covariance matrix (cov)


In [None]:
# put the result in a dataframe


#### Computation of the correlation matrix

In [None]:
#  Computate of the correlation matrix (cor)


In [None]:
# put the result in a dataframe


### 2. Interpretations

Make scatterplots by crossing the variables and interpret the most important positive and negative correlations, then try to explain

- Which sports are strongly positively or negatively correlated? How do you interpret these correlations?
- In particular, why is the long jump negatively correlated with the 100m race?

The following code is used to represent four scatterplots, each time crossing two disciplines:

```
fig, ax = plt.subplots(2,2,figsize=(16,10)) 

ax[0,0].scatter(x=X[:,0], y=X[:,5])
ax[0,0].set_xlabel(sports[0])
ax[0,0].set_ylabel(sports[5])

ax[0,1].scatter(X[:,0], X[:,1])
ax[0,1].set_xlabel(sports[0])
ax[0,1].set_ylabel(sports[1])

ax[1,0].scatter(X[:,4], X[:,1])
ax[1,0].set_xlabel(sports[4])
ax[1,0].set_ylabel(sports[1])

ax[1,1].scatter(X[:,2], X[:,8])
ax[1,1].set_xlabel(sports[2])
ax[1,1].set_ylabel(sports[8])

plt.show()
```

Using the library ``seaborn``, you can obtain a scatterplot matrix of all the sports

```
import seaborn as sns
```
and use

```
g = sns.pairplot(df, vars=[sports].append(score), kind="scatter", diag_kind="hist", height=1.2)
g.fig.suptitle("Scatterplot matrix", y=1.05)
```

In [None]:
# import seaborn here


In [None]:
# scatterplot matrix here


# Linear Regression

We don't know how the ``SCORE`` is computed ! Without any guess, we may assume that it depends linearly of the
results of the athletes. In this part, you will try to recover the variable ``SCORE`` by adjusting a linear model of the form
$$
 \text{SCORE} = \beta_0 + \beta_1 \text{C100} + \beta_2 \text{SLONG} +\ldots+\beta_{10} \text{C1500} + \text{error}
$$
with $\beta_0,\ldots,\beta_{10}$ unknown parameters to determine.


The first step will be to implement the computation of the ML estimator 
$$
\hat{\boldsymbol\beta} = (X^TX)^{-1} X^T Y
$$
using ``numpy``

1. First add a column to the matrix $X$ with only 1 as value. You can use 
  ```
  n, p = X.shape
  one = np.ones( (n, 1), dtype=X.dtype)
  X = np.hstack( [one, X] )
  ```
  in order to obtain a matrix of size $(n,p+1)$.


In [None]:
# add CONST to variables names
coefnames = pd.Index(["CONST"]).append(sports)
coefnames

In [None]:
# get n and p values here


In [None]:
# add column with constant one to X here


2. Next implement the computation of the vector $\boldsymbol\beta$ using the ``T``, ``dot`` and ``np.linalg.inv``
   methods
   
   Using
   ```
      pd.DataFrame(beta.reshape((1,p+1)), columns = coefnames, index=['Beta'])   
   ```
   you shoud get the following result


|      | CONST | C100 | SLONG | LPOIDS | SHAUT | C400 | C110 | LDISQ |SPERCH | LJAVE | C1500 |
|------|-------|------|-------|--------|-------|------|------|-------|-------|-------|-------|
| **Beta**     | 9493.686831 | -198.535205 | 207.018676 | 60.120172 | 958.607525 | -57.153815| -129.792204| 18.248976 |258.625203| 14.617478| -6.259179 |



In [None]:
# compute beta 9493in a numpy array here


In [None]:
# copy the result in a dataframe here


3. Compute the **predicted** values of the model, i.e.

   $$
     \hat{Y} = X \hat{\boldsymbol\beta}
   $$
   and the **residuals** of the model, i.e.
   $$
     \hat{\boldsymbol\epsilon} = Y - \hat{Y}
   $$


In [None]:
# compute `yhat` here


In [None]:
# compute the residual `res` here


4. Show scatterlots $(y,\hat{y})$ and $(y,\hat{\boldsymbol\epsilon})$ and interpret

In [None]:
fig, ax = plt.subplots(2,1,figsize=(16,10)) 

ax[0].scatter(x=yhat, y=y)
ax[0].set_xlabel("Predicted SCORE")
ax[0].set_ylabel("SCORE")

ax[1].scatter(x=res, y=y)
ax[1].set_xlabel("Residuals")
ax[1].set_ylabel("SCORE")

plt.show()

4. Compute the $R^2$ value and conclude

   We often use three different sum of squares values to measure how well the regression line actually fits the data:

- Sum of Squares Total (SST): The sum of squared differences between individual data points (yi) and the mean of the response variable (y).

    $$ \text{SST} = \sum (y_i -\bar{y})^2$$

- Sum of Squares Regression (SSR): The sum of squared differences between predicted data points (ŷi) and the mean of the response variable(y).

    $$ \text{SSR} = \sum (\hat{y}_i -\bar{y})^2$$

- Sum of Squares Error (SSE) – The sum of squared differences between predicted data points (ŷi) and observed data points (yi).

    $$ \text{SSE} = \sum (\hat{y}_i -y_i)^2 = \sum \hat{\boldsymbol\epsilon}_i^2$$

- According to the Pythagorean theorem, we have the following equality:
$$ \begin{aligned}
  ||Y-\bar{Y}||^2 &= ||\hat{Y}-\bar{Y}||^2 + ||\hat{\epsilon}||^2\\
  \text{SST} &= \text{SSR} + \text{SSE}
  \end{aligned}  
$$
i.e. "Total variance" = "Explained variance" + "Residual variance"

- The $R^2$ is computed as the ratio of total variance to explained variance
$$
R^2 = \frac{\text{SSR}}{\text{SST}}
$$

The closer it is to 1, the more relevant the model, and the closer it is to 0, the less convincing the model.


In [None]:
# Compute ymean here


In [None]:
# compute the R2 here
