# Basic statistics

Here, the goal is to get used to several concepts in basic statistics needed later on for Machine Learning.
We are going to generate a simple dataset ourselves and try to calculate some standard quantities to describe them.

First, let's import some standard libraries and generate the dataset.

In [1]:
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np
import seaborn as sns

In [2]:
rng = np.random.RandomState(0)
n_samples = 10000
cov = [[1,       0.7*1*1.3,   0.3*1.3*1],
       [0.7*1*1.3,    1.3**2,      0],
       [0.3*1.3*1,   0, 1.5**2]]
data = rng.multivariate_normal(mean=[1, -1, 0], cov=cov, size=n_samples)
data = pd.DataFrame(data, columns=["x1", "x2", "x3"])

First, make a scatter plot of this dataset using matplotlib to visualise what are the possible values of $x_1$, $x_2$ and $x_3$.
One can make a 2D plot of $x_1$ versus $x_2$, $x_2$ versus $x_3$ or $x_1$ versus $x_3$ to understand how they relate to one another.

Often this is cumbersome if one has too many variables to plot, but we are going to discuss methods to get around this later in the lecture.

One tip: one may use `plt.scatter` for every pair of variables to do this, or use the seaborn module to plot all pairs in one go. Look for documentation on `sns.pairplot`.

What are the means and covariances of each variable?
Can you explain how they can be geometrically interpreted from the plots above?

Tip 1: Try subtracting the mean from the variables, dividing by the square root of the variance, and plotting the variables again. If you fit a line to the data, what is the slope? What is the correlation coefficient?

Tip 2: Try using `np.linalg.eig(covariance_matrix)` to decompose the covariance matrix into eigenvectors and try interpreting it.
