One of the most important part of data science is to understand your data. To do so, it is interesting to visualize them and explore the link between the different variables.

This notebook will introduce some visualization techniques and mathematically quantify the link between different variables: Pearson Correlation. You will also discover what is Principal Components Analysis (PCA).

As in the previous notebook, we will describe the square data example and then you will have to explore the house price dataset by yourself. 


In [None]:
# Import libraries
import pandas as pd # You begin to know this library
import numpy as np # This one too
import matplotlib.pyplot as plt # This is a new one !
from mpl_toolkits.mplot3d import Axes3D # To plot in 3d
import seaborn as sns # And a second new one

[Matplotlib](https://matplotlib.org/contents.html) will allow you to make clear visualization of your data.

[Seaborn](https://seaborn.pydata.org/) makes it look nicer and have some more tools

# Square example

In [None]:
# Read the file
squares = pd.read_csv('data/squares.csv') 

## Visualization of distribution

Before visualizing, it is always interesting to hypothesize what we should observe. This way it is simpler to detect errors but also counterintuitive behavior of the data.

So for this toy dataset, I expect the first point to have coordinates between 0 and 1 on x and y axes (because I built the dataset this way !) in a totally random distribution (close to uniform, meaning each point within has the same probability of being sampled). And the area of the square should not depend on the first point, so the area color should appear also random.

In [None]:
plt.figure() # That allows you to create a figure
squares.plot.scatter(x='x0', y='y0', c='area', colormap='viridis') 
# c is obtional, it is in order to add another dimension
plt.show() # It allows to display what you have drawn

In order to check the uniform distribution, it can be interesting to display the histogram which reflects the empirical distribution.

In [None]:
plt.figure()
squares.x0.plot.hist(bins = 20)
plt.show()

Is the side of the square uniformly distributed ?

In [None]:
# Compute side

# Display the histogram

plt.show()

This visual verification is not really statistical, it exists some test to be sure of this assumption that we will see later for normal distribution.

In [None]:
plt.figure()
# Plot the histogram of the area

plt.show()

Interestingly, the square of the size is not unifromly distributed.

If you want to do some statistic you can look at the following computation, otherwise skip it and let's verify that the distribution is $$f_{X^2}(x) = \dfrac1{2\sqrt{x}}$$

Here is the math that you can skip :)
$$F_{X^2}(x) = \mathbb{P}(X^2 \leq x) = \mathbb{P}(X \in [0,\sqrt{x}]) = \sqrt{x}$$
$$f_{X^2}(x) = \dfrac{dF_{X^2}(x)}{dx} = \dfrac1{2\sqrt{x}}$$

In [None]:
listX = np.linspace(10**-2, 1, 100)
listY = 1/(2*np.sqrt(listX))

In [None]:
plt.figure()
plt.plot(listX, listY) # This function plot the points of coordinates x, y
squares.area.plot.hist(bins = 100, normed = True)
plt.show()

Using this plot, what is the theoritical mean value of areas that we should observe ? Is it coherent with np.mean(squares.area) ?

### Principal Component Analysis

Wikipedia defines the [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) as "a **statistical procedure** that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of **linearly uncorrelated variables** called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components."

In simple words, PCA allows you to compute combination of the different features that explain the change in the data. This way, you can select the few first dimensions in order to have a good visualization of your data, or a strong compression of them.

In [None]:
# Import of the PCA because it would not be easy to compute it by hand !
from sklearn.decomposition import PCA

In [None]:
pca = PCA()
# Fit the pca to the given data and return the transformed data
transformed = pca.fit_transform(squares) 

In [None]:
plt.figure()
sns.barplot(x = np.arange(len(pca.explained_variance_)), 
            # np.arange creates a list of number 
            # from 0 to the given value
            y = np.cumsum(pca.explained_variance_ratio_))
plt.xlabel("Principal Component") 
plt.ylabel("Cumulative Explained variance")
plt.axhline(0.95, linestyle=':')
plt.show()

The dotted line represent the 95% of explained variance which is a standard threshold for stopping the algorithm: in this case, we can use only 4 dimensions to represent really accurately the data.

This is coherent with what we can expect, there is a lot of redundancy for a square by repeating the coordinates of each points. Can you think about an easy representation in 4 dimensions for a square ?

Let's see what the PCA has found : 

In [None]:
def displayPC(pca, dataset, numberDim = 4):
    """
        Displays the first numberDim components of the 
        given pca computed on the given dataset
        
        No worries, we will explain for loops later
    """
    for dim in range(numberDim):
        print("PC {} :".format(dim) 
              + '+'.join([" {:.2f} {:s} ".format(pc, c)
                          for pc, c in zip(pca.components_[dim], 
                                           dataset.columns)]))

In [None]:
displayPC(pca, squares)

Why do you think that this representation is much more complex that the one that you were thinking about ?

However, it is still interesting to display the data in this representation. But because of the randomness of our dataset, we don't expect to see too much structure in the cloud of points.

In [None]:
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(transformed[:, 0], transformed[:, 1], transformed[:, 2],
           c=transformed[:, 3], cmap=plt.cm.viridis)
plt.show()

## Correlation

Correlation is a **statistical relationship involving dependence**(linear relationship with each other).

Correlations are useful because they can indicate a predictive relationship that can be **exploited in practice**. For example, an electrical utility may produce less power on a mild day based on the correlation between electricity demand and weather. In this example, there is a causal relationship, because extreme weather causes people to use more electricity for heating or cooling. However, in general, the presence of a correlation is not sufficient to infer the presence of a causal relationship (**correlation does not imply causation !**).

Here is an example extracted from the website [Spurious Correlation](http://tylervigen.com/spurious-correlations)

![Correlation](./img/SpuriousCorrelation.png)

It is sometimes possible to explain these correlations by a shared cause, but there is no causal connection between the two features.

Do you have any assumption of what we should observe ?

In [None]:
correlation = squares.corr()
correlation

Hum ... Not really nice to look at, seaborn allows us to make a much better representation

In [None]:
plt.figure()
sns.heatmap(correlation, vmin=-1, vmax=1, cmap="bwr")
plt.show()

Seaborn also allows you to observe the distribution in parallel to the correlation

In [None]:
plt.figure()
sns.pairplot(squares, kind="reg") 
# You can limit the number of plots by using vars=["var1", ...]
plt.show()

From this observation, can you make an assumption on how I created this dataset ?

What can you say about the distribution of the other points ?

# Real data

Now, that you have played with these tools, let's apply them to our dataset

In [None]:
houses = pd.read_csv('data/train.csv')

## Analysis of SalePrice

What is the mean, median, min, max prices of sale and the standard deviation ?

In [None]:
# Save the two following values:
mean = 
std = 

What would it mean if the std was 0 ?

Is the market stable over the studied period ? In order to answer this question, make a comparaison of the SalePrice with the YrSold. Why is it important to look at this point ?

What is the most correlated variable with the SalePrice ?

Is the price normally distributed ?   
Define the function `normalDistribution`, the bell curve which is defined by :
$$N(x, \mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$
with $\mu$ the mean and $\sigma$ the standard deviation

In [None]:
def normalDistribution(points, mean, std):
    """
        Computes the normal distribution of mean and std
        for the given points
    """
    values = 
    return values

In [None]:
plt.figure()
# Plot the distribution of SalePrice
houses.SalePrice.plot.hist(bins = 100, normed=True)
# Plot the normal distribution
points = np.linspace(houses.SalePrice.min(), houses.SalePrice.max(), 10**4)
plt.plot(points, normalDistribution(points, mean, std), color='red')
plt.show()

Can you conclude ? Here a statistical tool would be great ! You will look on how to correct it in the advance notebook that deals with improvement of models.

## Dimensionality reduction

In [None]:
numericalHouses = houses.select_dtypes(include="number")
numericalHouses = numericalHouses.fillna(-1) 
# Replace all absent data with -1
# => We will do a much better analysis in the next notebook

Let's reduce the data dimensionality

In [None]:
pca = PCA()
transformed = pca.fit_transform(numericalHouses)

In [None]:
# Plot the data on the two first components

In [None]:
# Analysis the two first components => Look at the maximum values
# How much dimensions do you need to explain 95 % of the variability
pca.components_[0]

Recompute a PCA without the SalePrice but plot the SalePrice as the
c color in order to know if the first components are linked with 
this price.