# Correlations


1. [Correlation computation and scatterplots](#section1)
2. [Scatterplot matrix](#section2)
3. [Heatmaps](#section3)

Introducing an additional library: [seaborn](https://seaborn.pydata.org/) - for statistical data visualization


In [27]:
import pandas as pd
import numpy as np
import seaborn as sns

We'll work with the [California Housing data](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html)



In [None]:
url = 'https://raw.githubusercontent.com/nlihin/data-analytics/main/datasets/housing.csv'
house_df = pd.read_csv(url)
house_df.head()

<a id='section1'></a>

### 1. Correlation computation and scatterplots

In [None]:
house_df.plot.scatter(x = 'median_house_value', y = 'median_income')

In [None]:
house_df[['median_income', 'median_house_value']].corr(method='pearson')

In [None]:
house_df.corr(method='pearson')

In [None]:
house_df.plot.scatter(x = 'total_bedrooms', y = 'households')

##### Almost similar - using matplotlib plt function:

In [None]:
import matplotlib.pyplot as plt 
plt.scatter(house_df['total_bedrooms'], house_df['households'])

##### Using seaborn:

In [None]:
sns.scatterplot(data=house_df, x='total_bedrooms', y='households')

##### using seaborn with a regression line:

In [None]:
sns.regplot(data=house_df, x='total_bedrooms', y='households')

#### Scatterplots work when there are missing data

In [None]:
x = pd.Series(range(10, 20))
y = pd.Series([2, 1, 3, np.nan, 8, 3, 18, 25])
example_df = pd.DataFrame({'x': x, 'y': y})
example_df

In [None]:
example_df.plot.scatter(x = 'x', y = 'y')

In [None]:
example_df.corr(method='pearson') 

In [None]:
example_df.corr(method='spearman') 

In [None]:
example_df.corr(method='kendall') 

<a id='section2'></a>

### 2. Scatterplot matrix

The diagonal shows the distribution of the three numeric variables.

In the other cells of the plot matrix, we have the scatterplots of each variable combination in the dataframe. 

In [None]:
features = ['median_house_value', 'housing_median_age',
            'median_income']
pd.plotting.scatter_matrix(house_df[features])

In [None]:
#sns.set()
sns.pairplot(house_df[features], height = 2.5)

<a id='section3'></a>

### 3. Heatmaps

##### Pandas doesn't contain a built-in heatmap function. We can try and create one by adding color to corr:

In [None]:
correlation_matrix = house_df[features].corr()
correlation_matrix.style.background_gradient(cmap='coolwarm')

In [None]:
correlation_matrix.style.background_gradient(cmap='Blues')

##### Or we can use seaborn

In [None]:
features = ['median_house_value', 'housing_median_age','median_income','total_bedrooms','population']
correlation_matrix = house_df[features].corr().round(2)
sns.heatmap(data=correlation_matrix,cmap='Greens', annot=True)

---
> ##### Summary
>
>* `.corr` - compute pairwise correlation of columns, excluding NA/null values. [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)
>
>* `.corr.style.background_gradient` - change the background color. [various options](corr.style.background_gradient)
>
>* `.plotting.scatter_matrix` - draw a matrix of scatter plots. [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.plotting.scatter_matrix.html?highlight=scatter_matrix)
>
>* `.plot.scatter` - plot a scatter plot
>
> Seaborn package:
>
>* `sns.scatterplot` - a scatter plot
>
>* `sns.regplot` - a scatter plot with a regression line
>
>* `sns.pairplot` - scatter plot matrix
>
> * `sns.heatmap` - a heatmap. @annot = True to print the values inside the square
>
---