# Exercise 4 : Linear Regression

---

### Essential Libraries

Let us begin by importing the essential Python Libraries.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [None]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

---

## Setup : Import the Dataset

Dataset from Kaggle : The **"House Prices"** competition     
Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

The dataset is `train.csv`; hence we use the `read_csv` function from Pandas.  
Immediately after importing, take a quick look at the data using the `head` function.

In [None]:
houseData = pd.read_csv('train.csv')
houseData.head()

---

## Problem 1 : Predicting SalePrice using GrLivArea

Extract the required variables from the dataset, as mentioned in the problem.     

In [None]:
houseGrLivArea = pd.DataFrame(houseData['GrLivArea'])
houseSalePrice = pd.DataFrame(houseData['SalePrice'])

Plot `houseSalePrice` against `houseGrLivArea` using standard JointPlot.

In [None]:
sb.jointplot(houseGrLivArea, houseSalePrice, height = 12, color = "coral")

Import the `LinearRegression` model from `sklearn.linear_model`.

In [None]:
# Import LinearRegression model from Scikit-Learn
from sklearn.linear_model import LinearRegression

# Create a Linear Regression object
linreg = LinearRegression()

Prepare both the datasets by splitting in Train and Test sets.   
Train Set with 1100 samples and Test Set with 360 samples.

In [None]:
# Split the dataset into Train and Test       
houseGrLivArea_train = pd.DataFrame(houseGrLivArea[:1100])
houseGrLivArea_test  = pd.DataFrame(houseGrLivArea[-360:])
houseSalePrice_train = pd.DataFrame(houseSalePrice[:1100])
houseSalePrice_test  = pd.DataFrame(houseSalePrice[-360:])

# Check the sample sizes
print("Train Set :", houseGrLivArea_train.shape, houseSalePrice_train.shape)
print("Test Set  :", houseGrLivArea_test.shape, houseSalePrice_test.shape)

Fit Linear Regression model on `houseGrLivArea_train` and `houseSalePrice_train`

In [None]:
linreg.fit(houseGrLivArea_train, houseSalePrice_train)

#### Visual Representation of the Linear Regression Model

Check the coefficients of the Linear Regression model you just fit.

In [None]:
print('Intercept \t: b = ', linreg.intercept_)
print('Coefficients \t: a = ', linreg.coef_)

Plot the regression line based on the coefficients-intercept form.

In [None]:
# Formula for the Regression line
regline_x = houseGrLivArea_train
regline_y = linreg.intercept_ + linreg.coef_ * houseGrLivArea_train

# Plot the Linear Regression line
f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(houseGrLivArea_train, houseSalePrice_train)
plt.plot(regline_x, regline_y, 'r-', linewidth = 3)
plt.show()

#### Prediction of Response based on the Predictor

Predict `SalePrice` given `GrLivArea` in the Test dataset.

In [None]:
# Predict SalePrice values corresponding to GrLivArea
houseSalePrice_test_pred = linreg.predict(houseGrLivArea_test)

# Plot the Predictions
f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(houseGrLivArea_test, houseSalePrice_test, color = "green")
plt.scatter(houseGrLivArea_test, houseSalePrice_test_pred, color = "red")
plt.show()

#### Goodness of Fit of the Linear Regression Model

Check how good the predictions are on the Train Set.    
Metric : Explained Variance or R^2 on the Train Set.

In [None]:
print("Explained Variance (R^2) \t:", linreg.score(houseGrLivArea_train, houseSalePrice_train))

Check how good the predictions are on the Test Set.    
Metric : Explained Variance or R^2 on the Test Set.

In [None]:
print("Explained Variance (R^2) \t:", linreg.score(houseGrLivArea_test, houseSalePrice_test))

#### You should also try the following

* Convert `SalePrice` to `log(SalePrice)` in the beginning and then use it for Regression     
  Code : `houseSalePrice = pd.DataFrame(np.log(houseData['SalePrice']))`    
  
* Perform a *Random Train-Test Split* on the dataset before you start with the Regression      
  Note : Check the preparation notebook `M3 LinearRegression.ipynb` for the code

---

## Problem 2 : Predicting SalePrice using LotArea

Extract the required variables from the dataset, as mentioned in the problem.     

In [None]:
housePredictor = pd.DataFrame(houseData['LotArea'])
houseSalePrice = pd.DataFrame(houseData['SalePrice'])
sb.jointplot(housePredictor, houseSalePrice, height = 12)

#### Linear Regression on SalePrice vs Predictor

In [None]:
# Import LinearRegression model from Scikit-Learn
from sklearn.linear_model import LinearRegression

# Split the dataset into Train and Test       
housePredictor_train = pd.DataFrame(housePredictor[:1100])
housePredictor_test  = pd.DataFrame(housePredictor[-360:])
houseSalePrice_train = pd.DataFrame(houseSalePrice[:1100])
houseSalePrice_test  = pd.DataFrame(houseSalePrice[-360:])

# Create a Linear Regression object
linreg = LinearRegression()

# Train the Linear Regression model
linreg.fit(housePredictor_train, houseSalePrice_train)

#### Visual Representation of the Linear Regression Model

In [None]:
print('Intercept \t: b = ', linreg.intercept_)
print('Coefficients \t: a = ', linreg.coef_)

# Formula for the Regression line
regline_x = housePredictor_train
regline_y = linreg.intercept_ + linreg.coef_ * housePredictor_train

# Plot the Linear Regression line
f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(housePredictor_train, houseSalePrice_train)
plt.plot(regline_x, regline_y, 'r-', linewidth = 3)
plt.show()

#### Prediction of Response based on the Predictor

In [None]:
# Predict SalePrice values corresponding to GrLivArea
houseSalePrice_test_pred = linreg.predict(housePredictor_test)

# Plot the Predictions
f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(housePredictor_test, houseSalePrice_test, color = "green")
plt.scatter(housePredictor_test, houseSalePrice_test_pred, color = "red")
plt.show()

#### Goodness of Fit of the Linear Regression Model

In [None]:
print("Explained Variance (R^2) on Train Set \t:", linreg.score(housePredictor_train, houseSalePrice_train))
print("Explained Variance (R^2) on Test Set \t:", linreg.score(housePredictor_test, houseSalePrice_test))

---

## Problem 2 : Predicting SalePrice using TotalBsmtSF

Extract the required variables from the dataset, as mentioned in the problem.     

In [None]:
housePredictor = pd.DataFrame(houseData['TotalBsmtSF'])
houseSalePrice = pd.DataFrame(houseData['SalePrice'])
sb.jointplot(housePredictor, houseSalePrice, height = 12)

#### Linear Regression on SalePrice vs Predictor

In [None]:
# Import LinearRegression model from Scikit-Learn
from sklearn.linear_model import LinearRegression

# Split the dataset into Train and Test       
housePredictor_train = pd.DataFrame(housePredictor[:1100])
housePredictor_test  = pd.DataFrame(housePredictor[-360:])
houseSalePrice_train = pd.DataFrame(houseSalePrice[:1100])
houseSalePrice_test  = pd.DataFrame(houseSalePrice[-360:])

# Create a Linear Regression object
linreg = LinearRegression()

# Train the Linear Regression model
linreg.fit(housePredictor_train, houseSalePrice_train)

#### Visual Representation of the Linear Regression Model

In [None]:
print('Intercept \t: b = ', linreg.intercept_)
print('Coefficients \t: a = ', linreg.coef_)

# Formula for the Regression line
regline_x = housePredictor_train
regline_y = linreg.intercept_ + linreg.coef_ * housePredictor_train

# Plot the Linear Regression line
f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(housePredictor_train, houseSalePrice_train)
plt.plot(regline_x, regline_y, 'r-', linewidth = 3)
plt.show()

#### Prediction of Response based on the Predictor

In [None]:
# Predict SalePrice values corresponding to GrLivArea
houseSalePrice_test_pred = linreg.predict(housePredictor_test)

# Plot the Predictions
f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(housePredictor_test, houseSalePrice_test, color = "green")
plt.scatter(housePredictor_test, houseSalePrice_test_pred, color = "red")
plt.show()

#### Goodness of Fit of the Linear Regression Model

In [None]:
print("Explained Variance (R^2) on Train Set \t:", linreg.score(housePredictor_train, houseSalePrice_train))
print("Explained Variance (R^2) on Test Set \t:", linreg.score(housePredictor_test, houseSalePrice_test))

---

## Problem 2 : Predicting SalePrice using GarageArea

Extract the required variables from the dataset, as mentioned in the problem.     

In [None]:
housePredictor = pd.DataFrame(houseData['GarageArea'])
houseSalePrice = pd.DataFrame(houseData['SalePrice'])
sb.jointplot(housePredictor, houseSalePrice, height = 12)

#### Linear Regression on SalePrice vs Predictor

In [None]:
# Import LinearRegression model from Scikit-Learn
from sklearn.linear_model import LinearRegression

# Split the dataset into Train and Test       
housePredictor_train = pd.DataFrame(housePredictor[:1100])
housePredictor_test  = pd.DataFrame(housePredictor[-360:])
houseSalePrice_train = pd.DataFrame(houseSalePrice[:1100])
houseSalePrice_test  = pd.DataFrame(houseSalePrice[-360:])

# Create a Linear Regression object
linreg = LinearRegression()

# Train the Linear Regression model
linreg.fit(housePredictor_train, houseSalePrice_train)

#### Visual Representation of the Linear Regression Model

In [None]:
print('Intercept \t: b = ', linreg.intercept_)
print('Coefficients \t: a = ', linreg.coef_)

# Formula for the Regression line
regline_x = housePredictor_train
regline_y = linreg.intercept_ + linreg.coef_ * housePredictor_train

# Plot the Linear Regression line
f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(housePredictor_train, houseSalePrice_train)
plt.plot(regline_x, regline_y, 'r-', linewidth = 3)
plt.show()

#### Prediction of Response based on the Predictor

In [None]:
# Predict SalePrice values corresponding to GrLivArea
houseSalePrice_test_pred = linreg.predict(housePredictor_test)

# Plot the Predictions
f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(housePredictor_test, houseSalePrice_test, color = "green")
plt.scatter(housePredictor_test, houseSalePrice_test_pred, color = "red")
plt.show()

#### Goodness of Fit of the Linear Regression Model

In [None]:
print("Explained Variance (R^2) on Train Set \t:", linreg.score(housePredictor_train, houseSalePrice_train))
print("Explained Variance (R^2) on Test Set \t:", linreg.score(housePredictor_test, houseSalePrice_test))

---

## Extra : Predicting SalePrice using Multiple Variables

Extract the required variables from the dataset, and then perform Multi-Variate Regression.     

In [None]:
housePredictor = pd.DataFrame(houseData[['GrLivArea','LotArea','TotalBsmtSF','GarageArea']])
houseSalePrice = pd.DataFrame(houseData['SalePrice'])

#### Linear Regression on SalePrice vs Predictor

In [None]:
# Import LinearRegression model from Scikit-Learn
from sklearn.linear_model import LinearRegression

# Split the dataset into Train and Test       
housePredictor_train = pd.DataFrame(housePredictor[:1100])
housePredictor_test  = pd.DataFrame(housePredictor[-360:])
houseSalePrice_train = pd.DataFrame(houseSalePrice[:1100])
houseSalePrice_test  = pd.DataFrame(houseSalePrice[-360:])

# Create a Linear Regression object
linreg = LinearRegression()

# Train the Linear Regression model
linreg.fit(housePredictor_train, houseSalePrice_train)

#### Coefficients of the Linear Regression Model

Note that you CANNOT visualize the model as a line on a 2D plot, as it is a multi-dimensional surface.

In [None]:
print('Intercept \t: b = ', linreg.intercept_)
print('Coefficients \t: a = ', linreg.coef_)

#### Prediction of Response based on the Predictor

In [None]:
# Predict SalePrice values corresponding to GrLivArea
houseSalePrice_train_pred = linreg.predict(housePredictor_train)
houseSalePrice_test_pred = linreg.predict(housePredictor_test)

# Plot the Predictions vs the True values
f, axes = plt.subplots(1, 2, figsize=(24, 12))
axes[0].scatter(houseSalePrice_train, houseSalePrice_train_pred, color = "blue")
axes[0].plot(houseSalePrice_train, houseSalePrice_train, 'w-', linewidth = 1)
axes[0].set_xlabel("True values of the Response Variable (Train)")
axes[0].set_ylabel("Predicted values of the Response Variable (Train)")
axes[1].scatter(houseSalePrice_test, houseSalePrice_test_pred, color = "green")
axes[1].plot(houseSalePrice_test, houseSalePrice_test, 'w-', linewidth = 1)
axes[1].set_xlabel("True values of the Response Variable (Test)")
axes[1].set_ylabel("Predicted values of the Response Variable (Test)")
plt.show()

#### Goodness of Fit of the Linear Regression Model

In [None]:
print("Explained Variance (R^2) on Train Set \t:", linreg.score(housePredictor_train, houseSalePrice_train))
print("Explained Variance (R^2) on Test Set \t:", linreg.score(housePredictor_test, houseSalePrice_test))

---

## Interpretation and Discussion

Now that you have performed Linear Regression of `SalePrice` against the four variables `GrLivArea`, `LotArea`, `TotalBsmtSF`, `GarageArea`, compare-and-contrast the Exaplained Variance (R^2) to determine which model is the best in order to predict `SalePrice`. What do you think?