# Multipe Regression

In this notebook we will learn the respective steps needed to compute Simple Linear Regression with data on house sales in King County, USA (https://www.kingcounty.gov) to predict house prices. You will:
* Upload and preprocess the data
* Write a function to compute the Multiple Regression weights
* Write a function to make predictions of the output given the input features
* Compare different models for predicting house prices

# Import all required libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#import seaborn as seabornInstance
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
%matplotlib inline

# Upload, preprocess and check the data

Dataset is from house sales in King County, USA (https://www.kingcounty.gov).

In [None]:
# Import dataset
sales = pd.read_csv('housing.csv')

# Look at the table to check potential features
sales[:10]

In [None]:
# Check if the dataset contains NaN (Not-a-Number) values
sales.isnull().any()

In [None]:
# Maybe relevant for Friday/Saturday: Drop all columns / rows with NaN (Not-a-Number) values
sales_drop = sales.dropna()

In [None]:
# Check some statistics of the data
sales.describe()

# Understand that some variables are no good choices for linear regression, 
# e.g. zipcode, lat, long, yr_renovated etc.

In [None]:
# Plot Scatter Matrix
plot_data_new = sales[['price', 'bedrooms', 'sqft_living', 'sqft_lot', 'sqft_above']] 
from pandas.plotting import scatter_matrix
sm = scatter_matrix(plot_data_new, figsize = (15,15))

In [None]:
# Plot some feature relations
sales.plot(x = 'bedrooms', y = 'price', style = 'o')
plt.title('price vs. bedrooms')
plt.xlabel('bedrooms')
plt.ylabel('price')
plt.show()

In [None]:
# Plot some feature relations
sales.plot(x = 'sqft_living', y = 'price', style = 'o')
plt.title('price vs. sqft_living')
plt.xlabel('sqft_living')
plt.ylabel('price')
plt.show()

In [None]:
# Divide the data into some 'attributes' (X) and 'labels' (y) aka input and output.
X2 = sales[['bedrooms','bathrooms']]
y2 = sales['price']

# Split data into training and testing

In [None]:
# Split data set into 80% train and 20% test data 
X_train, X_test, y_train, y_test = train_test_split(X2, y2, test_size=0.2, random_state=0)
# Look at the shape to check the split ratio
X_train.shape, X_test.shape, y_train.shape, y_test.shape

# Use a pre-build multiple regression function 



In [None]:
# Train a Sklearn built-in function
reg = LinearRegression()
reg.fit(X_train, y_train)

In [None]:
X_train.columns

In [None]:
# See intercept and coefficients chosen by the model
print('Intercept:', reg.intercept_)
coeff_df = pd.DataFrame({'Features': X2.columns, 'Coefficients': reg.coef_}).set_index('Features')
coeff_df

In [None]:
# Do prediction on test data
y_pred = reg.predict(X_test)

In [None]:
# Check differences between actual and predicted value
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred, 'Difference': y_pred - y_test},
                   columns=['Actual', 'Predicted', 'Difference']).astype(int)
df.head()

In [None]:
# Evaluate the performance
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In [None]:
# Performance of the model => r2_score = 1 - (variation unexplained / total variation)
# => How much of the variation is explained?
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

# Check for overfitting

In [None]:
# If r2 and RMSE in train and test differ dramatically => Overfitting!
# => Compare r2 and RSME in Test and Train!

# "Prework" needed to do the comparison
y_pred_train = reg.predict(X_train)
y_pred_test = reg.predict(X_test)

# => Compare r2 and RSME in Test and Train!
print('RSME Train:', np.sqrt(metrics.mean_squared_error(y_train, y_pred_train)))
print('RSME Test: ', np.sqrt(metrics.mean_squared_error(y_test, y_pred_test)))
print('R-2 Train:', r2_score(y_train, y_pred_train))
print('R-2 Test: ', r2_score(y_test, y_pred_test))

# For more information on performance evaluation see also
* https://en.wikipedia.org/wiki/Mean_absolute_error
* https://en.wikipedia.org/wiki/Mean_squared_error
* https://en.wikipedia.org/wiki/Root-mean-square_deviation
* https://en.wikipedia.org/wiki/Coefficient_of_determination