# 01 Linear Regression


### Dataset

* Predict the house sales in King County, Washington State, USA. The dataset consisted of historic data of houses sold between May 2014 to May 2015. 
* https://www.kaggle.com/shivachandel/kc-house-data

## Analyse your Data

* import the libraries that have the tools that we need

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
# comment if not working

* Load data from file

In [None]:
dataset = pd.read_csv("data/kc_house_data.csv")

In [None]:
# have a look at the first five rows
dataset.head()

In [None]:
dataset.info()

In [None]:
dataset.describe()

In [None]:
# Drop rows with missing values
dataset = dataset.dropna()

In [None]:
dataset.info()

Seaborn is another Python Library that makes plot of graph easier.
* https://seaborn.pydata.org/

In [None]:
# import the library
import seaborn as sns

#### Plot the distribution of the target variable using distplot

In [None]:
sns.distplot( dataset['price'] ) 

In [None]:
# simple analysis with one feature and one target
# we select only 2 features.
df = dataset[["sqft_living", "price"]]

In [None]:
sns.lmplot(x="sqft_living", y="price", data=df, scatter_kws={'alpha':0.3})
# this seaborn graph  has already a regression line in the graph

* The graph uses a statistical method to find the regression line (see presentation)
* We want to use Scikit Learn Linear regression to model more complex dataset with more features 
* but for now we can implement the model for 1 feature and 1 target and see if it matches the plot

In [None]:
# create our reduced dataset with one feature and one target
x = df['sqft_living']
y = df['price']
       
print(x.shape)
np.array(x)

In [None]:
# Reshape array x into a column vector
x = np.array(x).reshape(-1, 1) # this is something to remember (reshape -1,1 if single column)
print(x.shape,y.shape)

x

##### Create the model, fit the model, predition+visualisation (3 steps in one cell)

In [None]:
#Fitting simple linear regression to the Training Set
from sklearn.linear_model import LinearRegression 

# create the model
model = LinearRegression()

# train the model
model.fit(x, y)

# prediction
predictions = model.predict(x)

#Visualizing training data and prediction one on top of the other
plt.scatter(x, y)
plt.plot(x, predictions, color='r')
plt.xlabel("sqft_living")
plt.ylabel("Price")
plt.show()


# <a id="4"></a><br> Examining and Adding More Features

In [None]:
dataset.columns

In [None]:
# we put all the features in a separate dataset
features = ['bedrooms', 'bathrooms', 'sqft_living',
            'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
            'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
            'sqft_living15', 'sqft_lot15']


## <a id="6"></a><br> Complex Model 


In [None]:
# lets add all the features and set our target

In [None]:
x = dataset[features]
y = dataset['price']

In [None]:
x.shape, y.shape

In [None]:
#Splitting the data into Train and Test
from sklearn.model_selection import train_test_split 
xtrain, xtest, ytrain, ytest = train_test_split(x,y,random_state=0)


In [None]:
# create a new model
complex_model_1 = LinearRegression()


In [None]:
complex_model_1.fit(xtrain,ytrain)

# Evaluate the model  

#### make the predictions

In [None]:
pred1 = complex_model_1.predict(xtest)



In [None]:
from sklearn.metrics import mean_squared_error

print ("\nCOMPLEX MODEL - 1")
print ("features1: 'all the features' ")
print('Intercept: {}'.format(complex_model_1.intercept_))
print('Coefficients: {}'.format(complex_model_1.coef_))
print ('rme (testing) {}'.format(np.sqrt(mean_squared_error(ytest, pred1))))


( For information on printing and the .format command, see: https://pyformat.info/ )

## R2 score - proportion of variance in the output explained by our model 


0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.

* r2-score is useful to compare models rather than as an accuracy metric for one regressor


In [None]:
print('score (training): {}'.format(complex_model_1.score(xtrain,ytrain)))
# these two are the same
print('score (test): {}'.format(complex_model_1.score(xtest,ytest)))

from sklearn.metrics import r2_score
print('r2_score (test): {}'.format(r2_score(ytest, pred1)))

complex_model_1.score
print ('rme (testing) {}'.format(np.sqrt(mean_squared_error(ytest, pred1))))


#### Evaluate the Pearson cross-correlation of features

In [None]:
f, ax = plt.subplots(figsize=(16, 12))
plt.title('Pearson Correlation Matrix',fontsize=25)

sns.heatmap(dataset[features].corr(), linewidths=0.25, vmax=1.0, square=True, cmap="BuGn_r", linecolor='k', annot=True)

In [None]:
sns.pairplot(dataset[['sqft_living', 'sqft_above']])
sns.pairplot(dataset[['sqft_living', 'sqft_living15']])

* these features are not adding values to our model..

In [None]:
# lets remove some features (sqft_above and 'sqft_living15' )
features2 = ['bedrooms', 'bathrooms', 'sqft_living',
            'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
            'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
             'sqft_lot15']
x = dataset[features2]
y = dataset['price']

#Splitting the data into Train and Test
from sklearn.model_selection import train_test_split 
xtrain, xtest, ytrain, ytest = train_test_split(x,y, random_state=0)
complex_model_2 = LinearRegression()
complex_model_2.fit(xtrain,ytrain)
pred2 = complex_model_2.predict(xtest)

print ("\nCOMPLEX MODEL - 2")
print ("features2: without a couple of features")
print('Intercept: {}'.format(complex_model_2.intercept_))
print('Coefficients: {}'.format(complex_model_2.coef_))
print('score (training): {}'.format(complex_model_2.score(xtrain,ytrain)))
print('score (test): {}'.format(complex_model_2.score(xtest,ytest)))
print ('rme (testing) {}'.format(np.sqrt(mean_squared_error(ytest, pred2))))
r2_score(ytest, pred2)

* avoinding to use 2 features in our dataset changed the r2_score by a small amount - but it reduced the computational time
* first rule : try to avoid features that don't add any additional information to help the prediction


## Coefficient (the slopes)
* store the coefficient into a database so it easier to read the values
* we use pd.DataFrame that creates a pandas dataframe from numpy array

In [None]:
coeff_df = pd.DataFrame(data = complex_model_2.coef_, index = x.columns,columns=['Coefficient'])
coeff_df

Interpreting the coefficients:

* Holding all other features fixed, a 1 unit increase in sqft_living is associated with an *increase of 17.19... * of the price
* Holding all other features fixed, a 1 unit increase in yr_built is associated with an *decrease of 357.66... * of the price

### Residuals vs Predictions

In [None]:
residuals = ytest - pred2
plt.scatter(pred2, residuals)

In [None]:
sns.distplot(residuals,bins=50)

### Evaluate the model by looking at the results instead of calculating the score

In [None]:
plt.scatter(ytest,pred2)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')

a good situation would be to have most of the data along the diagonal..

In [None]:
OneExample = x.loc[17384]
OneExample

In [None]:
predictionFor17384 = complex_model_2.predict(np.array(OneExample).reshape(1,-1))

In [None]:
predictionFor17384 - y.loc[17384] # in dollars..

In [None]:
import matplotlib.pyplot as plt
%matplotlib 
# We can make an interactive plot that we can zoom in on

In [None]:
from matplotlib.pyplot import figure
figure(num=None, figsize=(18, 8), )
plt.plot(np.array(ytest))
plt.plot(pred2)

plt.show(figure)


* This confirms what was clear from the distribution of prices and the plot of the prediction vs true target. It seems there some non linearity going on?