# Preliminary Questions:

### What is the purpose of splitting your data into training, validation and test sets?

Training data will allow us to build up the model
Validation set will allow us to take the model and tune it, for example setting parameters on regularsation lamda
Test data set will allows us to check that the model performs well on a dataset that it has not seen before. The training set should follow the same general distribution of the training set

### What is the cost function for linear regression? 

J(theta) = 1/2m SUM( (predicted-Y - Y )^2  )
where predicted-y = Theta0 + Theta1X1 + Theta2X2 ...

### How would you modify the cost function for linear regression to use regularisation?

J(theta) = 1/2m SUM( (predicted-Y - Y )^2  + 1/mLamda(SUM (from j=1>m (Thetaj^2)))
Where lamda = Regularisation Parameter
Could use ABS(Thetaj) rather than (ThetaJ)^2 


### How does the size of the regularisation parameter impact your model?

Larger regularisation parameter will mean that the theta parameter weights will be smaller

### What metric should you use to evaluate the accuracy of a linear regression model?

Value of the Minimised cost function

# Data Import And Initial Investigation

In [None]:
from sklearn.datasets import load_boston
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

boston = load_boston()

#print(boston)

In [None]:
boston.keys()

In [None]:
boston.DESCR

- CRIM     per capita crime rate by town
- ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS    proportion of non-retail business acres per town
- CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX      nitric oxides concentration (parts per 10 million)
- RM       average number of rooms per dwelling
- AGE      proportion of owner-occupied units built prior to 1940
- DIS      weighted distances to five Boston employment centres
- RAD      index of accessibility to radial highways
- TAX      full-value property-tax rate per 10,000usd
- PTRATIO  pupil-teacher ratio by town
- B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT    % lower status of the population
- MEDV     Median value of owner-occupied homes in 1000usd

In [None]:
y = boston.target
X = boston.data
X.shape
#print(Y)

In [None]:
data   = pd.DataFrame(boston.data, columns = boston.feature_names)
target = pd.DataFrame(boston.target, columns = ["TARGET"])

data['TARGET'] = target

data.head()


In [None]:
target.describe()

In [None]:
data.describe()

In [None]:
np.sum(data.isnull(), axis = 0)

In [None]:
for i in boston.feature_names:
    plt.hist(data[i],bins = 100)
    plt.title(i)
    plt.show()
    plt.boxplot(data[i])
    plt.title(i)
    plt.show()
    plt.scatter(data[i], data['TARGET'])
    plt.title(i)
    plt.show()

In [None]:
for i in boston.feature_names:
    for j in boston.feature_names:
        if i == j :
            print("same")
        else:
            plt.scatter(data[i], data[j])
            plt.title(i+" vs "+j)
            plt.xlabel(i)
            plt.ylabel(j)
            plt.show()

In [None]:
plt.hist(target["TARGET"],bins = 100)
plt.show()
plt.boxplot(target["TARGET"])
plt.show()


# Initial Predicition with raw data

In [None]:
from sklearn.model_selection import train_test_split
 
# Split into training and testing datasets
# The random_state=0 kwarg ensures that the split is performed in a consistent manner between runs
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
model = LinearRegression()
 
# Fit training set
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)
 
 
# Prediction metric
naive_rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(naive_rmse)

In [None]:
# Scatter plot
plt.scatter(y_test, y_pred)
plt.xlabel('true price')
plt.ylabel('predicted price')
plt.title('Boston house price prediction with linear regression model')
plt.show()
 
# Distribution of errors
plt.hist(y_pred - y_test,bins = range(-20,11,1))
plt.xlabel('difference in true and predicted price')
plt.show()

# Feature Engineering

Attempts

- Remove outliers in target, values of 50
  - RMSE Improvement - 1.4568695846892066
-Add Standard Scaler
  - RMSE Improvement - 1.456869584689211 - Barely any improvement
-Replace Standard with MinMax Scaler
  - RMSE Improvement - 1.4568695846892137 - Barely any improvement, but best scaling
-Replace Standard with MaxAbs Scaler
  - RMSE Improvement - 1.456869584689212 - Worse than MinMax
-Add categorisation on CHAS feature
  - RMSE Improvement - 1.4568695847061575 - Improvement
  
- Loop through to remove features one by one
Removing CHAS and INDUS have a positive impact - Remove these two together leads to 4.304458421612283, improvement of 1.4792003656685777
Removing all other features has a negative impact individially
Removed Feature	New RMSE	Impact
CHAS	4.313259982	0.01352922015
INDUS	4.318445876	0.008343326533
AGE	    4.326751297	3.79E-05
ZN	    4.33557273	-0.008783527861
NOX	    4.3659072	-0.03911799755
B	    4.393902229	-0.06711302657
CRIM	4.405942413	-0.07915321079
RAD	    4.42633282	-0.09954361702
TAX	    4.447770195	-0.1209809927
DIS	    4.504328654	-0.1775394516
LSTAT	4.590714818	-0.2639256156
PTRATIO	4.705479732	-0.3786905294
RM	    4.985348228	-0.6585590251

- Remove CHAS and INDUS and the repeat process
Removed Feature	New RMSE	Impact
AGE	    4.306313597	-0.001855175676
ZN	    4.312388958	-0.00793053648
NOX	    4.345536505	-0.04107808292
B	    4.372303799	-0.06784537708
CRIM	4.383219661	-0.07876123917
RAD	    4.404340957	-0.09988253532
TAX	    4.440388143	-0.1359297214
DIS	    4.486801427	-0.1823430053
LSTAT	4.56257363	-0.2581152082
PTRATIO	4.709273147	-0.4048147255
RM	    4.982456124	-0.6779977027


In [None]:
 
def basic_train_and_run(X_train,X_test,y_train,y_test) :
# Split into training and testing datasets
# The random_state=0 kwarg ensures that the split is performed in a consistent manner between runs

    model = LinearRegression()
 
# Fit training set
    model.fit(X_train, y_train)

# Predict on test set
    y_pred = model.predict(X_test)
 
 
# Prediction metric
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    rmse_improvement = 4.304458421612283  - rmse
    return(y_pred,rmse,rmse_improvement)
    #return(rmse_improvement)


In [None]:
def model_performance (rmse,rmse_imporvement,y_test,y_pred, model_performance, show_graphs):
    print("RMSE            : "+str(rmse))
    print("RMSE Improvement: "+str(rmse_improvement))
    if show_graphs == True :
        # Scatter plot
        plt.scatter(y_test, y_pred)
        plt.xlabel('true price')
        plt.ylabel('predicted price')
        plt.title('Boston house price prediction with linear regression model')
        plt.show()

        # Distribution of errors
        plt.hist(y_pred - y_test,bins = range(-20,11,1))
        plt.xlabel('difference in true and predicted price')
        plt.show()

In [None]:

y = boston.target
X = boston.data
data = pd.DataFrame(X, columns = boston.feature_names)

data['TARGET'] = target
#X_engineered_train = X_engineered_train.drop(columns =['CRIM'])
#y_engineered_train = y_engineered_train.drop(columns =['CRIM'])
#print(X_engineered_train)

from sklearn import preprocessing
import math


# the 50 values look like outliers to me
data.drop(data[data.TARGET == 50].index, inplace=True)

y = data.TARGET
X = data.drop(["TARGET","CHAS","INDUS"], axis = 1)


lstat = X["LSTAT"]
sqrtLSTAT = []
for i in lstat:
    sqrtLSTAT.append(i ** -0.6)
X["srtLSTAT"] = sqrtLSTAT
rad = X["RAD"]
binaryRad = []
for i in rad:
    if i > 10:
        binaryRad.append(1)
    else:
        binaryRad.append(0)
X["binaryRAD"] = binaryRad


tax = X["TAX"]
binaryTax = []
for i in tax:
    if i > 500:
        binaryTax.append(1)
    else:
        binaryTax.append(0)
X["binaryTAX"] = binaryTax

b = X["B"]
binaryB = []
for i in b:
    if i > 335:
        binaryB.append(1)
    else:
        binaryB.append(0)
X["binaryB"] = binaryB

zn = X["ZN"]
binaryZN = []
for i in zn:
    if i > 20:
        binaryZN.append(1)
    else:
        binaryZN.append(0)
X["binaryZN"] = binaryZN
#X = X.drop("ZN",axis = 1)


crime = X["CRIM"]
binaryCRIME = []
for i in crime:
    if i > 20:
        binaryCRIME.append(1)
    else:
        binaryCRIME.append(0)
X["binaryCRIME"] = binaryCRIME
X = X.drop("CRIM",axis = 1)


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)


    
scaler = preprocessing.MinMaxScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
   
y_pred, rmse, rmse_improvement = basic_train_and_run(X_train, X_test, y_train, y_test)

model_performance(rmse,rmse_improvement, y_test,y_pred,True)


In [None]:
 
def better_train_and_run(X_train,X_test,y_train,y_test,alpha) :
# Split into training and testing datasets
# The random_state=0 kwarg ensures that the split is performed in a consistent manner between runs
    from sklearn.linear_model import Ridge
    from sklearn.linear_model import Lasso
    model = Ridge(alpha=alpha)
 
# Fit training set
    model.fit(X_train, y_train)

# Predict on test set
    y_pred = model.predict(X_test)
 
 
# Prediction metric
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    rmse_improvement = 4.304458421612283  - rmse
    return(y_pred,rmse,rmse_improvement)
    #return(rmse_improvement)


In [None]:
alphaVals = []
#print(alphaVals)

alphaVals = [0.001,0.003,0.01,0.03,0.1,0.3,1,3,10]
for i in alphaVals:
    alpha = i
    print(alpha)
    y_pred, rmse, rmse_improvement = better_train_and_run(X_train, X_test, y_train, y_test,alpha)

    model_performance(rmse,rmse_improvement, y_test,y_pred,False)