## 0. Setting Up The Data

In [None]:
pip install ucimlrepo

In [None]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
real_estate_valuation = fetch_ucirepo(id=477) 
  
# data (as pandas dataframes) 
X = real_estate_valuation.data.features 
y = real_estate_valuation.data.targets 
  
# metadata 
print(real_estate_valuation.metadata) 
  
# variable information 
print(real_estate_valuation.variables) 


## 1. Business Understanding

This model seeks to predict housing prices utilising linear regression 

## 2. Data Understanding

In [None]:
X.info()

In [None]:
X.head(5)

In [None]:
X.describe()

## 3. Data Presentation

Geographical coordinates bear no linear relationship with the rest of the dataset, or to the final price.  
It is true however that in real life the physical location of the apartment can have an effect to the price, but in the way that linear regression makes its prediction it is more likely to confuse the model, than provide meaningful input.  
In addition if the data were the standardised any relation between the X and Y coordinates would be lost, as well as their placement geographically. 

In [None]:
X = X.drop(columns=['X5 latitude', 'X6 longitude'])
X.head(5)

## 4. Modeling

First model constructed will be done without standardised dataset

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.3, 
    random_state=42
)

Split the data reserving 70% for training and 30% for testing

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
b0 = model.intercept_ 
b1 = model.coef_ 
print(b0)
print(b1)

Intercept seems to be at an unreasonable value, which could be explained by the varying scales of value in the dataset.  
In addition the negative weighs of the coefficients are considerably low when compared to the positive ones.  
As for the coefficients themselves:  
Transaction date bears the highest value in pricing and distance to the nearest station affects the price most in negative manner.  
Distance to the station makes sense as transportation would be one of the most important aspects of city life and if we interpret as longer distance to the nearest station loweing the price of the apartment, it would be in-line with this logic.  
However the transaction date bearing the highest weight on the apartment price seems suspcious.  

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
X.head(5)

Data is split again with 70-30 split using the standardised dataset

In [None]:
X_train_st, X_test_st, y_train_st, y_test_st = train_test_split(
    X, 
    y, 
    test_size=0.3, 
    random_state=42
)

In [None]:
model_st = LinearRegression()
model_st.fit(X_train_st, y_train_st)

In [None]:
b0 = model_st.intercept_ 
b1 = model_st.coef_ 
print(b0)
print(b1)

These results would appear to be a lot more reasonable.  
Intercept is at a reasonable value and the coefficients are all within the same value scale.  
Distance to the nearest station remains one of the strongest, becoming the strongest, weighted values.  
Transaction date drops in favour of number of convenience stores as well, which seems a lot more reasonable.  

## 5. Evaluation

Evaluating non-standardised model

In [None]:
from sklearn.metrics import mean_absolute_error

preds = model.predict(X_test)

print("Mean absolute error: %.2f" % mean_absolute_error(y_test, preds))

Evaluating standardised model

In [None]:
preds = model_st.predict(X_test_st)

print("Mean absolute error: %.2f" % mean_absolute_error(y_test_st, preds))

## 6. Deployment