## Kaggle Competition - House Prices: Advanced Regression Techniques
# Linear Regression Exploration

This module is about creating a baseline linear regression model

The code for this module is in: [src/models/LinearRegressionModel.py](../src/models/LinearRegressionModel.py)

In [1]:
import sys
import os
sys.path.append( os.path.abspath( os.path.join(os.getcwd(), ".." ))) 
from src.utils import reset_root_dir
reset_root_dir()

'/Users/jamie/Dropbox/Programming/Kaggle/kaggle-house-prices'

In [2]:
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import math
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

from src.utils.Charts import Charts
from src.models.LinearRegressionModel import LinearRegressionModel

## Baseline Linear Regression

The basic workflow for linear regression is: 
- split the dataset into test and training datasets
- create a 2 dimentional dataframe of X input features and a 1 dimentional vector of known Y outputs
- generate artifical features as required
- fit a linear model
- score and output predictions against the test dataset

As a naive baseline, we can use the raw data from all the numeric columns in the dataset as X to predict the Y price. Non-numeric fields will cause linear regression to crash. 

In [3]:
linearRegressionModel = LinearRegressionModel()
linearRegressionModel.data["X_train"].head(3)

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
1292,1293,70,60.0,6600,5,4,1892,1965,0.0,0,...,432,0,287,0,0,0,0,0,12,2009
1018,1019,80,0.0,10784,7,5,1991,1992,76.0,0,...,402,164,0,0,0,0,0,0,5,2007
1213,1214,80,0.0,10246,4,9,1965,2001,0.0,648,...,364,88,0,0,0,0,0,0,5,2006


In [4]:
linearRegressionModel.data["Y_train"].head(3)

1292    107500
1018    160000
1213    145000
Name: SalePrice, dtype: int64

In [5]:
linearRegressionModel.execute()

{'class': 'LinearRegressionModel',
 'filename': './data/submissions/LinearRegressionModel.csv',
 'scores': {'R^2': 0.6889500856951878, 'RMSLE': 0.19416226032280556}}

# Scoring Methods

[sklearn.linear_model.LinearRegression()](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
uses [R^2](https://www.investopedia.com/terms/r/r-squared.asp) as its default scoring method. 1 meaning perfect correleation between inputs and outputs.
- R^2 = 0.823 | using training / validation dataset splitting 

[Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview/evaluation) uses [Root-Mean-Squared-Error (RMSE)](https://www.statisticshowto.datasciencecentral.com/rmse/) between the logarithm of the predicted value and the logarithm of the observed sales price. 
- RMSLE = 0.18600 | local train      - without training / validation dataset splitting 
- RMSLE = 0.19416 | local validation - with    training / validation dataset splitting 
- RMSLE = 0.20892 | Kaggle test      - with    training / validation splitting
- RMSLE = 0.43452 | Kaggle test      - without training / validation splitting

## Submit to Kaggle
- https://www.kaggle.com/c/house-prices-advanced-regression-techniques/submissions

```
$ kaggle competitions submit -c house-prices-advanced-regression-techniques -f data/submissions/LinearRegressionModel.csv -m "LinearRegressionModel.py - raw numeric fields
```
    
Before training / validation dataset splitting: 
- Your (Kaggle) submission scored 0.43452, which is an improvement of your previous score of 0.74279. Great job! 
- Kaggle Rank 4079 / 4339

After training / validation dataset splitting:
- Your (Kaggle) submission scored 0.20892, which is an improvement of your previous score of 0.43452. Great job! 
- Kaggle Rank 3751 / 4339

Unable to explain why training / validation splitting has such a major impact on kaggle test scores, but minimal effect when applied locally. Maybe a smaller dataset leads to less overfitting.