# Using Linear Regression to predict house sale prices *(Work In Progress)*

XXXXX

## Setting-up the workflow

In this section we define a general function to help set-up the machine-learning workflow. 
We first import the modules we will need: 

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

We then define the function `train_and_test` which trains a model and returns the error between its prediction and the true data. 
Its parameters are: 
* `data`: the dataframe used to train and test the model, 
* `target`: the name of the target column, 
* `model`: the model to be trained and tested (by default, a linear regression model), 
* `error-function`: the error function to be used (by default, the root mean squared error), 
* `features`: the list of names of columns to be used as features (by default, all numerical columns except `target`), 
* `n_train`: the number of rows to be used for training (by default, approximatelu 80% of the dataset), 
* `randomize`: whether the rows of the dataframe should be randomly re-ordered before the separation into training and test sets (by default, `True`).

In [2]:
def train_and_test(data, target, 
                   model = LinearRegression(), 
                   error_function = lambda y1, y2: np.sqrt(mean_squared_error(y1,y2)), 
                   features = None, 
                   n_train = 0, 
                   randomize = True): 
    '''
    Divides data into a training set and a test set, trains model on the 
        training set, compute its prediction for the test set, and returns the 
        error. 
    
    If features is None, the features are all the numerical columns except 
        target.
    If n_train is 0, the training set contains close to 80% of the rows of data.
    If randomize is True, the rows of data are randomly reschuffled before the
        division into training and test sets. 
    
    data: pandas dataframe
    target: name of a numericla column in data
    model: sklearn model
    error_function: a function taking two series as argument and returning a 
        float
    features: list of names of columns of data or None
    n_train: positive integer or 0
    randomize: bool
    '''
    
    # if randomize is True, randomly re-order the dataset
    if(randomize):
        data = data.sample(frac=1)
    
    # if features is not given take all the numerical columns except target
    if features is None: 
        features = data.select_dtypes(include=[np.number]).columns
        features.remove(target)
    
    # if n_train is not given, take 80% of the data
    if n_train == 0:
        n_train = int(0.8*data.shape[0])
    
    # features for the training set
    X_train = data[features].iloc[:n_train]
    
    # target for the training set
    y_train = data[target].iloc[:n_train]
    
    # features for the test set
    X_test = data[features].iloc[n_train:]
    
    # target for the test set
    y_test = data[target].iloc[n_train:] 
    
    # fit the model on the training set
    model.fit(X_train, y_train)
    
    # compute and return the error function
    return error_function(model.predict(X_test), y_test)

### Description of the dataset

The data we will use, in the file `AmesHousing.txt`, is a [dataset on house sales in Ames from 2006 to 2010](http://www.amstat.org/publications/jse/v19n3/decock/AmesHousing.xls). 
It was compiled by Dean De Cock from Truman State University and is described in details in [this article of the Journal of Statistics Education](https://doi.org/10.1080/10691898.2011.11889627). 
It contains, according to the article, 80 columns related to house sales, among which are 20 continuous variables, 14 diecrete ones, 23 ordinal ones, and 23 cardinal ones.

Let us import it, check the number of columns, and print the first few lines:

In [4]:
df = pd.read_csv('../Data/AmesHousing/AmesHousing.txt', delimiter = '\t')
df.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


It seems there are actually 82 columns. 
The reason for this difference is probably that the columns `Order` (which does not give information on the actual sale) and `SalePrice` (which will be the target) were not included in the count mentioned in the article.
Let us now determine the number of lines and see if there are missing values:

In [5]:
print('Number of lines: ' + str(df.shape[0]))
print()
print('Missing values:')
df.isnull().sum().sort_values(ascending=False)[:30]

Number of lines: 2930

Missing values:


Pool QC           2917
Misc Feature      2824
Alley             2732
Fence             2358
Fireplace Qu      1422
Lot Frontage       490
Garage Qual        159
Garage Yr Blt      159
Garage Cond        159
Garage Finish      159
Garage Type        157
Bsmt Exposure       83
BsmtFin Type 2      81
BsmtFin Type 1      80
Bsmt Cond           80
Bsmt Qual           80
Mas Vnr Type        23
Mas Vnr Area        23
Bsmt Full Bath       2
Bsmt Half Bath       2
Garage Area          1
Garage Cars          1
Total Bsmt SF        1
Bsmt Unf SF          1
BsmtFin SF 2         1
BsmtFin SF 1         1
Electrical           1
Exterior 2nd         0
Exterior 1st         0
Roof Matl            0
dtype: int64

The dataframe has 2930 lines. 
27 columns among the 82 have at least one missing value. 
Among them, 5 have more than half of their values missing.

# Feature Engineering