# Project: Predicting house sale prices  
**Data:** housing data for the city of Ames, Iowa, USA, 2006 to 2010  
**Data description:** https://s3.amazonaws.com/dq-content/307/data_description.txt  
**Source:** https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627  

## Phase 1 - preparation

In [78]:
# importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
from scipy import stats
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from itertools import combinations
%matplotlib inline

# Setting pandas display options
pd.options.display.max_columns = 200
pd.options.display.max_rows = 100

In [2]:
# Importing source dataset
ames_data = pd.read_csv('C:\\Users\\tgusc\\Documents\\GitHub\\Python\\GP23_PredictingHouseSalePrices\\AmesHousing.tsv', delimiter="\t")

In [None]:
# Converting object types to numbers and strings
ames_data = ames_data.convert_dtypes()
print(ames_data.info())

In [83]:
def transform_features(data_in, cutoff_missing = 0.25, cutoff_fill = 0.05, fill_method = 'mode'):
    """
    Transform features based on their characteristics.
    
    Parameters
    ----------
    data_in : str
        DataFrame to analyze.
    cutoff_missing : float64, default = 0.25
        Percentage of missing values used as cutoff point for dropping variable. If missings > cutoff_missing then
        drop variable from DataFrame.
    cutoff_fill : float64, default = 0.05
        Percentage of missing values used as cutoff point for filling missing variables with fill_method. If 
        missings < cutoff_fill then replace missing values with fill_method
    fill_method : str, default = 'mode'
        Filling method for missing values, when variable meets cutoff_fill criteria. Can choose from average, median, mode.    
    seed : int
        Random number seed for results reproductibility.
    train_pct : float64, default 0.8
        Percentage of DS dataset to be used as train. 
        
    Returns
    -------
    data_out : DataFrame
        DataFrame with transformed features.
    """
    data_out = pd.DataFrame()
    missing_info = data_in.isna().sum()/len(data_in)
    dropped_cols = []
    for col in data_in.columns:
        p_miss = missing_info[missing_info.index == col][0]
        if p_miss > cutoff_missing:
            print(col + ' - dropped because of missing values exceeding ' + str(cutoff_missing) + '%.' + ' Missing values = '
                 + str(round(p_miss*100,2)) + '%.')
            dropped_cols.append(col)
    for col_n in data_in.select_dtypes('number'):
        p_miss = missing_info[missing_info.index == col_n][0]
        if ((p_miss <= cutoff_fill) & (p_miss > 0)):
            if fill_method == 'mode':
                fill=data_in[col_n].mode()[0]
            elif fill_method == 'mean':
                fill=np.mean(data_in[col_n])
            elif fill_method == 'median':
                fill=np.median(data_in[col_n])
            else:
                print(fill_method + ' is not known. Column will not be transformed')
                continue
            data_out[col_n] = data_in[col_n].fillna(value = fill)
            print(col_n + ' - ' + str(round(p_miss*100,4)) + '% of missing values. They are replaced with ' + fill_method + ' value - ' + str(fill))
        else :
            data_out[col_n] = data_in[col_n]
            print(col_n + ' - ' + str(round(p_miss*100,4)) + '% of missing values. Variable copied.')
    for col_c in data_in.select_dtypes(exclude='number'):
        if col_c not in dropped_cols:
            data_out[col_c] = data_in[col_c].astype('category')
            print(col_c + ' - ' + str(round(p_miss*100,4)) + '% of missing values. Variable copied.')
    return data_out

In [84]:
ames_data_transformed = transform_features(data_in = ames_data)

Alley - dropped because of missing values exceeding 0.25%. Missing values = 93.24%.
Fireplace Qu - dropped because of missing values exceeding 0.25%. Missing values = 48.53%.
Pool QC - dropped because of missing values exceeding 0.25%. Missing values = 99.56%.
Fence - dropped because of missing values exceeding 0.25%. Missing values = 80.48%.
Misc Feature - dropped because of missing values exceeding 0.25%. Missing values = 96.38%.
Order - 0.0% of missing values. Variable copied.
PID - 0.0% of missing values. Variable copied.
MS SubClass - 0.0% of missing values. Variable copied.
Lot Frontage - 16.7235% of missing values. Variable copied.
Lot Area - 0.0% of missing values. Variable copied.
Overall Qual - 0.0% of missing values. Variable copied.
Overall Cond - 0.0% of missing values. Variable copied.
Year Built - 0.0% of missing values. Variable copied.
Year Remod/Add - 0.0% of missing values. Variable copied.
Mas Vnr Area - 0.785% of missing values. They are replaced with mode value - 

In [85]:
ames_data_transformed

Unnamed: 0,Order,PID,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,TotRms AbvGrd,Fireplaces,Garage Yr Blt,Garage Cars,Garage Area,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,SalePrice,MS Zoning,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Type,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin Type 2,Heating,Heating QC,Central Air,Electrical,Kitchen Qual,Functional,Garage Type,Garage Finish,Garage Qual,Garage Cond,Paved Drive,Sale Type,Sale Condition
0,1,526301100,20,141,31770,6,5,1960,1960,112,639,0,441,1080,1656,0,0,1656,1,0,1,0,3,1,7,2,1960,2,528,210,62,0,0,0,0,0,5,2010,215000,RL,Pave,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,Hip,CompShg,BrkFace,Plywood,Stone,TA,TA,CBlock,TA,Gd,Gd,BLQ,Unf,GasA,Fa,Y,SBrkr,TA,Typ,Attchd,Fin,TA,TA,P,WD,Normal
1,2,526350040,20,80,11622,5,6,1961,1961,0,468,144,270,882,896,0,0,896,0,0,1,0,2,1,5,0,1961,1,730,140,0,0,0,120,0,0,6,2010,105000,RH,Pave,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Feedr,Norm,1Fam,1Story,Gable,CompShg,VinylSd,VinylSd,,TA,TA,CBlock,TA,TA,No,Rec,LwQ,GasA,TA,Y,SBrkr,TA,Typ,Attchd,Unf,TA,TA,Y,WD,Normal
2,3,526351010,20,81,14267,6,6,1958,1958,108,923,0,406,1329,1329,0,0,1329,0,0,1,1,3,1,6,0,1958,1,312,393,36,0,0,0,0,12500,6,2010,172000,RL,Pave,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,TA,TA,CBlock,TA,TA,No,ALQ,Unf,GasA,TA,Y,SBrkr,Gd,Typ,Attchd,Unf,TA,TA,Y,WD,Normal
3,4,526353030,20,93,11160,7,5,1968,1968,0,1065,0,1045,2110,2110,0,0,2110,1,0,2,1,3,1,8,2,1968,2,522,0,0,0,0,0,0,0,4,2010,244000,RL,Pave,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,Hip,CompShg,BrkFace,BrkFace,,Gd,TA,CBlock,TA,TA,No,ALQ,Unf,GasA,Ex,Y,SBrkr,Ex,Typ,Attchd,Fin,TA,TA,Y,WD,Normal
4,5,527105010,60,74,13830,5,5,1997,1998,0,791,0,137,928,928,701,0,1629,0,0,2,1,3,1,6,1,1997,2,482,212,34,0,0,0,0,0,3,2010,189900,RL,Pave,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,,TA,TA,PConc,Gd,TA,No,GLQ,Unf,GasA,Gd,Y,SBrkr,TA,Typ,Attchd,Fin,TA,TA,Y,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2925,2926,923275080,80,37,7937,6,6,1984,1984,0,819,0,184,1003,1003,0,0,1003,1,0,1,0,3,1,6,0,1984,2,588,120,0,0,0,0,0,0,3,2006,142500,RL,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,Mitchel,Norm,Norm,1Fam,SLvl,Gable,CompShg,HdBoard,HdBoard,,TA,TA,CBlock,TA,TA,Av,GLQ,Unf,GasA,TA,Y,SBrkr,TA,Typ,Detchd,Unf,TA,TA,Y,WD,Normal
2926,2927,923276100,20,,8885,5,5,1983,1983,0,301,324,239,864,902,0,0,902,1,0,1,0,2,1,5,0,1983,2,484,164,0,0,0,0,0,0,6,2006,131000,RL,Pave,IR1,Low,AllPub,Inside,Mod,Mitchel,Norm,Norm,1Fam,1Story,Gable,CompShg,HdBoard,HdBoard,,TA,TA,CBlock,Gd,TA,Av,BLQ,ALQ,GasA,TA,Y,SBrkr,TA,Typ,Attchd,Unf,TA,TA,Y,WD,Normal
2927,2928,923400125,85,62,10441,5,5,1992,1992,0,337,0,575,912,970,0,0,970,0,1,1,0,3,1,6,0,,0,0,80,32,0,0,0,0,700,7,2006,132000,RL,Pave,Reg,Lvl,AllPub,Inside,Gtl,Mitchel,Norm,Norm,1Fam,SFoyer,Gable,CompShg,HdBoard,Wd Shng,,TA,TA,PConc,Gd,TA,Av,GLQ,Unf,GasA,TA,Y,SBrkr,TA,Typ,,,,,Y,WD,Normal
2928,2929,924100070,20,77,10010,5,5,1974,1975,0,1071,123,195,1389,1389,0,0,1389,1,0,1,0,2,1,6,1,1975,2,418,240,38,0,0,0,0,0,4,2006,170000,RL,Pave,Reg,Lvl,AllPub,Inside,Mod,Mitchel,Norm,Norm,1Fam,1Story,Gable,CompShg,HdBoard,HdBoard,,TA,TA,CBlock,Gd,TA,Av,ALQ,LwQ,GasA,Gd,Y,SBrkr,TA,Typ,Attchd,RFn,TA,TA,Y,WD,Normal


. Update transform_features() so that any column from the data frame with more than 25% (or another cutoff value) missing values is dropped. You also need to remove any columns that leak information about the sale (e.g. like the year the sale happened). In general, the goal of this function is to:

    remove features that we don't want to use in the model, just based on the number of missing values or data leakage
    transform features into the proper format (numerical to categorical, scaling numerical, filling in missing values, etc)
    create new features by combining other features
The transform_features() function shouldn't modify the train data frame and instead return a new one entirely. 
This way, we can keep using train in the experimentation cells.
Which columns contain less than 5% missing values?

    For numerical columns that meet this criteria, let's fill in the missing values using the most popular value for that column.

What new features can we create, that better capture the information in some of the features?

    An example of this would be the years_until_remod feature we created in the last lesson.

Which columns need to be dropped for other reasons?

    Which columns aren't useful for machine learning?
    Which columns leak data about the final sale?



In [4]:
def select_features(data):

    return ["Gr Liv Area","SalePrice"]

In [7]:
def train_and_test(data, target):
    train = data[:1460]
    test = data[1460:]
    features = select_features(train)
    features.remove(target)
    reg = LinearRegression()
    reg.fit(train[features], train[target])
    predictions=reg.predict(test[features])
    rmse=np.sqrt(mean_squared_error(predictions,test[target]))
    return rmse

In [8]:
#TEST functions
print(train_and_test(ames_data, "SalePrice"))

57088.25161263909
