In [7]:
#%pip install Pandas
#%pip install keras
#%pip install tensorflow
#%pip install scikit-learn

In [22]:
import os
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scipy as sp
import seaborn as sns
import keras
import sklearn.decomposition
import sklearn.model_selection
import sklearn.preprocessing
import sklearn.linear_model
import sklearn.metrics


Contents 
- [1. Introduction](#1-introduction)
- [2. Dataset and initial inspection](#2-dataset-and-initial-inspection)
- [3. Analysing and formatting the dataset](#3-analysing-and-formatting-the-dataset)
  - [3.1 Formatting the dataset](#3-1-formatting-the-dataset)
  - [3.2 Analysing the dataset](#3-2-analysing-the-dataset)
- [1. Introduction](#1-introduction)



# 1. Introduction

Predicting for how much houses will sell will help current and prospective homeowners navigate the cluttered landscape of real estate. In this report we will explore how the dataset is structured, how to format the dataset to allow us to fit a model to it, and the different kinds of models we can fit to the dataset. 

# 2. Dataset and initial inspection

The dataset provided houses a lot of different parameters on which the price of a house can depend. Just to give a couple examples with their usual impact on the house price from general knowledge and experience:

- LotArea: the area of the land on which a house stands. In general the higher this number the higher the price.
- Bedroom: number of bedrooms above ground. In general the higher the number of bedrooms the higher the price.

There are a lot more prameters that are contained in the dataset. For more detail on this consult the data_description.txt file. 

The total of 79 explanatory variables quantify almost all physical properties of the houses in the dataset. These in some way try to include different human preferences in an objective manner by assessing the quality of materials and the functionality rating of the home. In the end these preferences can differ from person to person as everybody has a different eye for aesthetics. The dataset therefore should be interpreted as how it is structured, an objective assessment of the physical features of the house, and any model that is fit to it should be treated as an objective estimator for the selling price.

# 3. Analysing and formatting the dataset

## 3.1 Formatting the dataset

When analysing the dataset and consulting the data_description.txt file provided. One can notice that there are some non-numeric values. When importing the files through Pandas some of the values will format to NaN values. Before we are able to fit a model to the dataset we have to reformat these values to numeric values. The data_description.txt file was used to get all the different kinds of non-numeric values. A for loop is then used to assign numbers to each non-numeric value. This ensures that linear regression can be used.

The for loop is primitive and ineficient, but due to the fast computer cores of today it still runs fast enough. 

After assigning numeric values to every data point, the entire dataset is columnwise mean normalized.

In [23]:
# accessing the csv files

train_csv = pd.read_csv("./house-prices-advanced-regression-techniques/train.csv")
test_csv = pd.read_csv("./house-prices-advanced-regression-techniques/test.csv")

data_fields = train_csv.columns.values

print(train_csv)
print(len(data_fields))
print(test_csv)


        Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0        1          60       RL         65.0     8450   Pave   NaN      Reg   
1        2          20       RL         80.0     9600   Pave   NaN      Reg   
2        3          60       RL         68.0    11250   Pave   NaN      IR1   
3        4          70       RL         60.0     9550   Pave   NaN      IR1   
4        5          60       RL         84.0    14260   Pave   NaN      IR1   
...    ...         ...      ...          ...      ...    ...   ...      ...   
1455  1456          60       RL         62.0     7917   Pave   NaN      Reg   
1456  1457          20       RL         85.0    13175   Pave   NaN      Reg   
1457  1458          70       RL         66.0     9042   Pave   NaN      Reg   
1458  1459          20       RL         68.0     9717   Pave   NaN      Reg   
1459  1460          20       RL         75.0     9937   Pave   NaN      Reg   

     LandContour Utilities  ... PoolArea PoolQC  Fe

In [19]:
# formatting the data

# change the NaN values to zero

train_csv = train_csv.mask(pd.isna(train_csv), 0)
test_csv = test_csv.mask(pd.isna(test_csv), 0)

# uniformly change non numeric values to a numeric value

classes = { 
    "mszoning_classes"           :     ["A","C","FV","I","RH","RL","RP","RM"],
    "street_classes"            :      ["Grvl","Pave","C (all)"],
    "alley_classes"             :      ["Grvl","Pave","NA"],
    "lotshape_classes"          :      ["Reg","IR1","IR2","IR3"],
    "landcontour_classes"       :      ["Lvl","Bnk","HLS","Low"],
    "utilities_classes"         :      ["AllPub","NoSewr","NoSeWa","ELO"],
    "lotconfig_classes"         :      ["Inside","Corner","CulDSac","FR2","FR3"],
    "landslope_classes"         :      ["Gtl","Mod","Sev"],
    "neighborhood_classes"      :      ["Blmngtn","Blueste","BrDale","BrkSide","ClearCr","CollgCr","Crawfor","Edwards","Gilbert","IDOTRR","MeadowV","Mitchel","Names","NoRidge","NPkVill","NridgHt","NWAmes","OldTown","SWISU","Sawyer","SawyerW","Somerst","StoneBr","Timber","Veenker"],
    "condition1_classes"        :      ["Artery","Feedr","Norm","RRNn","RRAn","PosN","PosA","RRNe","RRAe", "NAmes"],
    "condition2_classes"        :      ["Artery","Feedr","Norm","RRNn","RRAn","PosN","PosA","RRNe","RRAe", "NAmes"],
    "bldgtype_classes"          :      ["1Fam","2FmCon","Duplx","TwnhsE","TwnhsI"],
    "housestyle_classes"        :      ["1Story","1.5Fin","1.5Unf","2Story","2.5Fin","2.5Unf","SFoyer","SLvl"],
    "roofstyle_classes"         :      ["Flat","Gable","Gambrel","Hip","Mansard","Shed"],
    "roofmatl_classes"          :      ["ClyTile","CompShg","Membran","Metal","Roll","Tar","WdShake","WdShngl"],
    "exterior1st_classes"       :      ["AsbShng","AsphShn","BrkComm","BrkFace","CBlock","CemntBd","HdBoard","ImStucc","MetalSd","Other","Plywood","PreCast","Stone","Stucco","VinylSd","Wd Sdng","WdShing"],
    "exterior2nd_classes"       :      ["AsbShng","AsphShn","BrkComm","BrkFace","CBlock","CemntBd","HdBoard","ImStucc","MetalSd","Other","Plywood","PreCast","Stone","Stucco","VinylSd","Wd Sdng","WdShing"],
    "masvnrtype_classes"        :      ["BrkCmn","BrkFace","CBlock","None","Stone"],
    "exterqual_classes"         :      ["Ex","Gd","TA","Fa","Po"],
    "extercond_classes"         :      ["Ex","Gd","TA","Fa","Po"],
    "foundation_classes"        :      ["BrkTil","CBlock","PConc","Slab","Stone","Wood"],
    "bsmtqual_classes"          :      ["Ex","Gd","TA","Fa","Po","NA"],
    "bsmtcond_classes"          :      ["Ex","Gd","TA","Fa","Po","NA"],
    "bsmtexposure_classes"      :      ["Gd","Av","Mn","No","NA"],
    "bsmtfintype1_classes"      :      ["GLQ","ALQ","BLQ","Rec","LwQ","Unf","NA"],
    "bsmtfintype2_classes"      :      ["GLQ","ALQ","BLQ","Rec","LwQ","Unf","NA"],
    "heating_classes"           :      ["Floor","GasA","GasW","Grav","OthW","Wall"],
    "heatingqc_classes"         :      ["Ex","Gd","TA","Fa","Po"],
    "centralair_classes"        :      ["N", "Y"],
    "electrical_classes"        :      ["SBrkr","FuseA","FuseF","FuseP","Mix"],
    "kitchenqual_classes"       :      ["Ex","Gd","TA","Fa","Po"],
    "functional_classes"        :      ["Typ","Min1","Min2","Mod","Maj1","Maj2","Sev","Sal"],
    "fireplacequ_classes"       :      ["Ex","Gd","TA","Fa","Po","NA"],
    "garagetype_classes"        :      ["2Types","Attchd","Basment","BuiltIn","CarPort","Detchd","NA"],
    "garagefinish_classes"      :      ["Fin","RFn","Unf","NA"],
    "garagequal_classes"        :      ["Ex","Gd","TA","Fa","Po","NA"],
    "garagecond_classes"        :      ["Ex","Gd","TA","Fa","Po","NA"],
    "paveddrive_classes"        :      ["Y","P","N"],
    "poolqc_classes"            :      ["Ex","Gd","TA","Fa","NA"],
    "fence_classes"             :      ["GdPrv","MnPrv","GdWo","MnWw","NA"],
    "miscfeature_classes"       :      ["Elev","Gar2","Othr","Shed","TenC","NA"],
    "saletype_classes"          :      ["WD","CWD","VWD","New","COD","Con","ConLw","ConLI","ConLD","Oth"],
    "salecondition_classes"     :      ["Normal","Abnorml","AdjLand","Alloca","Family","Partial"]
}

enumerate = range(1,50)
for key in classes:
    for i,j in zip(classes[key], enumerate):
        train_csv = train_csv.mask(train_csv == i, j)
        test_csv = test_csv.mask(test_csv == i, j)


print("train_csv")
print(train_csv)
print("test_csv")
print(test_csv)

train_csv
        Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0        1          60        6         65.0     8450      2     0        1   
1        2          20        6         80.0     9600      2     0        1   
2        3          60        6         68.0    11250      2     0        2   
3        4          70        6         60.0     9550      2     0        2   
4        5          60        6         84.0    14260      2     0        2   
...    ...         ...      ...          ...      ...    ...   ...      ...   
1455  1456          60        6         62.0     7917      2     0        1   
1456  1457          20        6         85.0    13175      2     0        1   
1457  1458          70        6         66.0     9042      2     0        1   
1458  1459          20        6         68.0     9717      2     0        1   
1459  1460          20        6         75.0     9937      2     0        1   

     LandContour Utilities  ... PoolArea 

## 3.2 Analysing the dataset

We can check which features are most 

In [14]:
## Split x and y
train_x = train_csv.drop("SalePrice", axis=1)
train_x = train_x.drop("Id", axis=1).values
train_y = train_csv["SalePrice"].values
print(train_y)
print(train_x)

[208500 181500 223500 140000 250000 143000 307000 200000 129900 118000
 129500 345000 144000 279500 157000 132000 149000  90000 159000 139000
 325300 139400 230000 129900 154000 256300 134800 306000 207500  68500
  40000 149350 179900 165500 277500 309000 145000 153000 109000  82000
 160000 170000 144000 130250 141000 319900 239686 249700 113000 127000
 177000 114500 110000 385000 130000 180500 172500 196500 438780 124900
 158000 101000 202500 140000 219500 317000 180000 226000  80000 225000
 244000 129500 185000 144900 107400  91000 135750 127000 136500 110000
 193500 153500 245000 126500 168500 260000 174000 164500  85000 123600
 109900  98600 163500 133900 204750 185000 214000  94750  83000 128950
 205000 178000 118964 198900 169500 250000 100000 115000 115000 190000
 136900 180000 383970 217000 259500 176000 139000 155000 320000 163990
 180000 100000 136000 153900 181000  84500 128000  87000 155000 150000
 226000 244000 150750 220000 180000 174000 143000 171000 230000 231500
 11500

In [12]:
means = []


np.set_printoptions(threshold=np.inf)

classkey = classes.keys()
keys = []
for key in classkey:
    keys.append(key)
for i in range(len(train_x[0])):
    print(i, keys[i])
    print(train_x[:,i])
    
    means.append(np.sum(train_x[:,i], axis=0))

0 mszoning_classes
[60 20 60 70 60 50 20 60 50 190 20 60 20 20 20 45 20 90 20 20 60 45 20 120
 20 20 20 20 20 30 70 20 20 20 120 60 20 20 20 90 20 20 85 20 20 120 50 20
 190 20 60 50 90 20 80 20 160 60 60 20 20 75 120 70 60 60 20 20 30 50 20
 20 60 20 50 180 20 50 90 50 60 120 20 20 80 60 60 160 50 20 20 20 30 190
 60 60 20 20 30 20 20 60 90 20 50 60 30 20 50 20 50 80 60 20 70 160 20 20
 60 60 80 50 20 120 20 190 120 45 60 20 60 60 20 20 20 20 20 90 60 60 20
 20 50 20 90 160 30 60 20 50 20 20 60 20 30 50 20 60 60 60 20 60 20 45 40
 190 20 60 60 20 50 20 160 20 20 20 60 50 20 30 160 70 20 50 50 75 80 50
 90 120 70 60 20 160 20 160 20 75 75 20 20 20 50 120 50 20 20 20 60 20 30
 20 60 20 60 20 20 70 50 120 20 60 60 20 20 160 60 160 20 120 20 60 160 20
 60 160 20 60 20 50 20 30 50 160 60 20 190 20 60 50 30 120 60 80 20 60 60
 20 60 20 80 60 80 50 30 20 60 75 30 20 60 20 60 20 20 50 20 20 20 60 60
 20 120 20 120 160 50 20 20 70 60 190 50 60 20 80 50 60 60 20 190 60 20 20
 75 20 60 50 30 20 

TypeError: unsupported operand type(s) for +: 'int' and 'str'

# N. References

- Predicting House Prices (Keras - ANN), Tomas Mantero, https://www.kaggle.com/code/tomasmantero/predicting-house-prices-keras-ann/notebook
- 