# Importing dependencies

The first step is to import all the libs that going to be used, in this case, `pandas` for data manipulation, and `sklearn` for data processing and implementing the regression algorithms.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
import seaborn as snb
from sklearn.feature_selection import SequentialFeatureSelector, SelectKBest, f_regression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from statistics import mean
from matplotlib import pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
from sklearn.linear_model import LogisticRegression

# Split Train and Test Datasets

After importing the modules, it's time to import the csv dataset as a pandas's dataframe, and then separate this dataset in train and validation sets, in this case, I used 66% of the data for training and 33% to validate. The method tested was a simple  `train_test_split`, the reason is, as the data has a significant quantity of values, maybe it's not so necessary to use a **k-fold** split method, for example.

In [2]:
dataset = pd.read_csv("./train.csv")
print(dataset.head())
y = dataset["SalePrice"]
X = dataset.drop("SalePrice", axis = 1)
# xTrain, xVal, yTrain, yVal = train_test_split(X, y, test_size = 0.33)
print(X.shape)
X.head()

   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   

  YrSold  SaleType  SaleCondition  SalePrice  
0   2008        WD   

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,2,2008,WD,Normal
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,5,2007,WD,Normal
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,9,2008,WD,Normal
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,2,2006,WD,Abnorml
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,12,2008,WD,Normal


# Handle Null Entries

Here I started the preprocessing stage, first of all, I'm trying to find which features in my dataset have any null values, for those features, there're some differente approachs like, simply remove those columns, or input the mean value in the null entries.

## Verify Number of Null Occurrences

Here I'm veryfing the number of null values in each column to define the method that will be used. For the columns that presents a rate of null values higher than 70%, the decision is to drop the column, because of the lack of information. If there are less than 70%, I'm going to set the mean value of each column in the null entries.

In [3]:
nullFeat = [col for col in X.columns if X[col].isnull().any()]

nullEntriesHigherThan70Percent = [col for col in nullFeat 
                          if X[col].isnull().sum(axis = 0) / X.shape[0] >= .7]
nullEntriesLessThan70Percent = [col for col in nullFeat 
                          if X[col].isnull().sum(axis = 0) / X.shape[0] < .7]

print(nullEntriesHigherThan70Percent, nullEntriesLessThan70Percent)

['Alley', 'PoolQC', 'Fence', 'MiscFeature'] ['LotFrontage', 'MasVnrType', 'MasVnrArea', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond']


## Remove Null Features

First I'm removing the null columns, where the rate if higher than 70%, to see the impact in the result

In [4]:
X = X.drop(nullEntriesHigherThan70Percent, axis = 1)

print(X.shape)

(1460, 76)


## Convert Categorical Features into Numeric

Converting categorical features into numeric ones by applying `OneHotEncoding` technic, for this I'll be choosing the features with low cardinality (<= 10), this value was chosen arbitrarily. Between the selected columns, I will analyse which of them can be considered ordinal values and which are nominal ones. For ordinal values, as the values follow an order, it's important to convert the categorical values into numerical ones and keep respecting the order, the ordinal features are: <b>"Street", "LotShape", "LandContour", "Utilities", "LandSlope", "ExterQual", "ExterCond", "HeatingQC", "KitchenQual", "Functional", "PavedDrive", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", "FireplaceQu", "GarageFinish", "GarageQual", "GarageCond"</b>.

In [5]:
categoricalColumns = [col for col in X if X[col].dtype == "object"]

print("All categorical columns: ", categoricalColumns)

nominalFeat = ["MSZoning", "LotConfig", "Condition1", "Condition2", "BldgType", "HouseStyle", "RoofStyle", "RoofMatl", "MasVnrType", 
              "Foundation", "Heating", "CentralAir", "Electrical", "GarageType", "SaleType", "SaleCondition", "Neighborhood",
              "Exterior1st", "Exterior2nd"]
ordinalFeat = ["Street", "LotShape", "LandContour", "Utilities", "LandSlope", "ExterQual", "ExterCond", "HeatingQC", "KitchenQual",
              "Functional", "PavedDrive", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", "FireplaceQu",
              "GarageFinish", "GarageQual", "GarageCond"]

All categorical columns:  ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'SaleType', 'SaleCondition']


In [6]:
newValues = {"N" : 0, "P" : 1, "Y": 2}
X = X.replace({"PavedDrive": newValues})

newValues = {"Sal" : 0, "Sev" : 1, "Maj2": 2, "Maj1": 3, "Mod": 4, "Min2": 5, "Min1": 6, "Typ": 7}
X = X.replace({"Functional": newValues})

newValues = {"Po" : 0, "Fa" : 1, "TA": 2, "Gd": 3, "Ex": 4}
X = X.replace({"ExterQual": newValues, "ExterCond": newValues, "HeatingQC": newValues, "KitchenQual": newValues})

newValues = {"Sev" : 0, "Mod" : 1, "Gtl": 2}
X = X.replace({"LandSlope": newValues})

newValues = {"ELO" : 0, "NoSeWa" : 1, "NoSewr": 2, "AllPub": 3}
X = X.replace({"Utilities": newValues})

newValues = {"Low" : 0, "HLS" : 1, "Bnk": 2, "Lvl": 3}
X = X.replace({"LandContour": newValues})

newValues = {"Grvl" : 0, "Pave" : 1}
X = X.replace({"Street": newValues})

newValues = {"IR3" : 0, "IR2" : 1, "IR1": 2, "Reg": 3}
X = X.replace({"LotShape": newValues})

newValues = {"Ex" : 6, "Gd" : 5, "TA" : 4, "Fa" : 3, "Po" : 2, "NA" : 1}
X = X.replace({"BsmtQual": newValues, "BsmtCond": newValues, "FireplaceQu" : newValues, "GarageQual" : newValues, "GarageCond" : newValues})

newValues = {"Gd" : 5, "Av" : 4, "Mn" : 3, "No" : 2, "NA" : 1}
X = X.replace({"BsmtExposure": newValues})

newValues = {"GLQ" : 7, "ALQ" : 6, "BLQ" : 5, "Rec" : 4, "LwQ" : 3, "Unf" : 2, "NA" : 1}
X = X.replace({"BsmtFinType1": newValues, "BsmtFinType2" : newValues})

newValues = {"Fin" : 4, "RFn" : 3, "Unf" : 2, "NA" : 1}
X = X.replace({"GarageFinish": newValues})

print(X["BsmtCond"].head())

0    4.0
1    4.0
2    4.0
3    5.0
4    4.0
Name: BsmtCond, dtype: float64


In [7]:
oneHot = OneHotEncoder(handle_unknown = "ignore", sparse = False)

nominalLowCardinality = [col for col in nominalFeat if X[col].nunique() <= 5]
nominalHighCardinality = [col for col in nominalFeat if col not in nominalLowCardinality]

#Applying One-Hot-Encoder

oneHotTrainCol = pd.DataFrame(oneHot.fit_transform(X[nominalLowCardinality]), index = X.index)

X = X.drop(nominalLowCardinality, axis = 1)

X = pd.concat([X, oneHotTrainCol], axis = 1)

#Applying Label-Encoder

xCopy = X.copy()

labelEncoder = LabelEncoder()
for col in nominalHighCardinality :
    X[col] = labelEncoder.fit_transform(xCopy[col])

X.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LandSlope,Neighborhood,...,18,19,20,21,22,23,24,25,26,27
0,1,60,65.0,8450,1,3,3,3,2,5,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,2,20,80.0,9600,1,3,3,3,2,24,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,3,60,68.0,11250,1,2,3,3,2,5,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,4,70,60.0,9550,1,2,3,3,2,6,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,5,60,84.0,14260,1,2,3,3,2,15,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


In [8]:
simpleImputer = SimpleImputer()

imputedX = pd.DataFrame(simpleImputer.fit_transform(X), columns = X.columns)
print(imputedX)
[col for col in imputedX.columns if imputedX[col].isnull().sum(axis = 0) > 0]

          Id  MSSubClass  LotFrontage  LotArea  Street  LotShape  LandContour  \
0        1.0        60.0         65.0   8450.0     1.0       3.0          3.0   
1        2.0        20.0         80.0   9600.0     1.0       3.0          3.0   
2        3.0        60.0         68.0  11250.0     1.0       2.0          3.0   
3        4.0        70.0         60.0   9550.0     1.0       2.0          3.0   
4        5.0        60.0         84.0  14260.0     1.0       2.0          3.0   
...      ...         ...          ...      ...     ...       ...          ...   
1455  1456.0        60.0         62.0   7917.0     1.0       3.0          3.0   
1456  1457.0        20.0         85.0  13175.0     1.0       3.0          3.0   
1457  1458.0        70.0         66.0   9042.0     1.0       3.0          3.0   
1458  1459.0        20.0         68.0   9717.0     1.0       3.0          3.0   
1459  1460.0        20.0         75.0   9937.0     1.0       3.0          3.0   

      Utilities  LandSlope 

[]

# Testing Features
At this step, the `Sequential Feature Selector` algorithm will be used, this method tests different combinations of features, and the value that is used to defined the best combination is the highest accuracy obtained by the best combination.

In [9]:
randomForest = RandomForestClassifier(n_jobs = 4, random_state = 42)
gbr = GradientBoostingRegressor(random_state = 42)
svr = SVR()
logisticRegression = LogisticRegression(random_state = 42)
paramGrid = [{
        "model": gbr,
        "name": "gbr",
        "params": {
            "learning_rate": [0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6, 1.8], 
            "n_estimators": [50, 100, 150, 200, 250, 300],
            "loss": ["ls", "lad", "huber"],
            "max_leaf_nodes": [10, 20, 30, 40]    
        }
    },
    {
        "model": randomForest,
        "name": "randomForest",
        "params": {
            "n_estimators": [50, 100, 150, 200, 250, 300],
            "max_leaf_nodes": [10, 20, 30, 40]    
        }
    },
    {
        "model": svr,
        "name": "svr",
        "params": {
            "kernel": ["linear", "poly", "sigmoid", "rbf"],
            "gamma": ["scale", "auto"],
            "C": [1, 5, 10, 15]    
        }
    },
    {
        "model": logisticRegression,
        "name": "logisticRegression",
        "params": {
            "penalty": ["l1", "l2", "elasticnet", "none"],
            "C": [1, 5, 10, 15]    
        }
    }
]
scores = []

for val in paramGrid:
    if val["name"] != "svr":
        gridGBR = GridSearchCV(val["model"], val["params"], cv = 5, scoring = "neg_mean_absolute_error", n_jobs = 4)
        gridGBR.fit(imputedX, y)
        scores.append({val["name"]: {
            "score": gridGBR.best_score_ * -1,
            "params": gridGBR.best_params_
        }})
    print(scores)

print(scores)
# for i in range(5, 96, 5) :
#     gridGBR = GridSearchCV(gbr, paramGrid, cv = 5, scoring = "neg_mean_absolute_error", n_jobs = 4)
#     skb = SelectKBest(score_func = f_regression, k = i)
#     xSKB = skb.fit_transform(imputedX, y)
#     gridGBR.fit(xSKB, y)
#     maeSKBGS = gridGBR.best_score_ * -1
#     maeSKB = -1 * cross_val_score(gbr, xSKB, y, cv = 5, scoring = "neg_mean_absolute_error")
#     gridDF = pd.DataFrame(SKBGS.cv_results_)
#     print(maeSKBGS, mean(maeSKB), gridGBR.best_params_)
#     scores.append([maeSKBGS, mean(maeSKB)])
    
# scoresDF = pd.DataFrame(scores, columns = ["scores SKBGS", "scores SKB"])

# fig, axes = plt.subplots(1, 2, figsize=(16, 8), sharey=True)
# fig.suptitle("Different features selectors scores (MAE)")

# snb.lineplot(ax = axes[0], data = scoresDF, x = scoresDF.index, y = "scores SKBGS")
# axes[0].set_title("Scores SKBGS")
# axes[0].set_xlabel("n features")
# axes[0].set_ylabel("scores")

# snb.lineplot(ax = axes[1], data = scoresDF, x = scoresDF.index, y = "scores SKB")
# axes[1].set_title("Scores SKB")
# axes[1].set_xlabel("n features")
# axes[1].set_ylabel("scores")

#     selectedFeats.append(skb.get_support(indices = True))

[{'gbr': {'score': 15857.756727143651, 'params': {'learning_rate': 0.2, 'loss': 'ls', 'max_leaf_nodes': 10, 'n_estimators': 200}}}]




[{'gbr': {'score': 15857.756727143651, 'params': {'learning_rate': 0.2, 'loss': 'ls', 'max_leaf_nodes': 10, 'n_estimators': 200}}}, {'randomForest': {'score': 31309.969178082196, 'params': {'max_leaf_nodes': 40, 'n_estimators': 250}}}]
[{'gbr': {'score': 15857.756727143651, 'params': {'learning_rate': 0.2, 'loss': 'ls', 'max_leaf_nodes': 10, 'n_estimators': 200}}}, {'randomForest': {'score': 31309.969178082196, 'params': {'max_leaf_nodes': 40, 'n_estimators': 250}}}]


             nan -41151.6239726              nan -40317.74246575
             nan -41471.07739726             nan -40317.74246575
             nan -41537.31712329             nan -40317.74246575]


[{'gbr': {'score': 15857.756727143651, 'params': {'learning_rate': 0.2, 'loss': 'ls', 'max_leaf_nodes': 10, 'n_estimators': 200}}}, {'randomForest': {'score': 31309.969178082196, 'params': {'max_leaf_nodes': 40, 'n_estimators': 250}}}, {'logisticRegression': {'score': 40317.74246575342, 'params': {'C': 1, 'penalty': 'none'}}}]
[{'gbr': {'score': 15857.756727143651, 'params': {'learning_rate': 0.2, 'loss': 'ls', 'max_leaf_nodes': 10, 'n_estimators': 200}}}, {'randomForest': {'score': 31309.969178082196, 'params': {'max_leaf_nodes': 40, 'n_estimators': 250}}}, {'logisticRegression': {'score': 40317.74246575342, 'params': {'C': 1, 'penalty': 'none'}}}]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
