# Quick note

The final dataset contains 76 columns, while the dataset transformed using dummy variables consists of 258 columns. I didn't want the user to input values for all 258 columns, as I believe this would not only be impractical but also significantly increase the chances of errors. Therefore, I created a *.py* file called `converter.py` with function called *dummies_converter* that transforms the user's input into dummy variables and reshapes it into an appropriate format. Specifically, it converts the user's input of 76 values into an output containing 257 values (one value is missing because the *'SalePrice'* column was excluded to ensure the function's result is immediately ready for price predictions).

The project is organized into sections: default, cross-validation, scaled default, scaled cross-validation and mean error. The goal of this structure is to improve the organization of the project and to facilitate comparisons between all types of models.

**The project follows the following format:** 

* The cell in which the operation was performed

* The result of that operation

* A textual explanation detailing what was done in the cell, the conclusion, the idea, and similar information.

# Importing libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
pwd

'C:\\Users\\jovan\\Desktop\\HouseProject\\House_Price_Prediction_-_Classification'

# Importing data

In [None]:
df = pd.read_csv('C:\\Users\\jovan\\Desktop\\HouseProject\\House_Price_Prediction_-_Classification\\Data\\Housing_Data_Final.csv')
df = df.drop('Unnamed: 0', axis = 1)

In [3]:
df

Unnamed: 0,PID,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,...,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_VWD,Sale Type_WD,Sale Condition_AdjLand,Sale Condition_Alloca,Sale Condition_Family,Sale Condition_Normal,Sale Condition_Partial
0,526301100,20,141.000000,31770,6,5,1960,1960,112.0,639.0,...,0,0,0,0,1,0,0,0,1,0
1,526350040,20,80.000000,11622,5,6,1961,1961,0.0,468.0,...,0,0,0,0,1,0,0,0,1,0
2,526351010,20,81.000000,14267,6,6,1958,1958,108.0,923.0,...,0,0,0,0,1,0,0,0,1,0
3,526353030,20,93.000000,11160,7,5,1968,1968,0.0,1065.0,...,0,0,0,0,1,0,0,0,1,0
4,527105010,60,74.000000,13830,5,5,1997,1998,0.0,791.0,...,0,0,0,0,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2921,923275080,80,37.000000,7937,6,6,1984,1984,0.0,819.0,...,0,0,0,0,1,0,0,0,1,0
2922,923276100,20,75.144444,8885,5,5,1983,1983,0.0,301.0,...,0,0,0,0,1,0,0,0,1,0
2923,923400125,85,62.000000,10441,5,5,1992,1992,0.0,337.0,...,0,0,0,0,1,0,0,0,1,0
2924,924100070,20,77.000000,10010,5,5,1974,1975,0.0,1071.0,...,0,0,0,0,1,0,0,0,1,0


# Train_test split

In [None]:
X = df.drop('SalePrice', axis = 1)

In [None]:
y = df['SalePrice']

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

**MEAN ABSOLUTE ERROR / MEAN SQUARED ERROR FUNCTION**

In [None]:
def errors(preds):
    MAE = mean_absolute_error(y_test, preds)
    RMSE = np.sqrt(mean_squared_error(y_test, preds))
    print('MAE: ', MAE)
    print('RMSE: ', RMSE)

# Default models

**LINEAR REGRESSION**

In [20]:
from sklearn.linear_model import LinearRegression

In [10]:
linear_default_model = LinearRegression()

In [11]:
linear_default_model.fit(X_train, y_train)

In [12]:
linear_default_model_preds = linear_default_model.predict(X_test)

In [8]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [15]:
MAE = mean_absolute_error(y_test, linear_default_model_preds)

In [16]:
RMSE = np.sqrt(mean_squared_error(y_test, linear_default_model_preds))

In [17]:
MAE

16915.228718310234

In [18]:
RMSE

38883.47863056807

-----

**L1 REGULARIZATION (LASSO)**

In [21]:
from sklearn.linear_model import Lasso

In [23]:
l1_model = Lasso(alpha = 1, max_iter = 1000000)

In [24]:
l1_model.fit(X_train, y_train)

In [25]:
l1_model_preds = l1_model.predict(X_test)

In [26]:
MAE = mean_absolute_error(y_test, l1_model_preds)

In [27]:
RMSE = np.sqrt(mean_squared_error(y_test, l1_model_preds))

In [28]:
MAE

16742.092468052157

In [29]:
RMSE

38020.421597050205

-----

**L2 REGULARIZATION (RIDGE)**

In [22]:
from sklearn.linear_model import Ridge

In [31]:
ridge_model = Ridge(alpha = 1)

In [33]:
ridge_model.fit(X_train, y_train)

  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


In [36]:
ridge_model_preds = ridge_model.predict(X_test)

In [40]:
errors(preds = ridge_model_preds)

MAE:  16005.498628765352
RMSE:  27687.06407201346


-----

**L1 AND L2 REGULARIZATION (ELASTIC NET)**

In [23]:
from sklearn.linear_model import ElasticNet

In [60]:
elastic_net_model = ElasticNet(alpha = 1, l1_ratio = 0.5, max_iter = 100000)

In [61]:
elastic_net_model.fit(X_train, y_train)

  model = cd_fast.enet_coordinate_descent(


In [62]:
elastic_net_model_preds = elastic_net_model.predict(X_test)

In [63]:
errors(elastic_net_model_preds)

MAE:  17645.784577904455
RMSE:  32559.857206931836


-----

**KNearestNeighbors (KNN)**

In [24]:
from sklearn.neighbors import KNeighborsRegressor

In [65]:
knn_model = KNeighborsRegressor()

In [67]:
knn_model.fit(X_train, y_train)

In [68]:
knn_model_preds = knn_model.predict(X_test)

In [69]:
errors(knn_model_preds)

MAE:  28053.61010928962
RMSE:  42151.42035699611


-----

**DECISION TREE**

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [71]:
decision_tree_model = DecisionTreeRegressor()

In [72]:
decision_tree_model.fit(X_train, y_train)

In [73]:
decision_tree_model_preds = decision_tree_model.predict(X_test)

In [74]:
errors(decision_tree_model_preds)

MAE:  24263.63661202186
RMSE:  37851.334880597926


-----

**SUPPORT VECTOR MACHINES**

In [26]:
from sklearn.svm import SVR

In [76]:
svr_model = SVR()

In [77]:
svr_model.fit(X_train, y_train)

In [78]:
svr_model_preds = svr_model.predict(X_test)

In [79]:
errors(svr_model_preds)

MAE:  57693.89306077866
RMSE:  85014.62078868182


-----

**RANDOM FOREST**

In [27]:
from sklearn.ensemble import RandomForestRegressor

In [81]:
rfg_model = RandomForestRegressor()

In [82]:
rfg_model.fit(X_train, y_train)

In [83]:
rfg_model_preds = rfg_model.predict(X_test)

In [84]:
errors(rfg_model_preds)

MAE:  15038.344489981786
RMSE:  23871.615601542093


-----

**GRADIENT BOOST**

In [28]:
from sklearn.ensemble import GradientBoostingRegressor

In [87]:
gbr_model = GradientBoostingRegressor()

In [88]:
gbr_model.fit(X_train, y_train)

In [89]:
gbr_model_preds = gbr_model.predict(X_test)

In [90]:
errors(gbr_model_preds)

MAE:  13774.358780739503
RMSE:  21959.29196661789


-----

**ADA BOOST**

In [29]:
from sklearn.ensemble import AdaBoostRegressor

In [92]:
abr_model = AdaBoostRegressor()

In [93]:
abr_model.fit(X_train, y_train)

In [94]:
abr_model_preds = abr_model.predict(X_test)

In [95]:
errors(abr_model_preds)

MAE:  22820.838776245942
RMSE:  30475.009598736055


-----

# Scaled default models

**SCALER**

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
scaled_X_train = scaler.fit_transform(X_train)

In [None]:
scaled_X_test = scaler.transform(X_test)

**MODEL FUNCTION**

In [None]:
def function_model(model):
    model.fit(scaled_X_train, y_train)
    preds = model.predict(scaled_X_test)
    errors(preds)

In [None]:
def function_model_nonscaled(model):
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    errors(preds)

**LINEAR REGRESSION**

In [30]:
scaled_linear_model = LinearRegression()

In [35]:
function_model(scaled_linear_model)

MAE:  1.8723673728087936e+16
RMSE:  2.939553528160561e+17


-----

**L1 REGULARIZATION (LASSO)**

In [43]:
scaled_l1_model = Ridge()

In [40]:
function_model(scaled_l1_model)

MAE:  16784.480670065055
RMSE:  37795.13825926879


-----

**L2 REGULARIZATION (RIDGE)**

In [46]:
scaled_l2_model = Lasso(max_iter = 1000000)

In [47]:
function_model(scaled_l2_model)

MAE:  16923.111314815244
RMSE:  38791.89698752861


-----

**L1 AND L2 REGULARIZATION (ELASTIC NET)**

In [49]:
scaled_elastic_net_model = ElasticNet(max_iter = 100000)

In [50]:
function_model(scaled_elastic_net_model)

MAE:  16142.665953997894
RMSE:  29078.586212506536


-----

**KNearestNeighbors (KNN)**

In [53]:
scaled_knn_model = KNeighborsRegressor()

In [54]:
function_model(scaled_knn_model)

MAE:  23346.025956284153
RMSE:  35742.20789135747


-----

**SUPPORT VECTOR MACHINES**

In [55]:
scaled_svm_model = SVR()

In [56]:
function_model(scaled_svm_model)

MAE:  57655.825796571386
RMSE:  84988.34978945108


-----

**Alghoritms like decision trees, random forests, gradient boosting and adaboost do not need to be scaled!**

# Cross-Validation

For alghoritms that have integrated cross-validation I will just use that version of model

In [None]:
from sklearn.model_selection import GridSearchCV

**L1 REGULARIZATION (LASSO)**

In [59]:
from sklearn.linear_model import LassoCV

In [95]:
lasso_model_cv = LassoCV(eps = 0.1, n_alphas = 100, cv = 5)

In [61]:
function_model(lasso_model_cv)

MAE:  20690.468883656555
RMSE:  34722.30269635705


**L2 REGULARIZATION (RIDGE)**

In [62]:
from sklearn.linear_model import RidgeCV

In [91]:
alphas = np.arange(1, 1000)
ridge_model_cv = RidgeCV(alphas = alphas)

In [92]:
function_model(ridge_model_cv)

MAE:  15781.533515444198
RMSE:  28036.50467878595


In [93]:
ridge_model_cv.alpha_

354

**L1 AND L2 REGULARIZATION (ELASTIC NET)**

In [94]:
from sklearn.linear_model import ElasticNetCV

In [96]:
elastic_net_model_cv = ElasticNetCV(l1_ratio = [.1, .5, .7, .9, .95, .99, 1], eps = 0.001, n_alphas = 100, max_iter = 100000)

In [97]:
function_model(elastic_net_model_cv)

MAE:  15327.677874421217
RMSE:  27559.682062184882


In [98]:
elastic_net_model_cv.l1_ratio_

1.0

In [101]:
elastic_net_model_cv.alpha_

588.4058618218438

**KNearestNeighbors (KNN)**

In [104]:
knn_model = KNeighborsRegressor()

In [105]:
k_values = list(range(1, 20))

In [110]:
param_grid = {'n_neighbors' : k_values}

In [111]:
knn_grid_model = GridSearchCV(estimator = knn_model, param_grid = param_grid)

In [112]:
function_model(knn_grid_model)

MAE:  23417.368283242264
RMSE:  36888.81090163333


In [113]:
knn_grid_model.best_params_

{'n_neighbors': 12}

**DECISION TREE**

In [None]:
from sklearn.tree import plot_tree

In [None]:
decision_tree_model = DecisionTreeRegressor()

In [131]:
criterion = ['squared_error', 'absolute_error']
max_depth = list(range(1, 10))
max_leaf_node = list(range(1, 10))

In [132]:
param_grid = {'criterion' : criterion, 'max_depth' : max_depth, 'max_leaf_nodes' : max_leaf_node}

In [133]:
decision_tree_model_cv = GridSearchCV(estimator = decision_tree_model, param_grid = param_grid)

In [134]:
function_model_nonscaled(decision_tree_model_cv)

KeyboardInterrupt: 