# Housing Prediction

#### The project focuses on analyzing the most relevant features to carry out house price prediction using Machine Learning. Predicting house prices is valuable not only for the investment decision-making process but also for sellers, buyers, insurers and even the government in formulating housing policies. 

#### Given the goal of this project, the target variable is quantitative, and since there is continuity in the outcome (target variable: price and is continuous), this problem is undoubtedly a regression problem. In other words, we aim to predict a continuous value (the price) based on the provided input features, considering the infinite number of possible price values. 

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import warnings
#from sklearn.linear_model import LinearRegression 
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn import metrics
from sklearn.feature_selection import SelectKBest, f_regression
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('Housing.csv') 

## Characterization of the data 

#### * Let's check the dataset.

In [3]:
df.head(10)

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished
5,10850000,7500,3,3,1,yes,no,yes,no,yes,2,yes,semi-furnished
6,10150000,8580,4,3,4,yes,no,no,no,yes,2,yes,semi-furnished
7,10150000,16200,5,3,2,yes,no,no,no,no,0,no,unfurnished
8,9870000,8100,4,1,2,yes,yes,yes,no,yes,2,yes,furnished
9,9800000,5750,3,2,4,yes,yes,no,no,yes,1,yes,unfurnished


#### * There are categorical and numerical datapoints. 

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   price             545 non-null    int64 
 1   area              545 non-null    int64 
 2   bedrooms          545 non-null    int64 
 3   bathrooms         545 non-null    int64 
 4   stories           545 non-null    int64 
 5   mainroad          545 non-null    object
 6   guestroom         545 non-null    object
 7   basement          545 non-null    object
 8   hotwaterheating   545 non-null    object
 9   airconditioning   545 non-null    object
 10  parking           545 non-null    int64 
 11  prefarea          545 non-null    object
 12  furnishingstatus  545 non-null    object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB


In [5]:
df.shape

(545, 13)

#### * The dataset has 545 observations and 13 variables. 

In [6]:
df.drop_duplicates()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished
...,...,...,...,...,...,...,...,...,...,...,...,...,...
540,1820000,3000,2,1,1,yes,no,yes,no,no,2,no,unfurnished
541,1767150,2400,3,1,1,no,no,no,no,no,0,no,semi-furnished
542,1750000,3620,2,1,1,yes,no,no,no,no,0,no,unfurnished
543,1750000,2910,3,1,1,no,no,no,no,no,0,no,furnished


#### * There are not duplicated cells.

In [7]:
df.isnull().sum()

price               0
area                0
bedrooms            0
bathrooms           0
stories             0
mainroad            0
guestroom           0
basement            0
hotwaterheating     0
airconditioning     0
parking             0
prefarea            0
furnishingstatus    0
dtype: int64

#### * There are not missing values. 

In [8]:
df.dtypes

price                int64
area                 int64
bedrooms             int64
bathrooms            int64
stories              int64
mainroad            object
guestroom           object
basement            object
hotwaterheating     object
airconditioning     object
parking              int64
prefarea            object
furnishingstatus    object
dtype: object

#### * There are cells represented as strings, let's convert them to numerical data for the machine learning models. Ensuring that the features are in a suitable format is crucial for effective model training. 

In [9]:
df["mainroad"].replace ({'no':0, 'yes':1}, inplace = True)
df["guestroom"].replace ({'no':0, 'yes':1}, inplace = True)
df["basement"].replace ({'no':0, 'yes':1}, inplace = True)
df["hotwaterheating"].replace ({'no':0, 'yes':1}, inplace = True)
df["airconditioning"].replace ({'no':0, 'yes':1}, inplace = True)
df["prefarea"].replace ({'no':0, 'yes':1}, inplace = True)
df["furnishingstatus"].replace ({'furnished':0, 'semi-furnished':1, 'unfurnished' :2}, inplace = True)

In [10]:
df.head(10)

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,1,0,0,0,1,2,1,0
1,12250000,8960,4,4,4,1,0,0,0,1,3,0,0
2,12250000,9960,3,2,2,1,0,1,0,0,2,1,1
3,12215000,7500,4,2,2,1,0,1,0,1,3,1,0
4,11410000,7420,4,1,2,1,1,1,0,1,2,0,0
5,10850000,7500,3,3,1,1,0,1,0,1,2,1,1
6,10150000,8580,4,3,4,1,0,0,0,1,2,1,1
7,10150000,16200,5,3,2,1,0,0,0,0,0,0,2
8,9870000,8100,4,1,2,1,1,1,0,1,2,1,0
9,9800000,5750,3,2,4,1,1,0,0,1,1,1,2


In [11]:
df.describe()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
count,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0
mean,4766729.0,5150.541284,2.965138,1.286239,1.805505,0.858716,0.177982,0.350459,0.045872,0.315596,0.693578,0.234862,1.069725
std,1870440.0,2170.141023,0.738064,0.50247,0.867492,0.348635,0.382849,0.477552,0.209399,0.46518,0.861586,0.424302,0.761373
min,1750000.0,1650.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3430000.0,3600.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4340000.0,4600.0,3.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,5740000.0,6360.0,3.0,2.0,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,2.0
max,13300000.0,16200.0,6.0,4.0,4.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,2.0


In [12]:
df.dtypes.value_counts()

int64    13
Name: count, dtype: int64

#### * The dataset is suitable for fitting the machine learning models. However, before doing so, the price column has been removed from the features and coded as the target variable. 

In [13]:
X = df.drop("price", axis=1)
y = df["price"]

#### * Let's check if the encoding works...

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   price             545 non-null    int64
 1   area              545 non-null    int64
 2   bedrooms          545 non-null    int64
 3   bathrooms         545 non-null    int64
 4   stories           545 non-null    int64
 5   mainroad          545 non-null    int64
 6   guestroom         545 non-null    int64
 7   basement          545 non-null    int64
 8   hotwaterheating   545 non-null    int64
 9   airconditioning   545 non-null    int64
 10  parking           545 non-null    int64
 11  prefarea          545 non-null    int64
 12  furnishingstatus  545 non-null    int64
dtypes: int64(13)
memory usage: 55.5 KB


In [15]:
print(X, y)

     area  bedrooms  bathrooms  stories  mainroad  guestroom  basement  \
0    7420         4          2        3         1          0         0   
1    8960         4          4        4         1          0         0   
2    9960         3          2        2         1          0         1   
3    7500         4          2        2         1          0         1   
4    7420         4          1        2         1          1         1   
..    ...       ...        ...      ...       ...        ...       ...   
540  3000         2          1        1         1          0         1   
541  2400         3          1        1         0          0         0   
542  3620         2          1        1         1          0         0   
543  2910         3          1        1         0          0         0   
544  3850         3          1        2         1          0         0   

     hotwaterheating  airconditioning  parking  prefarea  furnishingstatus  
0                  0              

#### * Here some frequency graphs of the features. 

In [16]:
new_df.hist(figsize = (20, 12))
plt.show()

NameError: name 'new_df' is not defined

In [None]:
df.corr()

#### * There are features such as area, number of bathrooms, and bedrooms, as well as the availability of air conditioning, which have a stronger correlation with the price compared to the furnishing status of the house and the installation of hot water heating in the property. The next heatmap is plotted to check the correlations for each column.

In [None]:
plt.figure(figsize=(20,12))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")

In [None]:
X = df.drop("price", axis=1)
y = df["price"]

In [None]:
print(y)

## Preprocessing the data

## Splitting Data into Training and Testing

In [None]:
from sklearn.model_selection import train_test_split 

#### Splitting the data into training and testing sets is helpful to "estimate the performance of the machine learning algorithms when they are used to make predictions on data not used to train the model" (Brownlee, 2020). It is also important to mention that this technique can be used for both classification or regression problems, as well, as for any supervised learning algorithm. 

#### Keeping in mind that: 

#### The train dataset is used to fit the machine learning model ,and
#### The Test dataset is employed to evaluate the model's performance,  

#### The variation in the accuracy (in this case, r2, as it is a regression problem) across different machine learning models includes three training splits, as shown below: 

* Train: 80%, Test: 20%

* Train: 75%, Test: 25%

* Train: 70%, Test: 30%

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [None]:
X.shape, y.shape, X_train.shape, X_test.shape, y_train.shape, y_test.shape

#### The test_size variable is "where we actually specify the proportion of the test set" (Chauhan,2019).

#### Before building a model, it is helpful to have an overview of the dataset to determine whether the dataset needs to be scaled or not. 

In [None]:
new_df.describe()

#### It is concluded that it is necessary to scale the dataset to reduce bias impact in machine learning algorithms. Additionally, "neglecting scaling can unevenly influence regression problem, favoring some variables unfairly and disadvantaging certain classes during model training" (Cosgun, 2023). Although, the Robust Scaler helps decrease the impact of outliers in features like area, as part of this pre-processing, StandardScaler is applied, considering that the rest of the features seem to be on similar scales. However, there is no disagreement with (Cosgun, 2023) when it is stated that "feature scaling relies on trial and error rather than a singular solution". 

## Scaling the data

In [None]:
from sklearn.preprocessing import StandardScaler

scaler=StandardScaler()

X_train_scaled= scaler.fit_transform(X_train)
X_test_scaled= scaler.transform(X_test)

## Machine learning approaches 

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

#### The first machine learning model to be applied is Random Forest: 
* Commonly, "the feature importances given by this model are more reliable than the ones provided by a single tree.
* Regularly, it works well without heavy tuning of the parameters"(Muller and Guido,2017).
* This model is flexible, versatile, and easy to use for capturing the pattern in the data. 

In [None]:
rf_model = RandomForestRegressor( max_features="sqrt", max_depth=40, random_state=42).fit(X_train,y_train)
y_pred = rf_model.predict(X_test)
y_train_pred = rf_model.predict(X_train)

In [None]:
mse=mean_squared_error(y_train, y_train_pred)
print(f'Mean Square Error_train: {mse}')

mse=mean_squared_error(y_test, y_pred)
print(f'Mean Square Error_test: {mse}')

r2=r2_score(y_train, y_train_pred)
print(f'R-squared_train: {r2}')

r2=r2_score(y_test, y_pred)
print(f'R-squared_test: {r2}')

#### After calculating measures of performance on the training set and the testing set, considering the difference between the metrics of each of them, it is clear that there is overfitting because the model performs very well for training data but not with the testing data. 

In [None]:
predictions = rf_model.predict(X)
print(f'Mean Squared Error:', metrics.mean_squared_error(y, predictions))
print(f'R-squared:', metrics.r2_score(y, predictions))

#### Regarding the performance of the model the accuracy, calculated using the R-squared is good. 

### Hyperparameter tuning techniques

#### It is known that every machine learning model, as a mathematical model, has several number of parameters. However, hyperparameters are another kind of parameters that are used to find the best performance of a model through a process called Hyperparameter tuning. In this project, and particularly for this model, two Hyperparameter tuning techniques are applied: 

 * Random Hyperparameter Grid
 * Grid Search CV
 
#### Considering the explanation given in scikit-learn, "GridSearchCV exhaustively considers all parameter combinations, while RandomizedSearchCV can sample a given number of candidates from a parameter space with a specified distribution" (scikit-learn). 

#### Random Hyperparameter Grid

In [None]:
from sklearn.model_selection import RandomizedSearchCV

#### To use RandomizedSearchCV, the first step is to create a parameter grid to use it during fitting. In the case of a random forest, for instance, the number of decision trees is one hyperparameter. 

In [None]:
n_estimators = [int(x) for x in np.linspace(start = 50, stop= 600, num= 10)]
max_features= ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace (10, 200, num =11)]
max_depth.append(None)
min_samples_split= [2,5,8,10]
min_samples_leaf = [1,2,4,6]
bootstrap = [True, False]

##### Source: https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

param_grid = {'n_estimators': n_estimators, 'max_features': max_features, 'max_depth': max_depth, 'min_samples_split': min_samples_split, 'min_samples_leaf': min_samples_leaf, 'bootstrap':bootstrap}

print(param_grid)

rf = RandomForestRegressor()
rf_random=RandomizedSearchCV (estimator= rf, param_distributions=param_grid, n_iter= 100, cv=10, verbose=2, random_state=42)

rf_random.fit(X_train,y_train)

In [None]:
rf_random.best_params_

#### Given the best parameters by the hyperparameter tuning process, they are applied in the algorithm again to perform the model. 

In [None]:
rf_random = RandomForestRegressor(n_estimators=477, min_samples_split=2, min_samples_leaf=1, max_features="sqrt", max_depth=67, random_state=42, bootstrap=True)
rf_random.fit(X_train,y_train)

In [None]:
predictions = rf_random.predict(X)
print(f'Mean Squared Error:', metrics.mean_squared_error(y, predictions))
print(f'R-squared:', metrics.r2_score(y, predictions))

#### Actually, the r2 increased by a few decimal points, from 0.8517 to 0.8561. 

#### Let's move forward to try another technique- one that "instead of sampling randomly from a distribution as RandomizedSearchCV does, evaluates all combinations we define" (Koehrsen, 2018).

#### GridSearchCV

#### To use this technique, it is necessary to define an additional grid, as shown below. 

In [None]:
from sklearn.model_selection import GridSearchCV

n_estimators =[int(x) for x in np.linspace(start = 50, stop= 700, num= 10)]

param_grid= {'bootstrap': [True], 'max_depth': [67], 'max_features': ['sqrt'], 'min_samples_leaf':[2], 'min_samples_split':[5], 'n_estimators': n_estimators}

rf=RandomForestRegressor()

rf_grid= GridSearchCV(estimator=rf, param_grid = param_grid, cv=10, verbose=2)

rf_grid.fit(X_train,y_train)

In [None]:
rf_grid.best_params_

#### Given the best parameters by the hyperparameter tuning process, they are applied in the algorithm again to perform the model.

In [None]:
rf_grid = RandomForestRegressor(n_estimators=627, max_features="sqrt", max_depth=67, bootstrap= True, min_samples_leaf=2, min_samples_split=5, random_state=42)
rf_grid.fit(X_train,y_train)

In [None]:
predictions = rf_grid.predict(X)
print(f'Mean Squared Error:', metrics.mean_squared_error(y, predictions))
print(f'R-squared:', metrics.r2_score(y, predictions))

#### Using the GridSearchCV technique, the R2 performs even worse than with the first tecnique. However, to get a better estimate of the generalization performance, instead of applying the conventional split of the data into training and testing set, cross-validation can be used to evaluate the performance in each model (Muller and Guido,2017). 

### Cross validation on the model

In [None]:
from sklearn.model_selection import cross_val_score

models = [{"model": rf_model, "name": "Base Model"}, {"model": rf_random, "name": "Random Model"}, {"model": rf_grid, "name": "Grid Model"}]

for model in models:
    scores = cross_val_score(model["model"], X_train,y_train, cv=5, scoring= 'r2')
    print("Model:", model ["name"])
    print("R2 Scores:", scores)
    print("Mean R2 score:", scores.mean())
    print("Standard Deviation of R2 scores", scores.std())
    print("\n")
    

#### Overall, R2 around 0.62 and 0.64 is not very good in the cross validation. 

#### SUPPORT VECTOR REGRESSION 

#### Let's try another model to perform. This  model is chosen because it is robust to outliers and can manage non-linear data by introducing a kernel function. 

In [None]:
from sklearn.svm import SVR
svr_model = SVR()
svr_model.fit(X_train,y_train)

In [None]:
predictions = svr_model.predict(X)
print(f'Mean Squared Error:', metrics.mean_squared_error(y, predictions))
print(f'R-squared:', metrics.r2_score(y, predictions))

#### Although, there are many reasons that might explain this result, in overall, R2 is negative as the model does not follow the trend of the data, considering as well that there is a case of overfitting in the model. 

### Hyperparameter tuning techniques

#### To this model, the GridSearchCV is applied. 

In [None]:
param_grid = { 'C': [1,5,10], 'gamma': ['auto', 'scale']}

model_svr = SVR()

svr_grid= GridSearchCV(model_svr, param_grid, cv=5)

svr_grid.fit(X_train, y_train)



In [None]:
svr_grid.best_params_

#### Given the best parameters by the hyperparameter tuning process, they are applied in the algorithm again to perform the model. 

In [None]:
svr_model = SVR(C= 10, gamma = 'scale')

In [None]:
predictions = svr_grid.predict(X)
print(f'Mean Squared Error:', metrics.mean_squared_error(y, predictions))
print(f'R-squared:', metrics.r2_score(y, predictions))

#### After finding and using the best parameters, the R2 is less negative. Thus, although, it is clear that this model does not perform well, in practice, it can be seen the usefulness of applying these kind of techiques. 

### Cross validation on the model 

In [None]:
models = [{"model": svr_model, "name": "Base Model"}, {"model": svr_grid, "name": "Grid Model"}]

for model in models:
    scores = cross_val_score(model["model"], X_train,y_train, cv=20, scoring= 'r2')
    print("Model:", model ["name"])
    print("R2 Scores:", scores)
    print("Mean R2 score:", scores.mean())
    print("Standard Deviation of R2 scores", scores.std())
    print("\n")

#### With negative mean R2 scores in the cross validation, it is determined that this model performed poorly.

### KNN

#### Let's choose another model to perform, which is able to handle noisy data, makes predictions based on the comparability of data points in a dataset, is less sensitive to outliers, and requires few hyperparameters. 

In [None]:
from sklearn.neighbors import KNeighborsRegressor

knn_reg = KNeighborsRegressor(n_neighbors=2)
knn_reg.fit(X_train, y_train)

In [None]:
predictions = knn_reg.predict(X)
print(f'Mean Squared Error:', metrics.mean_squared_error(y, predictions))
print(f'R-squared:', metrics.r2_score(y, predictions))


#### R2 of around 0.61 is not very well. 

In [None]:
param_grid= {'n_neighbors': [1,2,3,4,5,6,7,8,9,10]}

knn = KNeighborsRegressor()

knn_grid= GridSearchCV(knn, param_grid, cv=5)

knn_grid.fit(X_train, y_train)

In [None]:
knn_grid.best_params_

In [None]:
best_knn= KNeighborsRegressor(n_neighbors=7)
best_knn.fit(X_train, y_train)

In [None]:
predictions = knn_grid.predict(X)
print(f'Mean Squared Error:', metrics.mean_squared_error(y, predictions))
print(f'R-squared:', metrics.r2_score(y, predictions))

### Cross validation on the model

In [None]:
models = [{"model": knn_reg, "name": "Base Model"}, {"model": knn_grid, "name": "Grid Model"}]

for model in models:
    scores = cross_val_score(model["model"], X_train,y_train, cv=5, scoring= 'r2')
    print("Model:", model ["name"])
    print("R2 Scores:", scores)
    print("Mean R2 score:", scores.mean())
    print("Standard Deviation of R2 scores", scores.std())
    print("\n")
    

#### With lower mean R2 scores in the cross validation, it is determined that this model performed poorly.

#### Following the results of the variation in the accuracy (R2) across three training splits using cross validation:

    Split 1: 20%
        *Random Forest:0.6428
        *Support vector regression:-0.1104
        *KNN: 0.3250
            
    Split 2: 25%
        *Random Forest:0.6306
        *Support vector regression:-0.1532
        *KNN: 
            
    Split 3: 30%
        *Random Forest:0.6135
        *Support vector regression:-0.1232
        *KNN: 0.2939
        

### Conclusions

* The dataset contains numerical and categorical variables, to predict the housing price. Before preprocessing, it was necessary to convert the categorical variables to numerical variables. 
* Splitting the data into training and testing sets is useful for evaluating the performance of a machine-learning algorithm using data which are not used to train the model to make predictions.
* Scaling the data is a good practice in the pre-processing part as long as brings all the data points in the same scale, ensuring that no feature dominates/impacts the performance of the algorithm. 
* The Python module Scikit-Learn has default parameters and even hyperparameters for all the models, but this project has taught me that tuning a model is a trial-and-error process because the inputs (hyperparameters) do not guarantee that the model be optimal the first time, and that is why this technique is called tuning. 
* After calculating measures of performance on the training set and the testing set, considering the difference between the metrics of each of them, it is clear that there is overfitting because the model performs very well for training data but not with the testing data.
* Hyperparameters are another kind of parameter that is used to find the best performance of a model through a process called Hyperparameter tuning. 


# References
https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#plot-all-scaling-standard-scaler-section

In [None]:
https://www.geeksforgeeks.org/random-forest-regression-in-python/
    
    
    
    file:///C:/Users/mpaul/OneDrive/Desktop/Data%20Analytics%20for%20Business/Machine%20Learning/Materials/Introduction%20to%20Machine%20Learning%20with%20Python%20(%20PDFDrive.com%20)-min.pdf
    
    https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74
    
    https://scikit-learn.org/stable/modules/grid_search.html
    https://medium.com/@hhuseyincosgun/which-data-scaling-technique-should-i-use-a1615292061e#:~:text=RobustScaler%20is%20a%20data%20preprocessing,of%20outliers%20in%20the%20data.

    https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/

        
        https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/
            
            
            https://medium.com/@hhuseyincosgun/which-data-scaling-technique-should-i-use-a1615292061e#:~:text=RobustScaler%20is%20a%20data%20preprocessing,of%20outliers%20in%20the%20data.
            
            
            https://www.kdnuggets.com/2019/03/beginners-guide-linear-regression-python-scikit-learn.html#:~:text=The%20test_size%20variable%20is%20where,proportion%20of%20the%20test%20set.&text=After%20splitting%20the%20data%20into,along%20with%20our%20training%20data.