# <p style="background-color:coral;font-family:newtimeroman;font-size:150%;color:white;text-align:center;border-radius:20px 20px;"><b>Housing Prices Prediction with Random Forest</b></p>
![](https://img.freepik.com/free-vector/modern-cottage-houses-set_74855-305.jpg?t=st=1658400642~exp=1658401242~hmac=d9fa26ceb8482ae8408d2e9b4e1d8b3315e530babfcee2245ad8b54eab9690cf&w=996)

<b>Hi guys </b>üòÄ

In this notebook, I'm going to show you how to perform random forest using housing prices dataset.

<b>Table of contents:</b>
<ul>
<li><a href="#Loading">Loading the dataset</a></li>  
<li><a href="#Understanding">Understanding the dataset</a></li>         
<li><a href="#Data-Preprocessing">Data preprocessing</a></li>
<li><a href="#Missing">Handling missing data</a></li>
<li><a href="#Splitting">Splitting the Dataset</a></li>
<li><a href="#Pipelines">Pipelines for data preprocessing</a></li>
<li><a href="#Model-Building">Model building</a></li>
<li><a href="#Cross-Validation">Cross-validation</a></li>      
<li><a href="#Grid-Search">Grid Search</a></li>        
<li><a href="#Conclusion">Conclusion</a></li>   
</ul>

Happy learning üê±‚Äçüèç 

<a id="Loading"></a>
# <p style="background-color:coral;font-family:newtimeroman;font-size:150%;color:white;text-align:center;border-radius:20px 20px;"><b>Loading the Dataset</b></p>

The dataset I'm going to load is the housing prices dataset. The dataset includes the train and test set. Let's read these datasets with the `read_csv` method and then look at the first five rows with the `head` method.

In [None]:
import pandas as pd
df_train = pd.read_csv("../input/home-data-for-ml-course/train.csv")
df_test = pd.read_csv("../input/home-data-for-ml-course/test.csv")
df_train.head()

<a id="Understanding"></a>
# <p style="background-color:coral;font-family:newtimeroman;font-size:150%;color:white;text-align:center;border-radius:20px 20px;"><b>Understanding the Dataset</b></p>

Let's take a look at the shape of train and test set with the `shape` attribute.

In [None]:
print("The shape of train set: ", df_train.shape)
print("The shape of test set: ", df_test.shape)

Let's have a look at the column types with the `dtypes` attribute.

In [None]:
df_train.dtypes

You can also use the `info` method to see information such as the index dtype and columns, non-null values and memory usage.

In [None]:
df_train.info()

Let's see the summary statistics of numerical columns with the `describe` method.

In [None]:
df_train.describe().T

<a id="Data-Preprocessing"></a>
# <p style="background-color:coral;font-family:newtimeroman;font-size:150%;color:white;text-align:center;border-radius:20px 20px;"><b>Data Preprocessing</b></p>

The first column is the `Id`. Let's convert this column into the index.

In [None]:
df_train.set_index("Id", inplace=True)
df_test.set_index("Id", inplace=True)
df_train.head()

## <span style="color:Orange">Handling Missing Data</span>


Let's take a look at missing data in each column with the `isnull` method.

In [None]:
df_train.isnull().sum()

Since there are many columns in the dataset, we can't see the number of missing data in all columns. Let's sort the columns with the most missing data using the `sort_values` method and look at the first twenty rows.

In [None]:
cols_with_null = df_train.isnull().sum().sort_values(ascending=False)
cols_with_null.head(20)

To see the count of all missing data in the dataset, let me use the `sum` method one more time.

In [None]:
print("Total number of missing data in the dataset: ", df_train.isnull().sum().sum())

Let's look at the number of missing data in the Sales Price target variable.

In [None]:
df_train["SalePrice"].isnull().sum()

Let's remove the first six columns with the most missing data with the `drop` method.

In [None]:
cols_to_drop = (cols_with_null.head(6).index).tolist()
df_train.drop(cols_to_drop, axis=1, inplace=True)
df_test.drop(cols_to_drop, axis=1, inplace=True)

## <span style="color:Orange">Creating the target and feature variables</span>

The SalePrice column is the target variable and the other columns is features. Let's assign y and X variables to these columns, respectively.

In [None]:
y = df_train.SalePrice
X = df_train.drop(["SalePrice"], axis=1)

## <span style="color:Orange">Splitting the dataset</span>

Let's split the dataset into the train and test set with the `train_test_split` method.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X,y,train_size=0.8, random_state=0)

## <span style="color:Orange">Handling the categorical and numerical columns</span>

Data preprocessing is different for categorical and numeric columns. Let's select categorical and numeric columns. I'm going to remove columns with more than ten subcategories.

In [None]:
categorical_cols=[cname for cname in X_train.columns 
                  if X_train[cname].nunique()<10 and X_train[cname].dtype == "object"]

In [None]:
numerical_cols=[cname for cname in X_train.columns 
                if X_train[cname].dtype in ["int64", "float64"]]

Let's have a look at the number of categorical and numerical columns.

In [None]:
print("The number of categorical columns: ", len(categorical_cols))
print("The number of numerical columns: ", len(numerical_cols))

We've selected `70(=35+35)` columns. Let's remove any other columns we didn't select from the datasets.

In [None]:
my_cols=categorical_cols+numerical_cols
X_train = X_train[my_cols]
X_val = X_val[my_cols]
X_test = df_test[my_cols]

## <span style="color:Orange">Pipelines for data preprocessing</span>

A machine learning pipeline allows us to combine a series of steps involved in training a model. Let's import the necessary libraries to build the pipelines.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

Let's build a pipeline for numerical columns to handle missing data and scale data.

In [None]:
numerical_transformer = Pipeline(steps=[
    ("imputer_num", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

Let's build an other pipeline for categorical columns to handle missing data and perform one-hot encoding.

In [None]:
categorical_transformer = Pipeline(steps = [
    ("imputer_cal", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

Let's apply these transformers to categorical and numerical columns.

In [None]:
preprocessor = ColumnTransformer(transformers=[
    ("num", numerical_transformer, numerical_cols),
    ("cat", categorical_transformer, categorical_cols)
])

<a id="Model-Building"></a>
# <p style="background-color:coral;font-family:newtimeroman;font-size:150%;color:white;text-align:center;border-radius:20px 20px;"><b>Model Building</b></p>

Random forests is an ensemble learning method used for classification and regression. Let's build a simple random forest model.

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=0)

Let's create a pipeline for data preprocessing and model building steps.

In [None]:
my_pipeline = Pipeline(steps=[ ("preprocessor", preprocessor),("model", rf)])

Let's train the model with the `train` method.

In [None]:
my_pipeline.fit(X_train, y_train)

## <span style="color:Orange">Model evaluation</span>

Let's predict the validation data with the `predict` method.

In [None]:
val_preds = my_pipeline.predict(X_val)

Let's see the performance of model on validation data with the `mean_absulate_error` function.

In [None]:
from sklearn.metrics import mean_absolute_error
print("Validation MAE: ", mean_absolute_error(y_val, val_preds))

<a id="Cross-Validation"></a>
# <p style="background-color:coral;font-family:newtimeroman;font-size:150%;color:white;text-align:center;border-radius:20px 20px;"><b>Cross-Validation</b></p>

Cross-validation is a resampling method that allows us to use different portions of the data to test and train a model on different iterations. Let's find the cross validation score with the `cross_val_score` function and calculate the mean of cross validation scores with the `mean` method.

In [None]:
from sklearn.model_selection import cross_val_score
scores = -1 * cross_val_score(my_pipeline, X,y, cv = 5, scoring="neg_mean_absolute_error")
print("Mean Cross Validation Score: ", scores.mean())

<a id="Grid-Search"></a>
# <p style="background-color:coral;font-family:newtimeroman;font-size:150%;color:white;text-align:center;border-radius:20px 20px;"><b>Grid Search</b></p>

The grid search allows us to generate candidates from the grid of parameter values specified by the param_grid parameter. Let's find the best hyperparameters of random forest model with the `GridSearchCV` class.

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = { 
    'model__n_estimators': [500, 600, 700],
    'model__max_features': ['auto','sqrt','log2'],
    'model__max_depth' : [5,6,7],
    'model__criterion' :['squared_error','absolute_error','poisson']}
GridCV = GridSearchCV(my_pipeline, param_grid, n_jobs= -1)
GridCV.fit(X_train,y_train)  
print(GridCV.best_params_)    
print(GridCV.best_score_)

Let's predict the test data with the `predict` method.

In [None]:
preds_test = GridCV.predict(X_test)

Let's convert these predictions into a dataframe.

In [None]:
output = pd.DataFrame({'Id': X_test.index, 'SalePrice': preds_test})
output.head()

Let's write this dataframe to an `csv` file.

In [None]:
output.to_csv('submission.csv', index=False)

You can now submit this file to the competition!

<a id="Conclusion"></a>
# <p style="background-color:coral;font-family:newtimeroman;font-size:150%;color:white;text-align:center;border-radius:20px 20px;"><b>Conclusion</b></p>

### That's it. In this notebook, I first performed EDA and then built a random forest model to predict house prices. I also used the grid search technique to find the combination of best hyperparameters.

### Thanks for reading üòÄ If you like this notebook, please upvote it üòä

### Don't forget to follow us on [YouTube](http://youtube.com/tirendazacademy) | [Medium](http://tirendazacademy.medium.com) | [Twitter](http://twitter.com/tirendazacademy) | [GitHub](http://github.com/tirendazacademy) | [Linkedin](https://www.linkedin.com/in/tirendaz-academy) | [Kaggle](https://www.kaggle.com/tirendazacademy)