 # House Price Prediction
 
 # Introduction
 <font color = 'green'>


**1.** [Business Problem](#1)
    
**2.** [About Dataset](#2)
    
**3.** [Data Loading & Checking](#3)
    
**4.** [Exploratory Data Analysis (EDA)](#4)
   
* **4.1** [Analysis of Categorical Variables](#5)
* **4.2** [Analysis of Numerical Variables](#6)
* **4.3** [Target Variable Analysis](#7)
* **4.4** [Outlier Analysis](#8)
* **4.5** [Correlation Analysis](#9)
    
**5.** [Data Preprocessing](#10)
    
* **5.1** [Missing Values ](#11)
* **5.2** [Outlier Supression](#12)
* **5.3** [Encoding for Base Model](#13)
* **5.4** [Scaling for Base Model](#14)
    
**6.** [Base Model](#15)
    
* **6.1** [Hold-out Method](#16)
* **6.2** [Modeling](#17)
* **6.3** [Model Performance Evaluation](#18)
    
**7.** [Feature Engineering](#19)
    
* **7.1** [Feature Extraction](#20)
* **7.2** [Encoding for Current & New Features](#21)
* **7.3** [Feature Scaling](#22)
    
**8.** [Model](#23)
    
* **8.1** [Hold-out Method](#24)
* **8.2** [Modeling](#25)
* **8.3** [Model Performance Evaluation](#26)
* **8.4** [Cross Validation](#27)

**9.** [Hyperparameter Tuning](#28)
    
* **9.1** [Determining Parameters](#29)
* **9.2** [Best Parameters & Best Scores](#30)   
    
**10.** [Final Model](#31)   
* **10.1** [Modeling](#32)
* **10.2** [Cross Validation](#33)
* **10.3** [Feature Importance](#34)
* **10.4** [Prediction](#35)
* **10.5** [Creating a Submission File](#36)

<a id = "1"></a><br>
## 1. Business Problem

A machine learning project is desired to predict prices of different types of houses using a dataset containing features of each house and their respective prices.

<a id = "2"></a>
## 2. About Dataset

The dataset for this project, which includes housing units in Ames, Iowa, contains 79 explanatory variables. You can access the dataset and competition page on Kaggle using the link below. Since it’s part of a Kaggle competition, there are two separate CSV files: one for training data and another for testing data. In the test dataset, house prices are left blank, and you’re expected to predict these values.

* **SalePrice** - the property's sale price in dollars. This is the target variable that you're trying to predict.
* **MSSubClass:** The building class
* **MSZoning:** The general zoning classification
* **LotFrontage:** Linear feet of street connected to property
* **LotArea:** Lot size in square feet
* **Street:** Type of road access
* **Alley:** Type of alley access
* **LotShape:** General shape of property
* **LandContour:** Flatness of the property
* **Utilities:** Type of utilities available
* **LotConfig:** Lot configuration
* **LandSlope:** Slope of property
* **Neighborhood:** Physical locations within Ames city limits
* **Condition1:** Proximity to main road or railroad
* **Condition2:** Proximity to main road or railroad (if a second is present)
* **BldgType:** Type of dwelling
* **HouseStyle:** Style of dwelling
* **OverallQual:** Overall material and finish quality
* **OverallCond:** Overall condition rating
* **YearBuilt:** Original construction date
* **YearRemodAdd:** Remodel date
* **RoofStyle:** Type of roof
* **RoofMatl:** Roof material
* **Exterior1st:** Exterior covering on house
* **Exterior2nd:** Exterior covering on house (if more than one material)
* **MasVnrType:** Masonry veneer type
* **MasVnrArea:** Masonry veneer area in square feet
* **ExterQual:** Exterior material quality
* **ExterCond:** Present condition of the material on the exterior
* **Foundation:** Type of foundation
* **BsmtQual:** Height of the basement
* **BsmtCond:** General condition of the basement
* **BsmtExposure:** Walkout or garden level basement walls
* **BsmtFinType1:** Quality of basement finished area
* **BsmtFinSF1:** Type 1 finished square feet
* **BsmtFinType2:** Quality of second finished area (if present)
* **BsmtFinSF2:** Type 2 finished square feet
* **BsmtUnfSF:** Unfinished square feet of basement area
* **TotalBsmtSF:** Total square feet of basement area
* **Heating:** Type of heating
* **HeatingQC:** Heating quality and condition
* **CentralAir:** Central air conditioning
* **Electrical:** Electrical system
* **1stFlrSF:** First Floor square feet
* **2ndFlrSF:** Second floor square feet
* **LowQualFinSF:** Low quality finished square feet (all floors)
* **GrLivArea:** Above grade (ground) living area square feet
* **BsmtFullBath:** Basement full bathrooms
* **BsmtHalfBath:** Basement half bathrooms
* **FullBath:** Full bathrooms above grade
* **HalfBath:** Half baths above grade
* **Bedroom:** Number of bedrooms above basement level
* **Kitchen:** Number of kitchens
* **KitchenQual:** Kitchen quality
* **TotRmsAbvGrd:** Total rooms above grade (does not include bathrooms)
* **Functional:** Home functionality rating
* **Fireplaces:** Number of fireplaces
* **FireplaceQu:** Fireplace quality
* **GarageType:** Garage location
* **GarageYrBlt:** Year garage was built
* **GarageFinish:** Interior finish of the garage
* **GarageCars:** Size of garage in car capacity
* **GarageArea:** Size of garage in square feet
* **GarageQual:** Garage quality
* **GarageCond:** Garage condition
* **PavedDrive:** Paved driveway
* **WoodDeckSF:** Wood deck area in square feet
* **OpenPorchSF:** Open porch area in square feet
* **EnclosedPorch:** Enclosed porch area in square feet
* **3SsnPorch:** Three season porch area in square feet
* **ScreenPorch:** Screen porch area in square feet
* **PoolArea:** Pool area in square feet
* **PoolQC:** Pool quality
* **Fence:** Fence quality
* **MiscFeature:** Miscellaneous feature not covered in other categories
* **MiscVal:** $Value of miscellaneous feature
* **MoSold:** Month Sold
* **YrSold:** Year Sold
* **SaleType:** Type of sale
* **SaleCondition:** Condition of sale

<a id = "3"></a><br>
## 3. Data Loading & Checking

In [None]:
!pip install missingno as msno
!pip install pydotplus
!pip install astor
!pip install joblib
!pip install skompiler

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from datetime import date
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split,GridSearchCV,cross_validate
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import MinMaxScaler, StandardScaler,LabelEncoder,RobustScaler

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression,Ridge, Lasso, ElasticNet
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor

import joblib 
import pydotplus
from skompiler import skompile
import datetime as dt

from sklearn.pipeline import Pipeline

import warnings
warnings.simplefilter(action="ignore")
from sklearn.exceptions import ConvergenceWarning

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option("display.width", 500)

In [None]:
#upload the dataset:

train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv') 
test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
train.head()

In [None]:
test.head()

In [None]:
df_ = pd.concat([train, test], ignore_index=False)

In [None]:
# defining a function to upload the dataset

def load(dataframe): 
    df = dataframe.copy()
    return df

In [None]:
df = load(df_) # copy of the dataset
df.head()  # display first 5 rows

<a id = "4"></a>
## 4. Exploratory Data Analysis

In [None]:
# defining function to check all: 

def check_df(dataframe,head = 5):
    print("##################### SHAPE ####################")
    print(dataframe.shape)
    print("#################### COLUMNS ###################")
    print(dataframe.columns)
    print("#################### INDEX ###################")
    print(dataframe.index)
    print("#################### TYPES ##################")
    print(dataframe.dtypes)
    print("#################### NA ANY ###################")
    print(dataframe.isnull().values.any())
    print(f"#################### NA SUM - RATIO ####################")
    print(pd.DataFrame({"na_sum": dataframe.isnull().sum(),
                        "ratio": dataframe.isnull().sum() / dataframe.shape[0]}))
    print("#################### QUANTILES ###############")
    print(dataframe.describe().T)

check_df(df)

In [None]:
# converting dtypes for date columns:

date_year = ['YearBuilt','YearRemodAdd','GarageYrBlt','YrSold']
for col in date_year:
    df[col] = pd.to_datetime(df[col],format='%Y').dt.year

In [None]:
# converting dtypes for date columns:

date_months = ['MoSold']
for col in date_months:
    df[col] = pd.to_datetime(df[col], format='%m').dt.month

In [None]:
#checking:

df.info()

In [None]:
#checking:

df[['YearBuilt','YearRemodAdd','GarageYrBlt','MoSold','YrSold']].head()

In [None]:
# Capturing numeric and categorical variables:

def grab_col_names(dataframe, cat_th = 10, car_th = 20):
    # cat_cols, cat_but_car:
    cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]
    num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and dataframe[col].dtypes != "O"]
    cat_but_car = [col for col in df.columns if dataframe[col].nunique() > car_th and dataframe[col].dtypes == "O"]
    cat_cols = cat_cols + num_but_cat
    cat_cols = [col for col in cat_cols if col not in cat_but_car]
    # num_cols:
    num_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O"]
    num_cols = [col for col in num_cols if col not in num_but_cat]
    return cat_cols, num_cols , cat_but_car

cat_cols, num_cols, cat_but_car = grab_col_names(df)

In [None]:
print(f"Categoric columns:{len(cat_cols)}") # categorical columns
print(f"Numeric columns:{len(num_cols)}") # numeric columns
print(f"Cardinal columns:{len(cat_but_car)}") # categorical type but cardinal columns
print(f"Checking: total columns(cat_num_car):{len(cat_cols + num_cols + cat_but_car)} , dataset total columns:{len(df.columns)}")

In [None]:
cat_but_car # categorical type but cardinal columns

> **Editing column types which we need:**

In [None]:
# defining a function to update cat_cols & num_cols by cat_but_car:

def col_types_updating_with_id_columns(dataframe,id_cols,car_but_cat_cols = [],cat_but_num_wrong = [],date_columns =[]):
    # updating num_cols,cat_cols:
    cat_cols, num_cols, cat_but_car = grab_col_names(dataframe)
    for col in cat_but_car:    
        if len(cat_but_car) > 0:
            num_cols.append(col)
            cat_cols.append(col)
            num_cols = [n_col for n_col in num_cols if (n_col not in id_cols) & (n_col not in car_but_cat_cols)]
            cat_cols = [c_col for c_col in cat_cols if (c_col not in id_cols) & (c_col not in num_cols)]
        else:continue
    for col in cat_but_num_wrong:    
        if len(cat_but_num_wrong) > 0:
            cat_cols.append(col)
            num_cols.remove(col)
            cat_cols = [c_col for c_col in cat_cols if (c_col not in id_cols)]
        else:continue 
    for col in date_columns:    
        if len(date_columns) > 0:
            if col in cat_cols:
                cat_cols.remove(col)
            if col in num_cols:
                num_cols.remove(col)
            date_cols = [d_col for d_col in date_columns if (d_col not in id_cols)]
        else:continue  
    return num_cols,cat_cols,date_cols

In [None]:
#id lists:

ids = ['Id']

In [None]:
# changing the categoric columns but in cat_but_car:

car_but_cat_cols = ['Neighborhood']

In [None]:
# changing the categoric columns but in num_cols:

cat_but_num_wrong = ["MSSubClass","OverallQual"] 

In [None]:
date_cols = ["MoSold","YrSold","GarageYrBlt","YearBuilt","YearRemodAdd"]

In [None]:
# applying the function:

num_cols,cat_cols,date_cols = col_types_updating_with_id_columns(df,ids,car_but_cat_cols,cat_but_num_wrong,date_cols)

In [None]:
cat_but_car

In [None]:
date_cols

In [None]:
cat_cols

In [None]:
num_cols

In [None]:
#defining a function to change dtypes:

def update_not_correct_dtype(dataframe,num_cols,cat_cols):
    not_correct_dtype_n = [col for col in num_cols if dataframe[col].dtypes not in ["float64","int64","int32","datetime64[ns]"]]
    not_correct_dtype_c = [col for col in cat_cols if dataframe[col].dtypes not in ["O","category","datetime64[ns]"]]
    if len(not_correct_dtype_n) > 0:
        for col in not_correct_dtype_n:
            dataframe[col] = dataframe[col].astype("float64")
    if len(not_correct_dtype_c) > 0:    
        for col in not_correct_dtype_c:
            dataframe[col] = dataframe[col].astype("O")
    return not_correct_dtype_n,not_correct_dtype_c

In [None]:
# applying the function:
update_not_correct_dtype(df,num_cols,cat_cols)

In [None]:
#checking:

df.info()

<a id = "5"></a>
### 4.1 Analysis of Categorical Variables

In [None]:
# defining a function to check summary of the categorical variables:

def cat_summary(dataframe,col_name,plot = False):
    print(f"#################### {col_name} Counts - Ratio ####################")
    print(pd.DataFrame({col_name: dataframe[col_name].value_counts(),
                        "Ratio": 100 * dataframe[col_name].value_counts() / len(dataframe)}))
    print(f"#################### {col_name} Unique Variable Counts ####################")
    print(f"{col_name} : {dataframe[col_name].nunique()}")
    if plot:
        print(f"#################### {col_name} Counts - Ratio Visualizing ####################")
    if plot:
        sns.countplot(x = dataframe[col_name], data = dataframe)
        plt.show(block = True)

In [None]:
# applying the function:
for col in cat_cols:
    cat_summary(df,col,plot=True)

<a id = "6"></a>
### 4.2 Analysis of Numerical Variables

In [None]:
# defining a function to check summary of the numerical variables:

def num_summary(dataframe,col_name,plot = False, quantiles = [0.05, 0.10, 0.20, 0.50, 0.60, 0.80, 0.90, 0.95, 0.99]):
    if plot:
        dataframe[col_name].hist(bins=20)
        plt.xlabel(col_name)
        plt.title(col_name)
        plt.show(block = True)
    print("#################### QUANTILES ###############")
    print(dataframe[col_name].describe(quantiles).T, end= "\n\n")

In [None]:
# applying the function:
for col in num_cols:
    num_summary(df, col , plot = True)

<a id = "7"></a>
### 4.3 Target Variable Analysis

In [None]:
# Target - Categorical Variables

In [None]:
# defining a function to check summary of the target and categorical variables:

def target_summary_with_cat(dataframe,target,categorical_col):
    for col in categorical_col:
        print(f"################ Target Mean by {col} #################", end ="\n\n")
        print(pd.DataFrame({f"{target}_Mean": dataframe.groupby(col)[target].mean()}), end = "\n\n\n")

In [None]:
# applying the function:

target_summary_with_cat(df, "SalePrice", cat_cols)

In [None]:
# Target - Numerical Variables

In [None]:
# defining a function to check summary of the target and numerical variables:

def analyze_continuous_target(df,target,numeric_cols):
    for col in numeric_cols:
        plt.figure(figsize=(10, 5))
        sns.regplot(x=df[col], y=df[target], line_kws={"color": "red"})
        plt.title(f'{target} vs {col}')
        plt.xlabel(col)
        plt.ylabel(target)
        plt.grid(True)
        plt.show()

In [None]:
# applying the function:

analyze_continuous_target(df,"SalePrice",num_cols)

In [None]:
# visualisation the date columns:

def analyze_date_target(df, year_columns, month_columns, target):
    # Year & Month formats:
    for col in year_columns:
        pd.to_datetime(df[col]).dt.year
        
    for col in month_columns:
        pd.to_datetime(df[col]).dt.month

    # Visualisation: 
    if len(year_columns) > 0:
        for col in year_columns:
            # Calculating annual average sales prices:
            yearly_avg = df.groupby(col)[target].mean().reset_index()
            # Creating the chart:
            plt.figure(figsize=(10, 5))
            plt.plot(yearly_avg[col], yearly_avg[target], marker='o', linestyle='-')
            plt.title(f'Average {target} by Year')
            plt.xlabel('Year')
            plt.ylabel(f'Average {target}')
            plt.grid(True)
            plt.show()
            
    if len(month_columns) > 0:
        for col in month_columns:
            # Calculating monthly average sales prices:
            monthly_avg = df.groupby(col)[target].mean().reset_index()
            # Creating the chart:
            plt.figure(figsize=(10, 5))
            plt.plot(monthly_avg[col], monthly_avg[target], marker='o', linestyle='-')
            plt.title(f'Average {target} by Month')
            plt.xlabel('Month')
            plt.ylabel(f'Average {target}')
            plt.grid(True)
            plt.show()

In [None]:
# applying the function:

analyze_date_target(df,['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold'],['MoSold'],"SalePrice")

<a id = "8"></a>
### 4.4 Outlier Analysis

In [None]:
# outlier analysis using graphical techniques:

for col in num_cols:
    sns.boxplot(x= df[col])
    plt.show()

In [None]:
# calculating ouitlier thresholds:

def outlier_thresholds(dataframe,col_name,q1=0.01,q3=0.99):
    quartile1 = dataframe[col_name].quantile(q1)
    quartile3 = dataframe[col_name].quantile(q3)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5*interquantile_range
    low_limit = quartile1 - 1.5*interquantile_range
    return low_limit, up_limit

In [None]:
# checking outlier in the dataframe:

def check_outlier(dataframe, col_name):
    if pd.api.types.is_numeric_dtype(dataframe[col_name]):
        low, up = outlier_thresholds(dataframe, col_name)
        return (dataframe[col_name] > up) | (dataframe[col_name] < low)
    else:
        return pd.Series(False, index=dataframe.index)

In [None]:
#checking results:
check_outlier(df, num_cols).head()

In [None]:
#defining a function to check columns outliers:

def check_all_columns_outliers(dataframe,num_cols):
    results = {}
    for col in num_cols:
        results[col] = check_outlier(dataframe, col).any() if pd.api.types.is_numeric_dtype(dataframe[col]) else False
    return results

In [None]:
#checking results:

check_all_columns_outliers(df,num_cols)

In [None]:
# listing columns based on outlier information:

def show_column_names_with_outliers_info(dataframe,col_list):
    print("################# Numeric Columns Outlier Thresholds: Low & Up Limit  #####################")
    for col in col_list:
        low, up = outlier_thresholds(dataframe,col)
        print(f"{col} : low: {low}, up: {up}",end ="\n")
    print(end="\n\n")
    no_outliers = []
    have_outliers = []
    for col,value in check_all_columns_outliers(dataframe,col_list).items():
        if value:
            have_outliers.append(col)
        else:
            no_outliers.append(col)
    print("################# Numeric Columns Have Outliers  #####################")
    print(have_outliers)
    print(f"count_columns: {len(have_outliers)}", end="\n\n")
    print("################# Numeric Columns Have NOT Outliers #####################")
    print(no_outliers)
    print(f"count_columns: {len(no_outliers)}", end="\n\n")
    return have_outliers,no_outliers


In [None]:
# applying the function:

have_outliers,no_outliers = show_column_names_with_outliers_info(df,num_cols)

<a id = "9"></a>
### 4.5 Correlation Analysis

In [None]:
# calculating correlation :

corr = df[num_cols].corr()
corr

In [None]:
# correlation graph:

sns.set(rc = {"figure.figsize":(12,12)})
sns.heatmap(corr,cmap = "RdBu")
plt.show()

In [None]:
# high correlation columns 
# list of items to be dropped:

def high_correlated_cols(dataframe,plot= False, corr_th = 0.90):
    import numpy as np
    corr = dataframe.corr()
    corr_matrix = corr.abs()
    upper_triangle_matrix = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    drop_list = [col for col in upper_triangle_matrix.columns if any (upper_triangle_matrix[col] > corr_th)]
    if plot:
        import seaborn as sns
        import matplotlib.pyplot as plt
        sns.set(rc = {"figure.figsize":(15,15)})
        sns.heatmap(corr_matrix,cmap = "RdBu")
        plt.show()
    return drop_list

In [None]:
# applying the function:

high_correlated_cols(df[num_cols],plot=False)

<a id = "10"></a>
## 5. Data Preprocessing

<a id = "11"></a>
### 5.1 Missing Values 

In [None]:
# defining a function for missing values:

def missing_values_table(dataframe, na_name=False):
    na_columns = [col for col in dataframe.columns if dataframe[col].isnull().sum() > 0]
    n_miss = dataframe[na_columns].isnull().sum().sort_values(ascending=False)
    ratio = (dataframe[na_columns].isnull().sum() / dataframe.shape[0] * 100).sort_values(ascending=False)
    missing_df = pd.concat([n_miss, np.round(ratio, 2)], axis=1, keys=['n_miss', 'ratio'])
    print(missing_df, end="\n")
    if na_name:
        return na_columns

In [None]:
# creating a list:

no_cols = ["Alley", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", "FireplaceQu",
           "GarageType", "GarageFinish", "GarageQual", "GarageCond", "PoolQC", "Fence", "MiscFeature"]

In [None]:
# filling the no_cols:

for col in no_cols:
    df[col].fillna("No", inplace=True)

In [None]:
# checking:

missing_values_table(df)

In [None]:
def quick_missing_imp(data, num_method="median", cat_length=20, target="SalePrice"):
    variables_with_na = [col for col in data.columns if data[col].isnull().sum() > 0]
    temp_target = data[target]
    print("# BEFORE")
    print(data[variables_with_na].isnull().sum(), "\n\n")
    data = data.apply(lambda x: x.fillna(x.mode()[0]) if (x.dtype == "O" and len(x.unique()) <= cat_length) else x, axis=0)
    if num_method == "mean":
        data = data.apply(lambda x: x.fillna(x.mean()) if x.dtype != "O" else x, axis=0)
    elif num_method == "median":
        data = data.apply(lambda x: x.fillna(x.median()) if x.dtype != "O" else x, axis=0)
    data[target] = temp_target
    print("# AFTER \n Imputation method is 'MODE' for categorical variables!")
    print(" Imputation method is '" + num_method.upper() + "' for numeric variables! \n")
    print(data[variables_with_na].isnull().sum(), "\n\n")
    return data

In [None]:
df = quick_missing_imp(df, num_method="median", cat_length=17)

<a id = "12"></a>
### 5.2 Outlier Suppression

In [None]:
# outlier suppression:

def replace_with_thresholds(dataframe,variable):
    low, up = outlier_thresholds(dataframe, variable)
    dataframe.loc[(dataframe[variable] < low),variable] = low
    dataframe.loc[(dataframe[variable] > up),variable] = up

In [None]:
df[have_outliers].dtypes

In [None]:
#applying the function:

for col in have_outliers:
    if col not in ["SalePrice"]:
        replace_with_thresholds(df, col)

<a id = "13"></a>
### 5.3 Encoding for Base Model

In [None]:
df_base = df.copy()

In [None]:
# for categorical columns:

In [None]:
# defining a function for rare analysis:

def rare_analyser(dataframe, target, cat_cols):
    for col in cat_cols:
        print(col, ":", len(dataframe[col].value_counts()))
        print(pd.DataFrame({"Count": dataframe[col].value_counts(),
                            "Ratio": dataframe[col].value_counts() / len(dataframe),
                            "Target_Mean": dataframe.groupby(col)[target].mean()}), end="\n\n\n")

In [None]:
# defining a function for rare encoding:

def rare_encoder(dataframe, rare_perc):
    temp_df = dataframe.copy()
    rare_columns = [col for col in temp_df.columns if temp_df[col].dtypes == 'O'
                    and (temp_df[col].value_counts() / len(temp_df) < rare_perc).any(axis=None)]
    for var in rare_columns:
        tmp = temp_df[var].value_counts() / len(temp_df)
        rare_labels = tmp[tmp < rare_perc].index
        temp_df[var] = np.where(temp_df[var].isin(rare_labels), 'Rare', temp_df[var])
    return temp_df

In [None]:
# applying the function:

rare_encoder(df_base, 0.01)
df_base.head()

In [None]:
# defining a function for label encoding:

def label_encoder(dataframe,cols):
    labelencoder = LabelEncoder()
    dataframe[cols] = labelencoder.fit_transform(dataframe[cols])
    return dataframe

In [None]:
# creating a list to label encoding:

# binary columns:

binary_cols = [col for col in df_base.columns if (df_base[col].dtypes not in ["float64","int64","int32"]) & (df_base[col].nunique() == 2)]
binary_cols

In [None]:
# applying the function:

for col in binary_cols:
    label_encoder(df_base,col)

In [None]:
# defining a function for one-hot encoding:

def one_hot_encoder(dataframe,cols,drop_first=True):
    dataframe = pd.get_dummies(dataframe,columns = cols, drop_first=drop_first)
    return dataframe

In [None]:
# creating a list to apply one-hot encoding:

# nominal columns:

ohe_cols = [col for col in df_base.columns if (df_base[col].dtype not in ["float64","int64","int32"]) & ((df_base[col].nunique() > 2) | (df_base[col].nunique() == 1))]
ohe_cols

In [None]:
# applying the function:

df_base = one_hot_encoder(df_base,ohe_cols)

In [None]:
#checking:

df_base.head()

In [None]:
#checking:

df.head()

<a id = "14"></a>
### 5.4 Scaling for Base Model

In [None]:
# for numerical columns:

In [None]:
# numerical columns not in date columns:

num_cols

In [None]:
date_cols

In [None]:
# standardization of numerical variables:

ss = StandardScaler()
ss_cols = num_cols + date_cols
ss_cols = [col for col in ss_cols if col not in ["SalePrice"]]
df_base[ss_cols] = ss.fit_transform(df_base[ss_cols])

In [None]:
#checking:

df_base.head()

In [None]:
#checking:

df.head()

<a id = "15"></a>
## 6. Base Model

<a id = "16"></a>
### 6.1 Hold-out Method 

In [None]:
# filtering train & test data:

train_df_base = df_base[df_base['SalePrice'].notnull()]
test_df_base = df_base[df_base['SalePrice'].isnull()]

In [None]:
y_base = train_df_base['SalePrice']  # dependent variable
X_base = train_df_base.drop(["Id", "SalePrice"], axis=1) # independent variables

In [None]:
# hold-out method:

X_base_train, X_base_test, y_base_train, y_base_test = train_test_split(X_base, y_base, test_size=0.20, random_state=24)

<a id = "17"></a>
### 6.2 Modeling

In [None]:
# model names:

models = [('Linear Regression', LinearRegression()), ('KNN', KNeighborsRegressor()), ('CART', DecisionTreeRegressor()),
          ('Random Forest', RandomForestRegressor()), ('GBM', GradientBoostingRegressor()),
          ("XGBoost", XGBRegressor(objective='reg:squarederror')), ("LightGBM", LGBMRegressor(verbose=-1)),("CatBoost", CatBoostRegressor(verbose=0))]

In [None]:
# fitting models:

for name, regressor in models:
    regressor.fit(X_base_train, y_base_train)

<a id = "18"></a>
### 6.3 Model Performance Evaluation

In [None]:
# evaluating results:

for model_name, regressor in models:
    y_base_pred = regressor.predict(X_base_test)
    mae = mean_absolute_error(y_base_test, y_base_pred)
    mse = mean_squared_error(y_base_test, y_base_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_base_test, y_base_pred)
    print(f"{model_name}:")
    print(f"  MAE: {mae:.4f}")
    print(f"  MSE: {mse:.4f}")
    print(f"  RMSE: {rmse:.4f}")
    print(f"  R2: {r2:.4f}")

<a id = "19"></a>
## 7. Feature Engineering

<a id = "20"></a>
### 7.1 Feature Extraction

In [None]:
# 1Story or 2Story:
    
df.loc[(df["HouseStyle"] == "1Story") | (df["HouseStyle"] == "2Story"),"New_HouseStyle"] = "1or2Story"

In [None]:
df["New_HouseStyle"] = df["New_HouseStyle"].apply(lambda x: "1or2Story" if x == "1or2Story" else "other")

In [None]:
df["New_HouseStyle"].value_counts()

In [None]:
df["New_1st*GrLiv"] = df["1stFlrSF"] * df["GrLivArea"]

In [None]:
df["New_1st*GrLiv"].head()

In [None]:
df["New_Garage*GrLiv"] = df["GarageArea"] * df["GrLivArea"]

In [None]:
df["New_Garage*GrLiv"].head()

In [None]:
df["New_TotalFlrSF"] = df["1stFlrSF"] + df["2ndFlrSF"]

In [None]:
df["New_TotalFlrSF"].head()

In [None]:
df["New_TotalBsmtFin"] = df["BsmtFinSF1"] + df["BsmtFinSF2"]

In [None]:
df["New_TotalBsmtFin"].head()

In [None]:
df["New_TotalSqFeet"] = df["GrLivArea"] + df["TotalBsmtSF"]

In [None]:
df["New_TotalSqFeet"].head()

In [None]:
df["New_Restoration"] = df["YearRemodAdd"] - df["YearBuilt"]

In [None]:
df["New_Restoration"].head()

In [None]:
df["New_HouseAge"] = df["YrSold"] - df["YearBuilt"]

In [None]:
df["New_HouseAge"].head()

In [None]:
df["New_RestorationAge"] = df["YrSold"] - df["YearRemodAdd"]

In [None]:
df["New_RestorationAge"].head()

In [None]:
df["New_GarageAge"] = df["GarageYrBlt"] - df["YearBuilt"]

In [None]:
df["New_GarageAge"].head()

In [None]:
df["New_GarageRestorationAge"] = np.abs(df["GarageYrBlt"] - df["YearRemodAdd"])

In [None]:
df["New_GarageRestorationAge"].head()

In [None]:
df["New_GarageSold"] = df["YrSold"] - df["GarageYrBlt"]

In [None]:
df["New_GarageSold"].head()

In [None]:
df.head()

<a id = "21"></a>
### 6.2 Encoding for Current & New Features

In [None]:
# for new categorical columns:

In [None]:
# defining a function for rare analysis:

def rare_analyser(dataframe, target, cat_cols):
    for col in cat_cols:
        print(col, ":", len(dataframe[col].value_counts()))
        print(pd.DataFrame({"Count": dataframe[col].value_counts(),
                            "Ratio": dataframe[col].value_counts() / len(dataframe),
                            "Target_Mean": dataframe.groupby(col)[target].mean()}), end="\n\n\n")

In [None]:
# defining a function for rare encoding:

def rare_encoder(dataframe, rare_perc):
    temp_df = dataframe.copy()
    rare_columns = [col for col in temp_df.columns if temp_df[col].dtypes == 'O'
                    and (temp_df[col].value_counts() / len(temp_df) < rare_perc).any(axis=None)]
    for var in rare_columns:
        tmp = temp_df[var].value_counts() / len(temp_df)
        rare_labels = tmp[tmp < rare_perc].index
        temp_df[var] = np.where(temp_df[var].isin(rare_labels), 'Rare', temp_df[var])
    return temp_df

In [None]:
# applying the function:

rare_encoder(df, 0.01)
df.head()

In [None]:
# defining a function for label encoding:

def label_encoder(dataframe,cols):
    labelencoder = LabelEncoder()
    dataframe[cols] = labelencoder.fit_transform(dataframe[cols])
    return dataframe

In [None]:
# creating a list to label encoding:

# binary columns:

binary_cols = [col for col in df.columns if (df[col].dtypes not in ["float64","int64","int32"]) & (df[col].nunique() == 2)]
binary_cols

In [None]:
# applying the function :

for col in binary_cols:
    label_encoder(df,col)

In [None]:
# defining a function for one-hot encoding:

def one_hot_encoder(dataframe,cols,drop_first=True):
    dataframe = pd.get_dummies(dataframe,columns = cols, drop_first=drop_first)
    return dataframe

In [None]:
# creating a list to apply one-hot encoding:

# nominal columns:

ohe_cols = [col for col in df.columns if (df[col].dtype not in ["float64","int64","int32"]) & ((df[col].nunique() > 2) | (df[col].nunique() == 1))]
ohe_cols

In [None]:
# applying the function:

df = one_hot_encoder(df,ohe_cols)

In [None]:
#checking:

df.head()

<a id = "22"></a>
### 6.3 Feature Scaling

In [None]:
# for current & new numerical columns:

In [None]:
# standardization of numerical variables:

ss = StandardScaler()
ss_cols = num_cols + date_cols
ss_cols = [col for col in ss_cols if col not in ["SalePrice"]]
df[ss_cols] = ss.fit_transform(df[ss_cols])

In [None]:
#checking:

df.head()

<a id = "23"></a>
# 8. Modeling

<a id = "24"></a>
## 8.1 Hold-out Method 

In [None]:
# filtering train & test data:

train_df = df[df['SalePrice'].notnull()]
test_df = df[df['SalePrice'].isnull()]

In [None]:
y = train_df['SalePrice']  # dependent variable
X = train_df.drop(["Id", "SalePrice"], axis=1) # independent variables

In [None]:
# hold-out method:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=24)

<a id = "25"></a>
## 8.2 Modeling 

In [None]:
# model names:

models = [('Linear Regression', LinearRegression()), ('KNN', KNeighborsRegressor()), ('CART', DecisionTreeRegressor()),
          ('Random Forest', RandomForestRegressor()), ('GBM', GradientBoostingRegressor()),
          ("XGBoost", XGBRegressor(objective='reg:squarederror')), ("LightGBM", LGBMRegressor(verbose=-1)),("CatBoost", CatBoostRegressor(verbose=0))]

In [None]:
# fitting models:

for name, regressor in models:
    regressor.fit(X_train, y_train)

<a id = "26"></a>
## 8.3. Model Performance Evaluation

In [None]:
# evaluating results:

for model_name, regressor in models:
    y_pred = regressor.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)
    print(f"{model_name}:")
    print(f"  MAE: {mae:.4f}")
    print(f"  MSE: {mse:.4f}")
    print(f"  RMSE: {rmse:.4f}")
    print(f"  R2: {r2:.4f}")

<a id = "27"></a>
## 8.4 Cross Validation

In [None]:
# evaluating 5-fold cross validation results:

for model_name, regressor in models:
    scoring = {'mae': 'neg_mean_absolute_error','mse': 'neg_mean_squared_error','r2': 'r2'}
    cv_results = cross_validate(regressor,X,y,cv=5,scoring=scoring)
    print(f"{model_name}:",end="\n")
    print("Average MAE: ", -cv_results['test_mae'].mean())
    print("Average MSE: ", -cv_results['test_mse'].mean())
    print("Average R2: ", cv_results['test_r2'].mean(),end="\n\n")

<a id = "28"></a>
# 9. Hyperparameter Tuning

<a id = "29"></a>
## 9.1 Determining Parameters

In [None]:
# default parameters of the models:

for model_name, regressor in models:
    print(f"####################### {model_name} #######################")
    print(f"parameters: {regressor.get_params()}",end="\n\n")

In [None]:
# determined parameters of the models for hiperparameter tuning:

model_params = [
    ('Linear Regression', LinearRegression(), {
        'fit_intercept': [True, False]
    }),
    ('KNN', KNeighborsRegressor(), {
        'n_neighbors': range(3,11),
        'weights': ['uniform', 'distance'],
        'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
    }),
    ('CART', DecisionTreeRegressor(), {
        'criterion': ['mse', 'friedman_mse', 'mae'],
        'splitter': ['best', 'random'],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': range(2,10)
    }),
    ('Random Forest', RandomForestRegressor(), {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5, 10]
    }),
    ('GBM', GradientBoostingRegressor(), {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.1, 0.2],
        'subsample': [0.8, 1.0],
        'max_depth': [3, 5, 7]
    }),
    ('XGBoost', XGBRegressor(objective='reg:squarederror'), {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.1, 0.2],
        'subsample': [0.8, 1.0],
        'max_depth': [3, 5, 7]
    }),
    ('LightGBM', LGBMRegressor(verbose=-1), {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.1, 0.2],
        'num_leaves': [31, 50, 100],
        'boosting_type': ['gbdt', 'dart']
    }),
    ('CatBoost', CatBoostRegressor(verbose=0), {
        'iterations': [200, 500],
        'learning_rate': [0.01, 0.1, 0.2],
        'depth': [3, 5, 7]
    })
]

<a id = "30"></a>
## 9.2 Best Parameters & Best Scores

In [None]:
# hiperparameter tuning:
# best parameters and best scores on the models:

best_model_name = None
best_model = None
best_score = float('-inf')
best_params = None

for model_name, regressor, params in model_params:
    regressor_grid = GridSearchCV(regressor, params, cv=5, n_jobs = -1, verbose = True).fit(X, y)
    
    print(f"Model: {model_name}")
    print("Best Parameters:", regressor_grid.best_params_)
    print("Best Score:", regressor_grid.best_score_,end="\n\n")
    print("######################",end="\n\n")
    
    if  regressor_grid.best_score_ > best_score:
        best_model_name = model_name
        best_model = regressor
        best_score = regressor_grid.best_score_
        best_params = regressor_grid.best_params_
        
      
print("######################### Best Model ve Hyperparameters #########################")
print(f"Model: {best_model_name}")
print("Best Parameters:", best_params)
print("Best Score:", best_score)

In [None]:
best_model

In [None]:
# defining a function to find parameters for the best model:

def find_best_model_params(model_params,best_model):
    for model_name, regressor, params in model_params:
        if best_model == regressor:
            best_model_params = params
    return best_model_params

In [None]:
#appyling the function:

best_model_params = find_best_model_params(model_params,best_model)

In [None]:
best_model_params

In [None]:
regressor_best_grid = GridSearchCV(best_model, best_model_params, cv=5, n_jobs = -1, verbose = True).fit(X, y)

In [None]:
regressor_best_grid

<a id = "31"></a>
# 10. Final Model

<a id = "32"></a>
## 10.1 Modeling

In [None]:
# fitting final model to use best parameters:

model_final = GradientBoostingRegressor(**regressor_best_grid.best_params_).fit(X, y)

<a id = "33"></a>
## 10.2 Cross Validation

In [None]:
# evaluating 5-fold cross validation results:

scoring = {'mae': 'neg_mean_absolute_error','mse': 'neg_mean_squared_error','r2': 'r2'}
cv_results_final = cross_validate(model_final,X,y,cv=5,scoring=scoring)
print(f"Best Model: {best_model_name}",end="\n")
print("Average MAE: ", -cv_results['test_mae'].mean())
print("Average MSE: ", -cv_results['test_mse'].mean())
print("Average R2: ", cv_results['test_r2'].mean(),end="\n\n")

<a id = "34"></a>
## 10.3 Feature Importance

In [None]:
# creating function to visualize:

def plot_importance(model, features, start = 0 ,num=len(X), save=False):
    feature_imp = pd.DataFrame({'Value': model.feature_importances_, 'Feature': features.columns})
    print(feature_imp.sort_values("Value",ascending=False)[start:num])
    plt.figure(figsize=(10, 10))
    sns.set(font_scale=1)
    sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value",
                                                                     ascending=False)[start:num])
    plt.title('Features')
    plt.tight_layout()
    plt.show()
    if save:
        plt.savefig('importances.png')

In [None]:
# appyling the function:

plot_importance(model_final, X,start=0, num=45)

<a id = "35"></a>
## 10.4 Prediction

In [None]:
# Make predictions on the test set:

predictions = model_final.predict(test_df.drop(["Id", "SalePrice"], axis=1))

<a id = "36"></a>
## 10.5 Creating a Submission File

In [None]:
# creating a DataFrame with Id and Prediction columns:

submission = pd.DataFrame({
    'Id': test_df["Id"],
    'SalePrice': predictions
})

In [None]:
# saving to CSV file:

submission.to_csv('submission.csv', index=False)

In [None]:
# display first 5 rows:

submission.head()