# House Price Prediction Project

This notebook is an end-to-end machine learning project developed to predict house prices. It includes Exploratory Data Analysis (EDA), Feature Engineering, Data Preprocessing, and prediction steps using various regression models.

## Table of Contents
1. [Requirements](#1-requirements)
2. [Data Overview](#2-data-overview)
3. [Feature Engineering](#3-feature-engineering)
4. [Modeling](#4-modeling)

In [None]:
# 1. REQUIREMENTS

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score,GridSearchCV
import warnings
warnings.filterwarnings("ignore")

warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=ConvergenceWarning)


pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

## TASK 1 : Apply EDA to the Dataset

1. General Picture
2. Analysis of Categorical Variables
3. Analysis of Numerical Variables
4. Analysis of Target Variable
5. Analysis of Correlation

### Loading and Merging the Dataset
Train and Test datasets are read. To ensure consistency in Feature Engineering operations, these two datasets are combined and processed on a single dataframe.

In [None]:
# Step 1: Read and merge Train and Test datasets. Proceed with the merged data.

# Combining train and test sets.
train = pd.read_csv("Datasets ( Genel )/train.csv")
test = pd.read_csv("Datasets ( Genel )/test.csv")
df = pd.concat([train, test], axis=0, ignore_index=True)

# Let's examine the first and last rows of the dataset for initial observations.
df.head()

In [None]:
df.tail()

### Data Overview
The dimensions, variable types, missing values, and basic statistics of the dataset are examined using the `check_df` function. This step is critical for understanding the dataset.

In [None]:
# 1. General Picture

def check_df(dataframe):
    """
    Prints the basic summary of the dataframe to the screen.
    """
    print("##################### Shape #####################")
    print(dataframe.shape)
    print("##################### Types #####################")
    print(dataframe.dtypes)
    print("##################### Head #####################")
    print(dataframe.head(3))
    print("##################### Tail #####################")
    print(dataframe.tail(3))
    print("##################### NA #####################")
    print(dataframe.isnull().sum())
    print("##################### Quantiles #####################")
    print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)


check_df(df)

### Grab Column Names
Variables in the dataset are categorized as categorical, numerical, and "categorical but cardinal" (categorical variables with high cardinality). This separation determines how each variable will be approached in subsequent analysis steps.

In [None]:
# CAPTURING NUMERIC AND CATEGORICAL VARIABLES

def grab_col_names(dataframe, cat_th=10, car_th=20):
    """
    Separates categorical and numerical variables in the dataframe with logical thresholds.
    """
    cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]

    num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and
                   dataframe[col].dtypes != "O"]

    cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique() > car_th and
                   dataframe[col].dtypes == "O"]

    cat_cols = cat_cols + num_but_cat
    cat_cols = [col for col in cat_cols if col not in cat_but_car]

    num_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O"]
    num_cols = [col for col in num_cols if col not in num_but_cat]

    print(f"Observations: {dataframe.shape[0]}")
    print(f"Variables: {dataframe.shape[1]}")
    print(f'cat_cols: {len(cat_cols)}')
    print(f'num_cols: {len(num_cols)}')
    print(f'cat_but_car: {len(cat_but_car)}')
    print(f'num_but_cat: {len(num_but_cat)}')

    return cat_cols, cat_but_car, num_cols

cat_cols, cat_but_car, num_cols = grab_col_names(df)

### Categorical Variable Analysis
The classes of categorical variables and their ratios in the dataset are examined. Visualization is performed using the `cat_summary` function.

In [None]:
# 2. Analysis of Categorical Variables

def cat_summary(dataframe, col_name, plot=False):
    """
    Reports class counts and ratios for a categorical variable.
    """
    print(pd.DataFrame({col_name: dataframe[col_name].value_counts(),
                        "Ratio": 100 * dataframe[col_name].value_counts() / len(dataframe)}))

    if plot:
        sns.countplot(x=dataframe[col_name], data=dataframe)
        plt.show(block=True)


for col in cat_cols:
    cat_summary(df, col)

### Numerical Variable Analysis
Distributions of numerical variables are examined using histograms and basic statistics (mean, standard deviation, quartiles). The `num_summary` function performs this analysis.

In [None]:
# 3. Analysis of Numerical Variables

def num_summary(dataframe, numerical_col, plot=False):
    """
    Provides summary statistics and optionally a histogram for a numerical variable.
    """
    quantiles = [0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95, 0.99]
    print(dataframe[numerical_col].describe(quantiles).T)

    if plot:
        dataframe[numerical_col].hist(bins=50)
        plt.xlabel(numerical_col)
        plt.title(numerical_col)
        plt.show(block=True)

    print("#####################################")


for col in num_cols:
    num_summary(df, col, True)

### Target Analysis
The effect of categorical variables on the target variable (SalePrice) is examined. By calculating the average house price for each categorical class, we try to understand which classes are determinants of the price.

In [None]:
# 4. Analysis of Target Variable

def target_summary_with_cat(dataframe, target, categorical_col):
    """
    Reports the mean of the target variable according to the classes of a categorical variable.
    """
    print(pd.DataFrame({"TARGET_MEAN": dataframe.groupby(categorical_col)[target].mean()}), end="\n\n\n")


for col in cat_cols:
    target_summary_with_cat(df,"SalePrice",col)

### Target Variable Distribution
The distribution of the target variable (SalePrice) is examined. Since it shows a right-skewed distribution, the effect of logarithmic transformation is also observed. Log transformation can improve model performance by approximating the distribution to normal.

In [None]:
# TRANSFORMATION
# Examination of the dependent variable
df["SalePrice"].hist(bins=100)
plt.show(block=True)

# Examination of the logarithm of the dependent variable
np.log1p(df['SalePrice']).hist(bins=50)
plt.show(block=True)

### Correlation Analysis
A correlation matrix is created to examine the relationship between numerical variables and visualized using a heatmap.

In [None]:
# 5. Analysis of Correlation

corr = df[num_cols].corr()
# Displaying correlations
sns.set(rc={'figure.figsize': (12, 12)})
sns.heatmap(corr, cmap="RdBu")
plt.show()

### Detection of High Correlation Variables
Variables with very high correlation with each other (multicollinearity) can cause noise and overfitting in the model. These variables are identified.

In [None]:
def high_correlated_cols(dataframe, plot=False, corr_th=0.70):
    """
    Detects columns with high correlation and optionally plots a heatmap.
    """
    corr = dataframe.corr()
    cor_matrix = corr.abs()
    upper_triangle_matrix = cor_matrix.where(np.triu(np.ones(cor_matrix.shape), k=1).astype(bool))  # np.bool yerine bool
    drop_list = [col for col in upper_triangle_matrix.columns if any(upper_triangle_matrix[col] > corr_th)]
    if plot:
        import seaborn as sns
        import matplotlib.pyplot as plt
        sns.set(rc={'figure.figsize': (15, 15)})
        sns.heatmap(corr, cmap="RdBu")
        plt.show()
    return drop_list

high_correlated_cols(df, plot=False)
high_correlated_cols(df, plot=True)

## Task 2 : Feature Engineering

### Outlier Analysis
Outliers in the dataset can distort the model's generalization ability. Lower and upper threshold values are determined using the IQR (Interquartile Range) method, and values outside these limits are considered outliers.

In [None]:
# Outlier Analysis

# Suppression of outliers
def outlier_thresholds(dataframe, variable, low_quantile=0.10, up_quantile=0.90):
    """
    Calculates lower and upper threshold values (outlier limits) for a variable.
    """
    quantile_one = dataframe[variable].quantile(low_quantile)
    quantile_three = dataframe[variable].quantile(up_quantile)
    interquantile_range = quantile_three - quantile_one
    up_limit = quantile_three + 1.5 * interquantile_range
    low_limit = quantile_one - 1.5 * interquantile_range
    return low_limit, up_limit

# Outlier check
def check_outlier(dataframe, col_name):
    """
    Checks if there are outliers (outside calculated thresholds) in the specified column.
    """
    low_limit, up_limit = outlier_thresholds(dataframe, col_name)
    if dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)].any(axis=None):
        return True
    else:
        return False


for col in num_cols:
    if col != "SalePrice":
      print(col, check_outlier(df, col))

### Outlier Suppression (Winsorization)
Detected outliers are suppressed to the calculated lower and upper threshold values. This reduces the impact of outliers without data loss.

In [None]:
# Suppression of outliers
def replace_with_thresholds(dataframe, variable):
    """
    Clips outliers to the calculated lower/upper threshold values for the relevant variable.
    """
    low_limit, up_limit = outlier_thresholds(dataframe, variable)
    dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
    dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit


for col in num_cols:
    if col != "SalePrice":
        replace_with_thresholds(df,col)

for col in num_cols:
    if col != "SalePrice":
      print(col, check_outlier(df, col))

### Missing Value Analysis
Missing values in the dataset are examined. The number of missing values in which variables and their ratios are reported.

In [None]:
# Missing Value Analysis

def missing_values_table(dataframe, na_name=False):
    """
    Reports columns containing missing values with their counts and ratios.
    """
    na_columns = [col for col in dataframe.columns if dataframe[col].isnull().sum() > 0]

    n_miss = dataframe[na_columns].isnull().sum().sort_values(ascending=False)

    ratio = (dataframe[na_columns].isnull().sum() / dataframe.shape[0] * 100).sort_values(ascending=False)

    missing_df = pd.concat([n_miss, np.round(ratio, 2)], axis=1, keys=['n_miss', 'ratio'])

    print(missing_df, end="\n")

    if na_name:
        return na_columns

missing_values_table(df)

### Missing Value Imputation (Special Cases)
Missing values (NA) in some variables actually indicate that the house does not have that feature (e.g., if there is no pool, PoolQC might be NA). Such missing values are filled with a meaningful label like "No".

In [None]:
df["Alley"].value_counts()

# Missing values in some variables indicate that the house does not have that feature
no_cols = ["Alley","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","FireplaceQu",
           "GarageType","GarageFinish","GarageQual","GarageCond","PoolQC","Fence","MiscFeature"]

# Filling missing values in columns with "No"
for col in no_cols:
    df[col].fillna("No", inplace=True)

missing_values_table(df)

### Missing Value Imputation with Median/Mode
For the remaining missing values; categorical variables are filled with the mode (most frequent value), and numerical variables are filled with the median. This process is automated with the `quick_missing_imp` function.

In [None]:
# This function ensures that missing values are filled with median or mean
def quick_missing_imp(data, num_method="median", cat_length=20, target="SalePrice"):
    """
    Fills missing values quickly.
    """
    variables_with_na = [col for col in data.columns if data[col].isnull().sum() > 0]  # Variables with missing values are listed

    temp_target = data[target]

    print("# BEFORE")
    print(data[variables_with_na].isnull().sum(), "\n\n")  # Number of missing values of variables before application

    # If variable is object and class count is less than or equal to cat_length, fill missing values with mode
    data = data.apply(lambda x: x.fillna(x.mode()[0]) if (x.dtype == "O" and len(x.unique()) <= cat_length) else x, axis=0)

    # If num_method is mean, missing values of non-object variables are filled with mean
    if num_method == "mean":
        data = data.apply(lambda x: x.fillna(x.mean()) if x.dtype != "O" else x, axis=0)
    # If num_method is median, missing values of non-object variables are filled with median
    elif num_method == "median":
        data = data.apply(lambda x: x.fillna(x.median()) if x.dtype != "O" else x, axis=0)

    data[target] = temp_target

    print("# AFTER \n Imputation method is 'MODE' for categorical variables!")
    print(" Imputation method is '" + num_method.upper() + "' for numeric variables! \n")
    print(data[variables_with_na].isnull().sum(), "\n\n")

    return data


df = quick_missing_imp(df, num_method="median", cat_length=17)

### Rare Analysis
Rare classes (those with very low frequency) in categorical variables are analyzed. These classes can often create noise for the model or do not carry enough information.

In [None]:
# Perform rare analysis and apply rare encoder.

# Examination of the distribution of categorical columns
def rare_analyser(dataframe, target, cat_cols):
    """
    Prints the summary of categorical columns with class count, distribution ratio, and target mean.
    """
    for col in cat_cols:
        print(col, ":", len(dataframe[col].value_counts()))
        print(pd.DataFrame({"COUNT": dataframe[col].value_counts(),
                            "RATIO": dataframe[col].value_counts() / len(dataframe),
                            "TARGET_MEAN": dataframe.groupby(col)[target].mean()}), end="\n\n\n")

rare_analyser(df, "SalePrice", cat_cols)

### Rare Encoding
Rare classes falling below the determined threshold value are combined under the "Rare" label. This process reduces the cardinality of categorical variables and ensures the model works more stably.

In [None]:
# Detection of rare classes
def rare_encoder(dataframe, rare_perc):
    """
    Combines rare classes (frequency below `rare_perc`) under the 'Rare' label.
    """
    temp_df = dataframe.copy()

    rare_columns = [col for col in temp_df.columns if temp_df[col].dtypes == 'O'
                    and (temp_df[col].value_counts() / len(temp_df) < rare_perc).any(axis=None)]

    for var in rare_columns:
        tmp = temp_df[var].value_counts() / len(temp_df)
        rare_labels = tmp[tmp < rare_perc].index
        temp_df[var] = np.where(temp_df[var].isin(rare_labels), 'Rare', temp_df[var])

    return temp_df


df = rare_encoder(df, 0.01)
rare_analyser(df, "SalePrice", cat_cols)

### Feature Extraction
New and meaningful variables are derived using existing variables. This step helps the model capture hidden patterns in the dataset (e.g., total house area, age of the house, etc.).

In [None]:
# Create new variables and add 'NEW' to the beginning of the new variables you create.

df["NEW_1st*GrLiv"] = df["1stFlrSF"] * df["GrLivArea"]

df["NEW_Garage*GrLiv"] = (df["GarageArea"] * df["GrLivArea"])

df["TotalQual"] = df[["OverallQual", "OverallCond", "ExterQual", "ExterCond", "BsmtCond", "BsmtFinType1",
                      "BsmtFinType2", "HeatingQC", "KitchenQual", "Functional", "FireplaceQu", "GarageQual", "GarageCond", "Fence"]].sum(axis = 1)


# Total Floor
df["NEW_TotalFlrSF"] = df["1stFlrSF"] + df["2ndFlrSF"]

# Total Finished Basement Area
df["NEW_TotalBsmtFin"] = df.BsmtFinSF1 + df.BsmtFinSF2

# Porch Area
df["NEW_PorchArea"] = df.OpenPorchSF + df.EnclosedPorch + df.ScreenPorch + df["3SsnPorch"] + df.WoodDeckSF

# Total House Area
df["NEW_TotalHouseArea"] = df.NEW_TotalFlrSF + df.TotalBsmtSF

df["NEW_TotalSqFeet"] = df.GrLivArea + df.TotalBsmtSF


# Lot Ratio
df["NEW_LotRatio"] = df.GrLivArea / df.LotArea

df["NEW_RatioArea"] = df.NEW_TotalHouseArea / df.LotArea

df["NEW_GarageLotRatio"] = df.GarageArea / df.LotArea

# MasVnrArea
df["NEW_MasVnrRatio"] = df.MasVnrArea / df.NEW_TotalHouseArea

# Dif Area
df["NEW_DifArea"] = (df.LotArea - df["1stFlrSF"] - df.GarageArea - df.NEW_PorchArea - df.WoodDeckSF)


df["NEW_OverallGrade"] = df["OverallQual"] * df["OverallCond"]


df["NEW_Restoration"] = df.YearRemodAdd - df.YearBuilt

df["NEW_HouseAge"] = df.YrSold - df.YearBuilt

df["NEW_RestorationAge"] = df.YrSold - df.YearRemodAdd

df["NEW_GarageAge"] = df.GarageYrBlt - df.YearBuilt

df["NEW_GarageRestorationAge"] = np.abs(df.GarageYrBlt - df.YearRemodAdd)

df["NEW_GarageSold"] = df.YrSold - df.GarageYrBlt


drop_list = ["Street", "Alley", "LandContour", "Utilities", "LandSlope","Heating", "PoolQC", "MiscFeature","Neighborhood"]

# Dropping variables in drop_list
df.drop(drop_list, axis=1, inplace=True)

### Encoding (Label & One-Hot)
Categorical variables are converted into a numerical format that machine learning models can understand. Label Encoding is applied for binary variables, and One-Hot Encoding is applied for multi-class variables.

In [None]:
# Apply Label Encoding & One-Hot Encoding operations.

cat_cols, cat_but_car, num_cols = grab_col_names(df)

def label_encoder(dataframe, binary_col):
    """
    Converts a binary categorical column to 0/1 values.
    """
    labelencoder = LabelEncoder()
    dataframe[binary_col] = labelencoder.fit_transform(dataframe[binary_col])
    return dataframe

binary_cols = [col for col in df.columns if df[col].dtypes == "O" and len(df[col].unique()) == 2]

for col in binary_cols:
    label_encoder(df, col)


def one_hot_encoder(dataframe, categorical_cols, drop_first=True):
    """
    Applies One-Hot Encoding for specified categorical columns.
    """
    dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first)
    return dataframe

df = one_hot_encoder(df, cat_cols, drop_first=True)

df.shape

## MODELING

### Establishing Base Models
Different regression algorithms (Linear Regression, Ridge, Lasso, GBM, XGBoost, LightGBM, CatBoost, etc.) are trained with default parameters and RMSE (Root Mean Squared Error) scores are compared. This step is done to determine which model is more suitable for the dataset.

In [None]:
# TASK 3: Model Building

# Separate Train and Test data. (Values with empty SalePrice variable are test data.)
train_df = df[df['SalePrice'].notnull()]
test_df = df[df['SalePrice'].isnull()]

y = train_df['SalePrice']  # np.log1p(df['SalePrice'])
X = train_df.drop(["Id", "SalePrice"], axis=1)

# Build a model with Train data and evaluate model success.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=17)


models = [('LR', LinearRegression()),
          ("Ridge", Ridge(random_state=12345)),
          ("Lasso", Lasso(random_state=12345)),
          ("ElasticNet", ElasticNet(random_state=12345)),
          ('KNN', KNeighborsRegressor()),
          ('CART', DecisionTreeRegressor(random_state=12345)),
          ('RF', RandomForestRegressor(random_state=12345)),
          ('SVR', SVR()),
          ('GBM', GradientBoostingRegressor(random_state=12345)),
          ("XGBoost", XGBRegressor(objective='reg:squarederror', random_state=12345)),
          ("LightGBM", LGBMRegressor(random_state=12345)),
          ("CatBoost", CatBoostRegressor(verbose=False, random_state=12345))]

for name, regressor in models:
    rmse = np.mean(np.sqrt(-cross_val_score(regressor, X, y, cv=5, scoring="neg_mean_squared_error")))
    print(f"RMSE: {round(rmse, 4)} ({name}) ")

In [None]:
df['SalePrice'].mean()
df['SalePrice'].std()
df["SalePrice"].hist(bins=100)
plt.show(block=True)

### Modeling with Log Transformation
The model is rebuilt by taking the logarithm of the target variable (SalePrice). Log transformation can improve the prediction success of the model by correcting the distribution of the target variable. The results are evaluated by converting them back to the original scale (inverse log).

In [None]:
# BONUS : Build a model by performing Log transformation and observe RMSE results.
# Note: Do not forget to take the inverse of the Log.

# Performing Log transformation


train_df = df[df['SalePrice'].notnull()]
test_df = df[df['SalePrice'].isnull()]

# plt.hist(np.log1p(train_df['SalePrice']), bins=100)
y = np.log1p(train_df['SalePrice'])
X = train_df.drop(["Id", "SalePrice"], axis=1)

# Splitting data into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=17)


lgbm = LGBMRegressor().fit(X_train, y_train)
y_pred = lgbm.predict(X_test)

y_pred
# Taking the inverse of the LOG transformation performed
new_y = np.expm1(y_pred)
new_y
new_y_test = np.expm1(y_test)
new_y_test

np.sqrt(mean_squared_error(new_y_test, new_y))

# RMSE : 22866.43915128612

### Hyperparameter Optimization
Hyperparameter optimization is performed for the selected best model (e.g., LightGBM). Using the GridSearchCV method, different parameter combinations are tried to find the parameters that maximize the model's performance.

In [None]:
# perform hyperparameter optimizations.

lgbm_model = LGBMRegressor(random_state=46)

y = np.expm1(y)  # undo logarithmic transformation
rmse = np.mean(np.sqrt(-cross_val_score(lgbm_model, X, y, cv=5, scoring="neg_mean_squared_error")))


lgbm_params = {"learning_rate": [0.01, 0.1],
               "n_estimators": [500, 1500],
               "colsample_bytree": [0.5, 0.7, 1]
             }

lgbm_gs_best = GridSearchCV(lgbm_model,
                            lgbm_params,
                            cv=5,
                            n_jobs=-1,
                            verbose=-1).fit(X, y)


lgbm_gs_best.best_params_
final_model = lgbm_model.set_params(**lgbm_gs_best.best_params_).fit(X, y)

print(f"Ä°lk RMSE: {rmse}")
rmse_new = np.mean(np.sqrt(-cross_val_score(final_model, X, y, cv=5, scoring="neg_mean_squared_error")))
print(f"Yeni RMSE: {rmse_new}")

### Feature Importance
The variables that the model gives more importance to when making predictions are visualized. This analysis is important for interpreting the model's decisions and detecting unnecessary variables.

In [None]:
# Plot the ranking of features using the feature_importance function indicating the importance level of variables.

# feature importance
def plot_importance(model, features, num=len(X), save=False):
    """
    Visualizes feature importance levels in the model with a bar chart.
    """

    feature_imp = pd.DataFrame({"Value": model.feature_importances_, "Feature": features.columns})
    plt.figure(figsize=(10, 10))
    sns.set(font_scale=1)
    sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False)[0:num])
    plt.title("Features")
    plt.tight_layout()
    plt.show()
    if save:
        plt.savefig("importances.png")


plot_importance(final_model, X)
plot_importance(final_model, X, num=30)

### Prediction and Submission
Predictions are generated for the test dataset using the optimized model. These predictions are saved in a CSV file in a format suitable for uploading to the Kaggle competition (Id, SalePrice).

In [None]:
# Predict the empty SalePrice variables in the test dataframe and
# Create a dataframe suitable for submission to the Kaggle page. (Id, SalePrice)

test_df["SalePrice"]
predictions = final_model.predict(test_df.drop(["Id","SalePrice"], axis=1))
# predictions = np.expm1(predictions)  # to take inverse if it was log
dictionary = {"Id":test_df.index, "SalePrice":predictions}
dfSubmission = pd.DataFrame(dictionary)
dfSubmission.to_csv("housePricePredictions.csv", index=False)