<h1> <center>📈 Prices Prediction: Using XGBoost Regression Tutorial 📈 </center> </h1>

***

<div class="alert alert-block alert-info" style="font-size:14px; color: black;line-height: 1.7em;">
    📌 &nbsp; About the house prices dataset: With 79 explanatory variables that describe (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home. The main challenge here is to use all these variables to predict the price of a property with the smallest possible error.
</div>

<div class="alert alert-block alert-info" style="font-size:14px; color: black;line-height: 1.7em;background-color:#DAF7A6">
    📌 &nbsp; Initially, the exploratory analysis will be carried out on the dataset using methods already known from descriptive statistics, such as variance, mean, median, and mode.
</div>

<div class="alert alert-block alert-info" style="font-size:14px; color: black;line-height: 1.7em;background-color:#FFA07A">
    📌 &nbsp; The machine learning model used here is the XGBoost Regression, an open-source library that provides an efficient and effective implementation of the gradient boosting algorithm. Feel free to use other regression models as an example..
</div>

<br><br>
<center><img src="https://ak.picdn.net/shutterstock/videos/6670145/thumb/1.jpg"></center>

<h1> <center> 📚 Importing Python libraries 📚 </center> </h1>

<div class="alert alert-block alert-info" style="font-size:14px; color: black;line-height: 1.7em;">
    📌 &nbsp; For data visualization: matplotlib, pandas, seaborn, and plotly.
</div>

<div class="alert alert-block alert-info" style="font-size:14px; color: black;line-height: 1.7em;background-color:#DAF7A6">
    📌 &nbsp; For data modeling, evaluation and cleaning: xgboost, numpy, sklearn, scipy.
</div>

In [None]:
import random 
import xgboost

import numpy as np 
import pandas as pd 
import plotly.express as px
import seaborn as sns

import matplotlib.pyplot as plt
import plotly.graph_objects as go

from pandas_datareader import data
from scipy import stats


from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

import matplotlib.ticker as ticker
from matplotlib.ticker import FixedFormatter, FixedLocator

pd.set_option('display.max_rows', 100)

In [None]:
path_file_csv = "../input/house-prices-advanced-regression-techniques/train.csv"

prices_train = pd.read_csv(path_file_csv)
cm = sns.light_palette("green", as_cmap=True)
prices_train.head(30).style.background_gradient(cmap=cm)

<h1> <center> 📊 Columns and data types 📊 </center> </h1>


<div class="alert alert-block alert-info" style="font-size:14px; color: black;line-height: 1.7em; background-color: #FA8072">
    📌 &nbsp; Here we will understand how the data is formatted, how the units are distributed within the dataset, and what kind of data we will need to work.
</div>

In [None]:
pd.DataFrame(prices_train.columns, columns=["name"])

<div class="alert alert-block alert-info" style="font-size:14px; color: black;line-height: 1.7em;background-color:#D0ECE7 ">
    📌 &nbsp; Here we are going to look at the data types and decide which data type we want to train in our model.
</div>

In [None]:
pd.DataFrame(prices_train.dtypes, columns=["type"])

In [None]:
prices_train.loc[:, prices_train.columns!='Id'].describe().style.background_gradient(cmap=cm)

In [None]:
prices_train.loc[:, prices_train.columns=='SalePrice'].describe().style.background_gradient(cmap=cm)

<h1> <center> 📊 Histogram analysis 📊 </center> </h1>

In [None]:
def get_random_color():
    r1 = lambda: random.randint(0,255)
    return '#%02X%02X%02X' % (r1(),r1(),r1())


def get_histplot_central_tendency(df: dict, fields: list):
    for field in fields:
        f, (ax1) = plt.subplots(1, 1, figsize=(15, 5))
        v_dist_1 = df[field].values
        sns.histplot(v_dist_1, ax=ax1, color=get_random_color(), kde=True)

        mean=df[field].mean()
        median=df[field].median()
        mode=df[field].mode().values[0]

        ax1.axvline(mean, color='r', linestyle='--', label="Mean")
        ax1.axvline(median, color='g', linestyle='-', label="Mean")
        ax1.axvline(mode, color='b', linestyle='-', label="Mode")
        ax1.legend()

        plt.title(f"{field} - Histogram analysis")
        
def get_scatter(df: dict, fields: list):
    ylim = (0, 700000)
    for field in fields:
        df_copy = pd.concat([df['SalePrice'], df[field]], axis=1)
        df_copy.plot.scatter(x=field, y='SalePrice', ylim=ylim, color=get_random_color())
        plt.title(f"{field} - Relationship with SalesPrice")

<div class="alert alert-block alert-info" style="font-size:14px; color: black;line-height: 1.7em;">
    📌 &nbsp; We use position measurements (mode, median and mean) to observe the characteristics of the numerical values of the sample set. The mean is the value that demonstrates the concentration of data in a distribution. The median is the value that separates the larger and smaller half of a sample, population, or probability distribution. Mode represents the most frequent value in a dataset.
</div>

In [None]:
fields = ["LotArea", "TotalBsmtSF", "GrLivArea", "GarageArea", "SalePrice"]
get_histplot_central_tendency(prices_train, fields)

In [None]:
get_scatter(prices_train, fields[0:4])

In [None]:
def get_headmap_price(df: dict):
    corr = df.corr()
    plt.figure(figsize=(35, 35))
    sns.heatmap(corr, annot=True, cmap="YlGnBu", linewidths=0.1, annot_kws={"fontsize":10})
    plt.title("Correlation house prices - return rate")

<h1> <center> 📊 Correlation headmap 📊 </center> </h1>

<div class="alert alert-block alert-info" style="font-size:14px; color: black;line-height: 1.7em; background-color: #D2B4DE">
    📌 &nbsp; Correlation is any statistical relationship, whether causal or not, between two random variables or bivariate data. More about correlation here: https://en.wikipedia.org/wiki/Correlation.
</div>

<div class="alert alert-block alert-info" style="font-size:14px; color: black;line-height: 1.7em; background-color: #A2D9CE">
    📌 &nbsp; Covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values (that is, the variables tend to show similar behavior), the covariance is positive. More about covariance here: https://en.wikipedia.org/wiki/Covariance
</div>

In [None]:
get_headmap_price(prices_train)

In [None]:
pd.DataFrame(prices_train.isnull().sum().sort_values(ascending=False), columns=["count"]).style.background_gradient(cmap=cm)

<h1> <center> 📊 Boxplots: SalesPrice x Year 📊 </center> </h1>

In [None]:
def get_boxplot_price(df: dict, fields: list):
    for field in fields:
        data_copy = pd.concat([df['SalePrice'], df[field]], axis=1)
        f, ax = plt.subplots(figsize=(26, 6))
        fig = sns.boxplot(x=field, y="SalePrice", data=data_copy, palette="Set3")
        plt.xticks(rotation=90)
        plt.title(f"Boxplot - {field} x SalePrice")
        plt.show()

In [None]:
get_boxplot_price(prices_train, ["YearRemodAdd", "YearBuilt"])

In [None]:
def get_bar_compare(df: dict, fields: list):
    for field in fields:
        plt.figure(figsize=(15, 6))
        sns.barplot(x = field, y = 'SalePrice', data = df, palette="Set3")
        plt.xlabel(field)
        plt.ylabel('Sale Price')
        plt.show()

In [None]:
 get_bar_compare(prices_train, ["MoSold", "MSSubClass", "Street", "MSZoning"])

<h1> <center>📈 Data cleaning and features analysis 📈 </center> </h1>

In [None]:
def cleaning_data_none(prices_train: dict, fields: dict):
    for field in fields:
        prices_train[field].fillna('None', inplace=True)
    
def cleaning_data_int(prices_train: dict, fields: dict):
    for field in fields:
        prices_train[field].fillna(0, inplace=True)
        
def cleaning_data_median(prices_train: dict, fields: dict):
    for field in fields:
        prices_train[field].fillna(prices_train[field].median(), inplace=True)

In [None]:
fields_clean_none = ['PoolQC',
                     'Alley',
                     'FireplaceQu',
                     'MasVnrType',
                     'Electrical',
                     'BsmtFinType2',
                     'BsmtFinType1',
                     'BsmtExposure',
                     'BsmtQual',
                     'BsmtCond',
                     'Fence',
                     'MiscFeature',
                     'GarageCond',
                     'GarageQual',
                     'GarageFinish',
                     'GarageType',
                     'SaleType',
                     'Utilities',
                     'Exterior1st',
                     'Exterior2nd',
                     'KitchenQual',
                     'Functional']

fields_clean_int = ['GarageYrBlt', 'MSZoning', 'BsmtFinSF1', 'BsmtFullBath', 'BsmtHalfBath']

fields_clean_median = ['LotFrontage',
                        'MasVnrArea',
                        'BsmtUnfSF',
                        'TotalBsmtSF',
                         'GarageCars',
                         'GarageArea']
                       

cleaning_data_none(prices_train, fields_clean_none)
cleaning_data_int(prices_train, fields_clean_int)
cleaning_data_median(prices_train, fields_clean_median)

In [None]:
features = prices_train.columns
features = list(features[1:len(features)-1])

In [None]:
len(features)

<h1> <center> 📈 Data encoding and features analysis 📈 </center> </h1>

<div class="alert alert-block alert-info" style="font-size:14px; color: black;line-height: 1.7em; background-color: #D2B4DE">
    <p>📌 &nbsp; A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory. The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here. As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types). </p>
         More about here: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html
</div>

In [None]:
df_types = pd.DataFrame(prices_train.dtypes, columns=["types"])
df_types_object = df_types[df_types["types"] == "object"]

for field_obj in df_types_object.index:
    prices_train[field_obj] = prices_train[field_obj].astype('category').cat.codes

prices_train.head(20).style.background_gradient(cmap=cm)

<h1> <center> 📈 Xgboost regression training📈 </center> </h1> 

<div class="alert alert-block alert-info" style="font-size:14px; color: black;line-height: 1.7em; background-color: #F1948A">
    <p>📌 &nbsp; About regression: It is a statistical measurement that attempts to determine the strength of the relationship between one dependent variable (correlation). </p>
</div>

<div class="alert alert-block alert-info" style="font-size:14px; color: black;line-height: 1.7em; background-color: #D6EAF8">
    <p>📌 &nbsp; About xgboost: It can be used directly for regression predictive modeling.</p>
</div>

<div class="alert alert-block alert-info" style="font-size:14px; color: black;line-height: 1.7em; background-color: #F9E79F ">
    <p>📌 &nbsp; In Xgboost it is possible to check the significance of each features in the final result. To do this, just use the plot_importance function. </p>
</div>

In [None]:
y = prices_train['SalePrice']
X = prices_train[features]

model = xgboost.XGBRegressor(n_estimators=1000, max_depth=10, eta=0.1, subsample=0.7, colsample_bytree=0.8)

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)
xgboost_model = model.fit(x_train, y_train)
xgboost_model.score(x_test, y_test)

In [None]:
from xgboost import plot_importance
plt.rcParams["figure.figsize"] = (20, 15)
plot_importance(model,max_num_features=100)
plt.show()

<h1> <center> 📈 Xgboost model evaluation📈 </center> </h1> 

<div class="alert alert-block alert-info" style="font-size:14px; color: black;line-height: 1.7em; background-color: #F1948A">
    <p>📌 &nbsp; For regression, the R2 metric can be used, with the best score equal to 1.0. </p>
</div>

<div class="alert alert-block alert-info" style="font-size:14px; color: black;line-height: 1.7em; background-color: #D6EAF8">
    <p>📌 &nbsp; It is also possible to use the MSE metric, also called the mean squared error, which represents the mean squared difference between the estimated values and the actual value. </p>
</div>

In [None]:
def get_plot_predict(y, y_pred):
    plt.figure(figsize=(15, 5))

    ax1 = sns.distplot(y, hist=False, color="orange", label="real value")
    ax2 = sns.distplot(y_pred, hist=False, color="blue", label="predict value", ax=ax1)

    plt.title("Real sales price x predict")
    plt.xlabel("SalesPrice")
    plt.gca().legend()
    plt.grid()
    plt.show()

In [None]:
train_predict = xgboost_model.predict(x_train)
test_predict = xgboost_model.predict(x_test)

r2_train = r2_score(y_train, train_predict)
squared_error_train = np.sqrt(mean_squared_error(y_train, train_predict))
                              
r2_test = r2_score(y_test, test_predict)
squared_error_test = np.sqrt(mean_squared_error(y_test, test_predict))
                            
print("Train dataset: ")
print(f"    r2: {round(r2_train, 2)} Squared error: {round(squared_error_train, 2)}")
get_plot_predict(y_train, train_predict)
print("Test dataset: ")
print(f"    r2: {round(r2_test, 2)} Squared error: {round(squared_error_test, 2)}")
get_plot_predict(y_test, test_predict)

<h1> <center> Submission </center> </h1> 

In [None]:
test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')
test.head(10).style.background_gradient(cmap=cm)

In [None]:
cleaning_data_none(test, fields_clean_none)
cleaning_data_int(test, fields_clean_int)
cleaning_data_median(test, fields_clean_median)

In [None]:
df_types = pd.DataFrame(test.dtypes, columns=["types"])
df_types_object = df_types[df_types["types"] == "object"]

for field_obj in df_types_object.index:
    test[field_obj] = test[field_obj].astype('category').cat.codes

test.head(20).style.background_gradient(cmap=cm)

In [None]:
test_id = test['Id']
prediction = xgboost_model.predict(test[features])
print(len(prediction))

In [None]:
output = pd.DataFrame({'Id': test_id, 'SalePrice': prediction})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")