# Problem Statement

- Task 1:- To prepare a complete data analysis report on the given data.


- Task 2:-

  a) To create a robust machine learning algorithm to accurately predict the price of the house given the various factors across the market.      

  b) To determine the relationship between the house features and how the price varies based on this.


- Task3:- 
     
     To come up with suggestions for the customer to buy the house according to the area, price and other requirements.


# Imporitng libraries 

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


# Load Data

In [None]:
data=pd.read_csv('data.csv')

In [None]:
data.head()

# Domain Analysis 

- Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this  dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

  With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

MSSubClass:
    Represents the type of dwelling involved in the sale, such as 1-STORY 1946 & NEWER ALL STYLES, capturing the style 
    and age of the property.

MSZoning: 
    Identifies the general zoning classification of the sale, offering insights into the permissible land use, such
    as residential low density or medium density.

LotFrontage: 
    Indicates the linear feet of street connected to the property, giving a measure of the property's frontage and
    its potential impact on accessibility and aesthetics.

LotArea: 
    Reflects the lot size in square feet, a crucial factor influencing the overall property value.

Street: 
    Specifies the type of road access to the property, distinguishing between paved and gravel roads, affecting
    convenience and property value.

Alley: 
    Describes the type of alley access to the property, providing information on additional access points.

LotShape: 
    Defines the general shape of the property's lot, influencing property aesthetics and potentially affecting land utility.

LandContour: 
    Indicates the flatness of the property, impacting construction feasibility and landscaping possibilities.

Utilities: 
    Specifies the type of utilities available, such as all public utilities or electricity only, affecting convenience
    and livability.

LotConfig: 
    Describes the lot configuration, providing insights into how the property is situated within its surroundings.

LandSlope: 
    Identifies the slope of the property, which can influence drainage, landscaping, and construction considerations.

Neighborhood: 
    Represents physical locations within Ames city limits, capturing the neighborhood's influence on property values
    and desirability.

Condition1 and Condition2: 
    Indicate the proximity to various conditions (e.g., railroad, park), offering insights into potential nuisances
    or amenities.

BldgType: 
    Specifies the type of dwelling, distinguishing between single-family, townhouse inside unit, etc.

HouseStyle: 
    Represents the style of dwelling, such as 1-story or 2-story, contributing to the property's architectural characteristics.

OverallQual and OverallCond: 
    Convey the overall material and finish quality, as well as the overall condition, influencing the property's
    appeal and value.

YearBuilt and YearRemodAdd: 
    Provide the year the house was built and remodeled, helping assess the property's age and recent upgrades.

RoofStyle and RoofMatl: 
    Describe the roof type and material, contributing to the property's aesthetics and durability.

Exterior1st and Exterior2nd: 
    Indicate the exterior covering on the house, influencing curb appeal and maintenance requirements.

MasVnrType and MasVnrArea: 
    Specify the masonry veneer type and area, adding to the property's visual appeal.

ExterQual and ExterCond: 
    Capture the exterior material quality and condition, influencing the property's durability and maintenance needs.

Foundation: 
    Represents the type of foundation, essential for assessing the property's structural integrity.

BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF: 
    Provide insights into basement quality, condition, exposure, and size, crucial for assessing additional living space.

Heating, HeatingQC: 
    Indicate the type of heating and heating quality, impacting comfort and energy efficiency.

CentralAir: 
    Specifies whether the property has central air conditioning, contributing to comfort and property value.

Electrical: 
    Represents the electrical system, a critical component for safety and functionality.

1stFlrSF, 2ndFlrSF, LowQualFinSF, GrLivArea: 
    Provide square footage details for various living areas, influencing the property's size and layout.

BsmtFullBath, BsmtHalfBath, FullBath, HalfBath: 
    Describe bathroom features, contributing to the property's functionality.

BedroomAbvGr, KitchenAbvGr, KitchenQual: 
    Specify the number of bedrooms and kitchens, along with kitchen quality, influencing livability and property value.

TotRmsAbvGrd: 
    Indicates the total rooms above ground, offering insights into the property's spatial layout.

Functional: 
    Represents the home's functionality rating, crucial for assessing usability and appeal.

Fireplaces, FireplaceQu: 
    Convey the number of fireplaces and fireplace quality, contributing to ambiance and property value.

GarageType, GarageYrBlt, GarageFinish, GarageCars, GarageArea, GarageQual, GarageCond: 
    Provide details on the garage, including type, year built, and size, influencing property utility and value.

PavedDrive: 
    Specifies whether the property has a paved driveway, impacting convenience and aesthetics.

WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch: 
    Capture porch and deck features, enhancing outdoor living and visual appeal.

PoolArea, PoolQC: 
    Indicate pool area and quality, contributing to luxury and property value.

Fence: 
    Describes fence quality, offering privacy and influencing property aesthetics.

MiscFeature, MiscVal: 
    Specify miscellaneous features and their values, potentially adding unique elements to the property.

MoSold, YrSold: 
    Represent the month and year of sale, capturing the temporal aspect of property transactions.

SaleType, SaleCondition: 
    Describe the type and condition of the sale, providing insights into the transaction dynamics.

SalePrice: 
    The target variable, representing the sale price of the house.

# Basic checks 

In [None]:
## Call the dataframe and do basic checks
pd.set_option('display.max_columns', None)

data.head()

In [None]:
data.tail()

In [None]:
data['LotFrontage'].median()

In [None]:
data.info()

In [None]:
data.loc[data['MiscFeature'].isnull()==True]
data['MiscFeature'].isnull().sum()
data.shape

In [None]:
data.size

In [None]:
## Sum of features
pd.set_option('display.max_rows',None)
data.isna().sum()

In [None]:
data.notnull().sum()

In [None]:
data.duplicated().sum()

In [None]:
# print() al the unique value of target value
data.SalePrice.unique()

In [None]:
# here in target we have no null value
data.SalePrice.isnull().sum()

In [None]:
pd.set_option('display.max_row',None)
data.SalePrice.value_counts()

In [None]:
data.describe().transpose()

In [None]:
data.describe(include = 'O')

### Fetching only categorical columns from the data

In [None]:
categorical_cols = data.select_dtypes('object').columns
print(categorical_cols)

In [None]:
##creating a sub_data for categorcal variables and sales price

data_categorical = pd.concat([data[categorical_cols],data['SalePrice']], axis=1)
data_categorical.head(5)

In [None]:
data_categorical.isnull().sum()

In [None]:
# print() all the categorical column with its unique values
categorical_col = []

for column in data.columns:
    if data[column].dtype == object and len(data[column].unique()) <= 50:
        categorical_col.append(column)
        print(f"{column}: {data[column].unique()}")
        print("=================================")

##### Numerical data statistical measures

In [None]:
data.describe()

##### Insights from numercial data statistical measures:

1)->General Information:

The dataset contains information on housing with 1460 entries.
The target variable is SalePrice.


2)->Summary Statistics:

SalePrice has a mean of approximately $180,921 and a median (50th percentile) of $163,000. The prices vary widely, ranging from $34,900 to $755,000.


3)->Year Information:

The houses in the dataset were generally built between 1872 and 2010, with an average year of construction around 1971.
The average year of remodeling is approximately 1984, with a range from 1950 to 2010.



4)->Lot Characteristics:

    
LotFrontage has missing values (1201 non-null), and the mean lot frontage is approximately 70.
LotArea varies widely, with an average lot area of approximately 10,516 square feet.


5)->Quality and Condition:

OverallQual and OverallCond represent the overall material and finish quality and overall condition of the house, respectively.
OverallQual has a mean of approximately 6, indicating an above-average quality on average.


6)->Living Area:

The average above-ground living area (GrLivArea) is approximately 1,515 square feet.
There is variation in low-quality finished square feet (LowQualFinSF), with an average of 5.84.


7)->Basement Information:

TotalBsmtSF represents the total square feet of the basement area, with an average of approximately 1,057 square feet.
BsmtFullBath and BsmtHalfBath indicate the number of basement full bathrooms and half bathrooms, respectively.


8)->Garage Information:

GarageYrBlt has missing values (1379 non-null) and represents the year the garage was built.
GarageCars and GarageArea represent the capacity and size of the garage, respectively.


9)->Outdoor Features:

WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, and ScreenPorch represent different types of porches and decks.
There is variability in the presence of pools (PoolArea) and miscellaneous features (MiscVal).


10)Time Information:

Houses were sold between 2006 and 2010, with an average sale year of approximately 2008.
MoSold represents the month of sale.

###### Categorical data statistical measures

In [None]:
data.describe(include = 'O')

##### Categorical data insights:

A)->Cardinality:

1)->The MSZoning feature has 5 unique values, with "RL" being the most frequent (1151 occurrences).

2)->Street is binary with "Pave" occurring 1454 times.

3)->Alley has two values ("Grvl" and "Pave"), with "Grvl" appearing 50 times.



B)->Lot Characteristics:

1)->LotShape has 4 unique values, with "Reg" being the most frequent (925 occurrences).

2)->LandContour has 4 unique values, with "Lvl" being the most frequent (1311 occurrences).

3)->Utilities is mostly constant, with "AllPub" occurring 1459 times.


c)->Location and Configuration:

1)->LotConfig has 5 unique values, with "Inside" being the most frequent (1052 occurrences).

2)->LandSlope is mostly gentle ("Gtl") and appears 1382 times.


D)->Neighborhood and Conditions:

1)->Neighborhood has 25 unique values, with "NAmes" being the most frequent (225 occurrences).

2)->Condition1 and Condition2 represent proximity to various conditions; most occurrences are "Norm" in both cases.


E)->Building Characteristics:

1)->BldgType mostly consists of single-family homes ("1Fam" - 1220 occurrences).

2)->HouseStyle is predominantly one-story houses ("1Story" - 726 occurrences).

3)->RoofStyle is mostly "Gable" (1141 occurrences), and RoofMatl is primarily "CompShg" (1434 occurrences).


F)->Exterior and Masonry Veneer:

1)->Exterior1st and Exterior2nd represent the exterior covering of the house; "VinylSd" is the most common for both.

2)->MasVnrType has 5 unique values, with "None" being the most frequent (864 occurrences).


G)->Basement Characteristics:

1)->BsmtQual and BsmtCond represent the overall condition of the basement; both are mostly "TA" (Tabulated Area).

2)->BsmtExposure is mostly "No," indicating no exposure to a basement wall.


H)->Heating and Air Conditioning:

1)->Heating is mostly "GasA," and HeatingQC is predominantly "Ex" (Excellent).

2)->CentralAir is mostly "Y," indicating central air conditioning.


I)->Electrical and Kitchen Quality:

1)->Electrical mostly consists of "SBrkr."

2)->KitchenQual is predominantly "TA" (Tabulated Area).


J)->Fireplaces and Garage:

1)->Functional is mostly "Typ" (Typical Functionality).

2)->FireplaceQu represents the quality of fireplaces; "Gd" (Good) is most common.


K)->Garage Characteristics:

1)->GarageType mostly consists of attached garages ("Attchd" - 1365 occurrences).

2)->GarageFinish is mostly "Unf" (Unfinished).


L)->Paved Driveway and Pool:

1)->PavedDrive is mostly "Y," indicating a paved driveway.

2)->PoolQC has only 3 non-null values and mostly "Gd" (Good).


M)->Fence and Miscellaneous Features:

1)->Fence has 157 occurrences of "MnPrv" (Minimum Privacy).

2)->MiscFeature has 49 occurrences of "Shed."


N)->Sale Type and Condition:

1)->SaleType mostly consists of "WD" (Warranty Deed - Conventional).

2)->SaleCondition is mostly "Normal.

##### Continous features from dataset:

In [None]:
numerical_cols = data.select_dtypes(include=['float64', 'int64']).columns
numerical_cols

In [None]:
data_numerical = data[numerical_cols]
print(len(numerical_cols))

In [None]:
data_numerical.head(4)

In [None]:
# numerical data along with its unique values
numerical_column = []

for column in data_numerical:
    if data[column].dtypes == int and len(data[column].unique()) <= 50:
        numerical_column.append(column)
    print(f"{column} :{data[column].unique()}")
    print('****************************')

In [None]:
# prompt: check disctint values in each column
data[numerical_cols].nunique()

### Relation date_time_year with target_variable:>:

In [None]:
# datetime_cols = [col for col in data if 'Yr'in  col or 'Year'in  col ]
# datetime_cols
datetime_col = data[['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold','SalePrice']]
datetime_col.head()

In [None]:
plt.figure(figsize = (10,10), facecolor = 'white')
plotnumber = 1

for column in datetime_col.columns[:-1]:
    plt.subplot(2,2,plotnumber)
    plt.plot(datetime_col[column], datetime_col['SalePrice'], 'o', label = f"{column} vs SalePrice")
    
    plt.title(f"{column} vs SalePrice")
    plt.xlabel(column, fontsize = 10)
    plt.ylabel('SalePrice', fontsize = 10)
    plotnumber += 1
    plt.grid(True)
    
plt.tight_layout()
plt.legend()
plt.show()

In [None]:
datetime_col.corr()

Positive Connection: 
    The strong positive correlation close to 1 between 'SalePrice' and 'YearBuilt', 'YearRemodAdd', and 'GarageYrBlt'
    suggests that as these aspects related to the property's history and construction time increase, the home's selling
    price tends to rise as well.

Time and Sale Price: 
    The small negative correlation near 0 between 'SalePrice' and 'YrSold' hints at a subtle trend—when the sale year
    increases, there's a slight tendency for the selling price to decrease, although this relationship is not very pronounced.

Predictive Power: 
    The moderate to strong correlation values imply that 'YearBuilt', 'YearRemodAdd', and 'GarageYrBlt' could serve as
    promising predictors for estimating 'SalePrice'. Essentially, changes in these features are associated with noticeable
    changes in the property's selling price. However, keep in mind that correlation doesn't prove causation, and a 
    comprehensive analysis should consider various factors.

In [None]:
numerical_discrete_cols = [col for col in numerical_cols if len(data[col].unique()) < 25 and col not in datetime_col]
numerical_discrete_cols

In [None]:
data_discrete = pd.concat([data[numerical_discrete_cols],data['SalePrice']], axis=1)
data_discrete.head()

In [None]:
numerical_continuous_cols = [col for col in numerical_cols if col not in numerical_discrete_cols and col not in datetime_col and col != 'Id']
numerical_continuous_cols

In [None]:
data_continous = data[numerical_continuous_cols]
data_continous.head(4)

# Exploratory Data Analysis:

### Univarite Analysis and Bivariate Analysis-- Autoviz
- Univarite Analysis:In univariate analysis, we focus on one thing at a time in our data, like a superhero investigating a single clue. We dig into its details, check for any weird stuff (outliers), and figure out what's typical about it (central tendencies). It's like zooming in on one character in a big story to understand their unique tale. 

In [None]:
# !pip install sweetviz

In [None]:
# import warnings
# warnings.filterwarnings("ignore")

# import sweetviz as sv #  library for univariant analysis

# my_report = sv.analyze(data)## pass the original dataframe

# my_report.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"

- Skewness:

    Skewness is a measure of the asymmetry or lack of symmetry in a distribution. In simple terms, it indicates whether the data is skewed to the left or right. If the distribution is skewed to the right, it means that the tail on the right side is longer or fatter than the left side, and vice versa for left skewness. In a perfectly symmetrical distribution, the skewness is zero.
    
    
what does positively and negatively skewness means?

- Positive skewness:

    If the data is positively skewed, it means that there are more data points on the left side of the distribution,
    and the right tail is longer.
    Positively skewed distributions are also called right-skewed distributions. This skewness pattern indicates that there are outliers or extreme values on the higher end of the data range, pulling the overall distribution in the positive direction. Understanding skewness is essential for accurately characterizing the shape of the data distribution, as it can impact the choice of statistical methods and the interpretation of results.The mean is typically greater than the median, as the larger values on the right side pull the mean in that direction.
    
    
    
- Negative Skewness:


    If the data is negatively skewed, it means that there are more data points on the right side of the distribution,
     and the left tail is longer.Negatively skewed distributions are also called left-skewed distributions. This skewness pattern indicates that there are outliers or extreme values on the lower end of the data range, pulling the overall distribution in the negative direction. Understanding skewness is crucial for accurately characterizing the shape of the data distribution, as it can influence the choice of statistical methods and the interpretation of results.The mean is typically less than the median, as the smaller values on the left side pull the mean in that direction.
     
     
- IQR:


     The Interquartile Range (IQR) is a measure of statistical dispersion, representing the range between the first quartile (25th percentile) and the third quartile (75th percentile) of a dataset. It is a useful tool in data analysis, providing insight into the spread of the middle 50% of the data, helping identify potential outliers and understand the central tendency more robustly.
     
     
- Variance:

     Variance is a statistical measure that quantifies the spread or dispersion of a set of data points. It calculates the    average squared difference between each data point and the mean of the dataset. A higher variance indicates greater variability among the values, while lower variance suggests that the values are closer to the mean, reflecting a more consistent dataset.
     
     
- Standard Deviation:


    Standard Deviation (std) is a statistical measure that quantifies the amount of variation or dispersion in a set of data points. It is the square root of the variance, providing a more interpretable metric in the same units as the original data. A higher standard deviation indicates greater variability among the values, while a lower standard deviation suggests that the values are closer to the mean, signifying less dispersion in the dataset.
    
    
- Range:

    Range is a simple measure of statistical dispersion, representing the difference between the maximum and minimum values in a dataset. It provides a quick assessment of the spread of data points. A wider range suggests greater variability among values, while a narrower range indicates less variability and a more concentrated dataset.
    

##### Insights from features using Sweetviz:

#### 1)-> MSSubClass:
- Range:
    Here range is 170 it implies a wide variation in the values of this variable.
    The range represents the difference between the highest and lowest values. 

- Inter quantile range(IQR):
    The Interquartile Range (IQR) of 50.0 in univariate analysis indicates that the middle
    50% of the data is spread across a range of 50.0 units which is moderate neither large nor small.
    A larger IQR suggests greater variability within the central portion of the dataset.

- Standard Deviation:
    A standard deviation of 42 indicates a relatively high degree of  spread
    in the data points from the mean.
    
- Variance :
    The value of 1789 for the variable (VAR) is the variance, which measures the average squared deviation
    of each data point from the mean. In this context, it suggests a significant variability in the data.
    
- Kurtosis:
    here we have kurtosis less than average which 3.
    
- Skewness:
    Here the data is positively skewed.which means the data is not normal.
    
#### 2) ->LotFrontage:
- Range: is 292
- IQR: is 21
- STD: is 24.3
- VAR :is 590
- Skewness:LotFrontage is positively skewed  
#### 3) ->LotArea:
- Range: is 214K
- IQR :is 4048
- STD: is 9981
- VAR :99.6 M
- Skewness:
    Here the data is positively skewed
#### 4)->YearBuilt: 
- Range is 138
- IQR : is 46
- STD is 30
- VAR is 912
- Skewness:
    Here the distribution is negatively skewed
#### 5)->YearRemodAdd:
- Range is 60.0
- IQR is 37
- STD is 26
- VAR is 426
- Skewness:
    Here the data is slightly in negatively skewed
#### 6)->MasVnrArea:
- Range: is 1600
- IQR : is 166
- STD: is 181
- VAR : is 32785
- Skewness:
    Here data point is positively  skewed
#### 7)->BsmtFinSF1:
- Range is 5644
- IQR is 712
- STD is 456
- Skewness:
    Here the data is positively skewed
#### 8)->BsmtFinSF2:
- Range is 1474
- IQR is 0.00
- STD is 161
- VAR is 26024
- Skewness:
    Here the data is positively skewed
#### BsmtUnfSF:
- Range is 2336
- IQR is 585
- STD is 442
- VAR is 195K
- Skewness:
    Here the data is positively skewed
#### 10)->TotalBsmtSF:
- Range: is 6110
- IQR: is 502
- STD: is 439
- VAR: is 192K
- Skewness:
    Here the data is Positively skewed
#### 11)-> LowQualFinSF:
- Range is 572
- IQR is 0.00
- STD is 48.6
- VAR is 2364
- Skewness: 
    Data is positively skewed
    Here we have peakedness
##### 12)->GrLivArea:
- Range: is 5308
- IQR is 647
- STD is 525
- VAR is 2364
- Skewness:
    Here data is positively skewed 
    Here we have peakedness
##### 13)->TotRmsAbvGrd:
- Range is 12.0
- IQR is 2.00
- STD is 1.63
- VAR is 2.64
- Skewness:
    Here the data is Positively skewed
    We have low kurtosis
##### 14) ->GarageYrBlt:
- Range is 110
- IQR is  41
- STD is 24.7
- VAR is 610
- Skewness: 
    Here the data is negativey skewed.Here we have low kurtosis
#### 15)-GarageArea:
- Range is 1418
- IQR is 242
- STD is 214
- VAR is 45713
- Skewness:
    Here we have positive skewness
    Here we have low kurtosis
##### 16)->WoodDeckSF:
- Range is 857
- IQR is 168 
- STD is 125
- VAR is 15710
- Skewness:
    Here the data is positively skewed
    Here we have no specific kurtosis
#### 17)->OpenPorschSF:
- Range is 547
- IQR is 68.0
- STD is 66.3
- VAR is 4390
- Skewness:
    Here we have positively skewed data
    We have peakedness in data 
#### 18)->EnclosedPorch:
- Range is 553
- IQR is 0.00
- STD is 61.
- VAR is 3736
- Skewness: 
    Here we have Positively skewed data
    We have high peakedness.
##### 19)->3SsnPorch:
- Range is 508
- IQR is 0.00
- STD is 29.3
- VAR is 860
- Skewness:
    We have strong positively skewed data 
    We have extremely high peakedness.
##### 20)->ScreenPorch:
- Range is 480
- IQR is 0.00
- STD is 55.8
- VAR is 3109
- Skewness:
    We have positive skewed data
    Here we have high peakedness.
##### 21)->PoolArea:
- We have no clear insights here 
##### 22)->MoSold:
- Range is 11.0
- IQR is 3.00
- STD is 2.70
- VAR is 7.31
- Skewness:
    Here the data is positively skewd
- We have extremely low kurtosis.


### Histplot for univariate analysis:

In [None]:
plt.figure(figsize=(20,25), facecolor='white')
plotnumber = 1

for column in data_continous[:-1]:
    if plotnumber<=15 :     # as there are 15 columns in the data
        ax = plt.subplot(4,4,plotnumber)
        sns.histplot(data_continous[column],  kde=True)
        
        plt.title(f'Histplot of {column}')
        plt.xlabel(f"{column} range",fontsize=10)
        plt.ylabel('Density', fontsize = 8)
        plotnumber += 1
    plt.xticks(rotation=90,fontsize=7)
        #plt.xlabel(col)
plt.show()

#### Bivariate Analysis for continous features:

In [None]:
plt.figure(figsize = (25, 25), facecolor = 'white')
plotnumber = 1
for column in data_continous[:-1]:
    if plotnumber <= 15:
        plt.subplot(4,4,plotnumber)
        sns.lineplot(x = data[column], y= data['SalePrice'], ci = None, label = f"{column} vs SalePrice")
        
        plt.title(f"Line Plot for {column}")
        plt.xlabel(column, fontsize = 10)
        plt.ylabel('SalePrice', fontsize = 10)
        plotnumber += 1
        plt.grid(True)
        
plt.tight_layout()
plt.legend()
plt.show()

#### Scatterplot for bivariate analysis:
          Here we will use scatterplot to see the outlier and behaviour of our data against target_variable

In [None]:
plt.figure(figsize = (20, 20), facecolor = 'white')
plotnumber = 1
for column in data_continous[:-1]:
    if plotnumber <= 15:
        plt.subplot(4,4,plotnumber)
        sns.scatterplot(x = data[column], y= data['SalePrice'],label = f"{column} vs SalePrice")
        
        plt.title(f"scatter Plot for {column}")
        plt.xlabel(column, fontsize = 10)
        plt.ylabel('SalePrice', fontsize = 10)
        plotnumber += 1
        plt.grid(True)
        
plt.tight_layout()
plt.legend()
plt.show()

#### Bivariate analysis of datetime column relation with SalePrice:

In [None]:
plt.figure(figsize = (10,10), facecolor = 'white')
plotnumber = 1

for column in datetime_col.columns[:-1]:
    plt.subplot(2,2,plotnumber)
    sns.lineplot(x = data[column], y = data['SalePrice'],ci = None   ,label = f"{column} vs SalePrice")
    
    plt.title(f"{column} vs SalePrice")
    plt.xlabel(column, fontsize = 10)
    plt.ylabel('SalePrice', fontsize = 10)
    plotnumber += 1
    plt.grid(True)
    
plt.tight_layout()
plt.legend()
plt.show()

#### Univariate analysis of categorical data:

In [None]:
plt.figure(figsize = (25, 30), facecolor = 'white')
plotnumber = 1
for column in data_categorical.columns:
    if plotnumber <= 43:
        plt.subplot(11,4, plotnumber)
        sns.countplot(x = data_categorical[column])
        
        plt.title(f'count of {column}')
        plt.xlabel(column, fontsize = 10)
        plt.ylabel('Accurance', fontsize = 10)
        plotnumber += 1
        
plt.tight_layout()

### Bivariate Analysis:

In [None]:
plt.figure(figsize = (25, 30), facecolor = 'white')
plotnumber = 1
for column in data_categorical.columns:
    if plotnumber <= 43:
        plt.subplot(11,4, plotnumber)
        sns.barplot(x = data_categorical[column], y = data.SalePrice, ci = None)
        
        plt.title(f'Barplot of {column}')
        plt.xlabel(column, fontsize = 10)
        plt.ylabel('SalePrice', fontsize = 10)
        plotnumber += 1
        
plt.tight_layout()

#### Bivariate analysis of discrete data against SalePrice :

In [None]:
plt.figure(figsize = (25, 30), facecolor = 'white')
plotnumber = 1
for column in data_discrete.columns[:-1]:
    if plotnumber <= 43:
        plt.subplot(11,4, plotnumber)
        sns.barplot(x = data_discrete[column], y = data.SalePrice, ci = None)
        
        plt.title(f'Barplot of {column}')
        plt.xlabel(column, fontsize = 10)
        plt.ylabel('SalePrice', fontsize = 10)
        plotnumber += 1
        
plt.tight_layout()

### Insights:

1)-> MSZoning:
    
1)->'FV' (Floating Village Residential) tend to have higher SalePrice

2)->'RL' zoning, or "Residential Low Density," is the second-best, suggesting a positive impact on sale prices.

3)-> properties with 'RM' and 'RH' zoning classifications, likely "Residential Medium Density" and "Residential High Density," have similar effects on sale prices. Conversely, 'C (all)' zoning, representing commercial properties, has the least impact on sales prices.



2)->Street:
'Pave' (paved) road, they tend to have a higher impact on SalesPrice, and 'Grvl' (gravel) road access is the second most influential factor



3)->Alley:
Properties with a 'Pave' (paved) alley have a strong positive relationship with SalePrice, indicating higher sale prices. Additionally, properties with a 'Grvl' (gravel) alley show the second-highest association with the target variable, though likely with a slightly lower impact on sale prices compared to 'Pave'.



4)-> LotShape:
properties with an 'IR2' (moderately irregular-shaped) lot have the highest impact on the target variable SalePrice, followed by 'IR3' (highly irregular) 'IR1' (slightly irregular), and then 'Reg' (regular-shaped) lots. This order indicates a perceived association between lot irregularity and the SalePrice



5)-> LandCounter:
'Lvl': Indicates properties with a level (flat) contour.

'Bnk': Represents properties with a banked contour (a slope).

'Low': Signifies properties with a low contour (depression).

'HLS': Represents properties with a hillside contour.

    
High Impact on SalePrice:

Properties with 'HLS' (hillside) land contour have the highest impact on SalePrice.

Second-Highest Impact:

'Low' contour properties (depression) come next in impacting SalePrice positively.

Moderate Impact:

'Lvl' contour properties (level or flat) show a moderate impact on SalePrice.

Lowest Impact:

'Bnk' contour properties (banked or sloped) have the least impact on SalePrice.



6)->Utilities:
AllPub (All Public):

Properties with 'AllPub' utilities have the strongest impact on the target variable SalePrice.

NoSewa (No Seperate Water and Electricity):

Properties with 'NoSeWa' utilities have a moderate impact on SalePrice.

'AllPub' (All Public Utilities) category in the 'Utilities' variable has the strongest positive impact on the target variable SalePrice, while 'NoSeWa' (No Separate Water and Electricity) has a moderate impact. This implies that properties with access to all standard public utilities may be associated with higher sale prices compared to properties with limited utility services.



7)->LotConfig:
The 'LotConfig' variable describes the lot configuration of properties in dataset:

'Inside': Properties with lot configurations positioned inside the neighborhood.

'FR2': Properties with lot configurations adjacent to a feeder road.

'Corner': Properties with lot configurations at a street corner.

'CulDSac': Properties with lot configurations at the end of a cul-de-sac.

'FR3': Properties with lot configurations adjacent to a feeder road on three sides.

'CulDSac' Configuration:

Properties with a 'CulDSac' lot configuration have the highest impact on the sale price.

'FR3' Configuration:

Properties with an 'FR3' (adjacent to a feeder road on three sides) lot configuration have the second-highest impact on the sale price.

Similar Impact:

The 'Inside,' 'FR2' (adjacent to a feeder road), and 'Corner' lot configurations have a similar impact on the sale price.

This suggests that the specific layout and positioning of properties within their neighborhoods may play a role in influencing their sale prices.



8)->LandSlope:
The 'LandSlope' variable describes the slope of the property in dataset:

'Gtl': Represents properties with a gentle slope.

'Mod': Represents properties with a moderate slope.

'Sev': Represents properties with a severe slope.

'Gtl' (gentle slope) and 'Mod' (moderate slope) has same impact on sale price.

'Sev' (severe slope) has a slightly higher impact on sale price.

This implies that properties with more severe slopes may be associated with a slightly higher impact on sale prices compared to properties with gentle or moderate slopes.



9)-> Neighborhood:
our observation suggests that, in our dataset, the neighborhood labeled 'NoRidge' has the highest impact on house prices.


10)-> Condition1:
Here's a brief description of each value in the 'Condition1' variable:

'Norm': Properties in a normal proximity to main roads or railroads.

'Feedr': Properties facing a feeder road, which is a smaller road that provides access to properties.

'PosN': Properties located near a positive feature, such as a park.

'Artery': Properties adjacent to an arterial road, a more significant road.

'RRAe': Properties with a rear easement, indicating a railroad at the rear.

'RRNn': Properties with a railroad nearby to the north.

'RRAn': Properties with a railroad nearby to the northwest.

'PosA': Properties located near a positive feature, such as a park (similar to 'PosN').

'RRNe': Properties with a railroad nearby to the northeast.

These values describe different conditions related to the proximity of properties to main roads or railroads, providing insights into their locations and potential influences on property values.

'PosA' (proximity to a positive feature such as a park),

'PosN' (proximity to a positive feature, similar to 'PosA'),

'RRNn' (proximity to a railroad to the north),

have a high impact on the target variable SalePrice. This suggests that properties with these conditions in their proximity may be associated with higher sale prices.



11)->Condition2:
our observation suggests that in the 'Condition2' variable:

'PosA' (proximity to a positive feature, similar to a park) and

'PosN' (proximity to a positive feature, such as a park)

have a high impact on the target variable SalePrice.



12)->BldgType:
The 'BldgType' variable describes different types of dwellings in the dataset:

'1Fam': Single-family homes.

'2fmCon': Two-family conversion (duplex).

'Duplex': Duplexes.

'TwnhsE': Townhouses inside a building.

'Twnhs': Townhouses in rows.

These values indicate the structural characteristics or types of residential buildings in the dataset.

our observation suggests that, in our dataset:

'1Fam' (Single-family homes) and

'TwnhsE' (Townhouses inside a building)

have the highest impact on the target variable, indicating that these types of dwellings may be associated with higher values in the target variable

### Multi-variate Analysis:

In [None]:
continous_feat = data[['LotArea','BsmtFinSF1','TotalBsmtSF','1stFlrSF','2ndFlrSF','GrLivArea','GarageArea','SalePrice']]
sns.set(style = 'ticks')
sns.pairplot(continous_feat, diag_kind = 'kde')

In [None]:
continous_feat.corr()

##### Insights:

High Correlations with SalePrice:
Features with relatively high positive correlations with SalePrice include:


TotalBsmtSF (0.613581)
GrLivArea (0.708624)
GarageArea (0.623431)
These features might have a strong influence on the SalePrice. Consider exploring these relationships further, and they could potentially be important predictors.



Correlation between Features:
TotalBsmtSF and 1stFlrSF show a high correlation of 0.819530. This is not surprising, as the total basement area and the first floor area are likely to be correlated.

GrLivArea and 2ndFlrSF also exhibit a strong correlation of 0.687501. This makes sense since the living area above ground and the second floor area are related.

GarageArea and TotalBsmtSF, as well as GarageArea and 1stFlrSF, show notable correlations. It suggests that the garage area is correlated with both the total basement area and the first floor area.



Low Correlations:
LotArea shows relatively low correlations with the other features. It might not have a strong linear relationship with the other variables.


Potential Multicollinearity:
TotalBsmtSF, 1stFlrSF, and GarageArea all have relatively high correlations with each other. When using these features in a predictive model, multicollinearity might need to be considered.


Negative Correlation:
2ndFlrSF has a negative correlation with BsmtFinSF1 (-0.137079). This suggests that as the finished square feet of the basement increase, the second floor square footage tends to decrease

#### Data Preprocessing:

In [None]:
# cheking null values
data.isnull().sum()

#### lets check how many column have null values:

In [None]:
column_with_null = data.columns[data.isnull().any()]
number_of_column_nv = len(column_with_null)
print("Number of columns with  null values :",number_of_column_nv)

#### lets see these 19 columns:

In [None]:
data[column_with_null].head()

#### Handling null values in continous data and replacing it with median value:

In [None]:
col_to_impute = ['LotFrontage','MasVnrArea','GarageYrBlt']

for col in col_to_impute:
    median_value = data[column].median()
    data[col].fillna(median_value, inplace = True)

#### Handling null values in categorical data and replacing it with mode value:

In [None]:
cols_to_impute = ['Alley', 'MasVnrType', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
       'BsmtFinType1', 'BsmtFinType2', 'Electrical', 'FireplaceQu',
       'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC',
       'Fence', 'MiscFeature']

for column in cols_to_impute:
    mode_value = data[column].mode().iloc[0]
    data[column].fillna(mode_value, inplace = True)

In [None]:
data.isnull().sum()

### Encoding:

In [None]:
from sklearn.preprocessing import LabelEncoder
label_Encoder = LabelEncoder()

In [None]:
col_to_encod = ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType', 'SaleCondition']


In [None]:
for column in col_to_encod:
    data[column] = label_Encoder.fit_transform(data[column])

In [None]:
data.head(4)

#### Handling Outliers:

In [None]:
# Create a dictionary to store the results for each column
outlier_info = {}

# Iterate through each column in your dataset
for cols in data_continous.columns:
    # Calculate statistics for the current column
    Q1 = data_continous[cols].quantile(0.25)
    Q3 = data_continous[cols].quantile(0.75)
    IQR = Q3 - Q1

    # Calculate the count of values greater than max_limit and less than min_limit for the current column
    cnt_greater_max = (data_continous[cols] > (Q3 + 1.5 * IQR)).sum()
    cnt_less_min = (data_continous[cols] < (Q1 - 1.5 * IQR)).sum()

    # Calculate the percentage of values greater than max_limit and less than min_limit for the current column
    percent_greater_max = (cnt_greater_max / len(data)) * 100
    percent_less_min = (cnt_less_min / len(data)) * 100

    # Store the results in the dictionary
    outlier_info[cols] = {
        'count_greater_max': cnt_greater_max,
        'percent_greater_max': percent_greater_max,
        'count_less_min': cnt_less_min,
        'percent_less_min': percent_less_min
    }

# Print the results for each column
for column, info in outlier_info.items():
    print(f"Column: {column}")
    print(f"Count Greater Than max_limit: {info['count_greater_max']}")
    print(f"Percent Greater Than max_limit: {info['percent_greater_max']:.2f}%")
    print(f"Count Less Than min_limit: {info['count_less_min']}")
    print(f"Percent Less Than min_limit: {info['percent_less_min']:.2f}%")
    print()

In [None]:
plt.figure(figsize=(20,25), facecolor='white')
plotnumber = 1

for cols in data_continous[:-1]:
    if plotnumber<=15 :    
        ax = plt.subplot(4,4,plotnumber)
        sns.boxplot(x=data_continous[cols])
        plt.xlabel(cols,fontsize=10)
    
    plotnumber+=1
    plt.xticks(rotation=90,fontsize=7)
        
plt.show()


In [None]:
from scipy import stats

In [None]:
data_continous.columns

In [None]:
data.tail(4)

In [None]:
continous_data = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea',
       'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch',
       'ScreenPorch']

In [None]:
Q1 = data[continous_data].quantile(0.25)
Q3 = data[continous_data].quantile(0.75)
IQR = Q3 - Q1

min_limit = Q1 - 1.5 * IQR
max_limit = Q3 + 1.5 * IQR

data[continous_data] = np.where (
    (data[continous_data] < min_limit) | (data[continous_data] > max_limit),
    data[continous_data].median(), data[continous_data]
)

### Handling Outlier discrete data:

In [None]:
discrete_data = ['MSZoning','Street','Alley','LotShape','LandContour','Utilities','LotConfig','LandSlope','Neighborhood',
            'Condition1','Condition2','BldgType','HouseStyle','RoofStyle','RoofMatl','Exterior1st','Exterior2nd','MasVnrType',
            'ExterQual','ExterCond','Foundation','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','Heating',
            'HeatingQC','CentralAir','Electrical','KitchenQual','Functional','FireplaceQu','GarageType','GarageFinish',
            'GarageQual','GarageCond','PavedDrive','PoolQC','Fence','MiscFeature','SaleType','SaleCondition']

In [None]:
plt.figure(figsize=(20, 25), facecolor = 'white')
plotnumber = 1
for column in data_discrete.columns[:-1]:
    plt.subplot(6,3, plotnumber)
    sns.boxplot(x = data[column])
    
    plt.xlabel(column, fontsize = 10)
    plt.ylabel('Counts' , fontsize = 10)
    plotnumber += 1

plt.tight_layout()
plt.show()

In [None]:
Q1 = data[discrete_data].quantile(0.25)
Q3 = data[discrete_data].quantile(0.75)
IQR = Q3 - Q1

min_limit = Q1 - 1.5 * IQR
max_limit = Q3 + 1.5 * IQR

data[discrete_data] = np.where((data[discrete_data]< min_limit) | (data[discrete_data]> max_limit),
                              data[discrete_data].mode().iloc[0], data[discrete_data])

In [None]:
# lets see  whole data

In [None]:
data.drop('Id', axis = 1, inplace = True)

In [None]:
data.head()

In [None]:
for column in data.columns[:-1]:
    sns.boxplot(x =data[column])
    plt.title(f"{column} Boxplot")
    plt.xlabel(column, fontsize = 10)
    plt.ylabel('Limits', fontsize = 10)
    plt.show()

## Feature Engineering:
- Definition:
  Feature engineering involves transforming or creating new features from raw data to enhance machine learning model performance.

-  Simplification:
   Simplify complex relationships in the data, making it easier for models to understand.

- Improved Model Performance:
  Enhance model accuracy and predictive power by providing more relevant and informative input features.


- Addressing Non-Linearity:
  Capture non-linear relationships between features and the target variable, enabling models to learn more complex patterns.

- Handling Missing Data:
  Fill or transform missing values in a meaningful way to prevent information loss during model training.

- Dimensionality Reduction:
  Reduce the number of features, especially in high-dimensional datasets, to avoid overfitting and speed up model training.

- Creating Composite Features:
  Generate new features by combining or interacting existing ones, providing additional information to the model.

- Temporal and Spatial Insights:
  Incorporate time or spatial components to reveal trends, patterns, or seasonality in the data.

- Enhanced Interpretability:
  Make models more interpretable by transforming features into more understandable or representative formats.

- Optimizing Model Resources:
  Save computational resources by excluding irrelevant or redundant features, making models more efficient.

##### Age of House at Sale time

In [None]:
data['Age_at_sale'] = data['YrSold'] - data['YearBuilt']

In [None]:
# Retain both 'YrSold' and 'YearBuilt' for context and potential further analysis.
data.head(3)

In [None]:
sns.scatterplot(x = data.Age_at_sale, y = data.SalePrice)

- The code creates a new binary column 'Remodeled' in the DataFrame df, where the value is 1 if the house has been 
remodeled (if 'YearBuilt' is different from 'YearRemodAdd'), and 0 otherwise
# ......................................................................................................................
- Finding whether a house has been remodeled is essential as it provides insights into property history and condition, impacting sale price by influencing perceived value and desirability.

In [None]:
data['Remodeld'] = (data['YearBuilt'] != data['YearRemodAdd']).astype(int)

In [None]:
sns.scatterplot(x = data.Remodeld, y = data.SalePrice)

##### Total square footage:
- Finding the total square footage combines the areas of the first and second floors, providing a comprehensive measure of living space that influences the property's size and, consequently, its sale price.

In [None]:
data['totalSF'] =data['1stFlrSF'] + data['2ndFlrSF']

In [None]:
data.head(3)

In [None]:
sns.scatterplot(x = data.totalSF, y= data.SalePrice)

##### Total Bathrooms:

In [None]:
data['TotalBath'] = data['FullBath'] + data['HalfBath'] + data['BsmtFullBath'] + data['BsmtHalfBath'] * 0.5

In [None]:
sns.scatterplot(x = data.TotalBath, y = data.SalePrice)

- The multiplication by 0.5 is applied to the 'BsmtHalfBath' column. It signifies that each half bathroom in the basement ('BsmtHalfBath') contributes half of a full bathroom to the total count. This adjustment ensures an accurate representation of the total number of bathrooms in the 'TotalBathrooms' column, considering both full and half bathrooms throughout the house.
- 'TotalBathrooms' adequately captures the bathroom-related information of our regression model and there are no specific reasons to retain individual columns, dropping them could simplify your feature set.

In [None]:
data.head(3)

#### Outdoor living area:

In [None]:
porch_deck_features = ['WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch']
data['Outdoor_LA'] = data[porch_deck_features].sum(axis = 1)


# # lets check correlation of above created column
cr = data['Outdoor_LA'].corr(data['SalePrice'])
print(cr)

In [None]:
sns.scatterplot(x = data.Outdoor_LA , y= data.SalePrice)

##### Over condition score:

In [None]:
data['quality_condition_score'] = data['OverallQual'] + data['OverallCond']

In [None]:
# lets the corelation of quality_condition_score with target variable 
corr = data['quality_condition_score'].corr(data['SalePrice'])
print('Quality_condition_score correlation with target variable ::',corr)

In [None]:
sns.scatterplot(x =data.quality_condition_score, y = data.SalePrice)

In [None]:
cr1 = data['OverallQual'].corr(data['SalePrice'])
print(cr1)

- A correlation coefficient of 0.67 is generally considered a strong positive correlation, indicating a substantial relationship between the two variables

##### Total Basement Area:

In [None]:
bsmt_area_feature = ['BsmtFinSF1','BsmtFinSF2','BsmtUnfSF']
data['Total_bmt_area'] = data[bsmt_area_feature].sum(axis = 1)

In [None]:
sns.scatterplot( x =data.Total_bmt_area, y= data.SalePrice )

### Created_feature:

In [None]:
created_feature = data[['Age_at_sale','totalSF','TotalBath','Outdoor_LA','quality_condition_score','Total_bmt_area','Remodeld']]

In [None]:
created_feature.head()

In [None]:
plt.figure(figsize = (20, 25), facecolor = 'white')
plotnumber = 1
for column in created_feature.columns:
    if plotnumber <= 7:
        plt.subplot(4,2, plotnumber)
        sns.scatterplot(x = created_feature[column], y =data.SalePrice)
        
        plt.title(f'{column} vs SalePrice')
        plt.xlabel(column, fontsize = 10)
        plt.ylabel('SalePrice', fontsize = 10)
        plotnumber += 1
        
plt.tight_layout()
plt.show()

In [None]:
y = data.SalePrice
y

In [None]:
data2  = pd.concat([created_feature , y], axis = 1)

In [None]:
data2.corr()

### Insights:
- Age_at_sale and SalePrice: There is a negative correlation (-0.53), indicating that as the age at sale increases, the sale price tends to decrease.

- TotalSF and SalePrice: There is a positive correlation (0.64), suggesting that as the total square footage increases, the sale price tends to increase.

- TotalBath and SalePrice: Positive correlation (0.62), indicating that houses with more bathrooms tend to have higher sale prices.

- Outdoor_LA and SalePrice: Positive correlation (0.41), suggesting that a larger outdoor living area is associated with higher sale prices.

- Quality_Condition_Score and SalePrice: Strong positive correlation (0.68), indicating that houses with higher quality and condition scores tend to have higher sale prices.

- Total_bmt_area and SalePrice: Positive correlation (0.50), suggesting that a larger basement area is associated with higher sale prices.

- Remodeled and SalePrice: Positive correlation (although weak) (0.02), suggesting that remodeled houses might have slightly higher sale prices.

## Feature selection:

In [None]:
## Checking correlation

plt.figure(figsize=(50, 40))#canvas size
sns.heatmap(data_numerical.corr(), annot=True, cmap="RdYlGn", annot_kws={"size":15})#plotting heat map to check correlation

In [None]:
correlation = data_numerical.corr()[['SalePrice']]
plt.figure(figsize=(10, 10))
sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.title("Correlation with SalePrice - House Price Dataset")
plt.show()

In [None]:
# Correlation feature Selection method

In [None]:
target_corr = data.corr()[['SalePrice']]
plt.figure(figsize= (10, 25))
sns.heatmap(target_corr, annot = True, cmap = plt.cm.Reds)

In [None]:
# print only  those column which have corr greater than 0.5
target_corr[abs(target_corr)>0.5].dropna()

In [None]:
X = data.copy()

In [None]:
X.drop('SalePrice', axis =1, inplace = True)

#### 2nd Feature selection Method:

L1 regularization, also known as Lasso regularization, is a technique in machine learning that penalizes large model weights during training. Its main goal is to encourage the model to use fewer features by driving some feature weights to exactly zero.

In the context of linear models, the regularization involves adding a penalty term to the original cost function. This penalty term, controlled by a parameter (λ), promotes sparsity by discouraging large weights. During training, the optimization algorithm minimizes the regularized cost function, resulting in some feature weights becoming zero.

Why use L1 regularization:

Feature Selection: L1 regularization naturally selects important features by driving less important ones to have zero weights, making the model more interpretable and efficient.

Multicollinearity Handling: It can handle high correlations between features by selecting one from a group of correlated features and setting others' weights to zero.

Simplifying Models: L1 regularization aids in creating simpler models that generalize better to new data, preventing overfitting and overly complex models.

Benefits:

Automatic Feature Selection: L1 regularization automatically selects relevant features without manual engineering.

Improved Generalization: By promoting sparsity, it helps prevent overfitting and enhances the model's ability to generalize to new, unseen data.

Interpretability: Sparse models are easier to interpret as they focus on a subset of the most relevant features

In [None]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
rnd = RandomForestRegressor(n_estimators=10)
from sklearn.model_selection import train_test_split

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.20, random_state=10)

In [None]:
sel_ = SelectFromModel(RandomForestRegressor(n_estimators=10,random_state=10))

In [None]:
sel_.fit(X_train, y_train)

In [None]:
sel_.get_support()

In [None]:
select_feat = X_train.columns[(sel_.get_support())]

In [None]:
select_feat

In [None]:
len(select_feat)

In [None]:
X = X[['OverallQual', 'YearRemodAdd', 'BsmtFinSF1', 'TotalBsmtSF', '1stFlrSF',
       'GrLivArea','LotArea','2ndFlrSF' ,'FullBath', 'GarageCars', 'GarageArea',
       'totalSF', 'TotalBath', 'Total_bmt_area']]

In [None]:
X.head()

In [None]:
data1 = pd.concat([X,y], axis = 1)

In [None]:
plt.figure(figsize= (20,20))
sns.heatmap(data1.corr(), annot = True,cmap = 'RdYlGn')

### Split Features:

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size = 0.25, random_state = 42)

In [None]:
X_train.shape , X_test.shape

In [None]:
y_test.shape

# Model Building


## Decision Tree:

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=14)

In [None]:
from sklearn.tree import DecisionTreeRegressor #importing decision tree from sklearn.tree

dt=DecisionTreeRegressor() # object creation for decision tree

dt.fit(X_train,y_train) # training the model

In [None]:
y_pred_dt = dt.predict(X_test) # prediction

In [None]:
# Checking Accuracy score
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error,mean_absolute_percentage_error
import math

print(f'Test Mean Squared Error: {mean_squared_error(y_test,y_pred_dt):.5f}')
print(f'Test Mean Absolute Error: {mean_absolute_error(y_test, y_pred_dt):.5f}')
print(f'Test Mean Absolute Percentage Error: {mean_absolute_percentage_error(y_test, y_pred_dt):.5f}')
print(f'R2 Score: {r2_score(y_test, y_pred_dt):.5f}')

In [None]:
y_train_predict = dt.predict(X_train)
print(f'R2 Score: {r2_score(y_train, y_train_predict):.2f}')

### Hyper paramter tuning for Decision tree

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
params = {
    "criterion":('friedman_mse', 'poisson', 'squared_error', 'absolute_error'), #quality of split
    "splitter":("best", "random"), # searches the features for a split
    "max_depth":(list(range(1, 20))), #depth of tree range from 1 to 19
    "min_samples_split":[2,20],    #the minimum number of samples required to split internal node
    "min_samples_leaf":list(range(1, 20)),#minimum number of samples required to be at a leaf node,we are passing list which is range from 1 to 19
}

tree_reg = DecisionTreeRegressor(random_state=3)#object creation for decision tree with random state 3
tree_cv = GridSearchCV(tree_reg, params, scoring="neg_mean_squared_error", n_jobs=-1, verbose=1, cv=3)
#passing model to gridsearchCV ,
#tree_clf-->model
#params---->hyperparametes(dictionary we created)
#scoring--->performance matrix to check performance
#n_jobs---->Number of jobs to run in parallel,-1 means using all processors.
#verbose=Controls the verbosity: the higher, the more messages.
#>1 : the computation time for each fold and parameter candidate is displayed;
#>2 : the score is also displayed;
#>3 : the fold and candidate parameter indexes are also displayed together with the starting time of the computation.
#cv------> number of flods

tree_cv.fit(X_train,y_train)#training data on gridsearch cv
best_params = tree_cv.best_params_ #it will give you best parameters
print(f"Best paramters: {best_params})")#printing  best parameters

In [None]:
#fitting 3 folds for each of 4332 candidates, totalling 12996 fits
tree_cv.best_params_#getting best parameters from cv

In [None]:
tree_cv.best_score_#getting best score form cv

In [None]:
dt_h=DecisionTreeRegressor(criterion='friedman_mse',max_depth=10,min_samples_leaf= 4,min_samples_split=20,splitter='random')#passing best parameter to decision tree

In [None]:
dt_h.fit(X_train,y_train)#traing model with best parameter

In [None]:
y_pred_dth=dt_h.predict(X_test)#predicting
y_pred_dth

In [None]:
# Checking Accuracy score

print(f'Test Mean Squared Error: {mean_squared_error(y_test,y_pred_dth):.5f}')
print(f'Test Mean Absolute Error: {mean_absolute_error(y_test, y_pred_dth):.5f}')
print(f'Test Mean Absolute Percentage Error: {mean_absolute_percentage_error(y_test, y_pred_dth):.5f}')
print(f'R2 Score: {r2_score(y_test, y_pred_dth):.5f}')

In [None]:
y_train_predict = dt_h.predict(X_train)
print(f'R2 Score: {r2_score(y_train, y_train_predict):.2f}')

## Random Forest- Ensemble Technique

In [None]:
#model creation
from sklearn.ensemble import RandomForestRegressor
rf=RandomForestRegressor(n_estimators=100)
rf.fit(X_train,y_train)

In [None]:
y_pred_rf=rf.predict(X_test)
y_pred_rf

In [None]:
print(f'Test Mean Squared Error: {mean_squared_error(y_test,y_pred_rf):.5f}')
print(f'Test Mean Absolute Error: {mean_absolute_error(y_test, y_pred_rf):.5f}')
print(f'Test Mean Absolute Percentage Error: {mean_absolute_percentage_error(y_test, y_pred_rf):.5f}')
print(f'R2 Score: {r2_score(y_test, y_pred_rf):.5f}')

In [None]:
#checking cross validation score
from sklearn.model_selection import cross_val_score

r2_scores = cross_val_score(rf, X_train, y_train, cv=3, scoring='r2')
print(r2_scores)
# Print mean and standard deviation of R-squared scores
print("Random Forest Regressor Cross-Validation R-squared Scores:")
print("Mean R-squared:", np.mean(r2_scores))
print("Standard Deviation of R-squared:", np.std(r2_scores))

### Hyper parameter tuning for Randomforest

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
# Define the hyperparameter grid
param_dist = {
    'n_estimators': [int(x) for x in np.linspace(start=50, stop=200, num=10)],  # Number of trees in the forest
    'max_features': ['auto', 'sqrt', 'log2'],  # Number of features to consider at each split
    'max_depth': [None, 10, 20, 30],  # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4]  # Minimum number of samples required to be at a leaf node
}

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Create RandomizedSearchCV
random_search = RandomizedSearchCV(
    rf, param_distributions=param_dist, n_iter=10, scoring='neg_mean_squared_error', cv=5, n_jobs=-1, random_state=42
)

In [None]:
# Fit the model
random_search.fit(X_train, y_train)

In [None]:
# Print the best parameters
print("Best Hyperparameters:", random_search.best_params_)

In [None]:
rf_reg = RandomForestRegressor(n_estimators= 66, min_samples_split= 2, min_samples_leaf= 2, max_features= 'sqrt', max_depth= None, bootstrap= False)#passing best parameter to randomforest

rf_reg.fit(X_train,y_train)#training

y_rf_reg=rf_reg.predict(X_test)#testing

In [None]:
# Checking Accuracy score
import math

print(f'Test Mean Squared Error: {mean_squared_error(y_test,y_rf_reg):.5f}')
print(f'Test Mean Absolute Error: {mean_absolute_error(y_test, y_rf_reg):.5f}')
print(f'Test Mean Absolute Percentage Error: {mean_absolute_percentage_error(y_test, y_rf_reg):.5f}')
print(f'R2 Score: {r2_score(y_test, y_rf_reg):.5f}')

# XGB Regressor

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
gbm=GradientBoostingRegressor() ## object creation
gbm.fit(X_train,y_train) ## fitting the data
y_gbm=gbm.predict(X_test) ## predicting the price

In [None]:
# Checking Accuracy score

print(f'Test Mean Squared Error: {mean_squared_error(y_test,y_gbm):.5f}')
print(f'Test Mean Absolute Error: {mean_absolute_error(y_test, y_gbm):.5f}')
print(f'Test Mean Absolute Percentage Error: {mean_absolute_percentage_error(y_test, y_gbm):.5f}')
print(f'R2 Score: {r2_score(y_test, y_gbm):.5f}')

In [None]:
!pip3 install xgboost

In [None]:
import xgboost

In [None]:
from xgboost import XGBRegressor#importing the model library

In [None]:
xgb_r=XGBRegressor() ## object creation
xgb_r.fit(X_train,y_train)# fitting the data

In [None]:
y_pred_xgb=xgb_r.predict(X_test) # predicting the strength of concrete

In [None]:
y_pred_xgb=xgb_r.predict(X_test) # predicting the strength of concrete

In [None]:
# Checking Accuracy score

print(f'Test Mean Squared Error: {mean_squared_error(y_test,y_pred_xgb):.5f}')
print(f'Test Mean Absolute Error: {mean_absolute_error(y_test, y_pred_xgb):.5f}')
print(f'Test Mean Absolute Percentage Error: {mean_absolute_percentage_error(y_test, y_pred_xgb):.5f}')
print(f'R2 Score: {r2_score(y_test, y_pred_xgb):.5f}')

## Hyper parameter tuning for XGBoost


In [None]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = {'gamma': [0,0.1,0.2,0.4,0.8,1.6,3.2,6.4,12.8,25.6,51.2,102.4, 200],
              'learning_rate': [0.01, 0.03, 0.06, 0.1, 0.15, 0.2, 0.25, 0.300000012, 0.4, 0.5, 0.6, 0.7],
              'max_depth': [5,6,7,8,9,10,11,12,13,14],
              'n_estimators': [50,65,80,100,115,130,150],
              'reg_alpha': [0,0.1,0.2,0.4,0.8,1.6,3.2,6.4,12.8,25.6,51.2,102.4,200],
              'reg_lambda': [0,0.1,0.2,0.4,0.8,1.6,3.2,6.4,12.8,25.6,51.2,102.4,200]}

XGB=XGBRegressor(random_state=42,verbosity=0,silent=0)
rcv= RandomizedSearchCV(estimator=XGB, scoring='neg_mean_absolute_error',param_distributions=param_grid, n_iter=100, cv=3,
                               verbose=2, random_state=42, n_jobs=-1)

#estimator--number of decision tree
#scoring--->performance matrix to check performance
#param_distribution-->hyperparametes(dictionary we created)
#n_iter--->Number of parameter settings that are sampled. n_iter trades off runtime vs quality of the solution.default=10
##cv------> number of flods
#verbose=Controls the verbosity: the higher, the more messages.
#n_jobs---->Number of jobs to run in parallel,-1 means using all processors.

rcv.fit(X_train, y_train)##training data on randomsearch cv
cv_best_params = rcv.best_params_##it will give you best parameters
print(f"Best paramters: {cv_best_params})")##printing  best parameter

In [None]:
XGB2=XGBRegressor(reg_lambda= 12.8, reg_alpha= 0.1, n_estimators=150, max_depth=5, learning_rate=0.1, gamma=0.8)
XGB2.fit(X_train, y_train)#training
y_predict_xgb2=XGB2.predict(X_test) # testing

In [None]:
# Checking Accuracy score

print(f'Test Mean Squared Error: {mean_squared_error(y_test,y_predict_xgb2):.5f}')
print(f'Test Mean Absolute Error: {mean_absolute_error(y_test, y_predict_xgb2):.5f}')
print(f'Test Mean Absolute Percentage Error: {mean_absolute_percentage_error(y_test, y_predict_xgb2):.5f}')
print(f'R2 Score: {r2_score(y_test, y_predict_xgb2):.5f}')

### Model  Creation:

### Decision tree

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
d_tr = DecisionTreeRegressor()
d_tr.fit(X_train, y_train)

In [None]:
y_hat = d_tr.predict(X_test)

In [None]:
d_r2 = r2_score(y_test, y_hat)
print('Decision tree R2_score :', d_r2)

In [None]:
### Trianing score of Decision Tree:
y_pred_train = d_tr.predict(X_train)

train_r2_score = r2_score(y_train, y_pred_train)
print('Training score :',train_r2_score)

### Hyper parameter tuning 

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

In [None]:
param = {
    'criterion':('friedman_mse','poisson', 'squared_error', 'absolute_error'),
    'splitter':('best','random'),
    'max_depth':(list(range(1, 20))),
    'min_samples_split':[2,3,5,19,20],
    'min_samples_leaf':(list(range(1, 20)))
}

In [None]:
d_tree = DecisionTreeRegressor(random_state=42)
tree_gv = GridSearchCV(d_tree,param, scoring = 'neg_mean_squared_error',n_jobs = -1 ,cv = 5, verbose = 2)

In [None]:
tree_gv.fit(X_train, y_train)

In [None]:
print("Best parameter :",tree_gv.best_params_)

In [None]:
print("Best estimators :",tree_gv.best_estimator_)

In [None]:
de_tr = DecisionTreeRegressor(criterion= 'friedman_mse', max_depth= 8, min_samples_leaf = 3, min_samples_split= 20, splitter= 'best')

In [None]:
de_tr.fit(X_train, y_train)

In [None]:
y_hyp_t= de_tr.predict(X_test)

In [None]:
hyper_r2_dtree =r2_score(y_test, y_hyp_t)
print("r2_score of Decision tree after hyper parameter tuning :",hyper_r2_dtree)

### RandomForest Regressor:

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rfr = RandomForestRegressor(n_estimators=400)
rfr.fit(X_train, y_train)

In [None]:
y_pred= rfr.predict(X_test)

In [None]:
rn_r2 = r2_score(y_test, y_pred)
print("Random Forest Regressor R2_score :",rn_r2)

In [None]:
### Training Score:
y_rfr_train = d_tr.predict(X_train)

train_r2_score = r2_score(y_train, y_rfr_train)
print('Training score of Random Forest Regressor:',train_r2_score)

### Hyper parameter tuning for randomForestRegressor:


In [None]:
rfr

In [None]:
param_grid = {
    'n_estimators':[int(x) for x in np.linspace(100, 2100, num= 13)],
    'max_features':['auto', 'sqrt','log2',None],
    'max_depth':[None]+[int(x) for x in np.linspace(10, 110, num= 11)],
    'min_samples_split':[2,5,10],
    'min_samples_leaf':[1,2,4],
    'bootstrap' : [True, False]
}

In [None]:
Rf = RandomForestRegressor(n_estimators=400)
rf_cv = RandomizedSearchCV(Rf, param_distributions = param_grid,scoring ='neg_mean_squared_error' ,n_iter = 100, cv =5,n_jobs = -1 ,verbose = 2)

In [None]:
rf_cv.fit(X_train, y_train)

In [None]:
print('Best parameter from random forest regressor :',rf_cv.best_params_)

In [None]:
print('Best estimator of randomforest regressor :',rf_cv.best_estimator_)

In [None]:
rand_reg  = RandomForestRegressor(bootstrap=False, max_features='log2', min_samples_leaf=2,
                      min_samples_split=10, n_estimators=1933)

In [None]:
rand_reg.fit(X_train, y_train)

In [None]:
y_hyp_rf = rand_reg.predict(X_test)

In [None]:
rf_hyp_r2 = r2_score(y_test, y_hyp_rf)
rf_hyp_r2

### XGBRegressor:

In [None]:
from xgboost import XGBRegressor

In [None]:
xgb = XGBRegressor()

In [None]:
xgb.fit(X_train, y_train)

In [None]:
y_hat_xgb = xgb.predict(X_test)

In [None]:
xgb_r2 = r2_score(y_test, y_hat_xgb)

In [None]:
print("XGBRegressor R2_score :",xgb_r2)

In [None]:
### Training score:
y_xgb_train = d_tr.predict(X_train)

train_r2_score = r2_score(y_train, y_xgb_train)
print('Training score  of XGBRegressor:',train_r2_score)
    

## Hyper parameter Tuning of XGBRegressor:

In [None]:
param_grid = {'gamma': [0,0.1,0.2,0.4,0.8,1.6,3.2,6.4,12.8,25.6,51.2,102.4, 200],
              'learning_rate': [0.01, 0.03, 0.06, 0.1, 0.15, 0.2, 0.25, 0.300000012, 0.4, 0.5, 0.6, 0.7],
              'max_depth': [5,6,7,8,9,10,11,12,13,14],
              'n_estimators': [50,65,80,100,115,130,150],
              'reg_alpha': [0,0.1,0.2,0.4,0.8,1.6,3.2,6.4,12.8,25.6,51.2,102.4,200],
              'reg_lambda': [0,0.1,0.2,0.4,0.8,1.6,3.2,6.4,12.8,25.6,51.2,102.4,200]}

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
gbr = GradientBoostingRegressor(n_estimators=100)

In [None]:
gbr.fit(X_train, y_train)

In [None]:
y_gbr = gbr.predict(X_test)

In [None]:
gbr_r2 = r2_score(y_test, y_gbr)

In [None]:
print("Gradient Boosting Regressor r2_score :",gbr_r2)

# Adaboostregressor

In [None]:
from sklearn.ensemble import AdaBoostRegressor

In [None]:
ad = AdaBoostRegressor(n_estimators=400)

In [None]:
ad.fit(X_train, y_train)

In [None]:
y_ad = ad.predict(X_test)

In [None]:
r2_score(y_test, y_ad)

### Hyper parameter tuning for it :

In [None]:
param_grid = {
    'n_estimators': [50, 100, 200, 300],  # Number of weak learners (trees)
    'learning_rate': [0.01, 0.05, 0.1, 0.5, 1],  # Shrinkage parameter
    'loss': ['linear', 'square', 'exponential']  # Loss function to minimize
}

In [None]:
add = AdaBoostRegressor(n_estimators=100)
ad_reg = GridSearchCV(add, param_grid=param_grid, scoring= 'r2', n_jobs = -1, cv =3, verbose = 2)

In [None]:
ad_reg.fit(X_train, y_train)

In [None]:
print('Best parameter of AdaBoostRegressor :',ad_reg.best_params_)

In [None]:
print('Best estimator of AdaBoostregressor :',ad_reg.best_estimator_)

In [None]:
adda = AdaBoostRegressor(learning_rate=1, loss='exponential')
adda.fit(X_train, y_train)

In [None]:
y_hyp_add = adda.predict(X_test)
y_hyp_add

In [None]:
r2_score(y_test, y_hyp_add)

## Model Creation with Engineered Features

### Decision Tree2

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X2,y,random_state=14)

In [None]:
X_train.shape

In [None]:
from sklearn.tree import DecisionTreeRegressor #importing decision tree from sklearn.tree

dt=DecisionTreeRegressor() # object creation for decision tree

dt.fit(X_train,y_train) # training the model

In [None]:
y_pred_dt = dt.predict(X_test) # prediction

In [None]:
# Checking Accuracy score
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error,mean_absolute_percentage_error
import math

print(f'Test Mean Squared Error: {mean_squared_error(y_test,y_pred_dt):.5f}')
print(f'Test Mean Absolute Error: {mean_absolute_error(y_test, y_pred_dt):.5f}')
print(f'Test Mean Absolute Percentage Error: {mean_absolute_percentage_error(y_test, y_pred_dt):.5f}')
print(f'R2 Score: {r2_score(y_test, y_pred_dt):.5f}')

In [None]:
y_train_predict = dt.predict(X_train)
print(f'R2 Score: {r2_score(y_train, y_train_predict):.2f}')

### Hyper paramter tuning for Decision tree2


In [None]:
params = {
    "criterion":('friedman_mse', 'poisson', 'squared_error', 'absolute_error'), #quality of split
    "splitter":("best", "random"), # searches the features for a split
    "max_depth":(list(range(1, 20))), #depth of tree range from 1 to 19
    "min_samples_split":[2,20],    #the minimum number of samples required to split internal node
    "min_samples_leaf":list(range(1, 20)),#minimum number of samples required to be at a leaf node,we are passing list which is range from 1 to 19
}

tree_reg = DecisionTreeRegressor(random_state=3)#object creation for decision tree with random state 3
tree_cv = GridSearchCV(tree_reg, params, scoring="neg_mean_squared_error", n_jobs=-1, verbose=1, cv=3)
#passing model to gridsearchCV ,
#tree_clf-->model
#params---->hyperparametes(dictionary we created)
#scoring--->performance matrix to check performance
#n_jobs---->Number of jobs to run in parallel,-1 means using all processors.
#verbose=Controls the verbosity: the higher, the more messages.
#>1 : the computation time for each fold and parameter candidate is displayed;
#>2 : the score is also displayed;
#>3 : the fold and candidate parameter indexes are also displayed together with the starting time of the computation.
#cv------> number of flods

tree_cv.fit(X_train,y_train)#training data on gridsearch cv
best_params = tree_cv.best_params_ #it will give you best parameters
print(f"Best paramters: {best_params})")#printing  best parameters

In [None]:
#fitting 3 folds for each of 4332 candidates, totalling 12996 fits
tree_cv.best_params_#getting best parameters from cv

In [None]:
tree_cv.best_score_#getting best score form cv

In [None]:
dt_h=DecisionTreeRegressor(criterion='poisson',max_depth=12,min_samples_leaf= 1,min_samples_split=2,splitter='best')#passing best parameter to decision tree

In [None]:
dt_h.fit(X_train,y_train)#traing model with best parameter

In [None]:
y_pred_dth=dt_h.predict(X_test)#predicting
y_pred_dth

In [None]:
# Checking Accuracy score

print(f'Test Mean Squared Error: {mean_squared_error(y_test,y_pred_dth):.5f}')
print(f'Test Mean Absolute Error: {mean_absolute_error(y_test, y_pred_dth):.5f}')
print(f'Test Mean Absolute Percentage Error: {mean_absolute_percentage_error(y_test, y_pred_dth):.5f}')
print(f'R2 Score: {r2_score(y_test, y_pred_dth):.5f}')

In [None]:
y_train_predict = dt_h.predict(X_train)
print(f'R2 Score: {r2_score(y_train, y_train_predict):.5f}')

## Random Forest- Ensemble Technique2

In [None]:
#model creation
from sklearn.ensemble import RandomForestRegressor
rf=RandomForestRegressor(n_estimators=100)
rf.fit(X_train,y_train)

In [None]:
y_pred_rf=rf.predict(X_test)
y_pred_rf

In [None]:
print(f'Test Mean Squared Error: {mean_squared_error(y_test,y_pred_rf):.5f}')
print(f'Test Mean Absolute Error: {mean_absolute_error(y_test, y_pred_rf):.5f}')
print(f'Test Mean Absolute Percentage Error: {mean_absolute_percentage_error(y_test, y_pred_rf):.5f}')
print(f'R2 Score: {r2_score(y_test, y_pred_rf):.5f}')

In [None]:
#checking cross validation score
from sklearn.model_selection import cross_val_score

r2_scores = cross_val_score(rf, X_train, y_train, cv=3, scoring='r2')
print(r2_scores)
# Print mean and standard deviation of R-squared scores
print("Random Forest Regressor Cross-Validation R-squared Scores:")
print("Mean R-squared:", np.mean(r2_scores))
print("Standard Deviation of R-squared:", np.std(r2_scores))

### Hyper parameter tuning for Randomforest2

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
# Define the hyperparameter grid
param_dist = {
    'n_estimators': [int(x) for x in np.linspace(start=50, stop=200, num=10)],  # Number of trees in the forest
    'max_features': ['auto', 'sqrt', 'log2'],  # Number of features to consider at each split
    'max_depth': [None, 10, 20, 30],  # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4]  # Minimum number of samples required to be at a leaf node
}

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Create RandomizedSearchCV
random_search = RandomizedSearchCV(
    rf, param_distributions=param_dist, n_iter=10, scoring='neg_mean_squared_error', cv=5, n_jobs=-1, random_state=42
)

In [None]:
# Fit the model
random_search.fit(X_train, y_train)

In [None]:
# Print the best parameters
print("Best Hyperparameters:", random_search.best_params_)

In [None]:
rf_reg = RandomForestRegressor(n_estimators= 50, min_samples_split= 10, min_samples_leaf= 2, max_features= 'auto', max_depth= 30, bootstrap= False)#passing best parameter to randomforest

rf_reg.fit(X_train,y_train)#training

y_rf_reg=rf_reg.predict(X_test)#testing

In [None]:
# Checking Accuracy score
import math

print(f'Test Mean Squared Error: {mean_squared_error(y_test,y_rf_reg):.5f}')
print(f'Test Mean Absolute Error: {mean_absolute_error(y_test, y_rf_reg):.5f}')
print(f'Test Mean Absolute Percentage Error: {mean_absolute_percentage_error(y_test, y_rf_reg):.5f}')
print(f'R2 Score: {r2_score(y_test, y_rf_reg):.5f}')

# XGB Regressor

In [None]:
gbm=GradientBoostingRegressor() ## object creation
gbm.fit(X_train,y_train) ## fitting the data
y_gbm=gbm.predict(X_test) ## predicting the price

In [None]:
# Checking Accuracy score

print(f'Test Mean Squared Error: {mean_squared_error(y_test,y_gbm):.5f}')
print(f'Test Mean Absolute Error: {mean_absolute_error(y_test, y_gbm):.5f}')
print(f'Test Mean Absolute Percentage Error: {mean_absolute_percentage_error(y_test, y_gbm):.5f}')
print(f'R2 Score: {r2_score(y_test, y_gbm):.5f}')

In [None]:
xgb_r=XGBRegressor() ## object creation
xgb_r.fit(X_train,y_train)# fitting the data

In [None]:
y_pred_xgb=xgb_r.predict(X_test) # predicting the strength of concrete

In [None]:
# Checking Accuracy score

print(f'Test Mean Squared Error: {mean_squared_error(y_test,y_pred_xgb):.5f}')
print(f'Test Mean Absolute Error: {mean_absolute_error(y_test, y_pred_xgb):.5f}')
print(f'Test Mean Absolute Percentage Error: {mean_absolute_percentage_error(y_test, y_pred_xgb):.5f}')
print(f'R2 Score: {r2_score(y_test, y_pred_xgb):.5f}')

## Hyper parameter tuning for XGBoost2

In [None]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = {'gamma': [0,0.1,0.2,0.4,0.8,1.6,3.2,6.4,12.8,25.6,51.2,102.4, 200],
              'learning_rate': [0.01, 0.03, 0.06, 0.1, 0.15, 0.2, 0.25, 0.300000012, 0.4, 0.5, 0.6, 0.7],
              'max_depth': [5,6,7,8,9,10,11,12,13,14],
              'n_estimators': [50,65,80,100,115,130,150],
              'reg_alpha': [0,0.1,0.2,0.4,0.8,1.6,3.2,6.4,12.8,25.6,51.2,102.4,200],
              'reg_lambda': [0,0.1,0.2,0.4,0.8,1.6,3.2,6.4,12.8,25.6,51.2,102.4,200]}

XGB=XGBRegressor(random_state=42,verbosity=0,silent=0)
rcv= RandomizedSearchCV(estimator=XGB, scoring='neg_mean_absolute_error',param_distributions=param_grid, n_iter=100, cv=3,
                               verbose=2, random_state=42, n_jobs=-1)

#estimator--number of decision tree
#scoring--->performance matrix to check performance
#param_distribution-->hyperparametes(dictionary we created)
#n_iter--->Number of parameter settings that are sampled. n_iter trades off runtime vs quality of the solution.default=10
##cv------> number of flods
#verbose=Controls the verbosity: the higher, the more messages.
#n_jobs---->Number of jobs to run in parallel,-1 means using all processors.

rcv.fit(X_train, y_train)##training data on randomsearch cv
cv_best_params = rcv.best_params_##it will give you best parameters
print(f"Best paramters: {cv_best_params})")##printing  best parameter

In [None]:
XGB2=XGBRegressor(reg_lambda= 0, reg_alpha= 200, n_estimators=100, max_depth=6, learning_rate=0.2, gamma=3.2)
XGB2.fit(X_train, y_train)#training
y_predict_xgb2=XGB2.predict(X_test) # testing

In [None]:
# Checking Accuracy score

print(f'Test Mean Squared Error: {mean_squared_error(y_test,y_predict_xgb2):.5f}')
print(f'Test Mean Absolute Error: {mean_absolute_error(y_test, y_predict_xgb2):.5f}')
print(f'Test Mean Absolute Percentage Error: {mean_absolute_percentage_error(y_test, y_predict_xgb2):.5f}')
print(f'R2 Score: {r2_score(y_test, y_predict_xgb2):.5f}')

# Model Performance Summary


In [None]:
# Assuming compiled_results is your DataFrame
data = {
    'Model': ['DT', 'DT_H','RF','RF_H','XGB', 'XGBoost','XGBOOST_H'],
    'MAPE': [0.151, 0.139, 0.095, 0.092, 0.085, 0.091,0.087],
    'R-squared': [0.781, 0.749, 0.883, 0.878, 0.889, 0.874, 0.882]
}

compiled_results = pd.DataFrame(data)
compiled_results.set_index('Model', inplace=True)
print(compiled_results)

In [None]:
# Extracting metric columns
metrics = compiled_results.columns  # Exclude 'Model'

# Number of subplots
num_plots = len(metrics)

# Create a subplot with a 2x2 grid
fig, axes = plt.subplots(1, 2, figsize=(10, 8))

# Flatten the 2x2 subplot grid for iteration
axes = axes.flatten()

# Loop through each metric and create a bar plot on each axis
for i, metric in enumerate(metrics):
    sns.barplot(x=compiled_results.index, y=compiled_results[metric], ax=axes[i])
    axes[i].set_title(f'{metric} Comparison',fontsize=15)
    axes[i].set_ylabel(metric,fontsize=15)
    axes[i].set_xlabel('Model',fontsize=15)
    axes[i].tick_params(axis='x', rotation=45, labelsize=12)
    axes[i].tick_params(axis='y',labelsize=15)

# Adjust layout
plt.tight_layout()
plt.show()