# *Problem Statement*
A House Price Prediction Model in machine learning is a predictive model that estimates the market price of a property (house) based on various input features or attributes of the property. These features typically include both structural and environmental factors that could affect a property's value, such as its size, location, number of bedrooms, number of bathrooms, age of the house, proximity to amenities, neighborhood crime rate, and other socio-economic indicators.

# START <BR>
**ProjectTeamID:- PTID-CDS-NOV-24-2192**<br>
**ProjectID:- PRCP-1020-HousePricePred**<br>


In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Markdown, display
import seaborn as sns
from ydata_profiling import ProfileReport
import warnings
import math
warnings.filterwarnings("ignore")


def set_on():
    pd.set_option('display.max_rows', None)  
    pd.set_option('display.max_columns', None)  

def set_off():
    pd.reset_option('display.max_rows')  
    pd.reset_option('display.max_columns')

In [5]:
df_housePrice=pd.read_csv("Data/HousePrice.csv")

df_housePrice.drop(columns="Id",axis=1,inplace=True)
tg='SalePrice'

In [8]:
df_housePrice.shape

(1460, 80)

In [10]:
df_housePrice.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   Alley          91 non-null     object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuil

# Identifying the columns with null values and addressing them appropriately.

In [12]:
df_housePrice.isnull().sum()
null_col=df_housePrice.columns[df_housePrice.isnull().any()]

In [14]:
def count_nullper(prg):
    high_nullcol=[]
    for i in df_housePrice[null_col]:
        if i=="SalePrice":
            continue
        if(df_housePrice[i].isnull().sum()/len(df_housePrice) * 100)> prg:
            high_nullcol.append(i)
    return len(high_nullcol)

for i in range(70,10,-10):
    print(f'<{count_nullper(i)}> Columns are Hving more than {i}% of missing Values')

<4> Columns are Hving more than 70% of missing Values
<4> Columns are Hving more than 60% of missing Values
<5> Columns are Hving more than 50% of missing Values
<6> Columns are Hving more than 40% of missing Values
<6> Columns are Hving more than 30% of missing Values
<6> Columns are Hving more than 20% of missing Values


## From hear we can select the lebel of percentage to decide to drop Null valued coloumn 

### We can compromise <5> Columns where missing values count is more than 50% 

In [16]:
list_of_droped=[]
prg=50
for i in df_housePrice[null_col]:
    if(df_housePrice[i].isnull().sum()/len(df_housePrice) * 100)> prg:
        df_housePrice.drop(columns=i,axis=1,inplace=True)
        list_of_droped.append(i)
Markdown(f"# These are the {len(list_of_droped)} Columns that we droped Now, \n # RIP 🪦💐💐\n{list_of_droped}")

# These are the 5 Columns that we droped Now, 
 # RIP 🪦💐💐
['Alley', 'MasVnrType', 'PoolQC', 'Fence', 'MiscFeature']

In [18]:
del list_of_droped

In [20]:
df_housePrice.shape

(1460, 75)

## Exteract Numerical and Categorical Null columns

In [22]:
null_col=df_housePrice.columns[df_housePrice.isnull().any()]

null_Numerical=df_housePrice[null_col].select_dtypes(include='number')
null_Categorical=df_housePrice[null_col].select_dtypes(include='O')

## What are the unique values and there mode of each category Column?

In [None]:
for i in null_Categorical.columns:
    modev=df_housePrice[i].mode()[0]
    print(f"The unique values for {i} are:- \n {null_Categorical[i].dropna().unique()} \n: mode= {modev} \n")

## Missing values in categorical columns are addressed by imputing with there respective mode

In [None]:
for i in null_Categorical.columns:
    modev=df_housePrice[i].mode()[0]
    df_housePrice[i].fillna(modev, inplace=True)
    print(f'Null entries in {i} filled with it\'s mode: "{modev}"')

In [None]:
df_housePrice[null_Categorical.columns].isnull().sum()

In [None]:
del null_Categorical
del null_col

## Handling numerical null values

In [None]:
null_Numerical.describe().T

## The values in the "GaragYrBlt" column exhibit significant standard variation and temporal variance. Therefore, the null values will be populated with the preceding valid entries.

In [None]:
df_housePrice["GarageYrBlt"].ffill(inplace=True)
null_Numerical.drop(columns="GarageYrBlt",axis=1,inplace=True)

In [None]:
for i in null_Numerical.columns:
    print("*"*25+"_"+i+"_"+"*"*25)
    print(f'skewness is  {df_housePrice[i].skew():.2f}')
    print(f'kurtosis is {df_housePrice[i].kurtosis():.2f}')
print("*"*65)


A kurtosis value significantly greater than 3 (which is the kurtosis of a normal distribution) suggests that there are more extreme values (outliers) in the data, leading to a sharper peak and fatter tails. This means that the data may have a higher likelihood of producing values far from the mean.
it can be clearly observe that <br><br> __[" LotFrontage : 17.45", " MasVnrArea : 10.08 " ]__ <br> Which is significantly greater than 3
The distribution of this data is positively skewed. This means that the majority of the data points are concentrated on the left side of the distribution, with a longer tail extending to the right.<br>
## __Therefore, it can be concluded that the median serves as a more suitable metric for imputation.__

In [None]:
for i in null_Numerical.columns:
    df_housePrice[i].fillna(df_housePrice[i].median(), inplace=True)

In [None]:
del null_Numerical

In [None]:
# set_on()
df_housePrice.isnull().sum()

In [None]:
# set_off()

In [None]:
df_housePrice.duplicated().sum()

# Next_Day<br>
>> Exploratory Data Analysis<br> 
>> Distribution of Numerical Variables<br>
>> Distribution of Categorical Features<br>
>> Distribution of Numerical Features with Sales Price<br>
>> Correlation between Variables<br>


# Exploratory Data Analysis

In [None]:
df_next=df_housePrice

## Spliting columns into Numerical and Categorical 

In [None]:
num_col=df_next.select_dtypes(include=['number'])
cat_col=df_next.select_dtypes(include='O')

In [None]:
Markdown(f"# Categorical have {len(cat_col.columns)} \n # Numerical have {len(num_col.columns)}")

In [None]:
num_col.describe().T

## Some columns are exhibiting significantly low standard deviation. This may indicate that there are categorical entries represented in numerical form. Therefore, we should address this issue.

In [None]:
LowStd=[]
threshold=20
for i in num_col.columns:
    if threshold>num_col[i].std():
        LowStd.append(i)
for i in LowStd:
    print(f"The unique values for {i} are:- \n {np.sort(num_col[i].unique())} with mode {num_col[i].mode()[0]} \n")
Markdown(f"## We have {len(LowStd)} Columns of Ordinal category in Numerical format" )

# The "YrSold" column is a temporal variable containing year values. Therefore, Encoding it with categorical data is not advisable.

In [None]:
LowStd.remove("YrSold")

In [None]:
cat_col=pd.concat([cat_col, num_col[LowStd]],ignore_index=False,axis=1)

num_col.drop(columns=num_col[LowStd],axis=1,inplace=True)
del LowStd
del threshold

In [None]:
Markdown(f"## Categorical have {len(cat_col.columns)} \n ## Numerical have {len(num_col.columns)}")

## Numerical Analysis with Distribution

In [None]:
def plotSize(rowSize,columnSize):
    return (math.ceil(columnSize/rowSize),rowSize)

In [None]:
_ ,cold=num_col.shape
plt.figure(figsize=(10,cold)) 

pltSize=plotSize(3,cold)

for i, column in enumerate(num_col):
    plt.subplot(pltSize[0],pltSize[1],i+1)
    sns.histplot(data=num_col, x=column, kde=True,bins=20,color='green')
    
    # plt.axvline(x=num_col[tg].median(), color="green", linestyle='--', linewidth=2)
    plt.axvline(x=num_col[column].median(), color="red", linestyle='--', linewidth=2)

    desc=f'skewness is {df_next[column].skew():.2f} \n kurtosis is {df_next[column].kurtosis():.2f}'
    plt.title(f'Distribution of {column} \n{desc}')
    sns.despine()
plt.tight_layout()  # Adjust subplots to fit into figure area.
plt.show()


In [None]:
_ ,cold=num_col.shape
plt.figure(figsize=(10,cold)) 

pltSize=plotSize(3,cold)
for i, column in enumerate(num_col):
    plt.subplot(pltSize[0],pltSize[1], i+1)
    sns.distplot(x=num_col[column])
    plt.axvline(x=num_col[column].median(), color="red", linestyle='--', linewidth=2)
    desc=f'skewness is {df_housePrice[column].skew():.2f} \n kurtosis is {df_housePrice[column].kurtosis():.2f}'
    plt.title(f'Distribution of {column} \n{desc}')
    sns.despine()

plt.tight_layout()  # Adjust subplots to fit into figure area.
plt.show()


In [None]:
# sns.violinplot(data=df_housePrice, x="SalePrice", y=column, palette=['blue','green']) 
_ ,cold=num_col.shape
plt.figure(figsize=(10,cold)) 

pltSize=plotSize(3,cold)

plt.figure(figsize=(10,cold*2)) 
for i, column in enumerate(num_col):
    plt.subplot(pltSize[0],pltSize[1], i+1)
    sns.violinplot(data=df_housePrice, x=column) 
    # sns.histplot(data=num_col, x=column, kde=True,bins=20,color='green')
    desc=f'skewness is {df_housePrice[column].skew():.2f} \n kurtosis is {df_housePrice[column].kurtosis():.2f}'
    plt.title(f'Distribution of {column} \n\n{desc}')
    sns.despine()

plt.tight_layout()  # Adjust subplots to fit into figure area.
plt.show()

In [None]:
_ ,cold=num_col.shape
plt.figure(figsize=(10,cold)) 

pltSize=plotSize(3,cold)
 
for i, column in enumerate(num_col):
    plt.subplot(pltSize[0],pltSize[1], i+1)
    sns.boxplot(data=df_next,x=column)
    # sns.histplot(data=num_col, x=column, kde=True,bins=20,color='green')
    desc=f'skewness is {df_housePrice[column].skew():.2f} \n kurtosis is {df_housePrice[column].kurtosis():.2f}'
    plt.title(f'Distribution of {column} \n\n{desc}')
    sns.despine()

plt.tight_layout()  # Adjust subplots to fit into figure area.
plt.show()
Markdown(f"## Outliers can be identified as the points that fall outside box.")

## Distribution of Categorical Features

In [None]:
_ ,cold=cat_col.shape
plt.figure(figsize=(10,cold*2)) 
for i, column in enumerate(cat_col):
    plt.subplot(26,2, i+1)
    sns.countplot(data=df_next, x=column)
    plt.title(f'Distribution of {column} \n Mode {df_next[column].mode()[0]}')
    sns.despine()

plt.tight_layout()  # Adjust subplots to fit into figure area.
plt.show()

## Numerical Relationship with Target

In [None]:
_ ,cold=num_col.shape
plt.figure(figsize=(10,cold)) 

pltSize=plotSize(3,cold)
num_corr=num_col.corrwith(df_next['SalePrice'], axis=0, method='pearson')
plt.figure(figsize=(10,cold)) 
for i, column in enumerate(num_col):
    plt.subplot(pltSize[0],pltSize[1], i+1)
    sns.regplot(data=df_next, x=column, y=df_next['SalePrice'])
    plt.title(f"{column}\ncorrelation with Target {num_corr[column]:.3f}")
    sns.despine()

plt.tight_layout()  # Adjust subplots to fit into figure area.
plt.show()

In [None]:
plt.figure(figsize=(10, 10))
sns.heatmap(num_col.corr(), fmt=".2f", annot=True, cmap="coolwarm", linewidths=0.5, linecolor="gray")

plt.title("headmap", fontsize=8)

# Show the heatmap
plt.show()


# Next_Day Feature Engineering

## Encoding and translation numerical and categorical values

## Address the Object Type  of columns<br>
## In a Categorical coulumn we have Ordinal Numerical and Object datatype<br>

## Trim the Categorical unique values in a prominent size of count
## Less count unique values is termed as "others" followed by there column name

In [None]:
# Selecting object data type only
obj_col=cat_col.select_dtypes(include="O")

for i in obj_col.columns:
    lowValue_counts = df_next[i].value_counts(normalize=True)
    lowValue_counts = lowValue_counts[lowValue_counts < 0.05].index
    df_next[i]= cat_col[i].apply(lambda x: 'Others_'+i if x in lowValue_counts else x)
    
for i in obj_col.columns:
    print(f'{df_next[i].value_counts(normalize=True) * 100}\n') 
Markdown(f"### Significantly, low occurrence entries are categorised as others_with_There_respective column names")

## Encoding of Categorical Data <br>
### By grouping each entries of categorical colume with the target(mean, madian,sum,variance and Standard Deviation)<br>
>> The objective is to determine the appropriate encoding method from among the available options.

In [None]:
_ ,cold=obj_col.shape
plt.figure(figsize=(10,cold*2)) 

pltSize=plotSize(2,cold)

for i, column in enumerate(obj_col):
    category_means = df_next.groupby(df_next[column])[tg].mean()
    plt.subplot(pltSize[0],pltSize[1], i+1)
    sns.barplot(x=category_means.index, y=category_means.values, palette='viridis')
    plt.title(f'{column} vs {tg} mean', fontsize=16)
    plt.xlabel(column, fontsize=14)
    plt.ylabel(tg, fontsize=14)
    sns.despine()
plt.tight_layout()  # Adjust subplots to fit into figure area.
plt.show()

In [None]:

_ ,cold=obj_col.shape
plt.figure(figsize=(10,cold*2)) 

pltSize=plotSize(2,cold)
for i, column in enumerate(obj_col):
    category_median = df_next.groupby(df_next[column])[tg].median()
    plt.subplot(pltSize[0],pltSize[1], i+1)
    sns.barplot(x=category_median.index, y=category_median.values, palette='viridis')
    plt.title(f'{column} vs {tg} median', fontsize=16)
    plt.xlabel(column, fontsize=14)
    plt.ylabel(tg, fontsize=14)
    sns.despine()
plt.tight_layout()  # Adjust subplots to fit into figure area.
plt.show()

In [None]:

_ ,cold=obj_col.shape
plt.figure(figsize=(10,cold*2)) 

pltSize=plotSize(2,cold)
for i, column in enumerate(obj_col):
    category_sum = df_next.groupby(df_next[column])[tg].sum()
    plt.subplot(pltSize[0],pltSize[1], i+1)
    sns.barplot(x=category_sum.index, y=category_sum.values, palette='viridis')
    plt.title(f'{column} vs {tg} sum', fontsize=16)
    plt.xlabel(column, fontsize=14)
    plt.ylabel(tg, fontsize=14)
    sns.despine()
plt.tight_layout()  # Adjust subplots to fit into figure area.
plt.show()

In [None]:

_ ,cold=obj_col.shape
plt.figure(figsize=(10,cold*2)) 

pltSize=plotSize(2,cold)
for i, column in enumerate(obj_col):
    category_var = df_next.groupby(df_next[column])[tg].var()
    plt.subplot(pltSize[0],pltSize[1], i+1)
    sns.barplot(x=category_var.index, y=category_var.values, palette='viridis')
    plt.title(f'{column} vs {tg} Variance', fontsize=16)
    plt.xlabel(column, fontsize=14)
    plt.ylabel(tg, fontsize=14)
    sns.despine()
plt.tight_layout()  # Adjust subplots to fit into figure area.
plt.show()

In [None]:

_ ,cold=obj_col.shape
plt.figure(figsize=(10,cold*2)) 

pltSize=plotSize(2,cold)
for i, column in enumerate(obj_col):
    category_std = df_next.groupby(df_next[column])[tg].std()
    plt.subplot(pltSize[0],pltSize[1], i+1)
    sns.barplot(x=category_std.index, y=category_std.values, palette='viridis')
    plt.title(f'{column} vs {tg} SD', fontsize=16)
    plt.xlabel(column, fontsize=14)
    plt.ylabel(tg, fontsize=14)
    sns.despine()
plt.tight_layout()  # Adjust subplots to fit into figure area.
plt.show()

## I have ultimately chosen to implement Standard Deviation Encoding to ensure that the assigned rank remains stable despite fluctuations in sales prices.

#### __ Why Standard deviation ?__
>> It will show very low variation on New entries of Sales Price hance the Encoded Rank will intaked<br>
>> It consist of Variance so help in learning<br>
>> It is a standardised by square root Hence, more stable<br>
>> Because of CTL and Large Number Theorem, This encoding is done before split<br>
>> And through ranking, we can prevent the Data Leak<br>

## Q Why not others:
>>__Sum__:The rnaking may can change by updating values on Sales Price which cause Misinterpretation on encoding<br>
>>__Mean__: The rnaking may can change by updating values on Sales Price which cause Misinterpretation on encoding and also Sensitive with outliers<br>
>>__Median__: The rnaking may can change by updating values on Sales Price <br>
>>__variance__: Variance is an essential component of the learning process; however,<br>
>> it also results in an increase in dimensionality due to the squaring of values.<br>
>>Nevertheless, this effect remains consistent regardless of new data entries.
>>Just Careful with null values it will. Block you to encode the rank for this .fillna(0)

## Q How to data with data leak
>>The ranking mechanism employed here enables us to obscure the true standard variation among categories; consequently, the issue of data leakage has been effectively addressed. And the string categorical value becomes ordinal categorical value

In [None]:
'''
This is a key dictionary for input control while developing web application For those category values
'''
cat_key_dictonary={}

for i in obj_col.columns:
    cat_key_dictonary[i]=None
    
    _encoded= df_next.groupby(cat_col[i])[tg].std().fillna(0)  # Calculate mean Sales for each Category
    _encoded = _encoded.rank(method='dense').astype(int)
    
    cat_key_dictonary[i]=dict(zip(_encoded.index.tolist(),_encoded.values.tolist()))
    
    cat_col[i] = cat_col[i].map(_encoded)


In [None]:
cat_key_dictonary 

In [None]:
plt.figure(figsize=(10, 10))
sns.heatmap(cat_col.corr(), fmt=".2f", cmap="coolwarm", linewidths=0.5, linecolor="gray")
plt.title("headmap", fontsize=8)

# Show the heatmap
plt.show()


In [None]:
_ ,cold=cat_col[obj_col.columns].shape
plt.figure(figsize=(10,cold*2)) 

pltSize=plotSize(2,cold)
num_corr=cat_col[obj_col.columns].corrwith(df_next['SalePrice'], axis=0, method='pearson')
plt.figure(figsize=(10,cold)) 
for i, column in enumerate(cat_col[obj_col.columns]):
    plt.subplot(pltSize[0],pltSize[1], i+1)
    sns.regplot(data=cat_col, x=column, y=df_next['SalePrice'])
    plt.title(f"{column}\ncorrelation with Target {num_corr[column]:.3f}")
    sns.despine()

plt.tight_layout()  # Adjust subplots to fit into figure area.
plt.show()
del num_corr
Markdown(f"## The graph demonstrates that the integrity of Categorical information is preserved within a numerical framework, while also maintaining the dimensional aspect. ")             

## Final Categorical Assingment to main Dataset

In [None]:
set_on()
df_housePrice


In [None]:
df_housePrice[cat_col.columns]=cat_col

In [None]:
del cold
del obj_col
del cat_col
del num_col
del df_next

In [None]:
df_housePrice

# Model Selection and Predicting The Standard Evaluation Score will R2 

In [None]:
y = df_housePrice['SalePrice']
x = df_housePrice.drop("SalePrice",axis=1) 

In [None]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.30,random_state=6)

# Outliers

__ Outlier Detection: __<br>
>>  IQR Method: Detects outliers based on values lying outside [Q1 - 1.5*IQR, Q3 + 1.5*IQR].<br>
>> Z-score Method: Detects outliers where the z-score is greater than 3 or less than -3.<br>

__ Imputation Strategy:__ <br>

>> Mean: Replaces outliers with the mean of non-outlier values.<br>
>> Median: Replaces outliers with the median of non-outlier values (robust against skewed data).<br>
>> Mode: Replaces outliers with the mode (most frequent value)<br>

### With reference of normal distribution<br>
>>The __Kurtosis__ of a normal distribution is 3<br>
>>The __Skewness of__ a normal distribution is 0

In [None]:
x_train.describe().T


In [None]:
# New Data frame for outliers
Outlier_numerical_columns = x_train

In [None]:

""" 
Iterate column wise for IQR and zscore 
winsorize has its own Iterator 
Default Parameters:
method='IQR' 
impute_strategy='median'
"""
# CheckPoint make sure to donot disturbe the approximate normal columns only deal with high skew and kurtosis columns 
def checkPoint(df,i='_'):
    t=True
    stdcheck = lambda: df[i].std() < 2
    kurcheck = lambda: 0 < df[i].kurtosis() < 3
    skecheck = lambda: 0 < df[i].skew() < 2 
    if stdcheck():
        #skipd in range
        t=False
    elif kurcheck():
        t=False
    elif skecheck():
        t=False
    return t
# Experimant with Outliers and Evaluate on the basis of std,skew and kurtosis
def SKEvaluateData(df):
    t=True
    for i in df:
        '''
            The Kurtosis of a normal distribution is 3
            The Skewness of a normal distribution is 0
            And Having more then 2 Standard Deviation
        
        '''
        if checkPoint(df,i):
            t=False
            print("*"*25+"_"+i+"_"+"*"*25)
            print(f'skewness is  {df[i].skew():.2f}')
            print(f'kurtosis is {df[i].kurtosis():.2f}')
            print(f'S___T__D is {df[i].std():.2f}')
            print("*"*65)
    if t :
        print("Good!🟢 to Go Everything is Normal")
    else:
        print("""There are certain columns in the data set that contain outlier values. \n 
        It is acceptable to encounter outliers on occasion \n don't be Greedy 😏""")
        
def detect_and_impute_outliers(df, column='_', method='winsorize', impute_strategy='median'):
    
    if method =='winsorize':
        from scipy.stats import mstats
        for col in df:
            if checkPoint(df,col):
                df[col] = mstats.winsorize(df[col], limits=[0.1, 0.1])
        return df
        
    if method == 'IQR':
        # IQR Method
        Q1 = df[column].quantile(0.25)  # First quartile
        Q3 = df[column].quantile(0.75)  # Third quartile
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        # Detect outliers
        outliers = (df[column] < lower_bound) | (df[column] > upper_bound)
    
    elif method == 'zscore':
        # Z-score Method
        mean = df[column].mean()
        std = df[column].std()
        z_scores = (df[column] - mean) / std
        outliers = (z_scores.abs() > 3)
    
    else:
        raise ValueError("Invalid method. Choose 'IQR' ,'zscore' or 'winsorize'.")

    print(f"Detected outliers:\n{df[outliers]}")

    # Impute outliers
    if impute_strategy == 'mean':
        replacement_value = df[~outliers][column].mean()
    elif impute_strategy == 'median':
        replacement_value = df[~outliers][column].median()
    elif impute_strategy == 'mode':
        replacement_value = df[~outliers][column].mode()[0]
    else:
        raise ValueError("Invalid impute strategy. Choose 'mean', 'median', or 'mode'.")

    df.loc[outliers, column] = replacement_value
    return df


### Experiment Time

## Befor Outlier Treatement

In [None]:
SKEvaluateData(Outlier_numerical_columns)

### Outlier in Zscore with median imputation

In [None]:
# for i in Outlier_numerical_columns:
#      if checkPoint(Outlier_numerical_columns,i):
#         Outlier_numerical_columns=detect_and_impute_outliers(Outlier_numerical_columns,i, method='zscore', impute_strategy='median')

### Winsorize by 0.1% both the side 

In [None]:
# Outlier_numerical_columns=detect_and_impute_outliers(Outlier_numerical_columns, method='winsorize')

### Outlier in IQR with median imputation

In [None]:
for i in Outlier_numerical_columns:
    if checkPoint(Outlier_numerical_columns,i):
        Outlier_numerical_columns=detect_and_impute_outliers(Outlier_numerical_columns,i, method='IQR', impute_strategy='median')

## After Outlier Treatement
>> find a message to go green

In [None]:
SKEvaluateData(Outlier_numerical_columns)

In [None]:
x_train[Outlier_numerical_columns.columns]=Outlier_numerical_columns

In [None]:
Profile=ProfileReport(x_train,title=" Independent_Data_without_Outliers_Analysis",explorative=True)

Profile.to_notebook_iframe()


Profile.to_file("Independent_Data_without_Outliers_Analysis.html")


# Normalization and scaling
>> __When to Normalize?__<br>
When features have different scales (e.g., one is in kilometers, another in meters).<br>
When required by the algorithm (e.g., SVM, k-NN, PCA).<br>
>__Which Method to Choose?__<br>
Use Min-Max Normalization for bounded data like pixel values. [0,1] <br>
Use Z-score Normalization for unbounded data or models sensitive to feature scaling.[-3,3]<br>
Use Robust Scaling if your dataset contains many outliers.<br>

## Options for Scaling  Min-Max Scaler , StandardScaler And RobustScaler

In [None]:
NS_numerical_columns = x_train

In [None]:
"""Select from the list with correct speling
   [ Min-Max Scaler , StandardScaler , RobustScaler]
   return type is  pandas dataframe 
"""
def NormalizationMethod(df,method='_'):
    if method == 'min_maxScaler':
        
        from sklearn.preprocessing import MinMaxScaler
        scaler = MinMaxScaler()
        normalized_data = scaler.fit_transform(df)
        return pd.DataFrame(normalized_data, columns=df.columns)
    elif method == 'StandardScaler':
        from sklearn.preprocessing import StandardScaler
        scaler = StandardScaler()
        standardized_data = scaler.fit_transform(df)
        return pd.DataFrame(standardized_data, columns=df.columns)
    elif method == 'RobustScaler':
        from sklearn.preprocessing import RobustScaler
        scaler = RobustScaler()
        robust_scaled_data = scaler.fit_transform(df)
        return pd.DataFrame(robust_scaled_data, columns=df.columns)
    else:
        raise ValueError("Invalid method. Choose Min-Max Scaler , StandardScaler , RobustScaler.")
    return df


## Please select any one option; that will sufficient.

In [None]:
NS_numerical_columns=NormalizationMethod(NS_numerical_columns,method='RobustScaler') 
# select any one from hear:- Min-Max Scaler , StandardScaler , RobustScaler

In [None]:
x_train[NS_numerical_columns.columns]=NS_numerical_columns

## Regression models library

In [None]:
from sklearn.svm import SVR
from xgboost import XGBRegressor
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

# Evaluation and test score 
from sklearn.metrics import mean_squared_error,r2_score 

In [None]:
# model dictionary
models = {
    'LinearRegression':LinearRegression(),
    'RandomForestRegressor':RandomForestRegressor(),
    'XGBRegressor':XGBRegressor(),
    # 'SGDRegressor':SGDRegressor(),
    # 'SVR':SVR(),
    'Ridge':Ridge(),
    'ElasticNet':ElasticNet(),
}

## Causion <br>
>> ElasticNet,Redige,Linear Regression and  Others can cause change in input with Null values and this change may can effect all the instance<br> 

In [None]:

model_results_R2 = []
model_results_RMSE = []
model_names = []

"""
ElasticNet,Redige,Linear Regression and  
Others can cause change in input entries to  Null values and this change may can effect all the instance
step1>> Go to main split test train cell 
step2>> Run All befor model dictionary cell
step3>> Select one Model at a time and run slowly
"""

# training the model with function
for name,model in models.items():
    a = model.fit(x_train,y_train)
    predicted = a.predict(x_test)
    scoreRMSE = np.sqrt(mean_squared_error(y_test, predicted))
    scoreR2 = r2_score(y_test, predicted) 
    model_results_RMSE.append(scoreRMSE)
    model_results_R2.append(scoreR2*100)
    model_names.append(name)
    
    #creating dataframe
    df_results = pd.DataFrame([model_names,model_results_RMSE,model_results_R2])
    df_results = df_results.transpose()
df_results = df_results.rename(columns={0:'Models',1:'RMSE',2:"R2_Score"}).sort_values(by='R2_Score',ascending=False,)
    
print(df_results)

In [None]:
# Profile=ProfileReport(x_train,title=" Evaluation",explorative=True)

# Profile.to_notebook_iframe()


# Profile.to_file("Model Evaluation Analysis.html")

# End of  Modeling and Evaluation 

# Challanges

>> Dealing with Null values<br>
>> Encoding null and also find the best way to not to increase dimentionality<br>
>> Outliers Detection and imputation<br>
>> Selecting the best Model for parameter<br> 

### Advance Improvement Future scope
>> Proper handling with Null values<br>
>> Creating a bin for Year based columns and ordinal imputation like {year 1998 to 2000 can be called as old or assinged with 0}<br>
>> More Advance Paremetric selection with Training Model

## These are the suggested values for a Customer 	SalePrice $108000.00

MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	MasVnrArea	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	Heating	HeatingQC	CentralAir	Electrical	1stFlrSF	2ndFlrSF	LowQualFinSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Functional	Fireplaces	FireplaceQu	GarageType	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	PavedDrive	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
403	403	30	RL	60	10200	Pave	NA	Reg	Lvl	AllPub	Inside	Gtl	Sawyer	Norm	Norm	1Fam	1Story	5	8	1940	1997	Gable	CompShg	Wd Sdng	Wd Sdng	None	0	TA	TA	PConc	TA	TA	No	Unf	0	Unf	0	672	672	GasA	Ex	Y	SBrkr	672	0	0	672	0	0	1	0	2	1	TA	4	Typ	0	NA	Detchd	1940	Unf	1	240	TA	TA	N	168	0	0	0	0	0	NA	GdPrv	NA	0	8	2008	WD	Normal	108000