<h3>House Prices Regression Analysis</h3>
<h5>--A Kaggle Project--</h5>

<h4>Steps for the Project</h4>
<ul>
    <li>Loading Datasets</li>
    <ul>
        <li>Loading both train and test datasets </li>
    </ul>
</ul>
<ul>
    <li>Summary of Data</li>
    <ul>
        <li>Total Samples</li>
        <li>Total Features</li>
        <li>Total Categorical Features</li>
        <li>Total Numerical Features</li>
        <li>Statistics of Numerical Features</li>
        <li>Value Count of Categorical Features</li>
        <li>Unique Values DataFrame</li>
        <li>Null Values DataFrame</li>
    </ul>
</ul>
<ul>
    <li>Preprocessing</li>
    <ul>
        <li>Drop Duplicates</li>
        <li>Drop Columns with more than 80% null values</li>
        <li>Drop uninformative columns</li>
        <li>Drop Columns with single unique values</li>
        <li>Inpute Null Values</li>
        <li>Create New Features</li>
        <li>Outlier Analysis and Removal</li>
        <li>Drop Columns with single unique values again after outlier analysis</li>
    </ul>
</ul>
<ul>
    <li>Data Visualization</li>
    <ul>
        <li>Scatterplot of numerical features</li>
        <li>Distribution of numerical features</li>
        <li>BarCharts of categorical features</li>
        <li>Box plots to check the outliers</li>
    </ul>
</ul>
<ul>
    <li>Feature Transformation</li>
    <ul>
        <li>Changing the distribution of numerical features to Gaussian (Normal)</li></ul>
<li>Encoding</li>
    <ul>
    <li>Some of the categorical features are nominal and some are ordinal. We need to encode them separately.
<li>For ordinal features, we will do label encoding</li>
<li>For nominal features, we will do dummy encoding</li>
    </ul>
    <li><h4>Model Training & Evaluation</h4></li>
    <ul>
    <li>Perform Scaling
        <ul>
            <li>MinMax Scaling</li>
<li>Variance Scaling (Standard Scaler)</li>
        </ul>
</li>
    </ul>
    <li>Fitting Different Regression Models</li>
    <ul>
        <li>Linear Regression
<li>Polynomial Regression (with interaction features)</li>
<li>Ridge Regression</li>
<li>Lasso Regression</li>
<li>SGD Regression</li>
<li>Elastic Regression</li>
<li>Bayesian Ridge</li>
<li>Huber Regression (robust to outliers)</li>
<li>RANSAC Regression (robust to outliers)</li>
<li>XGB Regressor</li>
<li>Ensemble Regressor</li>
<li>Random Forest</li>
<li>Gradient Boosting</li>
<li>AdaBoosting</li>
<li>Bagging Regressor</li>
<li>ExtraTreesRegressor</li>
    </ul>
    <li>Feature Selection</li>
    <ul>
        <li>Selecting strong numerical features using Pearson’s Correlation Coefficient
<li>Selecting strong categorical using ANOVA</li>
    </ul>
    <li>Feature Selection</li>
    <ul>
    <li>Using PCA to perform dimensionality reduction.</li>
<li>Don't forget to scale your data before doing PCA.</li>
    </ul>
    <li>Model Training & Evaluation With Strong Features Only
<ul>
    <li>Using the same models as stated above.
</li>
</ul>
</li>
    <li>Conclusion</li>
    <ul>
        <li>Which model performed the best one with using all the features or the one with the strong features only ?</li>
    </ul>
    <li>Hyperameter Tuning</li>
    <ul>
    <li>Tuning the parameters of the best mode</li>
    </ul>
    <li>Feature Engineering Analysis</li>
    <ul>
     <li>Comparison of the scores of the different feature engineering steps.  </li>
    </ul>
    <li>STORY TELLING FROM THE RESULT ANALYSIS</li>
    <ul>
        <li>Simple interpretation of the results in layman language.</li>
    </ul>
</ul>

In [1]:
# Loading data
import numpy as np
import pandas as pd
from pandas.api.types import is_numeric_dtype, is_object_dtype
train_df = pd.read_csv("home-data-for-ml-course/train.csv")
test_df = pd.read_csv("home-data-for-ml-course/test.csv")


In [2]:
#view train dataset
train_df.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
# View test dataset
test_df.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


<h4>Summary Of Data</h4>
<ul>
<li>Total Samples
<li>Total Features</li>
<li>Total Categorical Features</li>
<li>Total Numerical Features</li>
<li>Stats of Numerical Features</li>
<li>Value Count of Categorical Features</li>
<li>Unique Values DataFrame</li>
<li>Null Values DataFrame</li>
</ul>

In [4]:
# Identify numerical & categorical features
def get_cat_num_features(df):
    """
    Identifies and separates numerical and categorical features from a DataFrame.

    Parameters:
    df (pd.DataFrame): Input DataFrame.

    Returns:
    tuple: A tuple containing two lists:
        - num_features: List of numerical feature column names.
        - cat_features: List of categorical feature column names.
    """
    num_features = []
    cat_features = []

    for col in df.columns:
        if is_numeric_dtype(df[col]):
            num_features.append(col)
        if is_object_dtype(df[col]):
            cat_features.append(col)
    return num_features, cat_features

In [12]:
#  Creates a DataFrame containing unique values and their counts for each feature.
def get_unique_df(features):
    """
    Creates a DataFrame containing unique values and their counts for each feature.

    Parameters:
    features (pd.DataFrame): Input DataFrame containing features.

    Returns:
    pd.DataFrame: DataFrame with columns:
        - 'Feature': Feature name.
        - 'Unique': Array of unique values in the feature.
        - 'Count': Number of unique values.
    """
    unique_list = []  # List to store individual entries for each feature
    
    for col in features.columns:
        v = features[col].unique()  # Get unique values
        l = len(v)  # Count of unique values
        unique_list.append({"Feature": col, "Unique": v, "Count": l})  # Append to list

    # Create a DataFrame from the list of dictionaries
    unique_df = pd.DataFrame(unique_list)
    
    return unique_df


In [15]:
# Identifies columns with missing values and summarizes the missing data.
from pandas.api.types import is_numeric_dtype

def get_null_df(features: pd.DataFrame) -> pd.DataFrame:
    """
    Identifies columns with missing values and summarizes the missing data.

    Parameters:
    features (pd.DataFrame): Input DataFrame containing features.

    Returns:
    pd.DataFrame: DataFrame with columns:
        - 'Column': Name of the column.
        - 'Type': Data type of the column (Numerical or Categorical).
        - 'Total NaN': Total number of missing values.
        - '%': Percentage of missing values relative to the total rows.
    """
    # Initialize a list to collect row data
    null_data = []

    # Identify columns with missing values
    col_null = features.columns[features.isna().any()].to_list()
    total_rows = len(features)

    for col in col_null:
        col_type = "Numerical" if is_numeric_dtype(features[col]) else "Categorical"
        null_count = features[col].isna().sum()
        
        # Append a dictionary for each column's information to the list
        null_data.append({
            "Column": col,
            "Type": col_type,
            "Total NaN": null_count,
            "%": (null_count / total_rows) * 100
        })

    # Create a DataFrame from the collected data
    return pd.DataFrame(null_data)


In [16]:
# Summary Function
def summary(data):
    """
    Provides a comprehensive summary of the dataset, including:
    - Total samples and features.
    - Numerical and categorical features.
    - Descriptive statistics.
    - Value counts of categorical features.
    - DataFrames summarizing unique values and missing values.

    Parameters:
    data (pd.DataFrame): Input DataFrame with the target column 'SalePrice'.

    Returns:
    dict: A dictionary containing:
        - 'features': DataFrame of features excluding the target column.
        - 'target': Target column (SalePrice).
        - 'stats': Descriptive statistics for numerical features.
        - 'unique_df': DataFrame summarizing unique values of each feature.
        - 'col_null_df': DataFrame summarizing columns with missing values.
    """
    print("Samples --> ", len(data))
    print()
    
    target = data['SalePrice']
    features = data.drop(['SalePrice'], axis=1)
    
    print("Features --> ", len(features.columns))
    print("\n", features.columns)
    
    num_features, cat_features = get_cat_num_features(features)
    
    print()
    print("\nNumerical Features --> ", len(num_features))
    print()
    print(num_features)
    print()
    print("Categorical Features -->", len(cat_features))
    print()
    print(cat_features)
    print()
    print("*************************************************")
    
    stats = features.describe().T
    print()
    print("Value counts of each categorical feature\n")
    for col in cat_features:
        print(col)
        print(features[col].value_counts())
        print()
        
    unique_df = get_unique_df(features)
    col_null_df = get_null_df(features)
    
    return {'features': features, 
            'target': target, 
            'stats': stats, 
            'unique_df': unique_df, 
            'col_null_df': col_null_df}

In [17]:
summary(train_df)

Samples -->  1460

Features -->  80

 Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQ

{'features':         Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
 0        1          60       RL         65.0     8450   Pave   NaN      Reg   
 1        2          20       RL         80.0     9600   Pave   NaN      Reg   
 2        3          60       RL         68.0    11250   Pave   NaN      IR1   
 3        4          70       RL         60.0     9550   Pave   NaN      IR1   
 4        5          60       RL         84.0    14260   Pave   NaN      IR1   
 ...    ...         ...      ...          ...      ...    ...   ...      ...   
 1455  1456          60       RL         62.0     7917   Pave   NaN      Reg   
 1456  1457          20       RL         85.0    13175   Pave   NaN      Reg   
 1457  1458          70       RL         66.0     9042   Pave   NaN      Reg   
 1458  1459          20       RL         68.0     9717   Pave   NaN      Reg   
 1459  1460          20       RL         75.0     9937   Pave   NaN      Reg   
 
      LandContour Utilitie