### **Linear regression on HousePricing**

* Linear Regression is one of the simplest and most widely used algorithms for regression tasks in machine learning. The goal of linear regression is to model the relationship between one or more independent variables (features) and a dependent variable (target) by fitting a linear equation to the observed data.

**Step1 : Import Necessary Libraries**

In [32]:
import numpy as np 
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

**Step2 : Load the Dataset**

In [33]:
df = pd.read_csv("E:\\Machine Learning\\Datasets\\House_Price_train.csv")

In [34]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [35]:
df.shape

(1460, 81)

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

**Step3 : Preprocess the Dataset**

In [37]:
df.drop(columns = ['Alley'], inplace = True)

In [38]:
columns_with_nan = df.columns[df.isna().any()].tolist()

print("Columns with NaN values:", columns_with_nan)

Columns with NaN values: ['LotFrontage', 'MasVnrType', 'MasVnrArea', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']


In [39]:
columns_to_check = ['LotFrontage', 'MasVnrType', 'MasVnrArea', 'BsmtQual', 
                    'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 
                    'Electrical', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 
                    'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 
                    'Fence', 'MiscFeature']

# Calculate and print the percentage of missing values for each specified column
for column in columns_to_check:
    missing_percentage = df[column].isnull().sum() / len(df) * 100
    print(f"Percentage of missing values in '{column}': {missing_percentage:.2f}%")


Percentage of missing values in 'LotFrontage': 17.74%
Percentage of missing values in 'MasVnrType': 59.73%
Percentage of missing values in 'MasVnrArea': 0.55%
Percentage of missing values in 'BsmtQual': 2.53%
Percentage of missing values in 'BsmtCond': 2.53%
Percentage of missing values in 'BsmtExposure': 2.60%
Percentage of missing values in 'BsmtFinType1': 2.53%
Percentage of missing values in 'BsmtFinType2': 2.60%
Percentage of missing values in 'Electrical': 0.07%
Percentage of missing values in 'FireplaceQu': 47.26%
Percentage of missing values in 'GarageType': 5.55%
Percentage of missing values in 'GarageYrBlt': 5.55%
Percentage of missing values in 'GarageFinish': 5.55%
Percentage of missing values in 'GarageQual': 5.55%
Percentage of missing values in 'GarageCond': 5.55%
Percentage of missing values in 'PoolQC': 99.52%
Percentage of missing values in 'Fence': 80.75%
Percentage of missing values in 'MiscFeature': 96.30%


In [40]:
df.drop(columns = ['MiscFeature', 'Fence','PoolQC','FireplaceQu','MasVnrType','LotFrontage'], inplace = True)

In [41]:
df.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotArea', 'Street', 'LotShape',
       'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood',
       'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl',
       'Exterior1st', 'Exterior2nd', 'MasVnrArea', 'ExterQual', 'ExterCond',
       'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
       'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF',
       '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath',
       'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch'

In [42]:
df_num = df.select_dtypes(include = [np.number])
df_cat = df.select_dtypes(include = ['object'])

In [43]:
print(df_num.columns)
print(df_cat.columns)

Index(['Id', 'MSSubClass', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces',
       'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')
Index(['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical'

In [44]:
from sklearn.impute import SimpleImputer
imputer1 = SimpleImputer(strategy='median')
df_num_imputed = pd.DataFrame(imputer1.fit_transform(df_num))
df_num_imputed.columns = df_num.columns

print(df_num_imputed.columns)

Index(['Id', 'MSSubClass', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces',
       'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')


In [45]:
imputer2 = SimpleImputer(strategy = 'most_frequent')
df_cat_imputed = pd.DataFrame(imputer2.fit_transform(df_cat))
df_cat_imputed.columns = df_cat.columns
print(df_cat_imputed.columns)

Index(['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional',
       'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive',
       'SaleType', 'SaleCondition'],
      dtype='object')


**Step 4 : Correlationship between the columns**

In [46]:
df_num_corr = df_num_imputed.corr()

In [47]:
df_num_columns = []
df_num_columns.extend(df_num_corr[(df_num_corr['SalePrice']>0.3)].index.values)
df_num_columns.extend(df_num_corr[(df_num_corr['SalePrice']<-0.3)].index.values)

In [48]:
df_num_columns

['OverallQual',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea',
 'BsmtFinSF1',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'GrLivArea',
 'FullBath',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageYrBlt',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'SalePrice']

In [49]:
from scipy.stats import f_oneway

In [50]:
df_cat_imputed['SP'] = df_num_imputed['SalePrice']

In [51]:
influence_list = []
noninfluence_list = []
for influence1 in list (df_cat_imputed.columns):
    if influence1 == 'SP':
        continue
    else:
        groups = [df_cat_imputed['SP'][df_cat_imputed[influence1] == category] for category in df_cat_imputed[influence1].unique()]
        f_stat, p_value = f_oneway(*groups)
        print(f"column : {influence1}, F-statistic: {f_stat}, P-value: {p_value}")
        if p_value < 0.05:
            influence_list.append(influence1)
        else:
            noninfluence_list.append(influence1)

column : MSZoning, F-statistic: 43.84028167245718, P-value: 8.817633866272648e-35
column : Street, F-statistic: 2.4592895583691994, P-value: 0.11704860406782483
column : LotShape, F-statistic: 40.13285166226295, P-value: 6.447523852011766e-25
column : LandContour, F-statistic: 12.850188333283924, P-value: 2.7422167521379096e-08
column : Utilities, F-statistic: 0.29880407484898486, P-value: 0.584716773968938
column : LotConfig, F-statistic: 7.809954123467792, P-value: 3.163167473604189e-06
column : LandSlope, F-statistic: 1.9588170374149438, P-value: 0.1413963584114019
column : Neighborhood, F-statistic: 71.78486512058272, P-value: 1.558600282771154e-225
column : Condition1, F-statistic: 6.118017137125925, P-value: 8.904549416138854e-08
column : Condition2, F-statistic: 2.0738986215227877, P-value: 0.043425658360948464
column : BldgType, F-statistic: 13.011077169620851, P-value: 2.0567364604967015e-10
column : HouseStyle, F-statistic: 19.595000995981223, P-value: 3.376776535121222e-25
c

column : Exterior2nd, F-statistic: 17.500839571369834, P-value: 4.8421856706985465e-43
column : ExterQual, F-statistic: 443.33483141504627, P-value: 1.4395510967787893e-204
column : ExterCond, F-statistic: 8.798714214177485, P-value: 5.106680608671862e-07
column : Foundation, F-statistic: 100.25385058740888, P-value: 5.791895002232234e-91
column : BsmtQual, F-statistic: 413.94564835837843, P-value: 2.078120077126687e-194
column : BsmtCond, F-statistic: 13.791801383643444, P-value: 7.166577741866081e-09
column : BsmtExposure, F-statistic: 76.57793447709939, P-value: 5.39423843153302e-46
column : BsmtFinType1, F-statistic: 70.51842537038553, P-value: 3.598398455129613e-66
column : BsmtFinType2, F-statistic: 2.3900074267234896, P-value: 0.03599298669495113
column : Heating, F-statistic: 4.259818559406287, P-value: 0.000753472106445497
column : HeatingQC, F-statistic: 88.39446198869796, P-value: 2.6670620921043572e-67
column : CentralAir, F-statistic: 98.30534356615253, P-value: 1.80950615

In [52]:
influence_list

['MSZoning',
 'LotShape',
 'LandContour',
 'LotConfig',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 'KitchenQual',
 'Functional',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'SaleType',
 'SaleCondition']

In [53]:
noninfluence_list

['Street', 'Utilities', 'LandSlope']

In [54]:
df_cat1 = df_cat[influence_list]
df_cat1.columns

Index(['MSZoning', 'LotShape', 'LandContour', 'LotConfig', 'Neighborhood',
       'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'ExterQual', 'ExterCond',
       'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
       'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical',
       'KitchenQual', 'Functional', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'SaleType', 'SaleCondition'],
      dtype='object')

**Step5 : Encoding the categorical columns**

In [55]:
from sklearn.preprocessing import OneHotEncoder

In [56]:
encoder = OneHotEncoder(sparse_output = False, handle_unknown = 'ignore')
encoded_array = encoder.fit_transform(df_cat1)
encoded_df = pd.DataFrame(encoded_array, columns = encoder.get_feature_names_out(df_cat1.columns))
encoded_df.shape

(1460, 233)

**Step 6 : Removing Outliers of the Datacolumns**

In [57]:
def remove_outliers_iqr(df,columns):
    for col in columns :
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)

        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    return df

In [58]:
import pandas as pd

def remove_outliers_iqr(df, columns):
    for col in columns:
        # Ensure the column is numeric
        if pd.api.types.is_numeric_dtype(df[col]):
            Q1 = df[col].quantile(0.25)
            Q3 = df[col].quantile(0.75)
            IQR = Q3 - Q1

            # Define the lower and upper bounds for outliers
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR

            # Remove rows where the values are outliers
            df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

    return df

columns_to_check = ['OverallQual',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea',
 'BsmtFinSF1',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'GrLivArea',
 'FullBath',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageYrBlt',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF']
df_no_outliers = remove_outliers_iqr(df_num_imputed, columns_to_check)

print("DataFrame after removing outliers:")
print(df_no_outliers.shape)


DataFrame after removing outliers:
(1150, 37)


**Step 7 : Scale the numerical columns**

In [59]:
from sklearn.preprocessing import StandardScaler
columns_to_scale = ['OverallQual','YearBuilt','YearRemodAdd','MasVnrArea','BsmtFinSF1',
                     'TotalBsmtSF','1stFlrSF','2ndFlrSF','GrLivArea','FullBath', 'TotRmsAbvGrd',
                     'Fireplaces','GarageYrBlt','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','SalePrice']
standard_scaler = StandardScaler()
df_standard_scaled = df_no_outliers.copy()
df_standard_scaled[columns_to_scale] = standard_scaler.fit_transform(df_no_outliers[columns_to_scale])


In [60]:
final_df = pd.concat([df_standard_scaled, encoded_df], axis = 1)
final_df.head()

Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,1.0,60.0,8450.0,0.846419,5.0,1.098515,0.907024,1.235608,0.736393,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,2.0,20.0,9600.0,0.033235,8.0,0.178964,-0.390276,-0.627782,1.425393,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,3.0,60.0,11250.0,0.846419,5.0,1.0304,0.858976,0.912367,0.179114,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,4.0,70.0,9550.0,0.846419,5.0,-1.898538,-0.678565,-0.627782,-0.50482,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
4,5.0,60.0,14260.0,1.659604,5.0,0.996342,0.76288,2.6997,0.607206,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


**Step 8 : Split the data column in to train and test data**

In [63]:
# Drop rows with any missing values
df_clean = final_df.dropna()

# Separate features and target again
X = df_clean.drop(columns=['SalePrice'])
y = df_clean['SalePrice']

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

**Step 9 : Train the data with Linear Regression**

In [64]:
from sklearn.metrics import r2_score

# Assuming y_test are the actual values and y_pred are the predicted values
y_true = y_test  # Replace with actual values
y_pred = model.predict(X_test)  # Replace with predicted values from your model

# Compute the R² score
r2 = r2_score(y_true, y_pred)

# Print the R² score
print(f"R-squared: {r2}")


R-squared: 0.878744716642077
