## Ordinary Least Squares OLS
In this project, we will conduct an Ordinary Least Squares (OLS) regression analysis with the primary goal of maximizing the R² value. Our approach will involve several key steps:

**Correlation Analysis:** We will begin by calculating the correlation of numeric features to evaluate their individual contributions to the R² value. This analysis will help us identify which features have the strongest relationships with the target variable.

**Dummy Variable Conversion:** After understanding the numeric features, we will convert categorical features into dummy variables. This step is essential for including these features in our regression model.

**Incremental Feature Addition:** We will then proceed to add features to the model one by one, carefully monitoring the changes in the R² value. This iterative process will allow us to observe how each additional feature affects the model's explanatory power.

In [1]:
import pandas as pd
import statsmodels.formula.api as smf #for OLS
import statsmodels.api as sm #for VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor #VIF
from sklearn.linear_model import LinearRegression # sklearn

from sklearn.model_selection import train_test_split # sklearn

from sklearn.model_selection import train_test_split, cross_val_score # sklearn
from sklearn.linear_model import LinearRegression, Ridge # sklearn
from sklearn.preprocessing import StandardScaler # sklearn
from sklearn.pipeline import make_pipeline # sklearn

import matplotlib.pyplot as plt #V
import seaborn as sns #V

import numpy as np #SSE, RMSE, and MAE
from sklearn.metrics import mean_squared_error, mean_absolute_error #SSE, RMSE, and MAE

from sklearn.preprocessing import MinMaxScaler #Scal

In [2]:
EDA = pd.read_csv(r'C:\Users\norah\Desktop\Najm AI\cleaned_Data_Ames_Housing_Data.csv', sep='\t')
print(EDA)

      Order        PID  MS_SubClass MS_Zoning  Lot_Frontage  Lot_Area Street  \
0         1  526301100           20        RL           141     31770   Pave   
1         2  526350040           20        RH            80     11622   Pave   
2         3  526351010           20        RL            81     14267   Pave   
3         4  526353030           20        RL            93     11160   Pave   
4         5  527105010           60        RL            74     13830   Pave   
...     ...        ...          ...       ...           ...       ...    ...   
2924   2926  923275080           80        RL            37      7937   Pave   
2925   2927  923276100           20        RL            69      8885   Pave   
2926   2928  923400125           85        RL            62     10441   Pave   
2927   2929  924100070           20        RL            77     10010   Pave   
2928   2930  924151050           60        RL            74      9627   Pave   

     Lot_Shape Land_Contour Utilities  

In [3]:
EDA.head()

Unnamed: 0,Order,PID,MS_SubClass,MS_Zoning,Lot_Frontage,Lot_Area,Street,Lot_Shape,Land_Contour,Utilities,...,Garage_Cars,Garage_Area,Garage_Qual,Garage_Cond,Paved_Drive,Mo_Sold,Yr_Sold,Sale_Type,Sale_Condition,SalePrice
0,1,526301100,20,RL,141,31770,Pave,IR1,Lvl,AllPub,...,2,528,TA,TA,P,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80,11622,Pave,Reg,Lvl,AllPub,...,1,730,TA,TA,Y,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81,14267,Pave,IR1,Lvl,AllPub,...,1,312,TA,TA,Y,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93,11160,Pave,Reg,Lvl,AllPub,...,2,522,TA,TA,Y,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74,13830,Pave,IR1,Lvl,AllPub,...,2,482,TA,TA,Y,3,2010,WD,Normal,189900


**First, we will correlate the numeric features.
We will place the integer features into a DataFrame.**

In [4]:
# Select only integer features
integer_data = EDA.select_dtypes(include=['int64'])

# Remove the specific columns
integer_data = integer_data.drop(columns=['Order', 'PID'])

# Display the new DataFrame with only integer features
print("\nDataFrame with Integer Features (excluding 'Order' and 'PID'):")
print(integer_data)

# Display the first few rows
display(integer_data.head())


DataFrame with Integer Features (excluding 'Order' and 'PID'):
      MS_SubClass  Lot_Frontage  Lot_Area  Overall_Qual  Overall_Cond  \
0              20           141     31770             6             5   
1              20            80     11622             5             6   
2              20            81     14267             6             6   
3              20            93     11160             7             5   
4              60            74     13830             5             5   
...           ...           ...       ...           ...           ...   
2924           80            37      7937             6             6   
2925           20            69      8885             5             5   
2926           85            62     10441             5             5   
2927           20            77     10010             5             5   
2928           60            74      9627             7             5   

      Year_Built  Year_Remod_Add  BsmtFin_SF_1  Bsmt_Unf_SF

Unnamed: 0,MS_SubClass,Lot_Frontage,Lot_Area,Overall_Qual,Overall_Cond,Year_Built,Year_Remod_Add,BsmtFin_SF_1,Bsmt_Unf_SF,Total_Bsmt_SF,...,Bedroom_AbvGr,Kitchen_AbvGr,TotRms_AbvGrd,Fireplaces,Garage_Yr_Blt,Garage_Cars,Garage_Area,Mo_Sold,Yr_Sold,SalePrice
0,20,141,31770,6,5,1960,1960,639,441,1080,...,3,1,7,2,1960,2,528,5,2010,215000
1,20,80,11622,5,6,1961,1961,468,270,882,...,2,1,5,0,1961,1,730,6,2010,105000
2,20,81,14267,6,6,1958,1958,923,406,1329,...,3,1,6,0,1958,1,312,6,2010,172000
3,20,93,11160,7,5,1968,1968,1065,1045,2110,...,3,1,8,2,1968,2,522,4,2010,244000
4,60,74,13830,5,5,1997,1998,791,137,928,...,3,1,6,1,1997,2,482,3,2010,189900


In [5]:
# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Scale all features
scaled_values = scaler.fit_transform(integer_data)

# Convert scaled values to the range of 0-100 and to integers
scaled_values_int = (scaled_values * 100).astype(int)

# Create a new DataFrame with the scaled integer values
NumCorr = pd.DataFrame(scaled_values_int, columns=integer_data.columns)

print(NumCorr)


      MS_SubClass  Lot_Frontage  Lot_Area  Overall_Qual  Overall_Cond  \
0               0            41        14            55            50   
1               0            20         4            44            62   
2               0            20         6            55            62   
3               0            24         4            66            50   
4              23            18         5            44            50   
...           ...           ...       ...           ...           ...   
2924           35             5         3            55            62   
2925            0            16         3            44            50   
2926           38            14         4            44            50   
2927            0            19         4            44            50   
2928           23            18         3            66            50   

      Year_Built  Year_Remod_Add  BsmtFin_SF_1  Bsmt_Unf_SF  Total_Bsmt_SF  \
0             63              16            1

In [6]:
integer_data.corr()

Unnamed: 0,MS_SubClass,Lot_Frontage,Lot_Area,Overall_Qual,Overall_Cond,Year_Built,Year_Remod_Add,BsmtFin_SF_1,Bsmt_Unf_SF,Total_Bsmt_SF,...,Bedroom_AbvGr,Kitchen_AbvGr,TotRms_AbvGrd,Fireplaces,Garage_Yr_Blt,Garage_Cars,Garage_Area,Mo_Sold,Yr_Sold,SalePrice
MS_SubClass,1.0,-0.3917,-0.20481,0.038994,-0.06699,0.036337,0.04293,-0.060075,-0.130421,-0.219445,...,-0.019523,0.257672,0.03145,-0.050246,0.106621,-0.046167,-0.103511,0.000106,-0.01786,-0.085508
Lot_Frontage,-0.3917,1.0,0.365407,0.199761,-0.068,0.115898,0.086672,0.199706,0.108131,0.329714,...,0.218293,0.005238,0.325061,0.229983,0.045844,0.290531,0.338073,0.010627,-0.007037,0.341489
Lot_Area,-0.20481,0.365407,1.0,0.096959,-0.034535,0.023109,0.021394,0.191555,0.023658,0.253589,...,0.136412,-0.020339,0.216413,0.25687,-0.026421,0.17935,0.212685,0.00371,-0.023057,0.266404
Overall_Qual,0.038994,0.199761,0.096959,1.0,-0.094219,0.596898,0.569252,0.284118,0.270058,0.547294,...,0.062803,-0.159911,0.380205,0.392743,0.46782,0.599211,0.563218,0.030705,-0.020647,0.799138
Overall_Cond,-0.06699,-0.068,-0.034535,-0.094219,1.0,-0.368552,0.048442,-0.050935,-0.136819,-0.173344,...,-0.005684,-0.08632,-0.089193,-0.031307,-0.303384,-0.181146,-0.153389,-0.006937,0.031146,-0.101191
Year_Built,0.036337,0.115898,0.023109,0.596898,-0.368552,1.0,0.61198,0.27987,0.128998,0.407526,...,-0.055405,-0.137929,0.111533,0.170453,0.715571,0.536902,0.479667,0.014347,-0.013153,0.558283
Year_Remod_Add,0.04293,0.086672,0.021394,0.569252,0.048442,0.61198,1.0,0.15179,0.164805,0.297481,...,-0.02213,-0.142587,0.196829,0.132884,0.582144,0.425138,0.376177,0.0176,0.032756,0.532652
BsmtFin_SF_1,-0.060075,0.199706,0.191555,0.284118,-0.050935,0.27987,0.15179,1.0,-0.477875,0.536547,...,-0.118959,-0.086738,0.047631,0.295882,0.156938,0.255501,0.309888,-0.001155,0.022397,0.432914
Bsmt_Unf_SF,-0.130421,0.108131,0.023658,0.270058,-0.136819,0.128998,0.164805,-0.477875,1.0,0.411726,...,0.188146,0.065579,0.251131,0.001389,0.160537,0.179379,0.164185,0.021569,-0.036384,0.182855
Total_Bsmt_SF,-0.219445,0.329714,0.253589,0.547294,-0.173344,0.407526,0.297481,0.536547,0.411726,1.0,...,0.051941,-0.038819,0.28075,0.333086,0.292193,0.437541,0.485453,0.016678,-0.010405,0.63228


In [7]:

# Calculate the correlation matrix
correlation_matrix2 = integer_data.corr()

# Extract the correlation of all features with 'SalePrice'
saleprice_corr2 = correlation_matrix2['SalePrice']

# Filter out correlations that are within -0.1 to 0.1 (exclusive)
filtered_corr2 = saleprice_corr2[
    (saleprice_corr2 >= 0.1) | (saleprice_corr2 <= -0.1)  # Excludes values between -0.1 and 0.1
]

# check to ensure we exclude any exact 0.0
filtered_corr2 = filtered_corr2[filtered_corr2 != 0.0]
filtered_corr2 = filtered_corr2[filtered_corr2 != -0.0]

# Display the filtered correlations for features 1 to 50
print("\nFeatures with Correlation not starting with 0.0 or -0.0 with 'SalePrice' (Index 1 to 50):")
print(filtered_corr2.iloc[0:50])


Features with Correlation not starting with 0.0 or -0.0 with 'SalePrice' (Index 1 to 50):
Lot_Frontage      0.341489
Lot_Area          0.266404
Overall_Qual      0.799138
Overall_Cond     -0.101191
Year_Built        0.558283
Year_Remod_Add    0.532652
BsmtFin_SF_1      0.432914
Bsmt_Unf_SF       0.182855
Total_Bsmt_SF     0.632280
First_Flr_SF      0.621604
Gr_Liv_Area       0.706628
Full_Bath         0.545407
Half_Bath         0.284834
Bedroom_AbvGr     0.143530
Kitchen_AbvGr    -0.119938
TotRms_AbvGrd     0.495140
Fireplaces        0.474356
Garage_Yr_Blt     0.441951
Garage_Cars       0.647665
Garage_Area       0.640228
SalePrice         1.000000
Name: SalePrice, dtype: float64


**We will focus on the features that have correlations closest to 1 or -1 with the SalePrice feature, while excluding those that are near 0 or -0. In particular, we will drop the features 'MS_SubClass', 'Yr_Sold', and 'Mo_Sold' as they are closest to 0.**

Scale 'Lot_Area' feature 

In [8]:
# Remove the specific columns
NumCorr = NumCorr.drop(columns=['MS_SubClass', 'Yr_Sold', 'Mo_Sold'])

In [9]:
display(NumCorr.head())

Unnamed: 0,Lot_Frontage,Lot_Area,Overall_Qual,Overall_Cond,Year_Built,Year_Remod_Add,BsmtFin_SF_1,Bsmt_Unf_SF,Total_Bsmt_SF,First_Flr_SF,...,Full_Bath,Half_Bath,Bedroom_AbvGr,Kitchen_AbvGr,TotRms_AbvGrd,Fireplaces,Garage_Yr_Blt,Garage_Cars,Garage_Area,SalePrice
0,41,14,55,50,63,16,11,18,17,27,...,25,0,37,33,38,50,20,40,35,27
1,20,4,44,62,64,18,8,11,14,11,...,25,0,25,33,23,0,21,20,49,12
2,20,6,55,62,62,13,16,17,21,20,...,25,50,37,33,30,0,20,20,20,21
3,24,4,66,50,69,29,18,44,34,37,...,50,50,37,33,46,50,23,40,35,31
4,18,5,44,50,90,79,14,5,15,12,...,50,50,37,33,30,25,32,40,32,23


In [10]:
NumCorr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2929 entries, 0 to 2928
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   Lot_Frontage    2929 non-null   int32
 1   Lot_Area        2929 non-null   int32
 2   Overall_Qual    2929 non-null   int32
 3   Overall_Cond    2929 non-null   int32
 4   Year_Built      2929 non-null   int32
 5   Year_Remod_Add  2929 non-null   int32
 6   BsmtFin_SF_1    2929 non-null   int32
 7   Bsmt_Unf_SF     2929 non-null   int32
 8   Total_Bsmt_SF   2929 non-null   int32
 9   First_Flr_SF    2929 non-null   int32
 10  Gr_Liv_Area     2929 non-null   int32
 11  Full_Bath       2929 non-null   int32
 12  Half_Bath       2929 non-null   int32
 13  Bedroom_AbvGr   2929 non-null   int32
 14  Kitchen_AbvGr   2929 non-null   int32
 15  TotRms_AbvGrd   2929 non-null   int32
 16  Fireplaces      2929 non-null   int32
 17  Garage_Yr_Blt   2929 non-null   int32
 18  Garage_Cars     2929 non-nul

## OLS Regression for the numeric features:

In [11]:
OLS_integer = NumCorr.select_dtypes(include=['number']).columns.difference(['SalePrice'])

# Generate the formula
features = '+'.join(OLS_integer)
formula = f'SalePrice ~ {features}'

# Fit the model
lm = smf.ols(formula, data=NumCorr)
fit = lm.fit()

# Check the results
print(fit.summary())

                            OLS Regression Results                            
Dep. Variable:              SalePrice   R-squared:                       0.820
Model:                            OLS   Adj. R-squared:                  0.819
Method:                 Least Squares   F-statistic:                     663.1
Date:                Thu, 10 Oct 2024   Prob (F-statistic):               0.00
Time:                        10:12:17   Log-Likelihood:                -8602.0
No. Observations:                2929   AIC:                         1.725e+04
Df Residuals:                    2908   BIC:                         1.737e+04
Df Model:                          20                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept        -10.3536      0.898    -11.


## Model Summary

- **Dependent Variable**: SalePrice
- **R-squared**: 0.820
- **Adjusted R-squared**: 0.819


### Key Metrics

- **R-squared** indicates that approximately 82.1% of the variance in `SalePrice` can be explained by the model. 
- The **Adjusted R-squared** value is slightly lower at 82.0%, accounting for the number of predictors in the model.

## Next Steps

To further enhance the model's predictive power, we will incorporate additional categorical features by converting them into dummy variables. This approach could potentially improve the model's \( R^2 \) value and overall predictive accuracy, allowing for a more comprehensive understanding of the factors impacting property sale prices.


**Next**, we will convert the non-integer features into integer format by creating dummy variables. This process allows us to represent categorical data numerically, enabling us to include these features in our analysis.

In [12]:
# Select only non-integer features
non_integer_data = EDA.select_dtypes(exclude=['int64']).copy()

# Check if 'SalePrice' is in the DataFrame and keep it
if 'SalePrice' in EDA.columns:
    non_integer_data = non_integer_data.merge(EDA[['SalePrice']], left_index=True, right_index=True)

# Display the non-integer features including 'SalePrice'
print("\nNon-Integer Features (including 'SalePrice'):")
print(non_integer_data.head())

# Convert non-integer features to dummy variables excluding 'SalePrice'
dummy_data = pd.get_dummies(non_integer_data.drop(columns=['SalePrice']), drop_first=True)

# Combine the dummy variables with 'SalePrice' using merge
dummy_data_with_saleprice = dummy_data.merge(non_integer_data[['SalePrice']], left_index=True, right_index=True)

# Display the resulting dummy DataFrame with 'SalePrice'
print("\nDummy Variables DataFrame (with 'SalePrice'):")
display(dummy_data_with_saleprice.head())


Non-Integer Features (including 'SalePrice'):
  MS_Zoning Street Lot_Shape Land_Contour Utilities Lot_Config Land_Slope  \
0        RL   Pave       IR1          Lvl    AllPub     Corner        Gtl   
1        RH   Pave       Reg          Lvl    AllPub     Inside        Gtl   
2        RL   Pave       IR1          Lvl    AllPub     Corner        Gtl   
3        RL   Pave       Reg          Lvl    AllPub     Corner        Gtl   
4        RL   Pave       IR1          Lvl    AllPub     Inside        Gtl   

  Neighborhood Condition_1 Condition_2  ... Kitchen_Qual Functional  \
0        NAmes        Norm        Norm  ...           TA        Typ   
1        NAmes       Feedr        Norm  ...           TA        Typ   
2        NAmes        Norm        Norm  ...           Gd        Typ   
3        NAmes        Norm        Norm  ...           Ex        Typ   
4      Gilbert        Norm        Norm  ...           TA        Typ   

  Garage_Type Garage_Finish Garage_Qual Garage_Cond Paved_Drive

Unnamed: 0,MS_Zoning_C,MS_Zoning_FV,MS_Zoning_I,MS_Zoning_RH,MS_Zoning_RL,MS_Zoning_RM,Street_Pave,Lot_Shape_IR2,Lot_Shape_IR3,Lot_Shape_Reg,...,Sale_Type_New,Sale_Type_Oth,Sale_Type_VWD,Sale_Type_WD,Sale_Condition_AdjLand,Sale_Condition_Alloca,Sale_Condition_Family,Sale_Condition_Normal,Sale_Condition_Partial,SalePrice
0,0,0,0,0,1,0,1,0,0,0,...,0,0,0,1,0,0,0,1,0,215000
1,0,0,0,1,0,0,1,0,0,1,...,0,0,0,1,0,0,0,1,0,105000
2,0,0,0,0,1,0,1,0,0,0,...,0,0,0,1,0,0,0,1,0,172000
3,0,0,0,0,1,0,1,0,0,1,...,0,0,0,1,0,0,0,1,0,244000
4,0,0,0,0,1,0,1,0,0,0,...,0,0,0,1,0,0,0,1,0,189900


**There are some features that contain spaces, and we want to replace these spaces with underscores (_).**

In [13]:
# Replace spaces and other characters in the column names (features)
# Specifically handle 'Wd Sdng' to 'Wd_Sdng'
dummy_data_with_saleprice.columns = dummy_data_with_saleprice.columns.str.replace('Wd Sdng', 'Wd_Sdng', regex=False)
dummy_data_with_saleprice.columns = dummy_data_with_saleprice.columns.str.replace(' ', '_', regex=False)

# Display the updated DataFrame with new feature names
print(dummy_data_with_saleprice.columns)

Index(['MS_Zoning_C', 'MS_Zoning_FV', 'MS_Zoning_I', 'MS_Zoning_RH',
       'MS_Zoning_RL', 'MS_Zoning_RM', 'Street_Pave', 'Lot_Shape_IR2',
       'Lot_Shape_IR3', 'Lot_Shape_Reg',
       ...
       'Sale_Type_New', 'Sale_Type_Oth', 'Sale_Type_VWD', 'Sale_Type_WD_',
       'Sale_Condition_AdjLand', 'Sale_Condition_Alloca',
       'Sale_Condition_Family', 'Sale_Condition_Normal',
       'Sale_Condition_Partial', 'SalePrice'],
      dtype='object', length=207)


In [14]:
display(dummy_data_with_saleprice.head())

Unnamed: 0,MS_Zoning_C,MS_Zoning_FV,MS_Zoning_I,MS_Zoning_RH,MS_Zoning_RL,MS_Zoning_RM,Street_Pave,Lot_Shape_IR2,Lot_Shape_IR3,Lot_Shape_Reg,...,Sale_Type_New,Sale_Type_Oth,Sale_Type_VWD,Sale_Type_WD_,Sale_Condition_AdjLand,Sale_Condition_Alloca,Sale_Condition_Family,Sale_Condition_Normal,Sale_Condition_Partial,SalePrice
0,0,0,0,0,1,0,1,0,0,0,...,0,0,0,1,0,0,0,1,0,215000
1,0,0,0,1,0,0,1,0,0,1,...,0,0,0,1,0,0,0,1,0,105000
2,0,0,0,0,1,0,1,0,0,0,...,0,0,0,1,0,0,0,1,0,172000
3,0,0,0,0,1,0,1,0,0,1,...,0,0,0,1,0,0,0,1,0,244000
4,0,0,0,0,1,0,1,0,0,0,...,0,0,0,1,0,0,0,1,0,189900


**Then, we will analyze the correlation among the dummy variables.**

In [15]:
dummy_data_with_saleprice.corr()

Unnamed: 0,MS_Zoning_C,MS_Zoning_FV,MS_Zoning_I,MS_Zoning_RH,MS_Zoning_RL,MS_Zoning_RM,Street_Pave,Lot_Shape_IR2,Lot_Shape_IR3,Lot_Shape_Reg,...,Sale_Type_New,Sale_Type_Oth,Sale_Type_VWD,Sale_Type_WD_,Sale_Condition_AdjLand,Sale_Condition_Alloca,Sale_Condition_Family,Sale_Condition_Normal,Sale_Condition_Partial,SalePrice
MS_Zoning_C,1.000000,-0.020710,-0.002425,-0.008950,-0.172711,-0.040100,-0.284560,-0.015144,-0.006876,0.054977,...,-0.027656,-0.004541,-0.001715,-0.017919,-0.005951,0.073905,-0.011720,-0.073998,-0.028033,-0.117380
MS_Zoning_FV,-0.020710,1.000000,-0.005835,-0.021530,-0.415483,-0.096468,0.014316,0.003973,-0.016542,0.025940,...,0.179836,-0.010925,-0.004125,-0.133563,-0.014316,-0.020288,-0.028194,-0.094892,0.176167,0.106639
MS_Zoning_I,-0.002425,-0.005835,1.000000,-0.002521,-0.048658,-0.011297,-0.202937,-0.004266,-0.001937,-0.007312,...,-0.007792,-0.001279,-0.000483,-0.028056,-0.001677,-0.002376,-0.003302,0.012088,-0.007898,-0.032900
MS_Zoning_RH,-0.008950,-0.021530,-0.002521,1.000000,-0.179548,-0.041688,0.006187,0.051671,-0.007149,0.013825,...,-0.028751,-0.004721,-0.001783,0.017009,-0.006187,-0.008767,-0.012184,-0.021038,-0.029142,-0.053638
MS_Zoning_RL,-0.172711,-0.415483,-0.048658,-0.179548,1.000000,-0.804499,0.068108,0.036169,0.006483,-0.246002,...,0.010555,0.009523,0.009928,0.019176,-0.004005,-0.005676,0.001992,0.039624,0.008495,0.243619
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Sale_Condition_Alloca,0.073905,-0.020288,-0.002376,-0.008767,-0.005676,0.002315,-0.172040,-0.014835,-0.006736,0.029632,...,-0.027093,-0.004449,-0.001680,0.035781,-0.005830,1.000000,-0.011481,-0.196556,-0.027462,-0.021609
Sale_Condition_Family,-0.011720,-0.028194,-0.003302,-0.012184,0.001992,0.020811,0.008102,-0.003344,0.065149,0.015993,...,-0.037651,0.050055,-0.002334,-0.014727,-0.008102,-0.011481,1.000000,-0.273156,-0.038164,-0.036919
Sale_Condition_Normal,-0.073998,-0.094892,0.012088,-0.021038,0.039624,0.034979,0.012430,0.013466,-0.014363,0.002790,...,-0.644580,-0.087489,0.008546,0.619905,-0.138700,-0.196556,-0.273156,1.000000,-0.653350,-0.142511
Sale_Condition_Partial,-0.028033,0.176167,-0.007898,-0.029142,0.008495,-0.096715,0.019378,0.012744,0.011070,-0.075546,...,0.986577,-0.014788,-0.005583,-0.753013,-0.019378,-0.027462,-0.038164,-0.653350,1.000000,0.350093


**We will save the data in a file named dummy_Ames_Housing_Data.**

In [16]:
dummy_data_with_saleprice.to_csv(r'C:\Users\norah\Desktop\Najm AI\dummy_Ames_Housing_Data.csv', index=False)

**Now, we will calculate the correlation for the dummy data.**

In [17]:

# Calculate the correlation matrix
correlation_matrix = dummy_data_with_saleprice.corr()

# Extract the correlation of all features with 'SalePrice'
saleprice_corr = correlation_matrix['SalePrice']

# Filter out correlations that are within -0.1 to 0.1 (exclusive)
filtered_corr = saleprice_corr[
    (saleprice_corr >= 0.1) | (saleprice_corr <= -0.1)  # Excludes values between -0.1 and 0.1
]

# check to ensure we exclude any exact 0.0
filtered_corr = filtered_corr[filtered_corr != 0.0]
filtered_corr = filtered_corr[filtered_corr != -0.0]

# Display the filtered correlations for features 1 to 50
print("\nFeatures with Correlation not starting with 0.0 or -0.0 with 'SalePrice' (Index 1 to 50):")
print(filtered_corr.iloc[50:100])


Features with Correlation not starting with 0.0 or -0.0 with 'SalePrice' (Index 1 to 50):
Bsmt_Exposure_Gd          0.355628
Bsmt_Exposure_No         -0.341893
BsmtFin_Type_1_BLQ       -0.122427
BsmtFin_Type_1_GLQ        0.391136
BsmtFin_Type_1_Rec       -0.157358
BsmtFin_Type_1_Unf       -0.107376
Heating_QC_Fa            -0.130511
Heating_QC_Gd            -0.132243
Heating_QC_TA            -0.338079
Central_Air_Y             0.264700
Electrical_FuseF         -0.123943
Electrical_SBrkr          0.238944
Kitchen_Qual_Fa          -0.146794
Kitchen_Qual_Gd           0.304230
Kitchen_Qual_TA          -0.526527
Functional_Typ            0.119496
Garage_Type_Attchd        0.248278
Garage_Type_BuiltIn       0.223390
Garage_Type_Detchd       -0.364625
Garage_Finish_RFn         0.168963
Garage_Finish_Unf        -0.519123
Garage_Qual_Fa           -0.165893
Garage_Qual_TA            0.125312
Garage_Cond_Fa           -0.147840
Garage_Cond_TA            0.151068
Paved_Drive_Y             0.273535

**Again we will focus on the features that have correlations closest to 1 or -1 with the SalePrice feature, while excluding those that are near 0 or -0. In particular.**

In [18]:
# Get the features from the filtered correlation
selected_features = filtered_corr.index.tolist()

# Create a new DataFrame with the selected features
Dummy_data = dummy_data_with_saleprice[selected_features]

# Display the shape of the new DataFrame
Dummy_data_shape = Dummy_data.shape
print(f"Shape of Dummy DataFrame: {Dummy_data_shape}")

# Display the filtered correlations for features 1 to 50
print("\nFiltered Correlations with 'SalePrice':")
print(filtered_corr.iloc[50:100])  # Adjust the range as needed

Shape of Dummy DataFrame: (2929, 81)

Filtered Correlations with 'SalePrice':
Bsmt_Exposure_Gd          0.355628
Bsmt_Exposure_No         -0.341893
BsmtFin_Type_1_BLQ       -0.122427
BsmtFin_Type_1_GLQ        0.391136
BsmtFin_Type_1_Rec       -0.157358
BsmtFin_Type_1_Unf       -0.107376
Heating_QC_Fa            -0.130511
Heating_QC_Gd            -0.132243
Heating_QC_TA            -0.338079
Central_Air_Y             0.264700
Electrical_FuseF         -0.123943
Electrical_SBrkr          0.238944
Kitchen_Qual_Fa          -0.146794
Kitchen_Qual_Gd           0.304230
Kitchen_Qual_TA          -0.526527
Functional_Typ            0.119496
Garage_Type_Attchd        0.248278
Garage_Type_BuiltIn       0.223390
Garage_Type_Detchd       -0.364625
Garage_Finish_RFn         0.168963
Garage_Finish_Unf        -0.519123
Garage_Qual_Fa           -0.165893
Garage_Qual_TA            0.125312
Garage_Cond_Fa           -0.147840
Garage_Cond_TA            0.151068
Paved_Drive_Y             0.273535
Sale_Type_Ne

In [19]:
display(Dummy_data.head())

Unnamed: 0,MS_Zoning_C,MS_Zoning_FV,MS_Zoning_RL,MS_Zoning_RM,Lot_Shape_Reg,Land_Contour_HLS,Lot_Config_CulDSac,Neighborhood_BrkSide,Neighborhood_Edwards,Neighborhood_IDOTRR,...,Garage_Qual_Fa,Garage_Qual_TA,Garage_Cond_Fa,Garage_Cond_TA,Paved_Drive_Y,Sale_Type_New,Sale_Type_WD_,Sale_Condition_Normal,Sale_Condition_Partial,SalePrice
0,0,0,1,0,0,0,0,0,0,0,...,0,1,0,1,0,0,1,1,0,215000
1,0,0,0,0,1,0,0,0,0,0,...,0,1,0,1,1,0,1,1,0,105000
2,0,0,1,0,0,0,0,0,0,0,...,0,1,0,1,1,0,1,1,0,172000
3,0,0,1,0,1,0,0,0,0,0,...,0,1,0,1,1,0,1,1,0,244000
4,0,0,1,0,0,0,0,0,0,0,...,0,1,0,1,1,0,1,1,0,189900


In [20]:
Dummy_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2929 entries, 0 to 2928
Data columns (total 81 columns):
 #   Column                  Non-Null Count  Dtype
---  ------                  --------------  -----
 0   MS_Zoning_C             2929 non-null   uint8
 1   MS_Zoning_FV            2929 non-null   uint8
 2   MS_Zoning_RL            2929 non-null   uint8
 3   MS_Zoning_RM            2929 non-null   uint8
 4   Lot_Shape_Reg           2929 non-null   uint8
 5   Land_Contour_HLS        2929 non-null   uint8
 6   Lot_Config_CulDSac      2929 non-null   uint8
 7   Neighborhood_BrkSide    2929 non-null   uint8
 8   Neighborhood_Edwards    2929 non-null   uint8
 9   Neighborhood_IDOTRR     2929 non-null   uint8
 10  Neighborhood_MeadowV    2929 non-null   uint8
 11  Neighborhood_NAmes      2929 non-null   uint8
 12  Neighborhood_NoRidge    2929 non-null   uint8
 13  Neighborhood_NridgHt    2929 non-null   uint8
 14  Neighborhood_OldTown    2929 non-null   uint8
 15  Neighborhood_Sawyer  

**After selecting the features that are nearest to 1 or -1, we will perform OLS regression by combining the integer features with the dummy data. We will add the dummy features one by one and observe whether the R and R² values increase or decrease. We will retain only those features that contribute to an increase in these metrics.**

In [21]:

# Generate the formula

Numformula = 'SalePrice ~ MS_SubClass+Lot_Frontage+Lot_Area+Overall_Qual+Overall_Cond+Year_Built+Year_Remod_Add+BsmtFin_SF_1+Bsmt_Unf_SF+Total_Bsmt_SF+First_Flr_SF+Gr_Liv_Area+Full_Bath+Half_Bath+Bedroom_AbvGr+Kitchen_AbvGr+TotRms_AbvGrd+Fireplaces+Garage_Yr_Blt+Garage_Cars+Garage_Area+Mo_Sold+Yr_Sold'


# Fit the initial model with base features from integer_data
lm = smf.ols(Numformula, data=integer_data)
fit = lm.fit()

# Display initial summary
print(fit.summary())

# New features to add from Dummy_data
new_features = ['Land_Contour_HLS','Neighborhood_NoRidge','Neighborhood_NridgHt','Neighborhood_Somerst','Neighborhood_StoneBr',
               'Condition_1_Feedr','Condition_2_PosA','Bldg_Type_2fmCon','Bldg_Type_Duplex','Roof_Style_Gable',
               'Roof_Matl_WdShngl','Exterior_First_CemntBd','Bsmt_Qual_Gd','Bsmt_Qual_TA','Bsmt_Exposure_Gd','Bsmt_Exposure_No',
               'BsmtFin_Type_1_GLQ','Kitchen_Qual_TA','Functional_Typ','Garage_Finish_RFn','Sale_Type_New','Sale_Type_WD_']  # From Dummy_data

# Loop to add features manually
for feature in new_features:
    # Update the formula by adding the new feature
    Numformula += f' + {feature}'
    
    # Add the feature from Dummy_data to integer_data(if it doesn't exist)
    if feature not in integer_data.columns:
        integer_data[feature] = Dummy_data[feature]

# After adding all features, refit the model with the final formula
lm = smf.ols(Numformula, data=integer_data)
fit = lm.fit() # fit() computation to estimate the model parameters (coefficients) 

# Print the final summary after all features have been added
print(f"\nUpdated model with all added features:")
print(fit.summary())



# Extract SalePrice from Dummy_data
if 'SalePrice' in Dummy_data.columns:
    SalePrice = Dummy_data['SalePrice']

    # Check if SalePrice can be merged into integer_data
    if len(SalePrice) == len(integer_data):
        # Merge SalePrice into integer_data
        integer_data['SalePrice'] = SalePrice.values  # Use .values for correct assignment

        # Move SalePrice to the last column
        integer_data = integer_data[[col for col in integer_data.columns if col != 'SalePrice'] + ['SalePrice']]


                            OLS Regression Results                            
Dep. Variable:              SalePrice   R-squared:                       0.825
Model:                            OLS   Adj. R-squared:                  0.824
Method:                 Least Squares   F-statistic:                     595.5
Date:                Thu, 10 Oct 2024   Prob (F-statistic):               0.00
Time:                        10:12:49   Log-Likelihood:                -34666.
No. Observations:                2929   AIC:                         6.938e+04
Df Residuals:                    2905   BIC:                         6.952e+04
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept       5.934e+05   9.65e+05      0.

In [22]:
integer_data.to_csv(r'C:\Users\norah\Desktop\Najm AI\Selected_Features.csv', index=False)

# Report on OLS Regression Analysis for SalePrice with the Dummy Data

## Model Development

### Initial Model Specification

The initial regression formula was constructed using the following integer features:

```plaintext
SalePrice ~ MS_SubClass + Lot_Frontage + Lot_Area + Overall_Qual + Overall_Cond + Year_Built + Year_Remod_Add + BsmtFin_SF_1 + Bsmt_Unf_SF + Total_Bsmt_SF + First_Flr_SF + Gr_Liv_Area + Full_Bath + Half_Bath + Bedroom_AbvGr + Kitchen_AbvGr + TotRms_AbvGrd + Fireplaces + Garage_Yr_Blt + Garage_Cars + Garage_Area + Mo_Sold + Yr_Sold
```

Using this formula, we fitted the initial model using the integer features from the dataset.

### Basic Model Results

The summary of the initial model is as follows:

- **R-squared**: 0.825
- **Adjusted R-squared**: 0.824

### Addition of Dummy Variables

To enhance the model, we incorporated additional features from the dummy dataset. The new features included:

- `Land_Contour_HLS`
- `Neighborhood_NoRidge`
- `Neighborhood_NridgHt`
- `Neighborhood_Somerst`
- `Neighborhood_StoneBr`
- `Condition_1_Feedr`
- `Condition_2_PosA`
- `Bldg_Type_2fmCon`
- `Bldg_Type_Duplex`
- `Roof_Style_Gable`
- `Roof_Matl_WdShngl`
- `Exterior_First_CemntBd`
- `Bsmt_Qual_Gd`
- `Bsmt_Qual_TA`
- `Bsmt_Exposure_Gd`
- `Bsmt_Exposure_No`
- `BsmtFin_Type_1_GLQ`
- `Kitchen_Qual_TA`
- `Functional_Typ`
- `Garage_Finish_RFn`
- `Sale_Type_New`
- `Sale_Type_WD_`


### Final Model Results

The final model, after incorporating all features, yielded the following results:

- **R-squared**: 0.879
- **Adjusted R-squared**: 0.877

## Conclusion

The OLS regression analysis successfully identified significant predictors of `SalePrice`, with the final model achieving an R-squared value of 0.879. This suggests that a substantial portion of the variance in property prices can be explained by the selected features. The addition of dummy variables notably improved the model's performance, highlighting the importance of categorical factors in predicting housing prices.