# Linear Regression Model Example
### This repository showcases a Python implementation of a linear regression model using a real estate dataset from AutoML, featuring 79 explanatory variables that capture various aspects of residential homes in Ames, Iowa.

## Summary:


*   We make use of openML to import the data
*   We choose which features of the homes are important
* We did not check for multicollinearity, when 2 features are highly correlated with each other -> weakness of the linear regression
* For the categorical features, we make use of OneHotEncoder to convert the categorical values of each columns to dummy variables -> do remember this method can become very expensive if your data contains a lot of unique categorical values





In [7]:
pip install openml



In [8]:
import openml
import pandas as pd

# Load the dataset by OpenML ID (Ames dataset is ID 42165)
ames = openml.datasets.get_dataset(42165)
df_ames, *_ = ames.get_data()
print(df_ames.head())


   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave  None      Reg   
1   2          20       RL         80.0     9600   Pave  None      Reg   
2   3          60       RL         68.0    11250   Pave  None      IR1   
3   4          70       RL         60.0     9550   Pave  None      IR1   
4   5          60       RL         84.0    14260   Pave  None      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0   None  None        None       0      2   
1         Lvl    AllPub  ...        0   None  None        None       0      5   
2         Lvl    AllPub  ...        0   None  None        None       0      9   
3         Lvl    AllPub  ...        0   None  None        None       0      2   
4         Lvl    AllPub  ...        0   None  None        None       0     12   

  YrSold  SaleType  SaleCondition  SalePrice  
0   2008        WD   

In [9]:
df_ames.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [11]:
import statsmodels.api as sm
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Load the Ames Housing dataset
ames = fetch_openml(name="house_prices", as_frame=True)
df_ames = ames.data
df_ames['SalePrice'] = ames.target  # Adding target to the DataFrame

# Selecting features and target
features = [
    'GrLivArea', 'LotArea', '1stFlrSF', '2ndFlrSF', 'TotalBsmtSF', 'GarageArea',
    'YearBuilt', 'YearRemodAdd', 'OverallQual', 'OverallCond', 'ExterQual',
    'KitchenQual', 'TotRmsAbvGrd', 'BedroomAbvGr', 'FullBath', 'HalfBath',
    'Fireplaces', 'GarageCars', 'GarageYrBlt', 'OpenPorchSF'
]
X = df_ames[features]
y = df_ames['SalePrice']

# Define categorical features
categorical_features = ['ExterQual', 'KitchenQual']
numeric_features = [col for col in features if col not in categorical_features]

# Preprocess numeric and categorical features
numeric_transformer = SimpleImputer(strategy='median')
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first'))  # Drop first to avoid multicollinearity
])

# Combine preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Preprocess and prepare the data
X_processed = preprocessor.fit_transform(X)

# Retrieve feature names for numeric and one-hot encoded columns
ohe_columns = preprocessor.named_transformers_['cat']['encoder'].get_feature_names_out(categorical_features)
all_feature_names = ['const'] + list(numeric_features) + list(ohe_columns)

# Convert X_processed to DataFrame with feature names
X_processed_df = pd.DataFrame(X_processed, columns=all_feature_names[1:])  # Exclude 'const' initially

# Add a constant for the intercept term in the Statsmodels linear model
X_processed_df = sm.add_constant(X_processed_df, has_constant='add')  # This adds the 'const' column

# Fit the Statsmodels OLS regression model on the entire dataset
model = sm.OLS(y, X_processed_df).fit()

# Print the summary of the model with feature names
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:              SalePrice   R-squared:                       0.815
Model:                            OLS   Adj. R-squared:                  0.812
Method:                 Least Squares   F-statistic:                     263.1
Date:                Sun, 27 Oct 2024   Prob (F-statistic):               0.00
Time:                        12:56:48   Log-Likelihood:                -17313.
No. Observations:                1460   AIC:                         3.468e+04
Df Residuals:                    1435   BIC:                         3.481e+04
Df Model:                          24                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const          -9.279e+05   1.49e+05     -6.

# Final thoughts:


*   With regression models we can observe which features of the home play an important role in determining the price
*   The p-value of each feature (<0.05) are the only ones which can be used
* An R-squared of 80% means that 80% of the variation in the outcome (like house prices) is explained by the predictors in the model. In other words, the model does a good job of capturing the main patterns in the data, though 20% of the variation is still unexplained or due to factors outside the model.

