# **Modeling and Evaluation (Regression) Notebook**

## Objectives
- Fit and evaluate a regression model to predict the Sales Price for a house in Ames, Iowa

## Inputs
- outputs/data_collected/house_oricing_data.csv
- Instructions on which variables to use for data cleaning and feature engineering. They are found in notebook 01 - 03.

## Outputs
- Train set
- Test set
- Data cleaning and Feature Engineering pipeline
- Modeling pipeline


---

## Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/ci-c5-housing-market-prices/jupyter_notebooks'

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/ci-c5-housing-market-prices'

---

## Step 1: Load data

In [18]:
import numpy as np
import pandas as pd
df = (pd.read_csv("outputs/data_collected/house_pricing_data.csv")
      .drop(labels=[], axis=1)  
                    # target variable for regressor, remove from classifier  
  )

print(df.shape)
df.head(3)

(1460, 24)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500


In [19]:
# Count missing values per column
missing_count_per_column = df.isnull().sum()

# Filter and sort columns with missing values
missing_columns = missing_count_per_column[missing_count_per_column > 0].sort_values(ascending=False)

print("\nColumns with Missing Values and Their Counts:")
print(missing_columns)


Columns with Missing Values and Their Counts:
EnclosedPorch    1324
WoodDeckSF       1305
LotFrontage       259
GarageFinish      235
BsmtFinType1      145
BedroomAbvGr       99
2ndFlrSF           86
GarageYrBlt        81
BsmtExposure       38
MasVnrArea          8
dtype: int64


---

## Step 2: ML Pipeline with all relevant data

ML pipeline for Data Cleaning and Feature Engineering

In [24]:
from sklearn.pipeline import Pipeline

# preprocessing
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

# Feature Engineering
from feature_engine.selection import SmartCorrelatedSelection # Import SmartCorrelationSelection
from feature_engine.encoding import OrdinalEncoder # Import OrdinalEncoder
from feature_engine.transformation import LogTransformer  # Import LogTransformer
from feature_engine.imputation import MeanMedianImputer # For Imputation

def drop_unwanted_columns(X):
    return X.drop(columns=[
        'LotFrontage', 'GarageFinish', '2ndFlrSF', 'GarageYrBlt',
        'EnclosedPorch', 'WoodDeckSF', 'BsmtFinType1', 'LotArea',
        'BsmtUnfSF', 'BedroomAbvGr', 'BsmtExposure', 'OverallCond'
    ])


def PipelineDataCleaningAndFeatureEngineering():
    pipeline_base = Pipeline([
        # Drop unwanted columns
        ("DropUnwantedFeatures", FunctionTransformer(drop_unwanted_columns, validate=False)),

        # Impute MasVnrArea using mean
        ("ImputeMasVnrArea", MeanMedianImputer(imputation_method='mean', variables=['MasVnrArea'])),

        # Encoding categorical variables using OrdinalEncoder
        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                     variables=['KitchenQual'])),

         # Feature selection based on correlation using SmartCorrelatedSelection
        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=['TotalBsmtSF', '1stFlrSF', 'KitchenQual', 'YearRemodAdd', 'GarageArea'],
         method="spearman", threshold=0.6, selection_method="variance")),

        # Apply log10 transformation to selected numeric features
        ("LogTransformation", LogTransformer(variables=['GrLivArea', 'SalePrice'], base='10'))

    ])

    return pipeline_base


# Run the pipeline
PipelineDataCleaningAndFeatureEngineering()

In [25]:
# 1. Get the pipeline
pipeline = PipelineDataCleaningAndFeatureEngineering()

# 2. Fit and transform  DataFrame
df_transformed = pipeline.fit_transform(df)

# 3. View the result
print(df_transformed.head())

   BsmtFinSF1  GarageArea  GrLivArea  KitchenQual  MasVnrArea  OpenPorchSF  \
0         706         548   3.232996            0       196.0           61   
1         978         460   3.101059            1         0.0            0   
2         486         608   3.251881            0       162.0           42   
3         216         642   3.234770            0         0.0           35   
4         655         836   3.342028            0       350.0           84   

   OverallQual  TotalBsmtSF  YearBuilt  YearRemodAdd  SalePrice  
0            7          856       2003          2003   5.319106  
1            6         1262       1976          1976   5.258877  
2            7          920       2001          2002   5.349278  
3            7          756       1915          1970   5.146128  
4            8         1145       2000          2000   5.397940  


  X.loc[:, self.variables_] = np.log10(X.loc[:, self.variables_])
  X.loc[:, self.variables_] = np.log10(X.loc[:, self.variables_])


ML Pipeline for Modelling, import libaries (in progress)

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor  # We'll use a Random Forest model here, but you can try others.
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler


# Feat Scaling
# from sklearn.preprocessing import StandardScaler

# Feat Selection
from sklearn.feature_selection import SelectFromModel

# ML algorithms
# from sklearn.linear_model import LogisticRegression
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.ensemble import GradientBoostingClassifier
# from sklearn.ensemble import ExtraTreesClassifier
# from sklearn.ensemble import AdaBoostClassifier
#from xgboost import XGBClassifier


def PipelineClassf(model):
    pipeline_base = Pipeline([
        ("scaler", StandardScaler()),
        ("feat_selection", SelectFromModel(model)),
        ("model", model),
    ])

    return pipeline_base

## Split Train and Test Set

X_train and y_train will be used to train the model.

X_test and y_test will be used to evaluate the model.

In [None]:
# Define features and target variable
X = data.drop(columns=['SalePrice'])  # All columns except 'SalePrice' are features
y = data['SalePrice']  # 'SalePrice' is the target variable

# Split the data into training and test sets (80% train, 20% test is common)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Scale the data

Using a model that is sensitive to the scale of data (like linear models or k-nearest neighbors) a scaler function is necessary. For tree-based models like RandomForest, scaling is not strictly necessary, but it’s good practice for certain models

In [None]:
# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the training data, and transform the test data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Choose and train a Model

In [None]:
# Initialize the RandomForestRegressor model
model = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model on the training data
model.fit(X_train_scaled, y_train)

---

## Step 3: Make Predictions and Evaluate the Model

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test_scaled)

# Evaluate the model using different metrics
mae = mean_absolute_error(y_test, y_pred)  # Mean Absolute Error
mse = mean_squared_error(y_test, y_pred)    # Mean Squared Error
rmse = mean_squared_error(y_test, y_pred, squared=False)  # Root Mean Squared Error

print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")


---

##  Step 4: Push files to Repo

The following files will be generated:
- Train Set
- Test Set
- Data cleaning and Feature Engineering pipeline
- Modeling pipeline

In [None]:
import joblib
import os

version = 'v1'
file_path = f'outputs/ml_pipeline/predict_churn/{version}'

try:
    os.makedirs(name=file_path)
except Exception as e:
    print(e)

## Train Set

In [None]:
print(X_train.shape)
X_train.head()

In [None]:
X_train.to_csv(f"{file_path}/X_train.csv", index=False)

In [None]:
y_train

In [None]:
y_train.to_csv(f"{file_path}/y_train.csv", index=False)

## Test Set

In [None]:
print(X_test.shape)
X_test.head()

In [None]:
X_test.to_csv(f"{file_path}/X_test.csv", index=False)

In [None]:
y_test

In [None]:
y_test.to_csv(f"{file_path}/y_test.csv", index=False)

## Data cleaning and Feature Engineering pipeline

## Modeling pipeline

---