# MLOps exercises

## Execise 1

In this exercise, do the following:
1. <font color='lightgreen'>Create a function that preprocess new ames data in the same way as the original ames data was preprocessed in step 5 in the `MLOps.ipynb` notebook.</font>
2. <font color='lightgreen'>Create a function that takes as input a new ames dataset and a model. The function should pre-process the new data and evaluate the model on that new data using mean absolute error.</font>
3. <font color='lightgreen'>Test the function from 2. on the "NewAmesData1.csv" dataset and the best model from the `MLOps.ipynb` notebook.</font>
4. <font color='lightgreen'>Test the function from 2. on the "NewAmesData2.csv" dataset and the best model from the `MLOps.ipynb` notebook. Do you see any drift?</font>
5. <font color='lightgreen'>Do you see a data drift in "NewAmesData2.csv"? If so, for which variables?</font>
6. <font color='lightgreen'>Do you see a data drift in "NewAmesData4.csv"? If so, for which variables?</font>
7. Create a function that retrain a model on the new data as well as the old training data
8. Retrain the `model_final` on the new data "NewAmesData1.csv" as well as the old training data, using the function from 5. Then test the new model on the old testset.
9. Split the "NewAmesData2.csv" dataset into a train and test set. Train  the best model from the `MLOps.ipynb` notebook on the training part and test it on the test part. Did you get a better model? Now combine your new training data with the original training data and retrain the model on that. Did that give you a better model?

In [1]:
import os
import kagglehub
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import mlflow
import mlflow.sklearn

In [2]:
AmesHousingRaw = pd.read_csv("AmesHousing.csv")
NewAmesData1 = pd.read_csv("NewAmesData1.csv")
NewAmesData2 = pd.read_csv("NewAmesData2.csv")
NewAmesData4 = pd.read_csv("NewAmesData4.csv")

#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', 80)
display(AmesHousingRaw.head(1))
print("AmesHousing:", AmesHousingRaw.shape)
#print(f"AmesHousingRaw variable types:", AmesHousingRaw.dtypes)
display(NewAmesData1.head(1))
print("NewAmesData1:", NewAmesData1.shape)
display(NewAmesData2.head(1))
print("NewAmesData2:", NewAmesData2.shape)
display(NewAmesData4.head(1))
print("NewAmesData4:", NewAmesData4.shape)

AmesHousingRaw = AmesHousingRaw[AmesHousingRaw["Lot Area"] <= 75000]
num_features = ["Lot Area", "Overall Cond", "Year Built", "Gr Liv Area", "Mo Sold", "Yr Sold", "TotRms AbvGrd"]
cat_features = ["Bldg Type", "Neighborhood"]
target_col = "SalePrice"

cols_needed = num_features + cat_features + [target_col]
AmesHousing = AmesHousingRaw[cols_needed].copy()
display(AmesHousing.head(1))
print("AmesHousing:", AmesHousing.shape)

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000


AmesHousing: (2930, 82)


Unnamed: 0,Lot Area,Overall Cond,Year Built,Gr Liv Area,TotRms AbvGrd,Mo Sold,Yr Sold,Bldg Type,Neighborhood,SalePrice
0,10738,6,1954,1457,5,8,2009,1Fam,NAmes,132661


NewAmesData1: (749, 10)


Unnamed: 0,Lot Area,Overall Cond,Year Built,Gr Liv Area,TotRms AbvGrd,Mo Sold,Yr Sold,Bldg Type,Neighborhood,SalePrice
0,10738,6,1954,1457,5,8,2009,1Fam,NAmes,-9790.3818


NewAmesData2: (749, 10)


Unnamed: 0,Lot Area,Overall Cond,Year Built,Gr Liv Area,TotRms AbvGrd,Mo Sold,Yr Sold,Bldg Type,Neighborhood,SalePrice
0,25646.750457,3,2001,2104,9,4,2021,1Fam,CollgCr,241432


NewAmesData4: (1474, 10)


Unnamed: 0,Lot Area,Overall Cond,Year Built,Gr Liv Area,Mo Sold,Yr Sold,TotRms AbvGrd,Bldg Type,Neighborhood,SalePrice
0,31770,5,1960,1656,5,2010,7,1Fam,NAmes,215000


AmesHousing: (2926, 10)


<h4>The best model from MLOps.ipynb. in MLOps this model had MSA : 19210</h3>

In [3]:
# I know do these pre-processing in the fucntion in task 1, but we want this cell before the tasks
ames = AmesHousing.copy() 
ames_wd = ames.join(pd.get_dummies(ames["Bldg Type"], drop_first=True, dtype = "int", prefix="BType"))
ames_wd = ames_wd.join(pd.get_dummies(ames_wd["Neighborhood"], drop_first=True, dtype = "int", prefix="Nbh"))
ames_wd = ames_wd.drop(columns = ["Bldg Type", "Neighborhood"])

X_ames = ames_wd.drop(columns=["SalePrice"])
y_ames = ames_wd.SalePrice

FEATURE_ORDER = list(X_ames.columns) # Had problems with feature order for incoming data, so lets save the order here.

X_train, X_test, y_train, y_test = train_test_split(X_ames, y_ames, test_size=0.2, random_state=1742)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=4217)


model_rf_500 = RandomForestRegressor(n_estimators=500)
model_rf_500.fit(X_train, y_train)
y_pred_rf_500 = model_rf_500.predict(X_val)
mea = mean_absolute_error(y_val, y_pred_rf_500)
print(f"MAE: {mea:.2f}")

MAE: 19276.20


Same model, same data returned MAE 19276 here, a slight difference from the orignal MLOps.ipynb, but it's close enough. It's within run-to-run variation

<hr>
<h3>Task 1 Create a function that preprocess new ames data in the same way as the original ames data was preprocessed in step 5 in the `MLOps.ipynb` notebook.</h3>

In [4]:
def preprocess_data(df):
    df = df[df["Lot Area"] <= 75000]
    df = df.join(pd.get_dummies(df["Bldg Type"], drop_first=True, dtype = "int", prefix="BType"))
    df = df.join(pd.get_dummies(df["Neighborhood"], drop_first=True, dtype = "int", prefix="Nbh"))
    df = df.drop(columns=["Bldg Type", "Neighborhood"])
    
    df.head(1)
    return df

In [5]:
# Run function
AmesHousing = preprocess_data(AmesHousing)

<hr>
<h3>Task 2 Create a function that takes as input a new ames dataset and a model. The function should pre-process the new data and evaluate the model on that new data using mean absolute error.</h3>

In [6]:
def preprocess_data_and_model_with_evaluation(df, model):
    df = preprocess_data(df) # use the function from task 1
    X = df.drop(columns=["SalePrice"])
    y = df["SalePrice"]

    X = X.reindex(columns=FEATURE_ORDER, fill_value=0) # Enforce same order.
    
    y_pred = model.predict(X)
    
    mea = mean_absolute_error(y, y_pred)

    return f"{mea:.2f}"


<h3>Task 3 Test the function from 2. on the "NewAmesData1.csv" dataset and the best model from the `MLOps.ipynb` notebook.</h3>

In [7]:
preprocess_data_and_model_with_evaluation(NewAmesData1, model_rf_500)

'19190.21'

<h5> The model returned mae(19190) for NewAmesData1.csv, compared to mae(19276) on AmesHousing.csv. Very close. Within 0.5% range</h5>

<hr>
<h3>Task 4 Test the function from 2. on the "NewAmesData2.csv" dataset and the best model from the `MLOps.ipynb` notebook. Do you see any drift?</h3>

In [8]:
preprocess_data_and_model_with_evaluation(NewAmesData2, model_rf_500)

'122807.52'

<li> NewAmesData2-MAE(122580)</li>
<li> NewAmesData1-MAE (19190) </li>
<li> AmesHousing  -MAE (19276)</li>
<p>Testing the model trained on <strong>AmesHousing</strong> aginst <strong>NewAmesData2</strong> we observe huge jump from 19k MAE to over 122k MAE. This suggests that <strong>NewAmesData2</strong> have large amounts of <strong>data drift</strong> or <strong>concept drift</strong> compared to training data. Meaning that either the data distribution or presence of outlier have changed. Or there is a shift in the relationship betwen features and the target(SalePrice)</p>

<hr>
<h3>Task 5 Do you see a data drift in "NewAmesData2.csv"? If so, for which variables?</h3>

In [17]:
NewAmesData2_aligned = NewAmesData2.reindex(columns=ames_wd.columns, fill_value=0)

mean_diff = (AmesHousing.select_dtypes("number").mean() - NewAmesData2_aligned.select_dtypes("number").mean()).abs().sort_values(ascending=False)
std_diff  = (AmesHousing.select_dtypes("number").std()  - NewAmesData2_aligned.select_dtypes("number").std()).abs().sort_values(ascending=False)

print("Top mean shifts:")
print(mean_diff.head(10))
print("\n")
print("Top std shifts:")
print(std_diff.head(10))


Top mean shifts:
SalePrice        64902.822331
Lot Area           139.006877
Gr Liv Area         17.815246
TotRms AbvGrd        0.572090
Overall Cond         0.504139
Year Built           0.458420
Nbh_NAmes            0.151401
Mo Sold              0.110115
Nbh_CollgCr          0.091251
Nbh_OldTown          0.081681
dtype: float64


Top std shifts:
SalePrice       82028.466107
Lot Area          116.526571
Gr Liv Area        13.967960
Year Built          1.762717
Nbh_NAmes           0.358501
Nbh_CollgCr         0.288015
Nbh_OldTown         0.273926
BType_TwnhsE        0.270767
Nbh_Edwards         0.248852
Nbh_Somerst         0.241561
dtype: float64


<h4>Results and interpretation: </h4>
<h5>NewAmesData2 contains large shifts in both target and in features like (lot size, year built, living area), which we can assume are key predictors. The model with this dataset have massive error increase that indicates a severe drift</h5>

<hr>
<h3>Task 6 Do you see a data drift in "NewAmesData4.csv"? If so, for which variables?</h3>

In [9]:
preprocess_data_and_model_with_evaluation(NewAmesData4, model_rf_500)

'28505.67'

There is adrift. From 19k(training set) to 28k(NewAmesData4)

In [18]:
NewAmesData4_aligned = NewAmesData4.reindex(columns=ames_wd.columns, fill_value=0)

mean_diff = (AmesHousing.select_dtypes("number").mean() - NewAmesData4_aligned.select_dtypes("number").mean()).abs().sort_values(ascending=False)
std_diff  = (AmesHousing.select_dtypes("number").std()  - NewAmesData4_aligned.select_dtypes("number").std()).abs().sort_values(ascending=False)

print("Top mean shifts:")
print(mean_diff.head(10))
print("\n")
print("Top std shifts:")
print(std_diff.head(10))


Top mean shifts:
Lot Area         16427.249514
SalePrice         4150.522919
Gr Liv Area         11.201248
Overall Cond         1.534391
Yr Sold              0.614787
TotRms AbvGrd        0.483968
Mo Sold              0.394079
Nbh_NAmes            0.151401
Year Built           0.139429
Nbh_CollgCr          0.091251
dtype: float64


Top std shifts:
SalePrice       6005.874448
Lot Area         703.166604
Gr Liv Area        8.439674
Yr Sold            5.852759
Year Built         1.774187
Nbh_NAmes          0.358501
Nbh_CollgCr        0.288015
Nbh_OldTown        0.273926
BType_TwnhsE       0.270767
Nbh_Edwards        0.248852
dtype: float64


<h4>Results and interpretation: </h4>
<h5>From 19k to 28.5k is a more modrate drift and performance drop. NewAmesData4 shows modrate distribution changes, Mostly in lot size and after that sale prices and sale year. While these shifts increase MAE, the model still performs reasonable. This suggests only partial drift</h5>