# MLOps exercises

## Execise 1

In this exercise, do the following:
1. Create a function that preprocess new ames data in the same way as the original ames data was preprocessed in step 5 in the `MLOps.ipynb` notebook.
2. Create a function that takes as input a new ames dataset and a model. The function should pre-process the new data and evaluate the model on that new data using mean absolute error.
3. Test the function from 2. on the "NewAmesData1.csv" dataset and the best model from the `MLOps.ipynb` notebook.
4. Test the function from 2. on the "NewAmesData2.csv" dataset and the best model from the `MLOps.ipynb` notebook. Do you see any drift?
5. Do you see a data drift in "NewAmesData2.csv"? If so, for which variables?
6. Do you see a data drift in "NewAmesData4.csv"? If so, for which variables?
7. Create a function that retrain a model on the new data as well as the old training data
8. Retrain the `model_final` on the new data "NewAmesData1.csv" as well as the old training data, using the function from 5. Then test the new model on the old testset.
9. Split the "NewAmesData2.csv" dataset into a train and test set. Train  the best model from the `MLOps.ipynb` notebook on the training part and test it on the test part. Did you get a better model? Now combine your new training data with the original training data and retrain the model on that. Did that give you a better model?

In [41]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split 
from sklearn.ensemble import RandomForestRegressor

ames = pd.read_csv("AmesHousing.csv")
ames1 = pd.read_csv("NewAmesData1.csv")
ames2 = pd.read_csv("NewAmesData2.csv")
ames4 = pd.read_csv("NewAmesData4.csv")


ames = ames[["Lot Area", "Overall Cond", "Year Built", "Gr Liv Area", "TotRms AbvGrd", "Mo Sold", "Yr Sold", "Bldg Type", "Neighborhood", "SalePrice"]]
ames = ames[ames["Lot Area"] <= 75000]
ames_wd = ames.join(pd.get_dummies(ames["Bldg Type"], drop_first=True, dtype = "int", prefix="BType"))
ames_wd = ames_wd.join(pd.get_dummies(ames_wd["Neighborhood"], drop_first=True, dtype = "int", prefix="Nbh"))
ames_wd = ames_wd.drop(columns = ["Bldg Type", "Neighborhood"])

X_ames = ames_wd.drop(columns=["SalePrice"])
y_ames = ames_wd.SalePrice
X_train, X_test, y_train, y_test = train_test_split(X_ames, y_ames, test_size=0.2, random_state=1742)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=4217)

model_rf_500 = RandomForestRegressor(n_estimators=500)
model_rf_500.fit(X_train, y_train)


In [42]:
def preproces_data(data):
    data = data[["Lot Area", "Overall Cond", "Year Built", "Gr Liv Area", "TotRms AbvGrd", "Mo Sold", "Yr Sold", "Bldg Type", "Neighborhood", "SalePrice"]]
    data = data.join(pd.get_dummies(data["Bldg Type"], drop_first=True, dtype = "int", prefix="BType"))
    data = data.join(pd.get_dummies(data["Neighborhood"], drop_first=True, dtype = "int", prefix="Nbh"))
    data = data.drop(columns = ["Bldg Type", "Neighborhood"])

    X_vars = data.drop(columns=["SalePrice"])
    Y_vars = data["SalePrice"]
    return X_vars, Y_vars

In [43]:
def new_data_MAE(model, x_vars=None, y_vars=None, data=None):
    if data is not None:
        X_vars, Y_vars = preproces_data(data)
        prediction = model.predict(X_vars)
        return mean_absolute_error(Y_vars, prediction)
    
    if x_vars is not None and y_vars is not None:
        prediction = model.predict(x_vars)
        return mean_absolute_error(y_vars, prediction)
    

In [44]:
print(new_data_MAE(model=model_rf_500, data=ames1))
print(new_data_MAE(data=ames2, model=model_rf_500))
print(new_data_MAE(data=ames4, model=model_rf_500))


19203.84563609257
122636.531734376
28779.143977159663


In [45]:
def retrain_model(new_data, old_X_train, old_y_train, model):

    X_new, y_new = preproces_data(new_data)
    
    X_combined = pd.concat([old_X_train, X_new])
    y_combined = pd.concat([old_y_train, y_new])
    
    model.fit(X_combined, y_combined)
    
    return model

In [46]:
model_final = retrain_model(ames1, X_train, y_train, model_rf_500)

MAE = new_data_MAE(model_final,x_vars=X_test, y_vars=y_test)
print(f"Mean Absolute Error on the old test set: {MAE}")

Mean Absolute Error on the old test set: 19974.379707920252
