# MLOps exercises

In [6]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split 
import mlflow.sklearn
from sklearn.metrics import mean_absolute_error

## Execise 1

In this exercise, do the following:
1. Create a function that preprocess new ames data in the same way as the original ames data was preprocessed in step 5 in the `MLOps.ipynb` notebook.
2. Create a function that takes as input a new ames dataset and a model. The function should pre-process the new data and evaluate the model on that new data using mean absolute error.
3. Test the function from 2. on the "NewAmesData1.csv" dataset and the best model from the `MLOps.ipynb` notebook.
4. Test the function from 2. on the "NewAmesData2.csv" dataset and the best model from the `MLOps.ipynb` notebook. Do you see any drift?
5. Do you see a data drift in "NewAmesData2.csv"? If so, for which variables?
6. Do you see a data drift in "NewAmesData4.csv"? If so, for which variables?
7. Create a function that retrain a model on the new data as well as the old training data
8. Retrain the `model_final` on the new data "NewAmesData1.csv" as well as the old training data, using the function from 5. Then test the new model on the old testset.
9. Split the "NewAmesData2.csv" dataset into a train and test set. Train  the best model from the `MLOps.ipynb` notebook on the training part and test it on the test part. Did you get a better model? Now combine your new training data with the original training data and retrain the model on that. Did that give you a better model?

In [2]:
def extract(dataset_path):
    df = pd.read_csv(dataset_path)
    return df

def transform(df):
    df = df[["Lot Area", "Overall Cond", "Year Built", "Gr Liv Area", "TotRms AbvGrd", "Mo Sold", "Yr Sold", "Bldg Type", "Neighborhood", "SalePrice"]]
    df = df[df["Lot Area"] <= 75000]

    df = df.join(pd.get_dummies(df["Bldg Type"], drop_first=True, dtype = "int", prefix="BType"))
    df = df.join(pd.get_dummies(df["Neighborhood"], drop_first=True, dtype = "int", prefix="Nbh"))
    df = df.drop(columns = ["Bldg Type", "Neighborhood"])
    return df

def ETL(dataset_path):
    df= extract(dataset_path)
    df_transformed = transform(df)
    return df_transformed

        


In [None]:
# dataset_path = "NewAmesData1.csv"

# df = ETL(dataset_path)
# df.head()


Unnamed: 0,Lot Area,Overall Cond,Year Built,Gr Liv Area,TotRms AbvGrd,Mo Sold,Yr Sold,SalePrice,BType_2fmCon,BType_Duplex,...,Nbh_NoRidge,Nbh_NridgHt,Nbh_OldTown,Nbh_SWISU,Nbh_Sawyer,Nbh_SawyerW,Nbh_Somerst,Nbh_StoneBr,Nbh_Timber,Nbh_Veenker
0,10738,6,1954,1457,5,8,2009,132661,0,0,...,0,0,0,0,0,0,0,0,0,0
1,13835,4,1989,1792,6,3,2006,372144,0,0,...,0,1,0,0,0,0,0,0,0,0
2,9667,5,1997,1103,6,5,2010,150786,0,0,...,0,0,0,0,0,1,0,0,0,0
3,10699,6,2000,1671,6,4,2009,211081,0,0,...,0,0,0,0,0,0,0,0,0,0
4,14033,4,1992,1357,5,10,2009,142802,0,0,...,0,0,0,0,0,0,0,0,0,0


2. Create a function that takes as input a new ames dataset and a model. The function should pre-process the new data and evaluate the model on that new data using mean absolute error.

In [4]:
model_rf_500_ = mlflow.sklearn.load_model("mlflow_rf_model")

No loops, that is in mlflow_task!!!! This excersies. Use the trained model from MLOPS., DONT TRAIN IN THE MODEL_EVALUATOR, No preprocessing in model_evaluator, should just send data to ETL. Open model from the MLOPS

Use MLFLOW

pip freeze > requinrements.text 

In [8]:
def model_evaluator(dataset_path, model_rf_500_):
    df = ETL(dataset_path)
    X_new = df.drop(columns=["SalePrice"])  # Adjust if the target column differs
    y_true = df["SalePrice"]
    y_pred = model_rf_500_.predict(X_new)
    return mean_absolute_error(y_true, y_pred)
## no train_test_split here, model should already be trained!!!

3. Test the function from 2. on the "NewAmesData1.csv" dataset and the best model from the `MLOps.ipynb` notebook.

In [10]:
mae = model_evaluator("NewAmesData1.csv", model_rf_500_)
mae

19268.3361078475

4. Test the function from 2. on the "NewAmesData2.csv" dataset and the best model from the `MLOps.ipynb` notebook. Do you see any drift?

In [11]:
mae = model_evaluator("NewAmesData2.csv", model_rf_500_)
mae

122715.83893998136