In [33]:
!pip install -r requirements.txt



# Project notebook

The following notebook is an excerpt and re-written example from a _real_ production model.

The overall purpose of the ML algorithm is to identify users on the website that are new possible customers. This is done by collecting behaviour data from the users as input, and the target is whether they converted/turned into customers -- essentially a classification problem. 

This notebook only focuses on the data processing part. As you know, there are multiple steps in an ML pipeline, and it's not always they are neatly separated like this. For the exam project, they will not be, and that is part of the challenge for you. For production code, it should also not be Python notebooks since, as you may well see, it is difficult to work with and collaborate on them in an automated way.

There is a lot of "fluff" in such a notebook. This ranges from comments and markdown cells to commented out code and random print statements. That is not necessary in a properly managed project where you can use git to check the version history and such. 

What is important for you is the identify the entry points into the code and segment them out into easily understandable chunks. Additionally, you might want to follow some basic code standards, such as:

- Import only libraries in the beginning of the files
- Define functions in the top of the scripts, or if used multiple places, move into a util.py script or such
- Remove unused/commented out code
- Follow the [PEP 8](https://peps.python.org/pep-0008/) style guide (and others)
  
Another thing to note is that comments can be misleading. Even if the markdown cell or inline comments says it does _X_, don't be surprised if it actually does _Y_. Sometimes additional text can be a blessing, but it can also be a curse sometimes. Remember, though, that your task is to make sure the code runs as before after refactoring the notebook into other files, not update/improve the model or flow to reflect what the comments might say.

***

# DATA PROCESSING

In this section, we will perform Exploratory Data Analysis (EDA) to better understand the dataset before proceeding with more advanced analysis. EDA helps us get a sense of the data’s structure, identify patterns, and spot any potential issues like missing values or outliers. By doing so, we can gain a clearer understanding of the data's key characteristics.

We will start with summary statistics to review basic measures like mean, median, and variance, providing an initial overview of the data distribution. Then, we’ll create visualizations such as histograms, box plots, and scatter plots to explore relationships between variables, check for any skewness, and highlight outliers.

The purpose of this EDA is to ensure that the dataset is clean and well-structured for further analysis. This step also helps us identify any necessary data transformations and informs decisions on which features might be most relevant for modeling in later stages.

# Create artifact directory
We want to create a directory for storing all the artifacts in the current directory. Users can load all the artifacts later for data cleaning pipelines and inferencing.

In [6]:
# dbutils.widgets.text("Training data max date", "2024-01-31")
# dbutils.widgets.text("Training data min date", "2024-01-01")
# max_date = dbutils.widgets.get("Training data max date")
# min_date = dbutils.widgets.get("Training data min date")

# testnng
max_date = "2024-01-31"
min_date = "2024-01-01"

In [None]:
#import os
#import shutil
#from pprint import pprint

# shutil.rmtree("./artifacts",ignore_errors=True)
#os.makedirs("artifacts",exist_ok=True)
#print("Created artifacts directory")

# Pandas dataframe print options

In [3]:
import pandas as pd
#import warnings

#warnings.filterwarnings('ignore')
#pd.set_option('display.float_format',lambda x: "%.3f" % x)

# Helper functions

* **describe_numeric_col**: Calculates various descriptive stats for a numeric column in a dataframe.
* **impute_missing_values**: Imputes the mean/median for numeric columns or the mode for other types.

In [1]:
def describe_numeric_col(x):
    """
    Parameters:
        x (pd.Series): Pandas col to describe.
    Output:
        y (pd.Series): Pandas series with descriptive stats. 
    """
    return pd.Series(
        [x.count(), x.isnull().count(), x.mean(), x.min(), x.max()],
        index=["Count", "Missing", "Mean", "Min", "Max"]
    )

def impute_missing_values(x, method="mean"):
    """
    Parameters:
        x (pd.Series): Pandas col to describe.
        method (str): Values: "mean", "median"
    """
    if (x.dtype == "float64") | (x.dtype == "int64"):
        x = x.fillna(x.mean()) if method=="mean" else x.fillna(x.median())
    else:
        x = x.fillna(x.mode()[0])
    return x

# Read data

We read the latest data from our data lake source. Here we load it locally after having pulled it from DVC.

In [None]:
!dvc pull

In [4]:
print("Loading training data")

data = pd.read_csv("./artifacts/raw_data.csv")

print("Total rows:", data.count())
display(data.head(5))


Loading training data
Total rows: lead_id                              12345
lead_indicator                       11753
date_part                            12345
is_active                            12345
marketing_consent                    12345
first_booking                        12345
existing_customer                    12345
last_seen                            12345
source                               12345
domain                               12345
country                              12345
visited_learn_more_before_booking    12345
visited_faq                          12345
purchases                            12345
time_spent                           12345
customer_group                       12345
onboarding                           12345
customer_code                        12294
n_visits                             12345
dtype: int64


Unnamed: 0,lead_id,lead_indicator,date_part,is_active,marketing_consent,first_booking,existing_customer,last_seen,source,domain,country,visited_learn_more_before_booking,visited_faq,purchases,time_spent,customer_group,onboarding,customer_code,n_visits
0,0,0.0,2024-01-24,0,True,2024-01-25,False,2024-01-13,organic,.dk,US,8,10,5,115.064399,2,True,,9
1,1,,2024-01-1,1,True,2024-01-24,False,2024-01-6,signup,.dk,US,0,3,5,119.907446,2,False,AGMVEYWACO,1
2,2,1.0,2024-01-27,0,True,2024-01-30,True,2024-01-30,signup,.dk,DK,7,10,3,100.47367,5,False,STWPXVNOWA,16
3,3,1.0,2024-01-28,0,False,2024-01-10,False,2024-01-11,li,.cn,US,10,10,4,98.571691,4,True,HZOKZERLZD,16
4,4,0.0,2024-01-5,0,True,2024-01-30,True,2024-01-25,fb,.com,US,5,1,5,101.996242,9,False,CVLIHCAPZN,2


In [7]:
import pandas as pd
import datetime
import json

if not max_date:
    max_date = pd.to_datetime(datetime.datetime.now().date()).date()
else:
    max_date = pd.to_datetime(max_date).date()

min_date = pd.to_datetime(min_date).date()

# Time limit data
data["date_part"] = pd.to_datetime(data["date_part"]).dt.date
data = data[(data["date_part"] >= min_date) & (data["date_part"] <= max_date)]

min_date = data["date_part"].min()
max_date = data["date_part"].max()
date_limits = {"min_date": str(min_date), "max_date": str(max_date)}
with open("./artifacts/date_limits.json", "w") as f:
    json.dump(date_limits, f)

# Feature selection

Not all columns are relevant for modelling

In [8]:
data = data.drop(
    [
        "is_active", "marketing_consent", "first_booking", "existing_customer", "last_seen"
    ],
    axis=1
)

In [9]:
#Removing columns that will be added back after the EDA
data = data.drop(
    ["domain", "country", "visited_learn_more_before_booking", "visited_faq"],
    axis=1
)

# Data cleaning
* Remove rows with empty target variable
* Remove rows with other invalid column data

In [10]:
import numpy as np

data["lead_indicator"].replace("", np.nan, inplace=True)
data["lead_id"].replace("", np.nan, inplace=True)
data["customer_code"].replace("", np.nan, inplace=True)

data = data.dropna(axis=0, subset=["lead_indicator"])
data = data.dropna(axis=0, subset=["lead_id"])

data = data[data.source == "signup"]
result=data.lead_indicator.value_counts(normalize = True)

print("Target value counter")
for val, n in zip(result.index, result):
    print(val, ": ", n)
data

Target value counter
0.0 :  0.5318352059925093
1.0 :  0.4681647940074906


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data["lead_indicator"].replace("", np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data["lead_id"].replace("", np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are s

Unnamed: 0,lead_id,lead_indicator,date_part,source,purchases,time_spent,customer_group,onboarding,customer_code,n_visits
2,2,1.0,2024-01-27,signup,3,100.473670,5,False,STWPXVNOWA,16
10,10,0.0,2024-01-11,signup,8,105.898925,4,True,JWLXPVANUM,4
13,13,0.0,2024-01-22,signup,6,86.717225,6,True,WNKCTHUHYI,5
17,17,1.0,2024-01-07,signup,5,106.951999,8,False,INIQMTGGOO,8
19,19,0.0,2024-01-20,signup,3,94.667990,9,True,SJGDAANOYJ,20
...,...,...,...,...,...,...,...,...,...,...
12328,12328,1.0,2024-01-01,signup,2,98.999168,5,True,AYAVVZAMGV,14
12330,12330,1.0,2024-01-14,signup,6,101.389931,4,True,YFKCVZSAJF,18
12334,12334,0.0,2024-01-21,signup,1,93.840461,5,True,ROVGLTVJNL,1
12337,12337,1.0,2024-01-03,signup,0,110.320204,6,True,BGQNNVRTDH,18


# Create categorical data columns

In [11]:
vars = [
    "lead_id", "lead_indicator", "customer_group", "onboarding", "source", "customer_code"
]

for col in vars:
    data[col] = data[col].astype("object")
    print(f"Changed {col} to object type")

Changed lead_id to object type
Changed lead_indicator to object type
Changed customer_group to object type
Changed onboarding to object type
Changed source to object type
Changed customer_code to object type


# Separate categorical and continuous columns

In [12]:
cont_vars = data.loc[:, ((data.dtypes=="float64")|(data.dtypes=="int64"))]
cat_vars = data.loc[:, (data.dtypes=="object")]

print("\nContinuous columns: \n")
pprint(list(cont_vars.columns), indent=4)
print("\n Categorical columns: \n")
pprint(list(cat_vars.columns), indent=4)


Continuous columns: 



NameError: name 'pprint' is not defined

# Outliers

Outliers are data points that significantly differ from the majority of observations in a dataset and can distort statistical analysis or model performance. To identify and remove outliers, one common method is to use the Z-score, which measures how many standard deviations a data point is from the mean. Data points with a Z-score greater than 2 (or sometimes 3) standard deviations away from the mean are typically considered outliers. By applying this threshold, we can filter out values that fall outside the normal range of the data, ensuring that the remaining dataset is more representative and less influenced by extreme values.

In [13]:
cont_vars = cont_vars.apply(lambda x: x.clip(lower = (x.mean()-2*x.std()),
                                             upper = (x.mean()+2*x.std())))
outlier_summary = cont_vars.apply(describe_numeric_col).T
outlier_summary.to_csv('./artifacts/outlier_summary.csv')
outlier_summary

Unnamed: 0,Count,Missing,Mean,Min,Max
purchases,2937.0,2937.0,5.000105,0.578163,9.494701
time_spent,2937.0,2937.0,100.01523,80.082404,119.902405
n_visits,2937.0,2937.0,8.884043,1.0,22.334292


# Impute data

In real-world datasets, missing data is a common occurrence due to various factors such as human error, incomplete data collection processes, or system failures. These gaps in the data can hinder analysis and lead to biased results if not properly addressed. Since many analytical and machine learning algorithms require complete data, handling missing values is an essential step in the data preprocessing phase.

In the next code block, we will handle missing data by performing imputation. For numerical columns, we will replace missing values with the mean or median of the entire column, which provides a reasonable estimate based on the existing data. For categorical columns (object type), we will use the mode, or most frequent value, to fill in missing entries. This approach helps us maintain a complete dataset while ensuring that the imputed values align with the general distribution of each column.

In [14]:
cat_missing_impute = cat_vars.mode(numeric_only=False, dropna=True)
cat_missing_impute.to_csv("./artifacts/cat_missing_impute.csv")
cat_missing_impute

Unnamed: 0,lead_id,lead_indicator,date_part,source,customer_group,onboarding,customer_code
0,2,0.0,2024-01-06,signup,5,True,AAAEJOLSFX
1,10,,,,,,AACEDUWABX
2,13,,,,,,AAJURHRBIM
3,17,,,,,,AAMFOUYNWS
4,19,,,,,,AAXKODEIPG
...,...,...,...,...,...,...,...
2932,12328,,,,,,
2933,12330,,,,,,
2934,12334,,,,,,
2935,12337,,,,,,


In [15]:
# Continuous variables missing values
cont_vars = cont_vars.apply(impute_missing_values)
cont_vars.apply(describe_numeric_col).T

Unnamed: 0,Count,Missing,Mean,Min,Max
purchases,2937.0,2937.0,5.000105,0.578163,9.494701
time_spent,2937.0,2937.0,100.01523,80.082404,119.902405
n_visits,2937.0,2937.0,8.884043,1.0,22.334292


In [16]:
cat_vars.loc[cat_vars['customer_code'].isna(),'customer_code'] = 'None'
cat_vars = cat_vars.apply(impute_missing_values)
cat_vars.apply(lambda x: pd.Series([x.count(), x.isnull().sum()], index = ['Count', 'Missing'])).T
cat_vars

  x = x.fillna(x.mode()[0])


Unnamed: 0,lead_id,lead_indicator,date_part,source,customer_group,onboarding,customer_code
2,2,1.0,2024-01-27,signup,5,False,STWPXVNOWA
10,10,0.0,2024-01-11,signup,4,True,JWLXPVANUM
13,13,0.0,2024-01-22,signup,6,True,WNKCTHUHYI
17,17,1.0,2024-01-07,signup,8,False,INIQMTGGOO
19,19,0.0,2024-01-20,signup,9,True,SJGDAANOYJ
...,...,...,...,...,...,...,...
12328,12328,1.0,2024-01-01,signup,5,True,AYAVVZAMGV
12330,12330,1.0,2024-01-14,signup,4,True,YFKCVZSAJF
12334,12334,0.0,2024-01-21,signup,5,True,ROVGLTVJNL
12337,12337,1.0,2024-01-03,signup,6,True,BGQNNVRTDH


# Data standardisation

Standardization, or scaling, becomes necessary when continuous independent variables are measured on different scales, as this can lead to unequal contributions to the analysis. The objective is to rescale these variables so they have comparable ranges and/or variances, ensuring a more balanced influence in the model.

In [17]:
from sklearn.preprocessing import MinMaxScaler
import joblib

scaler_path = "./artifacts/scaler.pkl"

scaler = MinMaxScaler()
scaler.fit(cont_vars)

joblib.dump(value=scaler, filename=scaler_path)
print("Saved scaler in artifacts")

cont_vars = pd.DataFrame(scaler.transform(cont_vars), columns=cont_vars.columns)
cont_vars

Saved scaler in artifacts


Unnamed: 0,purchases,time_spent,n_visits
0,0.271612,0.512086,0.703093
1,0.832368,0.648330,0.140619
2,0.608065,0.166620,0.187492
3,0.495914,0.674776,0.328110
4,0.271612,0.366288,0.890585
...,...,...,...
2932,0.159461,0.475057,0.609348
2933,0.608065,0.535096,0.796839
2934,0.047310,0.345506,0.000000
2935,0.000000,0.759362,0.796839


# Combine data

In [18]:
cont_vars = cont_vars.reset_index(drop=True)
cat_vars = cat_vars.reset_index(drop=True)
data = pd.concat([cat_vars, cont_vars], axis=1)
print(f"Data cleansed and combined.\nRows: {len(data)}")
data

Data cleansed and combined.
Rows: 2937


Unnamed: 0,lead_id,lead_indicator,date_part,source,customer_group,onboarding,customer_code,purchases,time_spent,n_visits
0,2,1.0,2024-01-27,signup,5,False,STWPXVNOWA,0.271612,0.512086,0.703093
1,10,0.0,2024-01-11,signup,4,True,JWLXPVANUM,0.832368,0.648330,0.140619
2,13,0.0,2024-01-22,signup,6,True,WNKCTHUHYI,0.608065,0.166620,0.187492
3,17,1.0,2024-01-07,signup,8,False,INIQMTGGOO,0.495914,0.674776,0.328110
4,19,0.0,2024-01-20,signup,9,True,SJGDAANOYJ,0.271612,0.366288,0.890585
...,...,...,...,...,...,...,...,...,...,...
2932,12328,1.0,2024-01-01,signup,5,True,AYAVVZAMGV,0.159461,0.475057,0.609348
2933,12330,1.0,2024-01-14,signup,4,True,YFKCVZSAJF,0.608065,0.535096,0.796839
2934,12334,0.0,2024-01-21,signup,5,True,ROVGLTVJNL,0.047310,0.345506,0.000000
2935,12337,1.0,2024-01-03,signup,6,True,BGQNNVRTDH,0.000000,0.759362,0.796839


# Data drift artifact

In [19]:
import json

data_columns = list(data.columns)
with open('./artifacts/columns_drift.json','w+') as f:           
    json.dump(data_columns,f)
    
data.to_csv('./artifacts/training_data.csv', index=False)

# Binning object columns

In [20]:
data.columns

Index(['lead_id', 'lead_indicator', 'date_part', 'source', 'customer_group',
       'onboarding', 'customer_code', 'purchases', 'time_spent', 'n_visits'],
      dtype='object')

In [21]:
data['bin_source'] = data['source']
values_list = ['li', 'organic','signup','fb']
data.loc[~data['source'].isin(values_list),'bin_source'] = 'Others'
mapping = {'li' : 'socials', 
           'fb' : 'socials', 
           'organic': 'group1', 
           'signup': 'group1'
           }

data['bin_source'] = data['source'].map(mapping)

# Save gold medallion dataset

In [None]:
#spark.sql(f"drop table if exists train_gold")


In [22]:
# data_gold = spark.createDataFrame(data)
# data_gold.write.saveAsTable('train_gold')
# dbutils.notebook.exit(('training_golden_data',most_recent_date))

data.to_csv('./artifacts/train_data_gold.csv', index=False)

# MODEL TRAINING

Training the model uses a training dataset for training an ML algorithm. It has sample output data and the matching input data that affects the output.

In [23]:
import datetime

# Constants used:
current_date = datetime.datetime.now().strftime("%Y_%B_%d")
data_gold_path = "./artifacts/train_data_gold.csv"
data_version = "00000"
experiment_name = current_date

# Create paths

Maybe the artifacts path has not been created during data cleaning

In [24]:
import os
import shutil

#os.makedirs("artifacts", exist_ok=True)
os.makedirs("mlruns", exist_ok=True)
os.makedirs("mlruns/.trash", exist_ok=True)

In [25]:
import mlflow

mlflow.set_experiment(experiment_name)

2025/12/01 08:49:26 INFO mlflow.tracking.fluent: Experiment with name '2025_December_01' does not exist. Creating a new experiment.


<Experiment: artifact_location='file:///Users/yasminebenmessaoud/Library/CloudStorage/OneDrive-ITU/5.%20Semester/Data%20Science%20in%20Production%20-%20MLOps%20and%20Software%20Engineering/itu-sdse-project/notebooks/mlruns/627272652637156136', creation_time=1764575366223, experiment_id='627272652637156136', last_update_time=1764575366223, lifecycle_stage='active', name='2025_December_01', tags={}>

# Helper functions

* *create_dummies*: Create one-hot encoding columns in the data.

In [26]:
def create_dummy_cols(df, col):
    df_dummies = pd.get_dummies(df[col], prefix=col, drop_first=True)
    new_df = pd.concat([df, df_dummies], axis=1)
    new_df = new_df.drop(col, axis=1)
    return new_df

# Load training data
We use the training data we cleaned earlier

In [27]:
data = pd.read_csv(data_gold_path)
print(f"Training data length: {len(data)}")
data.head(5)

Training data length: 2937


Unnamed: 0,lead_id,lead_indicator,date_part,source,customer_group,onboarding,customer_code,purchases,time_spent,n_visits,bin_source
0,2,1.0,2024-01-27,signup,5,False,STWPXVNOWA,0.271612,0.512086,0.703093,group1
1,10,0.0,2024-01-11,signup,4,True,JWLXPVANUM,0.832368,0.64833,0.140619,group1
2,13,0.0,2024-01-22,signup,6,True,WNKCTHUHYI,0.608065,0.16662,0.187492,group1
3,17,1.0,2024-01-07,signup,8,False,INIQMTGGOO,0.495914,0.674776,0.32811,group1
4,19,0.0,2024-01-20,signup,9,True,SJGDAANOYJ,0.271612,0.366288,0.890585,group1


# Data type split

In [28]:
data = data.drop(["lead_id", "customer_code", "date_part"], axis=1)

cat_cols = ["customer_group", "onboarding", "bin_source", "source"]
cat_vars = data[cat_cols]

other_vars = data.drop(cat_cols, axis=1)

# Dummy variable for categorical vars

1. Create one-hot encoded cols for cat vars
2. Change to floats

In [29]:
import pandas as pd

for col in cat_vars:
    cat_vars[col] = cat_vars[col].astype("category")
    cat_vars = create_dummy_cols(cat_vars, col)

data = pd.concat([other_vars, cat_vars], axis=1)

for col in data:
    data[col] = data[col].astype("float64")
    print(f"Changed column {col} to float")

Changed column lead_indicator to float
Changed column purchases to float
Changed column time_spent to float
Changed column n_visits to float
Changed column customer_group_2 to float
Changed column customer_group_3 to float
Changed column customer_group_4 to float
Changed column customer_group_5 to float
Changed column customer_group_6 to float
Changed column customer_group_7 to float
Changed column customer_group_8 to float
Changed column customer_group_9 to float
Changed column onboarding_True to float


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cat_vars[col] = cat_vars[col].astype("category")


# Splitting data

In [30]:
y = data["lead_indicator"]
X = data.drop(["lead_indicator"], axis=1)

In [31]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42, test_size=0.15, stratify=y
)
y_train

335     1.0
499     0.0
2850    1.0
2142    0.0
1019    0.0
       ... 
1378    0.0
2771    0.0
28      0.0
1410    1.0
2488    1.0
Name: lead_indicator, Length: 2496, dtype: float64

# Model training

This stage involves training the ML algorithm by providing it with datasets, where the learning process takes place. Consistent training can significantly enhance the model's prediction accuracy. It's essential to initialize the model's weights randomly so the algorithm can effectively learn to adjust them.

In [34]:
!brew install libomp


[34m==>[0m [1mAuto-updating Homebrew...[0m
Adjust how often this is run with `$HOMEBREW_AUTO_UPDATE_SECS` or disable with
`$HOMEBREW_NO_AUTO_UPDATE=1`. Hide these hints with `$HOMEBREW_NO_ENV_HINTS=1` (see `man brew`).
[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/core/portable-ruby/blobs/sha256:c6946ba2c387b47934e77c352c2056489421003ec7ddb2abf246cef2168ec140[0m
######################################################################### 100.0%###########################                      73.7%
[34m==>[0m [1mPouring portable-ruby-3.4.7.arm64_big_sur.bottle.tar.gz[0m
[34m==>[0m [1mAuto-updated Homebrew![0m
Updated 2 taps (homebrew/core and homebrew/cask).
[34m==>[0m [1mNew Formulae[0m
aklomp-base64: Fast Base64 stream encoder/decoder in C99, with SIMD acceleration
ansible@12: Automate deployment, configuration, and upgrading
auto-editor: Efficient media analysis and rendering
aws-spiffe-workload-helper: Helper for providing AWS credentials to workloads using 

In [37]:
!pip install xgboost --force-reinstall

Collecting xgboost
  Downloading xgboost-3.1.2-py3-none-macosx_12_0_arm64.whl.metadata (2.1 kB)
Collecting numpy (from xgboost)
  Using cached numpy-2.3.5-cp311-cp311-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting scipy (from xgboost)
  Using cached scipy-1.16.3-cp311-cp311-macosx_14_0_arm64.whl.metadata (62 kB)
Downloading xgboost-3.1.2-py3-none-macosx_12_0_arm64.whl (2.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m12.2 MB/s[0m  [33m0:00:00[0m
[?25hUsing cached numpy-2.3.5-cp311-cp311-macosx_14_0_arm64.whl (5.4 MB)
Using cached scipy-1.16.3-cp311-cp311-macosx_14_0_arm64.whl (20.9 MB)
Installing collected packages: numpy, scipy, xgboost
[2K  Attempting uninstall: numpy
[2K    Found existing installation: numpy 2.3.5
[2K    Uninstalling numpy-2.3.5:
[2K      Successfully uninstalled numpy-2.3.5
[2K  Attempting uninstall: scipy━━━━━━━━━━━━━━━━━━━[0m [32m0/3[0m [numpy]
[2K    Found existing installation: scipy 1.16.3[0m [32m0/3[0

In [38]:
!pip install xgboost==1.7.6


Collecting xgboost==1.7.6
  Downloading xgboost-1.7.6-py3-none-macosx_12_0_arm64.whl.metadata (1.9 kB)
Downloading xgboost-1.7.6-py3-none-macosx_12_0_arm64.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m12.4 MB/s[0m  [33m0:00:00[0m
[?25hInstalling collected packages: xgboost
  Attempting uninstall: xgboost
    Found existing installation: xgboost 3.1.2
    Uninstalling xgboost-3.1.2:
      Successfully uninstalled xgboost-3.1.2
Successfully installed xgboost-1.7.6


# XGBoost

In [39]:
from xgboost import XGBRFClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
from scipy.stats import randint

model = XGBRFClassifier(random_state=42)
params = {
    "learning_rate": uniform(1e-2, 3e-1),
    "min_split_loss": uniform(0, 10),
    "max_depth": randint(3, 10),
    "subsample": uniform(0, 1),
    "objective": ["reg:squarederror", "binary:logistic", "reg:logistic"],
    "eval_metric": ["aucpr", "error"]
}

model_grid = RandomizedSearchCV(model, param_distributions=params, n_jobs=-1, verbose=3, n_iter=10, cv=10)

model_grid.fit(X_train, y_train)

Fitting 10 folds for each of 10 candidates, totalling 100 fits
[CV 4/10] END eval_metric=aucpr, learning_rate=0.05299176412941538, max_depth=7, min_split_loss=3.3081182713192603, objective=reg:squarederror, subsample=0.10405445120797607;, score=0.532 total time=   0.0s
[CV 3/10] END eval_metric=aucpr, learning_rate=0.05299176412941538, max_depth=7, min_split_loss=3.3081182713192603, objective=reg:squarederror, subsample=0.10405445120797607;, score=0.532 total time=   0.1s
[CV 1/10] END eval_metric=aucpr, learning_rate=0.05299176412941538, max_depth=7, min_split_loss=3.3081182713192603, objective=reg:squarederror, subsample=0.10405445120797607;, score=0.532 total time=   0.1s
[CV 2/10] END eval_metric=aucpr, learning_rate=0.05299176412941538, max_depth=7, min_split_loss=3.3081182713192603, objective=reg:squarederror, subsample=0.10405445120797607;, score=0.532 total time=   0.1s
[CV 5/10] END eval_metric=aucpr, learning_rate=0.05299176412941538, max_depth=7, min_split_loss=3.30811827131

# Model test accuracy

In [41]:
from sklearn.metrics import accuracy_score

best_model_xgboost_params = model_grid.best_params_
print("Best xgboost params")
print(best_model_xgboost_params)

y_pred_train = model_grid.predict(X_train)
y_pred_test = model_grid.predict(X_test)
print("Accuracy train", accuracy_score(y_pred_train, y_train ))
print("Accuracy test", accuracy_score(y_pred_test, y_test))


Best xgboost params
{'eval_metric': 'error', 'learning_rate': np.float64(0.24517995936765524), 'max_depth': 6, 'min_split_loss': np.float64(2.5253349437857056), 'objective': 'binary:logistic', 'subsample': np.float64(0.6575111505357728)}
Accuracy train 0.827323717948718
Accuracy test 0.7913832199546486


# XGBoost performance overview
* Confusion matrix
* Classification report

In [42]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

conf_matrix = confusion_matrix(y_test, y_pred_test)
print("Test actual/predicted\n")
print(pd.crosstab(y_test, y_pred_test, rownames=['Actual'], colnames=['Predicted'], margins=True),'\n')
print("Classification report\n")
print(classification_report(y_test, y_pred_test),'\n')

conf_matrix = confusion_matrix(y_train, y_pred_train)
print("Train actual/predicted\n")
print(pd.crosstab(y_train, y_pred_train, rownames=['Actual'], colnames=['Predicted'], margins=True),'\n')
print("Classification report\n")
print(classification_report(y_train, y_pred_train),'\n')

Test actual/predicted

Predicted    0    1  All
Actual                  
0.0        207   28  235
1.0         64  142  206
All        271  170  441 

Classification report

              precision    recall  f1-score   support

         0.0       0.76      0.88      0.82       235
         1.0       0.84      0.69      0.76       206

    accuracy                           0.79       441
   macro avg       0.80      0.79      0.79       441
weighted avg       0.80      0.79      0.79       441
 

Train actual/predicted

Predicted     0     1   All
Actual                     
0.0        1140   187  1327
1.0         244   925  1169
All        1384  1112  2496 

Classification report

              precision    recall  f1-score   support

         0.0       0.82      0.86      0.84      1327
         1.0       0.83      0.79      0.81      1169

    accuracy                           0.83      2496
   macro avg       0.83      0.83      0.83      2496
weighted avg       0.83      0.83    

# Save best XGBoost model

In [43]:
xgboost_model = model_grid.best_estimator_
xgboost_model_path = "./artifacts/lead_model_xgboost.json"
xgboost_model.save_model(xgboost_model_path)

model_results = {
    xgboost_model_path: classification_report(y_train, y_pred_train, output_dict=True)
}

# SKLearn logistic regression

In [45]:
import mlflow.pyfunc

from sklearn.linear_model import LogisticRegression
import os
from sklearn.metrics import cohen_kappa_score, f1_score
import matplotlib.pyplot as plt
import joblib

class lr_wrapper(mlflow.pyfunc.PythonModel):
    def __init__(self, model):
        self.model = model
    
    def predict(self, context, model_input):
        return self.model.predict_proba(model_input)[:, 1]


mlflow.sklearn.autolog(log_input_examples=True, log_models=False)
experiment_id = mlflow.get_experiment_by_name(experiment_name).experiment_id

with mlflow.start_run(experiment_id=experiment_id) as run:
    model = LogisticRegression()
    lr_model_path = "./artifacts/lead_model_lr.pkl"

    params = {
              'solver': ["newton-cg", "lbfgs", "liblinear", "sag", "saga"],
              'penalty':  ["none", "l1", "l2", "elasticnet"],
              'C' : [100, 10, 1.0, 0.1, 0.01]
    }
    model_grid = RandomizedSearchCV(model, param_distributions= params, verbose=3, n_iter=10, cv=3)
    model_grid.fit(X_train, y_train)

    best_model = model_grid.best_estimator_

    y_pred_train = model_grid.predict(X_train)
    y_pred_test = model_grid.predict(X_test)


    # log artifacts
    mlflow.log_metric('f1_score', f1_score(y_test, y_pred_test))
    mlflow.log_artifacts("artifacts", artifact_path="model")
    mlflow.log_param("data_version", "00000")
    
    # store model for model interpretability
    joblib.dump(value=model, filename=lr_model_path)
        
    # Custom python model for predicting probability 
    mlflow.pyfunc.log_model('model', python_model=lr_wrapper(model))


model_classification_report = classification_report(y_test, y_pred_test, output_dict=True)

best_model_lr_params = model_grid.best_params_

print("Best lr params")
print(best_model_lr_params)

print("Accuracy train:", accuracy_score(y_pred_train, y_train ))
print("Accuracy test:", accuracy_score(y_pred_test, y_test))

conf_matrix = confusion_matrix(y_test, y_pred_test)
print("Test actual/predicted\n")
print(pd.crosstab(y_test, y_pred_test, rownames=['Actual'], colnames=['Predicted'], margins=True),'\n')
print("Classification report\n")
print(classification_report(y_test, y_pred_test),'\n')

conf_matrix = confusion_matrix(y_train, y_pred_train)
print("Train actual/predicted\n")
print(pd.crosstab(y_train, y_pred_train, rownames=['Actual'], colnames=['Predicted'], margins=True),'\n')
print("Classification report\n")
print(classification_report(y_train, y_pred_train),'\n')

model_results[lr_model_path] = model_classification_report
print(model_classification_report["weighted avg"]["f1-score"])


Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV 1/3] END .......C=1.0, penalty=l1, solver=sag;, score=nan total time=   0.0s
[CV 2/3] END .......C=1.0, penalty=l1, solver=sag;, score=nan total time=   0.0s
[CV 3/3] END .......C=1.0, penalty=l1, solver=sag;, score=nan total time=   0.0s
[CV 1/3] END ........C=10, penalty=l1, solver=sag;, score=nan total time=   0.0s
[CV 2/3] END ........C=10, penalty=l1, solver=sag;, score=nan total time=   0.0s
[CV 3/3] END ........C=10, penalty=l1, solver=sag;, score=nan total time=   0.0s
[CV 1/3] END C=10, penalty=none, solver=newton-cg;, score=nan total time=   0.0s
[CV 2/3] END C=10, penalty=none, solver=newton-cg;, score=nan total time=   0.0s
[CV 3/3] END C=10, penalty=none, solver=newton-cg;, score=nan total time=   0.0s
[CV 1/3] END C=0.01, penalty=l1, solver=newton-cg;, score=nan total time=   0.0s
[CV 2/3] END C=0.01, penalty=l1, solver=newton-cg;, score=nan total time=   0.0s
[CV 3/3] END C=0.01, penalty=l1, solver=newton-c

21 fits failed out of a total of 30.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/yasminebenmessaoud/anaconda3/envs/sdse_project/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/yasminebenmessaoud/anaconda3/envs/sdse_project/lib/python3.11/site-packages/mlflow/utils/autologging_utils/safety.py", line 593, in safe_patch_function
    patch_function(call_original, *args, **kwargs)
  File "/Users/yasminebenmessaoud/anaconda3/envs/sdse_project/lib/python3.11/site-packages/mlflow/utils/autologging_utils/safety.py", line 259, in patch_with_m

Best lr params
{'solver': 'saga', 'penalty': 'l1', 'C': 10}
Accuracy train: 0.8145032051282052
Accuracy test: 0.7845804988662132
Test actual/predicted

Predicted  0.0  1.0  All
Actual                  
0.0        206   29  235
1.0         66  140  206
All        272  169  441 

Classification report

              precision    recall  f1-score   support

         0.0       0.76      0.88      0.81       235
         1.0       0.83      0.68      0.75       206

    accuracy                           0.78       441
   macro avg       0.79      0.78      0.78       441
weighted avg       0.79      0.78      0.78       441
 

Train actual/predicted

Predicted   0.0   1.0   All
Actual                     
0.0        1134   193  1327
1.0         270   899  1169
All        1404  1092  2496 

Classification report

              precision    recall  f1-score   support

         0.0       0.81      0.85      0.83      1327
         1.0       0.82      0.77      0.80      1169

    accuracy    

# Save columns and model results

In [47]:
column_list_path = './artifacts/columns_list.json'
with open(column_list_path, 'w+') as columns_file:
    columns = {'column_names': list(X_train.columns)}
    print(columns)
    json.dump(columns, columns_file)

print('Saved column list to ', column_list_path)

model_results_path = "./artifacts/model_results.json"
with open(model_results_path, 'w+') as results_file:
    json.dump(model_results, results_file)

{'column_names': ['purchases', 'time_spent', 'n_visits', 'customer_group_2', 'customer_group_3', 'customer_group_4', 'customer_group_5', 'customer_group_6', 'customer_group_7', 'customer_group_8', 'customer_group_9', 'onboarding_True']}
Saved column list to  ./artifacts/columns_list.json


# MODEL SELECTION

Model selection involves choosing the most suitable statistical model from a set of candidates. In straightforward cases, this process uses an existing dataset. When candidate models offer comparable predictive or explanatory power, the simplest model is generally the preferred choice.

In [48]:
# Constants used:
current_date = datetime.datetime.now().strftime("%Y_%B_%d")
artifact_path = "model"
model_name = "lead_model"
experiment_name = current_date

# Helper functions

In [49]:
import time
from mlflow.tracking.client import MlflowClient
from mlflow.entities.model_registry.model_version_status import ModelVersionStatus
from mlflow.tracking.client import MlflowClient

def wait_until_ready(model_name, model_version):
    client = MlflowClient()
    for _ in range(10):
        model_version_details = client.get_model_version(
          name=model_name,
          version=model_version,
        )
        status = ModelVersionStatus.from_string(model_version_details.status)
        print(f"Model status: {ModelVersionStatus.to_string(status)}")
        if status == ModelVersionStatus.READY:
            break
        time.sleep(1)


# Getting experiment model results

In [50]:
experiment_ids = [mlflow.get_experiment_by_name(experiment_name).experiment_id]
experiment_ids

['627272652637156136']

In [51]:
experiment_best = mlflow.search_runs(
    experiment_ids=experiment_ids,
    order_by=["metrics.f1_score DESC"],
    max_results=1
).iloc[0]
experiment_best

run_id                                               87267f3f78084ede99e16e5ff5f2c703
experiment_id                                                      627272652637156136
status                                                                       FINISHED
artifact_uri                        file:///Users/yasminebenmessaoud/Library/Cloud...
start_time                                           2025-12-01 08:00:22.535000+00:00
end_time                                             2025-12-01 08:00:25.127000+00:00
metrics.training_log_loss                                                    0.418664
metrics.training_score                                                       0.814503
metrics.training_recall_score                                                0.814503
metrics.best_cv_score                                                        0.812099
metrics.f1_score                                                             0.746667
metrics.f1_score_X_test                               

In [52]:
import json

with open("./artifacts/model_results.json", "r") as f:
    model_results = json.load(f)
results_df = pd.DataFrame({model: val["weighted avg"] for model, val in model_results.items()}).T
results_df

Unnamed: 0,precision,recall,f1-score,support
./artifacts/lead_model_xgboost.json,0.827509,0.827324,0.826982,2496.0
./artifacts/lead_model_lr.pkl,0.790542,0.78458,0.781814,441.0


In [53]:
best_model = results_df.sort_values("f1-score", ascending=False).iloc[0].name
print(f"Best model: {best_model}")

Best model: ./artifacts/lead_model_xgboost.json


# Get production model

In [54]:
from mlflow.tracking import MlflowClient

client = MlflowClient()
prod_model = [model for model in client.search_model_versions(f"name='{model_name}'") if dict(model)['current_stage']=='Production']
prod_model_exists = len(prod_model)>0

if prod_model_exists:
    prod_model_version = dict(prod_model[0])['version']
    prod_model_run_id = dict(prod_model[0])['run_id']
    
    print('Production model name: ', model_name)
    print('Production model version:', prod_model_version)
    print('Production model run id:', prod_model_run_id)
    
else:
    print('No model in production')


No model in production


# Compare prod and best trained model

In [55]:
train_model_score = experiment_best["metrics.f1_score"]
model_details = {}
model_status = {}
run_id = None

if prod_model_exists:
    data, details = mlflow.get_run(prod_model_run_id)
    prod_model_score = data[1]["metrics.f1_score"]

    model_status["current"] = train_model_score
    model_status["prod"] = prod_model_score

    if train_model_score>prod_model_score:
        print("Registering new model")
        run_id = experiment_best["run_id"]
else:
    print("No model in production")
    run_id = experiment_best["run_id"]

print(f"Registered model: {run_id}")

No model in production
Registered model: 87267f3f78084ede99e16e5ff5f2c703


# Register best model

In [None]:
if run_id is not None:
    print(f'Best model found: {run_id}')

    model_uri = "runs:/{run_id}/{artifact_path}".format(
        run_id=run_id,
        artifact_path=artifact_path
    )
    model_details = mlflow.register_model(model_uri=model_uri, name=model_name)
    wait_until_ready(model_details.name, model_details.version)
    model_details = dict(model_details)
    print(model_details)

# DEPLOY

A model version can be assigned to one or more stages. MLflow provides predefined stages for common use cases: None, Staging, Production, and Archived. With the necessary permissions, you can transition a model version between stages or request a transition to a different stage.

In [None]:
model_version = 1

# Transition to staging

In [None]:
from mlflow.tracking import MlflowClient

client = MlflowClient()


def wait_for_deployment(model_name, model_version, stage='Staging'):
    status = False
    while not status:
        model_version_details = dict(
            client.get_model_version(name=model_name,version=model_version)
            )
        if model_version_details['current_stage'] == stage:
            print(f'Transition completed to {stage}')
            status = True
            break
        else:
            time.sleep(2)
    return status

model_version_details = dict(client.get_model_version(name=model_name,version=model_version))
model_status = True
if model_version_details['current_stage'] != 'Staging':
    client.transition_model_version_stage(
        name=model_name,
        version=model_version,stage="Staging", 
        archive_existing_versions=True
    )
    model_status = wait_for_deployment(model_name, model_version, 'Staging')
else:
    print('Model already in staging')