### Levering MLflow for Experiment Tracking and Model Management

## Objective
The objective of this task is to introduce you to MLflow for experiment tracking, model management, and reproducibility in machine learning projects for the Sentiment Analysis Project.



#### Incorporating MLflow into your machine learning project to showcase experiment monitoring, model administration, and reproducibility involves the following process:

- Integrate MLflow into your existing machine learning projects.
- Train machine learning models while logging relevant information with MLflow.
- Demonstrate how to log parameters, metrics, and artifacts using MLflow tracking APIs.
- Customizing MLflow UI with run names.
- Demonstrate metric plots.
- Demonstrate hyperparameter plots.
- Demonstrate how to register models and manage by tagging them.
- (BONUS) Build a Prefect Workflow and Auto Schedule it. Show the Prefect Dashboard with relevant outputs.


#### Load Data from the Dataset folder

In [1]:
import pandas as pd
import numpy as np

# Load the dataset
data = pd.read_csv(r"C:\Users\arsha\Downloads\reviews_data_dump\reviews_badminton\data.csv")
data.head()

Unnamed: 0,Reviewer Name,Review Title,Place of Review,Up Votes,Down Votes,Month,Review text,Ratings
0,Kamal Suresh,Nice product,"Certified Buyer, Chirakkal",889.0,64.0,Feb 2021,"Nice product, good quality, but price is now r...",4
1,Flipkart Customer,Don't waste your money,"Certified Buyer, Hyderabad",109.0,6.0,Feb 2021,They didn't supplied Yonex Mavis 350. Outside ...,1
2,A. S. Raja Srinivasan,Did not meet expectations,"Certified Buyer, Dharmapuri",42.0,3.0,Apr 2021,Worst product. Damaged shuttlecocks packed in ...,1
3,Suresh Narayanasamy,Fair,"Certified Buyer, Chennai",25.0,1.0,,"Quite O. K. , but nowadays the quality of the...",3
4,ASHIK P A,Over priced,,147.0,24.0,Apr 2016,Over pricedJust â?¹620 ..from retailer.I didn'...,1


In [2]:
# print properties of attributes in the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8518 entries, 0 to 8517
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Reviewer Name    8508 non-null   object 
 1   Review Title     8508 non-null   object 
 2   Place of Review  8468 non-null   object 
 3   Up Votes         8508 non-null   float64
 4   Down Votes       8508 non-null   float64
 5   Month            8053 non-null   object 
 6   Review text      8510 non-null   object 
 7   Ratings          8518 non-null   int64  
dtypes: float64(2), int64(1), object(5)
memory usage: 532.5+ KB


In [3]:
# check the number of null values per column
data.isnull().sum()

Reviewer Name       10
Review Title        10
Place of Review     50
Up Votes            10
Down Votes          10
Month              465
Review text          8
Ratings              0
dtype: int64

In [4]:
data = data.dropna()

In [5]:
data.isnull().sum()

Reviewer Name      0
Review Title       0
Place of Review    0
Up Votes           0
Down Votes         0
Month              0
Review text        0
Ratings            0
dtype: int64

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8013 entries, 0 to 8507
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Reviewer Name    8013 non-null   object 
 1   Review Title     8013 non-null   object 
 2   Place of Review  8013 non-null   object 
 3   Up Votes         8013 non-null   float64
 4   Down Votes       8013 non-null   float64
 5   Month            8013 non-null   object 
 6   Review text      8013 non-null   object 
 7   Ratings          8013 non-null   int64  
dtypes: float64(2), int64(1), object(5)
memory usage: 563.4+ KB


In [7]:
# adding a sentiment column to classify reviews as Positive or Negative
# Positive = 1
# Negative = 0

# Method 1: Using numpy's where function
data['sentiment'] = np.where(data['Ratings'] == 5.0, 1,
                              np.where(data['Ratings'] == 4.0, 1, 0))

In [8]:
# Method 2: Using pandas' map function
# Create a dictionary mapping star ratings to sentiments
rating_sentiment_map = {5.0: 1, 4.0: 1, 1.0: 0, 2.0: 0, 3.0: 0}

# Map star ratings to sentiments using the dictionary
data['sentiment'] = data['Ratings'].map(rating_sentiment_map)

In [9]:
data.head()

Unnamed: 0,Reviewer Name,Review Title,Place of Review,Up Votes,Down Votes,Month,Review text,Ratings,sentiment
0,Kamal Suresh,Nice product,"Certified Buyer, Chirakkal",889.0,64.0,Feb 2021,"Nice product, good quality, but price is now r...",4,1
1,Flipkart Customer,Don't waste your money,"Certified Buyer, Hyderabad",109.0,6.0,Feb 2021,They didn't supplied Yonex Mavis 350. Outside ...,1,0
2,A. S. Raja Srinivasan,Did not meet expectations,"Certified Buyer, Dharmapuri",42.0,3.0,Apr 2021,Worst product. Damaged shuttlecocks packed in ...,1,0
5,Baji Sankar,Mind-blowing purchase,"Certified Buyer, Hyderabad",173.0,45.0,Oct 2018,Good quality product. Delivered on time.READ MORE,5,1
6,Flipkart Customer,Must buy!,"Certified Buyer, Doom Dooma",403.0,121.0,Jan 2020,BEST PURCHASE It is a good quality and is more...,5,1


#### Identify input and output

In [10]:
X = data["Review text"]
y = data["sentiment"]

#### Split data into training and testing sets

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

#### Data Cleaning and preprocessing on train and test data

In [12]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Preprocessing functions
def clean_text(text):
    text = re.sub(r"[^a-zA-Z]", " ", text)
    text = re.sub(r'\W+', ' ', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.lower()
    stop_words = set(stopwords.words('english'))
    words = text.split()
    cleaned_words = [word for word in words if word not in stop_words]
    return ' '.join(cleaned_words)

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    tokens = nltk.word_tokenize(text)
    lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(lemmatized_words)

In [13]:
# Apply text cleaning to the X_train data
X_train = X_train.apply(clean_text)
X_train = X_train.apply(lemmatize_text)
X_train.shape

(6009,)

In [14]:
# Apply text cleaning to the X_test data
X_test = X_test.apply(clean_text)
X_test = X_test.apply(lemmatize_text)
X_test.shape

(2004,)

#### Running the environment

In [15]:
pip install mlflow

Note: you may need to restart the kernel to use updated packages.


In [16]:
import mlflow
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import time
import joblib
import os

In [19]:
 mlflow.set_tracking_uri("sqlite:///mlflow_1.db")


In [17]:
import warnings
warnings.filterwarnings("ignore")

# mlflow.set_tracking_uri("sqlite:///mlflow_1.db")

mlflow.set_experiment("Sentiment Analysis of Flipkart Product Reviews")

<Experiment: artifact_location='file:///C:/Users/arsha/Downloads/MLOPs/mlruns/998641720780543424', creation_time=1711616665462, experiment_id='998641720780543424', last_update_time=1711616665462, lifecycle_stage='active', name='Sentiment Analysis of Flipkart Product Reviews', tags={}>

#### Auto Logging All Experiment Runs using MLFlow


In [18]:
# Define pipelines for various classifiers
pipelines = {
    'knn': Pipeline([
        ('tfidf', TfidfVectorizer()),
        ('classifier', KNeighborsClassifier())
    ]),
    'svc': Pipeline([
        ('tfidf', TfidfVectorizer()),
        ('classifier', SVC())
    ]),
    'logistic_regression': Pipeline([
        ('tfidf', TfidfVectorizer()),
        ('classifier', LogisticRegression())
    ]),
    'random_forest': Pipeline([
        ('tfidf', TfidfVectorizer()),
        ('classifier', RandomForestClassifier())
    ]),
    'decision_tree': Pipeline([
        ('tfidf', TfidfVectorizer()),
        ('classifier', DecisionTreeClassifier())
    ])
}

# Define parameter grid for each algorithm
param_grids = {
    'knn': [
        {
            'tfidf__max_features': [1000, 2000, 3000],
            'classifier__n_neighbors': [3, 5, 7],
            'classifier__p': [1, 2, 3]
        }
    ],
    'svc': [
        {
            'tfidf__max_features': [1000, 2000, 3000],
            'classifier__kernel': ['rbf'],
            'classifier__C': [0.1, 1, 10]
        },
        {
            'tfidf__max_features': [1000, 2000, 3000],
            'classifier__kernel': ['linear'],
            'classifier__C': [0.1, 1, 10]
        }
    ],
    'logistic_regression': [
        {
            'tfidf__max_features': [1000, 2000, 3000],
            'classifier__C': [0.1, 1, 10],
            'classifier__penalty': ['l1', 'l2']
        }
    ],
    'random_forest': [
        {
            'tfidf__max_features': [1000, 2000, 3000],
            'classifier__n_estimators': [50, 100, 200]
        }
    ],
    'decision_tree': [
        {
            'tfidf__max_features': [1000, 2000, 3000],
            'classifier__max_depth': [None, 5, 10]
        }
    ],

}


In [19]:
# Perform GridSearchCV for each algorithm

best_models = {}

# Run the Pipeline
for algo in pipelines.keys():
    print("*"*10, algo, "*"*10)
    grid_search = GridSearchCV(estimator=pipelines[algo], 
                               param_grid=param_grids[algo], 
                               cv=5, 
                               scoring='accuracy', 
                               return_train_score=True,
                               verbose=1
                              )
    
    mlflow.sklearn.autolog(max_tuning_runs=None)
    
    with mlflow.start_run() as run:
        %time grid_search.fit(X_train, y_train)
        
    # print('Score on Train Data: ', grid_search.best_score_)
    print('Score on Test Data: ', grid_search.score(X_test, y_test))

********** knn **********


The git executable must be specified in one of the following ways:
    - be included in your $PATH
    - be set via $GIT_PYTHON_GIT_EXECUTABLE
    - explicitly set via git.refresh()

All git commands will error until this is rectified.

This initial message can be silenced or aggravated in the future by setting the
$GIT_PYTHON_REFRESH environment variable. Use one of the following values:
    - quiet|q|silence|s|silent|none|n|0: for no message or exception
    - error|e|exception|raise|r|2: for a raised exception

Example:
    export GIT_PYTHON_REFRESH=quiet



Fitting 5 folds for each of 27 candidates, totalling 135 fits
CPU times: total: 5min 17s
Wall time: 2min 7s
Score on Test Data:  0.8637724550898204
********** svc **********




Fitting 5 folds for each of 18 candidates, totalling 90 fits
CPU times: total: 1min 38s
Wall time: 3min 17s
Score on Test Data:  0.8822355289421158
********** logistic_regression **********




Fitting 5 folds for each of 18 candidates, totalling 90 fits
CPU times: total: 9.59 s
Wall time: 34 s
Score on Test Data:  0.8812375249500998
********** random_forest **********




Fitting 5 folds for each of 9 candidates, totalling 45 fits
CPU times: total: 5min 18s
Wall time: 9min 34s
Score on Test Data:  0.8822355289421158
********** decision_tree **********




Fitting 5 folds for each of 9 candidates, totalling 45 fits
CPU times: total: 12.1 s
Wall time: 45.2 s
Score on Test Data:  0.8662674650698603


In [29]:
# Stop the auto logger
mlflow.sklearn.autolog(disable=True)

In [23]:
import os
import time
import joblib
import mlflow
from sklearn.model_selection import GridSearchCV

dev = "MOHD ARSHAD"
best_models = {}

# Create the directory if it doesn't exist
directory = 'C:/Users/arsha/Downloads/MLOPs/model/Best Models'
if not os.path.exists(directory):
    os.makedirs(directory)

for algo in pipelines.keys():
    print("*"*10, algo, "*"*10)
    grid_search = GridSearchCV(estimator=pipelines[algo], 
                               param_grid=param_grids[algo], 
                               cv=5, 
                               scoring='accuracy', 
                               return_train_score=True,
                               verbose=1
                              )

    # Fit
    start_fit_time = time.time()
    grid_search.fit(X_train, y_train)
    end_fit_time = time.time()

    # Predict
    start_predict_time = time.time()
    y_pred = grid_search.predict(X_test)
    end_predict_time = time.time()

    # Saving the best model
    model_path = f'C:/Users/arsha/Downloads/MLOPs/model/Best Models/{algo}.pkl'
    joblib.dump(grid_search.best_estimator_, model_path)
    model_size = os.path.getsize(model_path)

    # Print Log
    print('Train Score: ', grid_search.best_score_)
    print('Test Score: ', grid_search.score(X_test, y_test))
    print("Fit Time: ", end_fit_time - start_fit_time)
    print("Predict Time: ", end_predict_time - start_predict_time)
    print("Model Size: ", model_size)
    
    print()

    # Start the experiment run
    with mlflow.start_run() as run:
        # Log tags with mlflow.set_tag()
        mlflow.set_tag("developer", dev)

        # Log Parameters with mlflow.log_param()
        mlflow.log_param("algorithm", algo)
        mlflow.log_param("hyperparameter_grid", param_grids[algo])
        mlflow.log_param("best_hyperparameter", grid_search.best_params_)

        # Log Metrics with mlflow.log_metric()
        mlflow.log_metric("train_score", grid_search.best_score_)
        mlflow.log_metric("test_score", grid_search.score(X_test, y_test))
        mlflow.log_metric("fit_time", end_fit_time - start_fit_time)
        mlflow.log_metric("predict_time", end_predict_time - start_predict_time)
        mlflow.log_metric("model_size", model_size)


2024/03/28 19:05:20 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'a16fd7e1dea645b0a642d4a4208e7f65', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


********** knn **********
Fitting 5 folds for each of 27 candidates, totalling 135 fits




Train Score:  0.8597113331790895
Test Score:  0.8637724550898204
Fit Time:  127.97832441329956
Predict Time:  0.3654301166534424
Model Size:  425339



2024/03/28 19:07:29 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '8d29c68b1986490b9101a706d47cdaef', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


********** svc **********
Fitting 5 folds for each of 18 candidates, totalling 90 fits




Train Score:  0.878681935879834
Test Score:  0.8822355289421158
Fit Time:  205.7623302936554
Predict Time:  0.2769961357116699
Model Size:  287437



2024/03/28 19:10:56 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '9fc31f526ef849849724b6eca12948c8', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


********** logistic_regression **********
Fitting 5 folds for each of 18 candidates, totalling 90 fits




Train Score:  0.8740233111342324
Test Score:  0.8812375249500998
Fit Time:  36.02333188056946
Predict Time:  0.0432741641998291
Model Size:  120881



2024/03/28 19:11:33 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '344245e970eb49aaafd3f6db495e6f38', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


********** random_forest **********
Fitting 5 folds for each of 9 candidates, totalling 45 fits




Train Score:  0.8758531783691073
Test Score:  0.8817365269461078
Fit Time:  689.3061170578003
Predict Time:  0.4957869052886963
Model Size:  29732050



2024/03/28 19:23:04 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '6489e03ff7e8485c9868799aa67c99bf', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


********** decision_tree **********
Fitting 5 folds for each of 9 candidates, totalling 45 fits




Train Score:  0.8577129984580237
Test Score:  0.8682634730538922
Fit Time:  52.3689227104187
Predict Time:  0.049196720123291016
Model Size:  71136

