## What is MLFlow and its Components

MLFLow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLFLow currently offers four components:


_First thing first_
### Create Conda environment

##### run below commands in terminal but make sure conda is installed or use anaconda prompt which you will get as part of anaconda installation

1. `conda create -n envname python=3.9 ipykernel` 
it will create a conda env named envname and install python version 3.9 and a ipykernel inside this environment

2. Activate the environment
`conda activate envname`

3. add newly created environment to the notebook as kernel
`python -m ipykernel install --user --name=envname` 

4. install notebook inside the environment
`pip install notebook`

5. Now install all required dependencies to run this notebook

* `pip install pandas`
* `pip install numpy`
* `pip install scikit-learn`
* `pip install imblearn`
* `pip install matplotlib`
* `pip install mlflow`

Now open the notebook using below command: (from the anaconda prompt inside conda environment)

`jupyter notebook`


#### Make sure python is used from your newly created environment.

In [1]:
import sys
print(sys.executable)

c:\Users\Acleda\miniconda3\python.exe


In [2]:
!python --version

Python 3.10.9


### Create functions for all the steps involved in complete model training lifecycle

In [3]:
import pandas as pd
import numpy as np


In [6]:
def load_data(path):
    data = pd.read_csv(path)
    return data


In [7]:
data = load_data('https://raw.githubusercontent.com/TripathiAshutosh/dataset/main/banking.csv')

In [19]:
def data_cleaning(data):
    print("na values available in data \n")
    print(data.isna().sum())
    data = data.dropna()
    print("after droping na values \n")
    print(data.isna().sum())
    return data

data_cleaning(data)

na values available in data 

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp_var_rate      0
cons_price_idx    0
cons_conf_idx     0
euribor3m         0
nr_employed       0
y                 0
dtype: int64
after droping na values 

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp_var_rate      0
cons_price_idx    0
cons_conf_idx     0
euribor3m         0
nr_employed       0
y                 0
dtype: int64


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
0,44,blue-collar,married,Basic,unknown,yes,no,cellular,aug,thu,...,1,999,0,nonexistent,1.4,93.444,-36.1,4.963,5228.1,0
1,53,technician,married,unknown,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-0.1,93.200,-42.0,4.021,5195.8,0
2,28,management,single,university.degree,no,yes,no,cellular,jun,thu,...,3,6,2,success,-1.7,94.055,-39.8,0.729,4991.6,1
3,39,services,married,high.school,no,no,no,cellular,apr,fri,...,2,999,0,nonexistent,-1.8,93.075,-47.1,1.405,5099.1,0
4,55,retired,married,Basic,no,yes,no,cellular,aug,fri,...,1,3,1,success,-2.9,92.201,-31.4,0.869,5076.2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,59,retired,married,high.school,unknown,no,yes,telephone,jun,thu,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.866,5228.1,0
41184,31,housemaid,married,Basic,unknown,no,no,telephone,may,thu,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.860,5191.0,0
41185,42,admin.,single,university.degree,unknown,yes,yes,telephone,may,wed,...,3,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
41186,48,technician,married,professional.course,no,no,yes,telephone,oct,tue,...,2,999,0,nonexistent,-3.4,92.431,-26.9,0.742,5017.5,0


In [21]:
def preprocessing(data): # define a function called preprocessing that takes in a pandas dataframe called data
    data['education']=np.where(data['education'] =='basic.9y', 'Basic', data['education']) # replace basic.9y with Basic
    data['education']=np.where(data['education'] =='basic.6y', 'Basic', data['education']) # replace basic.6y with Basic
    data['education']=np.where(data['education'] =='basic.4y', 'Basic', data['education']) # replace basic.4y with Basic
    
    cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome'] # create a list of categorical variables
    for var in cat_vars: # for each categorical variable in the list
        cat_list='var'+'_'+var # create a new variable called cat_list that is var plus the name of the categorical variable
        cat_list = pd.get_dummies(data[var], prefix=var) # create a dummy variable for each category in the categorical variable
        data1=data.join(cat_list) # add the dummy variables to the dataframe
        data=data1 # rename the dataframe

    cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome'] # create a list of categorical variables
    data_vars=data.columns.values.tolist() # create a list of all the column names in the dataframe
    to_keep=[i for i in data_vars if i not in cat_vars] # create a list of column names that are not in the list of categorical variables
    
    final_data=data[to_keep] # create a new dataframe that only contains the non-categorical columns
    
    final_data.columns = final_data.columns.str.replace('.','_') # replace the . in the column names with _
    final_data.columns = final_data.columns.str.replace(' ','_') # replace the space in the column names with _
    return final_data # return the dataframe

preprocessing(data)

  final_data.columns = final_data.columns.str.replace('.','_') # replace the . in the column names with _


Unnamed: 0,age,duration,campaign,pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,...,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success
0,44,210,1,999,0,1.4,93.444,-36.1,4.963,5228.1,...,0,0,0,0,1,0,0,0,1,0
1,53,138,1,999,0,-0.1,93.200,-42.0,4.021,5195.8,...,0,0,1,0,0,0,0,0,1,0
2,28,339,3,6,2,-1.7,94.055,-39.8,0.729,4991.6,...,0,0,0,0,1,0,0,0,0,1
3,39,185,2,999,0,-1.8,93.075,-47.1,1.405,5099.1,...,0,0,1,0,0,0,0,0,1,0
4,55,137,1,3,1,-2.9,92.201,-31.4,0.869,5076.2,...,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,59,222,1,999,0,1.4,94.465,-41.8,4.866,5228.1,...,0,0,0,0,1,0,0,0,1,0
41184,31,196,2,999,0,1.1,93.994,-36.4,4.860,5191.0,...,0,0,0,0,1,0,0,0,1,0
41185,42,62,3,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,0,0,0,1,0,1,0
41186,48,200,2,999,0,-3.4,92.431,-26.9,0.742,5017.5,...,1,0,0,0,0,1,0,0,1,0


In [23]:
def train_test_split(final_data):
    from sklearn.model_selection import train_test_split
    X = final_data.loc[:, final_data.columns != 'y']
    y = final_data.loc[:, final_data.columns == 'y']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,stratify = y, random_state=47)
    return X_train, X_test, y_train, y_test

# train_test_split(final_data)

In [38]:
def over_sampling_target_class(X_train, y_train):
    # Import the SMOTE class from the imblearn.over_sampling library
    # from imblearn.over_sampling import SMOTE
    from imblearn.over_sampling import SMOTE
    
    # Create a SMOTE object with a random state of 0
    os = SMOTE(random_state=0)

    # Get the names of the columns in the X_train dataframe
    columns = X_train.columns
    
    # Use the fit_resample method of the SMOTE object to over-sample the minority class
    os_data_X, os_data_y = os.fit_resample(X_train, y_train)

    # Convert the over-sampled X data to a pandas DataFrame with column names
    os_data_X = pd.DataFrame(data=os_data_X, columns=columns)

    # Convert the over-sampled y data to a pandas DataFrame with a column named 'y'
    os_data_y = pd.DataFrame(data=os_data_y, columns=['y'])
    
    # Print some information about the over-sampled data
    print("length of oversampled data is ", len(os_data_X))
    print("Number of no subscription in oversampled data", len(os_data_y[os_data_y['y']==0]))
    print("Number of subscription", len(os_data_y[os_data_y['y']==1]))
    print("Proportion of no subscription data in oversampled data is ", len(os_data_y[os_data_y['y']==0])/len(os_data_X))
    print("Proportion of subscription data in oversampled data is ", len(os_data_y[os_data_y['y']==1])/len(os_data_X))
    
    # Replace the original X_train and y_train with the over-sampled X and y data
    X_train = os_data_X
    y_train = os_data_y['y']

    # Return the over-sampled X_train and y_train
    return X_train, y_train


In [39]:
def training_basic_classifier(X_train,y_train):
    from sklearn.ensemble import RandomForestClassifier
    model = RandomForestClassifier(n_estimators=101)
    model.fit(X_train, y_train)
    
    return model

In [40]:
def predict_on_test_data(model,X_test):
    y_pred = model.predict(X_test)
    return y_pred

In [41]:
def predict_prob_on_test_data(model,X_test):
    y_pred = model.predict_proba(X_test)
    return y_pred

In [42]:
def get_metrics(y_true, y_pred, y_pred_prob):
    from sklearn.metrics import accuracy_score,precision_score,recall_score,log_loss
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    entropy = log_loss(y_true, y_pred_prob)
    return {'accuracy': round(acc, 2), 'precision': round(prec, 2), 'recall': round(recall, 2), 'entropy': round(entropy, 2)}

In [43]:
def create_roc_auc_plot(clf, X_data, y_data):
    import matplotlib.pyplot as plt
    from sklearn import metrics
    metrics.plot_roc_curve(clf, X_data, y_data) 
    plt.savefig('roc_auc_curve.png')

In [44]:
def create_confusion_matrix_plot(clf, X_test, y_test):
    import matplotlib.pyplot as plt
    from sklearn.metrics import plot_confusion_matrix
    plot_confusion_matrix(clf, X_test, y_test)
    plt.savefig('confusion_matrix.png')

In [45]:
def hyper_parameter_tuning(X_train, y_train):
    # define random parameters grid
    n_estimators = [5,21,51,101] # number of trees in the random forest
    max_features = ['auto', 'sqrt'] # number of features in consideration at every split
    max_depth = [int(x) for x in np.linspace(10, 120, num = 12)] # maximum number of levels allowed in each decision tree
    min_samples_split = [2, 6, 10] # minimum sample number to split a node
    min_samples_leaf = [1, 3, 4] # minimum sample number that can be stored in a leaf node
    bootstrap = [True, False] # method used to sample data points

    random_grid = {'n_estimators': n_estimators,
                    'max_features': max_features,
                    'max_depth': max_depth,
                    'min_samples_split': min_samples_split,
                    'min_samples_leaf': min_samples_leaf,
                    'bootstrap': bootstrap
                }
    
    from sklearn.model_selection import RandomizedSearchCV
    from sklearn.ensemble import RandomForestClassifier
    classifier = RandomForestClassifier()
    model_tuning = RandomizedSearchCV(estimator = classifier, param_distributions = random_grid,
                    n_iter = 100, cv = 5, verbose=2, random_state=35, n_jobs = -1)
    model_tuning.fit(X_train, y_train)

    print ('Random grid: ', random_grid, '\n')
    # print the best parameters
    print ('Best Parameters: ', model_tuning.best_params_, ' \n')

    best_params = model_tuning.best_params_
    
    n_estimators = best_params['n_estimators']
    min_samples_split = best_params['min_samples_split']
    min_samples_leaf = best_params['min_samples_leaf']
    max_features = best_params['max_features']
    max_depth = best_params['max_depth']
    bootstrap = best_params['bootstrap']
    
    model_tuned = RandomForestClassifier(n_estimators = n_estimators, min_samples_split = min_samples_split,
                                        min_samples_leaf= min_samples_leaf, max_features = max_features,
                                        max_depth= max_depth, bootstrap=bootstrap) 
    model_tuned.fit( X_train, y_train)
    return model_tuned,best_params

In [46]:
data = load_data('https://raw.githubusercontent.com/TripathiAshutosh/dataset/main/banking.csv')

In [33]:
cleaned_data = data_cleaning(data)

na values available in data 

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp_var_rate      0
cons_price_idx    0
cons_conf_idx     0
euribor3m         0
nr_employed       0
y                 0
dtype: int64
after droping na values 

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp_var_rate      0
cons_price_idx    0
cons_conf_idx     0
euribor3m         0
nr_employed       0
y                 0
dtype: int64


In [56]:
final_data = preprocessing(cleaned_data)

  final_data.columns = final_data.columns.str.replace('.','_') # replace the . in the column names with _


In [57]:
X_train, X_test, y_train, y_test = train_test_split(final_data)

In [58]:
X_train, y_train = over_sampling_target_class(X_train, y_train)

length of oversampled data is  51166
Number of no subscription in oversampled data 25583
Number of subscription 25583
Proportion of no subscription data in oversampled data is  0.5
Proportion of subscription data in oversampled data is  0.5


In [59]:
model = training_basic_classifier(X_train,y_train)

In [60]:
y_pred = predict_on_test_data(model,X_test)

In [61]:
y_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [62]:
y_pred_prob = predict_prob_on_test_data(model,X_test) #model.predict_proba(X_test)

In [63]:
y_pred_prob

array([[1.        , 0.        ],
       [0.99009901, 0.00990099],
       [0.94059406, 0.05940594],
       ...,
       [1.        , 0.        ],
       [0.77227723, 0.22772277],
       [1.        , 0.        ]])

In [64]:
run_metrics = get_metrics(y_test, y_pred, y_pred_prob)

In [65]:
print(run_metrics)

{'accuracy': 0.91, 'precision': 0.63, 'recall': 0.52, 'entropy': 0.2}


In [67]:
# create_roc_auc_plot(model, X_test, y_test)

In [69]:
# create_confusion_matrix_plot(model, X_test, y_test)

### MLFlow work Start from here

In [83]:
experiment_name = 'basic_classifier' # basic classifier
run_name = "term_deposit" # term deposit
run_metrics = get_metrics(y_test, y_pred, y_pred_prob)
print(run_metrics)

{'accuracy': 0.91, 'precision': 0.62, 'recall': 0.56, 'entropy': 0.19}


In [86]:
create_experiment(experiment_name, run_name, run_metrics,model,'confusion_matrix.png', 'roc_auc_curve.png')



Run - Random_Search_CV_Tuned_Model is logged to Experiment - optimized model


### Function to create an experiment in MLFlow and log parameters, metrics and artifacts file like images etc.

In [75]:
def create_experiment(experiment_name,run_name, run_metrics,model, confusion_matrix_path = None, 
                    roc_auc_plot_path = None, run_params=None):
    import mlflow
    #mlflow.set_tracking_uri("http://localhost:5000") #uncomment this line if you want to use any database like sqlite as backend storage for model
    mlflow.set_experiment(experiment_name)
    
    with mlflow.start_run():
        
        if not run_params == None:
            for param in run_params:
                mlflow.log_param(param, run_params[param])
            
        for metric in run_metrics:
            mlflow.log_metric(metric, run_metrics[metric])
        
        mlflow.sklearn.log_model(model, "model")
        
        if not confusion_matrix_path == None:
            mlflow.log_artifact(confusion_matrix_path, 'confusion_materix')
            
        if not roc_auc_plot_path == None:
            mlflow.log_artifact(roc_auc_plot_path, "roc_auc_plot")
        
        mlflow.set_tag("tag1", "Random Forest")
        mlflow.set_tags({"tag2":"Randomized Search CV", "tag3":"Production"})
            
    print('Run - %s is logged to Experiment - %s' %(run_name, experiment_name))

### Create another experiment after tuning hyperparameters and log the best set of parameters for which model gives the optimal performance

In [85]:
import mlflow
experiment_name = "optimized model 2"
run_name="Tuned_Model"
model_tuned,best_params = hyper_parameter_tuning(X_train, y_train)
run_params = best_params

y_pred = predict_on_test_data(model_tuned,X_test) # will return the predicted class
y_pred_prob = predict_prob_on_test_data(model_tuned,X_test) # model.predict_proba(X_test)
run_metrics = get_metrics(y_test, y_pred, y_pred_prob)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


  warn(


Random grid:  {'n_estimators': [5, 21, 51, 101], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120], 'min_samples_split': [2, 6, 10], 'min_samples_leaf': [1, 3, 4], 'bootstrap': [True, False]} 

Best Parameters:  {'n_estimators': 21, 'min_samples_split': 2, 'min_samples_leaf': 3, 'max_features': 'auto', 'max_depth': 90, 'bootstrap': True}  



  warn(


In [78]:
run_metrics

{'accuracy': 0.91, 'precision': 0.62, 'recall': 0.56, 'entropy': 0.19}

In [79]:
for param in run_params:
    print(param, run_params[param])

n_estimators 21
min_samples_split 10
min_samples_leaf 4
max_features auto
max_depth 90
bootstrap False


In [82]:
create_experiment(experiment_name,run_name,run_metrics,model_tuned,'confusion_matrix.png', 'roc_auc_curve.png',run_params)



Run - Random_Search_CV_Tuned_Model is logged to Experiment - optimized model
