# Notebook for end-to-end development of the customer churn model selected in the EDA.
Please run the cells in order. There are parts of the script that require time for the resrouces to be spun up in azure. These are indicated in the markdown cells. Please wiat for the green notification for each particular cell before continuing onto the next one. Some of the parts such as creating of custom envrionment, pipeline run, endpoint etc will require aroundn 10 minutes. 

In [None]:
# Checking to see if azureml has been inslalled properly.
pip show azure-ai-ml

In [None]:
# Enter details of your AML workspace
subscription_id = "YOUR SUBSCRIPTION ID HERE"
resource_group = "rg-churn-pred-proj"
workspace = "churn-pred-proj"

In [None]:
#Connecting to the MLClient
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# get a handle to the workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace)

# Creating a data asset
This is the data we have downloaded form hugging face website and will be using for model development.

## List Data Stores

In [None]:
stores = ml_client.datastores.list()
for ds_name in stores:
    print(ds_name.name)

In [None]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
import time

my_path = "./data/churn.csv"
# web_path = "https://huggingface.co/datasets/scikit-learn/churn-prediction/blob/main/dataset.csv"
# set the version number of the data asset to the current UTC time
v1 = time.strftime("%Y.%m.%d.%H%M%S", time.gmtime())

churn_data = Data(
    name="telco-churn",
    version=v1,
    description="Churning customers of a telecommunication company",
    path=my_path,
    type=AssetTypes.URI_FILE,
    tags={"source_type": "web", "source": "Hugging Face"},
)

# create data asset
ml_client.data.create_or_update(churn_data)

print(f"Data asset created. Name: {churn_data.name}, version: {churn_data.version}")

## Access Data

In [None]:
%pip install -U azureml-fsspec

Check to see if we have the data or not.

In [None]:
import pandas as pd

# get a handle of the data asset and print the URI
data_asset = ml_client.data.get(name="telco-churn", version=v1)
print(f"Data asset URI: {data_asset.path}")

# read into pandas - note that you will see 2 headers in your data frame - that is ok, for now

df = pd.read_csv(data_asset.path)
df.head()

## Compute Cluster for Pipeline

Once you run the cell, wiat and check under compute to see if the compute has been created successfully before proceeding. You also get a green notification too if you have the default settings on. 

In [None]:
from azure.ai.ml.entities import AmlCompute

cpu_compute_target = "cpu-cluster"

try:
    # let's see if the compute target already exists
    cpu_cluster = ml_client.compute.get(cpu_compute_target)
    print(
        f"You already have a cluster named {cpu_compute_target}, we'll reuse it as is."
    )

except Exception:
    print("Creating a new cpu compute target...")

    # Let's create the Azure ML compute object with the intended parameters
    cpu_cluster = AmlCompute(
        # Name assigned to the compute cluster
        name="cpu-cluster",
        # Azure ML Compute is the on-demand VM service
        type="amlcompute",
        # VM Family
        size="STANDARD_DS3_V2",
        # Minimum running nodes when there is no job running
        min_instances=0,
        # Nodes in cluster
        max_instances=4,
        # How many seconds will the node running after the job termination
        idle_time_before_scale_down=180,
        # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
        tier="Dedicated",
    )
    
    print(
    f"AMLCompute with name {cpu_cluster.name} is created, the compute size is {cpu_cluster.size}"
    )

    # Now, we pass the object to MLClient's create_or_update method
    cpu_cluster = ml_client.begin_create_or_update(cpu_cluster)



## Pipeline job environment creation
Refresh the file explorer to make sure the directory is created

In [None]:
import os

dependencies_dir = "./dependencies"
os.makedirs(dependencies_dir, exist_ok=True)

Creating conda yaml file for in the dependencies directory


In [None]:
%%writefile {dependencies_dir}/conda.yaml
channels:
  - conda-forge
dependencies:
  - python=3.8
  - numpy=1.23.5
  - pip=22.3.1
  - scikit-learn=1.2.2
  - pandas=1.5.3
  - matplotlib=3.7.1
  - imbalanced-learn=0.10.1
  - pip:
      - mlflow==2.2.2
      - azureml-mlflow==1.51.0
      - xgboost==1.7.5
name: churn-env3

## Create Custom Environment
This can take around 5-10 minutes. Check the environment under the environment tab (custom environments). It changes from running to successful if everything is correct. 


In [None]:
from azure.ai.ml.entities import Environment

custom_env_name = "Churn-Proj-scikit-learn"

pipeline_job_env = Environment(
    name=custom_env_name,
    description="Custom environment for Customer Churn pipeline",
    tags={"scikit-learn": "1.2.2"},
    conda_file=os.path.join(dependencies_dir, "conda.yaml"),
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
    version="0.1.1",
)
pipeline_job_env = ml_client.environments.create_or_update(pipeline_job_env)

print(
    f"Environment with name {pipeline_job_env.name} is registered to workspace, the environment version is {pipeline_job_env.version}"
)

Creating the environment takes around 5-10 minutes depending on the packages. <mark>DO NOT PROCEED BEFORE THE ENVIRONMENT IS MARKED AS SUCCESSFUL. </mark>

# Building Pipeline

We require two components, Each component requires the python script (cells will write this to the appropriate file) and the yaml file.

## Component 1: Data Prep
This envolves missing vlaue recification, dropping some columns, encoding and scaling

In [None]:
import os

# create a folder for the script files
script_folder = 'src'
os.makedirs(script_folder, exist_ok=True)
print(script_folder, 'folder created')

prep-data.py script
This is the preparation component of the pipeline, some data cleaning, ecnoding and scaling

In [None]:
%%writefile $script_folder/prep-data.py

# import libraries
import argparse
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler


def main(args):
    # read data
    print('Reading data ...')
    df = get_data(args.input_data)

    print('Cleaning data ...')
    cleaned_data = clean_data(df)

    print('Encoding data ...')
    encoded_data = encode_data(cleaned_data)

    print('Normalizing data ...')
    normalized_data = normalize_data(encoded_data)

    output_df = normalized_data.to_csv((Path(args.output_data) / "churn_prepped.csv"), index = False)


def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--input_data", dest='input_data',
                        type=str)
    parser.add_argument("--output_data", dest='output_data',
                        type=str)

    # parse args
    args = parser.parse_args()

    # return args
    return args

# function that reads the data
def get_data(path):
    df = pd.read_csv(path)

    # Count the rows and print the result
    row_count = (len(df))
    print('Preparing {} rows of data'.format(row_count))

    return df


# function that removes useless values and imputes missing ones
def clean_data(df):
    # Column TotalCharges is a string, have to convert to numeric
    df.TotalCharges = df.TotalCharges.apply([lambda x: float(x) if x!= ' ' else x]) # make float if value exists
    mean = pd.to_numeric(df.TotalCharges, errors='coerce').mean()
    df.TotalCharges = df.TotalCharges.apply([lambda x: mean if x == ' ' else x ]) # replace ' ' with mean of this column

    # Drop useless columns (high cardinality and low correlation clumns - based on the ANOVA test done in EDA)
    df.drop(['customerID', 'gender','PhoneService', 'MultipleLines',
            'InternetService','StreamingTV', 'StreamingMovies'], axis = 1, inplace=True)
    
    return df   

# Function that encodes the data
def encode_data(df):
    cat_cols = ['Partner','Dependents','OnlineSecurity','OnlineBackup',
    	        'DeviceProtection','TechSupport','PaperlessBilling',
                'Contract', 'PaymentMethod'] #categorica columns

    # Encode categorical columns
    ord_enc = OrdinalEncoder()
    df[cat_cols] = ord_enc.fit_transform(df[cat_cols]).copy() 
    # Mapping the target (Churn column)
    lb = LabelEncoder()
    df['Churn'] = lb.fit_transform(df['Churn'])

    return df

# function that normalizes the data
def normalize_data(df):
    # Define Scaler
    mms = MinMaxScaler() # Normalisation using min max scaler
    df['tenure'] = mms.fit_transform(df[['tenure']])
    df['MonthlyCharges'] = mms.fit_transform(df[['MonthlyCharges']])
    df['TotalCharges'] = mms.fit_transform(df[['TotalCharges']])

    return df


# run script
if __name__ == "__main__":
    # add space in logs
    print("\n\n")
    print("*" * 60)

    # parse args
    args = parse_args()

    # run main function
    main(args)

    # add space in logs
    print("*" * 60)
    print("\n\n")

Preparing YAML file for prep-data

In [None]:
%%writefile prep-data.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: prep_data
display_name: Prepare training data imbalanced sampling
version: 1
type: command
inputs:
  input_data: 
    type: uri_file
outputs:
  output_data:
    type: uri_folder
code: ./src
environment: azureml:Churn-Proj-scikit-learn:0.1.1
command: >-
  python prep-data.py 
  --input_data ${{inputs.input_data}}
  --output_data ${{outputs.output_data}}

## Component 2: Train and Evaluate Model
We use mlflow to keep track of the training. We create the train-model.py script.

In [None]:
%%writefile $script_folder/train-model.py

import mlflow
from mlflow.models.signature import ModelSignature
from mlflow.types.schema import Schema, ColSpec

import glob
import argparse

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from imblearn.over_sampling import SMOTE

from sklearn.model_selection import train_test_split
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import roc_auc_score
from sklearn.metrics import RocCurveDisplay
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RepeatedStratifiedKFold

import logging

from xgboost import XGBClassifier

def main(args):

    logging.getLogger("mlflow").setLevel(logging.DEBUG)
    # read data
    df = get_data(args.training_data)
    # split data
    X_train, X_test, y_train, y_test = split_data(df)
    mlflow.start_run()
    # train model
    model = train(args.learning_rate, args.max_depth , X_train, y_train)
    # evaluate model
    evaluate(model, X_test, y_test)
    mlflow.end_run()


# function that reads the data
def get_data(data_path):

    all_files = glob.glob(data_path + "/*.csv")
    df = pd.concat((pd.read_csv(f) for f in all_files), sort=False)
    return df

# function that splits the data - uses SMOTE as data unbalanced.
def split_data(df):

    print("Splitting data...")
    oversample = SMOTE(sampling_strategy=1) # same sample size
    f1 = df.iloc[:,:13].values
    t1 = df.iloc[:,13].values
    f1, t1 = oversample.fit_resample(f1, t1)

    X_train, X_test, y_train, y_test = train_test_split(f1, t1, test_size=0.20, random_state=0)

    return X_train, X_test, y_train, y_test


# Function that trains the model, learning rate and max depth are nput args
def train(learning_rate, max_depth, X_train, y_train):

    mlflow.log_param("learning_rate", learning_rate)
    mlflow.log_param("max_depth", max_depth)
    model = XGBClassifier(learning_rate = learning_rate, max_depth = int(max_depth), n_estimators = 1000)
    model.fit(X_train,y_train)
    return model

# Function that evaluates the model
def evaluate(model,X_test,y_test):

    # calculate accuracy
    y_hat = model.predict(X_test)
    acc = np.average(y_hat == y_test)
    print('Accuracy:', acc)
    # calculate AUC
    y_scores = model.predict_proba(X_test)
    auc = roc_auc_score(y_test,y_scores[:,1])
    print('AUC: ' + str(auc))
    metrics = {
        "Accuracy": acc,
        "AUC": auc,       
        }
    mlflow.log_metrics(metrics)
    # Confusion Matrix
    cm = ConfusionMatrixDisplay.from_predictions(y_test,y_hat, normalize = 'true', cmap = 'Blues')
    #plt.savefig("Confusion-Matrix.png") 
    mlflow.log_figure(cm.figure_, "Confusion-Matrix.png")
    # plot ROC curve
    roc = RocCurveDisplay.from_predictions(y_test,y_hat)
    # Plot the diagonal 50% line
    plt.plot([0, 1], [0, 1], 'k--')
    # Plot the FPR and TPR achieved by our model
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    #plt.savefig("ROC-Curve.png") 
    mlflow.log_figure(roc.figure_, "ROC-Curve.png")   
    # Classification Report
    print(classification_report(y_test,y_hat))
    mlflow.log_text(classification_report(y_test,y_hat), "clf_report.txt")
    input_schema = Schema(
        [
            ColSpec("integer", "SeniorCitizen"),
            ColSpec("double", "Partner"),
            ColSpec("double", "Dependents"),
            ColSpec("double", "tenure"),
            ColSpec("double", "OnlineSecurity"),
            ColSpec("double", "OnlineBackup"),
            ColSpec("double", "DeviceProtection"),
            ColSpec("double", "TechSupport"),
            ColSpec("double", "Contract"),
            ColSpec("double", "PaperlessBilling"),
            ColSpec("double", "PaymentMethod"),
            ColSpec("double", "MonthlyCharges"),
            ColSpec("double", "TotalCharges"),
        ]
    )
    output_schema = Schema([ColSpec("integer")])
    signature = ModelSignature(inputs=input_schema, outputs=output_schema)
    # Save Model
    mlflow.xgboost.log_model(model, args.model_output, signature=signature)

def parse_args():

    # setup arg parser
    parser = argparse.ArgumentParser()
    # add arguments
    parser.add_argument("--training_data", dest='training_data',
                        type=str)
    parser.add_argument("--learning_rate", dest='learning_rate',
                        type=float, default=0.01)
    parser.add_argument("--max_depth", dest='max_depth',
                        type=int, default=3)
    parser.add_argument("--model_output", dest='model_output',
                        type=str)

    # parse args
    args = parser.parse_args()
    # return args
    return args

# run script
if __name__ == "__main__":
    # add space in logs
    print("\n\n")
    print("*" * 60)
    # parse args
    args = parse_args()
    # run main function
    main(args)
    # add space in logs
    print("*" * 60)
    print("\n\n")


train-model.YAML file 

In [None]:
%%writefile train-model.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: train_model
display_name: Train an XGBoost classifier model
version: 1
type: command
inputs:
  training_data: 
    type: uri_folder
  learning_rate:
    type: number
    default: 0.01
  max_depth:
    type: integer
    default: 3
outputs:
  model_output:
    type: mlflow_model
code: ./src
environment: azureml:Churn-Proj-scikit-learn:0.1.1
command: >-
  python train-model.py 
  --training_data ${{inputs.training_data}} 
  --learning_rate ${{inputs.learning_rate}}
  --max_depth ${{inputs.max_depth}}
  --model_output ${{outputs.model_output}} 

## Load Components

In [None]:
from azure.ai.ml import load_component
parent_dir = ""

prep_data = load_component(source=parent_dir + "./prep-data.yml")
train_XGBoost = load_component(source=parent_dir + "./train-model.yml")

## Build Pipeline

In [None]:
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml.dsl import pipeline

@pipeline()
def customer_churn_classification(pipeline_job_input):
    clean_data = prep_data(input_data=pipeline_job_input)
    train_model = train_XGBoost(training_data=clean_data.outputs.output_data)

    return {
        "pipeline_job_transformed_data": clean_data.outputs.output_data,
        "pipeline_job_trained_model": train_model.outputs.model_output,
    }

pipeline_job = customer_churn_classification(Input(type=AssetTypes.URI_FILE, path= data_asset.path))

In [None]:

print(pipeline_job)

Update pipeline job to include compute and data store information.

In [None]:
# change the output mode
pipeline_job.outputs.pipeline_job_transformed_data.mode = "upload"
pipeline_job.outputs.pipeline_job_trained_model.mode = "upload"
# set pipeline level compute
pipeline_job.settings.default_compute = "cpu-cluster"
# set pipeline level datastore
pipeline_job.settings.default_datastore = "workspaceblobstore"

# print the pipeline job again to review the changes
print(pipeline_job)

Submit the pipeline

In [None]:
# submit job to workspace
pipeline_job = ml_client.jobs.create_or_update(
    pipeline_job, experiment_name="pipeline_churn"
)
pipeline_job

This takes at least 15-20 minutes to run through. <mark>DO NOT PROCEED BEFORE THE ENVIRONMENT IS MARKED AS SUCCESSFUL. </mark>

Once the run is completed, you can see the logged metrics, ROC plot, Confusion Matrix, and feature importance plot. 

### Optional:

If you want to register the componets created in the workspace for other users, please run the cell below. Otherwise disregard it.

In [None]:
prep_data = ml_client.components.create_or_update(prep_data)
train_XGBoost = ml_client.components.create_or_update(train_XGBoost)

# Register Model

Navigate to the model in the pieline output and copy the name, artifact path and the run id and use them to register the model. The run id and path will change for each experiment, so you have to get your own. 

In [None]:
import mlflow

run_id = 'dfcecd6b-47ad-4f85-a90a-e2c995e834dd'
artifact_path = '/mnt/azureml/cr/j/ffc462fc02b74ca4bc5d97c910cdba2b/cap/data-capability/wd/model_output'
model_name = 'model'

mlflow.register_model(f"runs:/{run_id}/{artifact_path}", model_name)


In [None]:
model_example = ml_client.models.get(name="model", version="1")
print(model_example)

# Define and create an endpoint

In [None]:
from azure.ai.ml.entities import ManagedOnlineEndpoint
import datetime

online_endpoint_name = "endpoint-" + datetime.datetime.now().strftime("%m%d%H%M%f")

# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="Online endpoint for MLflow customer churn model",
    auth_mode="key",
)

Create the endpoint

In [None]:
ml_client.begin_create_or_update(endpoint).result()

This step take some time. Please wait until the notification pops up, or you can see the endpoint created successfully under the endpoints tab.

## Configuring the deployment

We already have our MLflow model to deploy. Given it is an MLflow model, we do not need to supply an environment. Normal models require the model and the environment registered separably. Need to be careful with instance type. I do not have access to all of them because of my tier, the current selection won't work on very large models, but is good for the demonstration purposes.

In [None]:
from azure.ai.ml.entities import Model, ManagedOnlineDeployment
from azure.ai.ml.constants import AssetTypes

# Getting the registered MLflow model. Need to rpvide name and version. These are available under model tab.
model = ml_client.models.get(name='model', version=2)

# Configure a blue deployment
blue_deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name=online_endpoint_name,
    model=model,
    instance_type="Standard_E2s_v3",
    instance_count=1,
)


# Deploying the model to the online endpoint

In [None]:
ml_client.online_deployments.begin_create_or_update(blue_deployment).result()

The deployment of the model may take 10-15 minutes. <mark>PLEASE WAIT</mark> for deployment to complete before continuing. 

To direct 100% of traffic to blue deployment:

In [None]:

# blue deployment takes 100 traffic
endpoint.traffic = {"blue": 100}
ml_client.begin_create_or_update(endpoint).result()

Please wait until this update finishes.

In [None]:
# test the blue deployment with some sample data
response = ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    deployment_name="blue",
    request_file="sample_data.json",
)

if response[1]=='1':
    print("Customer Will Churn")
else:
    print ("Customer Will NOT Churn")

Warpping up the endpoint to not incure charges.

In [None]:
ml_client.online_endpoints.begin_delete(name=online_endpoint_name)