## Introduction ##

Customer loss can significantly impact a business’s bottom line. By detecting at-risk customers early, companies can proactively engage them with retention strategies. In this workshop, we'll explore how to use native Snowflake’s [machine learning](https://docs.snowflake.com/de/developer-guide/snowpark-ml/reference/1.5.3/modeling) capabilities to automate the identification of dissatisfied customers—commonly referred to as churn prediction

 ** Internal** [aws - example ](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn_outputs.html#Data)

### Configuring the environment 

I have download the data and uploaded it into snowflake using the  **COPY** Command

In [None]:
# Import python packages
import streamlit as st
import pandas as pd

# We can also use Snowpark for our analyses!
from snowflake.snowpark.context import get_active_session
session = get_active_session()

#Snowflake libraries 
from snowflake import snowpark
from snowflake.ml import dataset
from snowflake.snowpark.functions import col,when,lit
from snowflake.snowpark.types import *

## Snowflake ml libraries
from snowflake.ml.modeling.xgboost import XGBClassifier
from snowflake.ml.modeling.preprocessing import MinMaxScaler , OneHotEncoder

# snowpark ML metrics
from snowflake.ml.modeling.metrics import accuracy_score,f1_score,precision_score,roc_auc_score,roc_curve,recall_score


# python libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import json
from IPython.display import display

## set the database and schema
session.use_database('ml_models')
session.use_schema('ml_models.ds')


In [None]:
#download the data 
churn = session.table("CHURN")

churn.show(5)

## EDA

Let’s explore the dataset further and uncover additional insights.

In [None]:
# get the numerical and categorical features
#get the schema
schema = churn.schema

numerical_types = (IntegerType, FloatType, DecimalType, LongType, ShortType, DoubleType)
numerical_columns =[f.name for f in schema if isinstance(f.datatype, numerical_types)]


categorical_types  = (StringType, VariantType, BooleanType)
categorical_columns = [f.name  for f in schema if isinstance(f.datatype, categorical_types)]

print("Numerical Columns:", numerical_columns)
print("Categorical Columns:", categorical_columns)

In [None]:
pd.set_option("display.max_columns", 500)
df = churn.describe()
df


We can see immediately that: - State appears to be quite evenly distributed. - Phone takes on too many unique values to be of any practical use. It’s possible that parsing out the prefix could have some value, but without more context on how these are allocated, we should avoid using it. - Most of the numeric features are surprisingly nicely distributed, with many showing bell-like gaussianity. VMail Message is a notable exception (and Area Code showing up as a feature we should convert to non-numeric).

In [None]:
#drop column phone from the snowprk dataframe
churn = churn.drop("PHONE")

#convert to a string column
churn = churn.with_column("AREA_CODE", col("AREA_CODE").cast(StringType()))


In [None]:
import matplotlib.pyplot as plt
df = churn.to_pandas()

# Histograms of numeric features by CHURN class
for column in df.select_dtypes(include=["number"]).columns:
    hist = df[[column, "CHURN"]].hist(by="CHURN", bins=30, edgecolor='black', figsize=(4, 3))
    plt.suptitle(f"{column} by CHURN", y=1)  # Add title
    plt.tight_layout()
    plt.show()


In [None]:
#df_corr = churn.select_dtypes(include=['number']).corr()
#df_corr
numerical_columns =[f.name for f in churn.schema if isinstance(f.datatype, numerical_types)]
 # Initialize an empty DataFrame to store the correlation matrix
corr_matrix = pd.DataFrame(index=numerical_columns, columns=numerical_columns, dtype=float)


# For each pair of numerical columns, calculate the correlation
for col1 in numerical_columns:
        for col2 in numerical_columns:
            correlation_value = churn.stat.corr(col1, col2)
            corr_matrix.loc[col1, col2] = correlation_value
            
            
print("\nCorrelation Matrix calculated with df.stat.corr():")
print(corr_matrix)
    


In [None]:
import seaborn as sns
plt.figure(figsize=(10, 8))

sns.heatmap(
    corr_matrix,
    annot = True,
    cmap ='coolwarm',
    fmt = ".2f",
    linewidths =-.5,
    cbar_kws={'label': 'Correlation Coefficient'}
)



We see several features that essentially have 100% correlation with one another. Including these feature pairs in some machine learning algorithms can create catastrophic problems, while in others it will only introduce minor redundancy and bias. Let’s remove one feature from each of the highly correlated pairs: Day Charge from the pair with Day Mins, Night Charge from the pair with Night Mins, Intl Charge from the pair with Intl Mins:

In [None]:
churn= churn.with_column_renamed("Int'l Plan","INTL_PLAN")
#churn.columns
#Cat_cols =['STATE','INTL_PLAN', 'VMAIL_PLAN']



churn= (
    churn
    .with_column("INTL_PLAN", 
                 when(col("INTL_PLAN")== True,1).otherwise(0))
    .with_column("VMAIL_PLAN", when(col("VMAIL_PLAN")== True,1).otherwise(0))
)


In [None]:
#drop 
churn = churn.drop("Day Charge", "Eve Charge", "Night Charge", "Intl Charge")

But first, let’s convert our categorical features into numeric features.

In [None]:


cat_cols =['STATE','INTL_PLAN', 'VMAIL_PLAN','AREA_CODE']
ohe = OneHotEncoder(input_cols=cat_cols,
                   output_cols=cat_cols,
                   drop_input_cols=True,
                   drop="first",
                   handle_unknown="ignore")
#fit & Transform
df = ohe.fit(churn).transform(churn)
df= df.with_column(
    "CHURN",
    when(col("CHURN") == "True.", 1).otherwise(0)
)

In [None]:
#df

# Train Test Split 
Let’s split the data into training, validation, and test sets.

In [None]:
ALTER DATASET CHURN_TRAIN_DF DROP VERSION 'snf';
ALTER DATASET CHURN_TEST_DF DROP VERSION 'snf';
ALTER DATASET CHURN_VALIDATION_DF DROP VERSION 'snf';


# Dataset
After splitting the data into training, validation, and test sets,  I will store them as Snowflake datasets (tables or views).  
This ensures I can reuse the splits in future runs without repeating the preprocessing steps.


In [None]:

train_df, validation_data,test_df = df.random_split(weights = [0.70,0.20,0.1],seed=62)

## we will keep the dataset in snowflake for future use
from snowflake.ml import dataset

# Materialize DataFrame contents into a Dataset
ds1 = dataset.create_from_dataframe(
    session,
    "churn_train_df",
    "snf",
    input_dataframe=train_df)
ds2 = dataset.create_from_dataframe(
    session,
    "churn_test_df",
    "snf",
    input_dataframe=train_df)
ds3 = dataset.create_from_dataframe(
    session,
    "churn_validation_df",
    "snf",
    input_dataframe=train_df)

In [None]:
# Create a DataConnector from a Snowflake Dataset
ds_train = dataset.load_dataset(session, "churn_train_df", "snf")
# Get a Snowpark DataFrame
df_train = ds_train.read.to_snowpark_dataframe()

ds_validation = dataset.load_dataset(session, "churn_validation_df", "snf")
df_validation = ds_validation.read.to_snowpark_dataframe()


ds_test = dataset.load_dataset(session, "churn_test_df", "snf")
df_test = ds_test.read.to_snowpark_dataframe()





In [None]:

# the snowflake ml libraries are sensitive to datatypes , make sure to cast it properly 
input_cols = [c for c in df.columns if c != "CHURN"]


for c in input_cols:
    df_train = df_train.with_column(c, col(c).cast("double"))

for c in input_cols:
    df_test = df_test.with_column(c, col(c).cast("double"))
for c in input_cols:
    df_validation = df_validation.with_column(c, col(c).cast("double"))


In [None]:
#df_train.columns
# Filter out the target column to get the feature columns
input_cols = [col_name for col_name in df_train.columns if col_name != "CHURN"]
OUTPUT_COLUMNS="PREDICTED_CHURN"
label_col="CHURN"

In [None]:
model1 = XGBClassifier(
    objective="binary:logistic",
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    use_label_encoder=False,
    eval_metric="logloss",
    input_cols=input_cols  ,
    label_cols=label_col,
    output_cols=OUTPUT_COLUMNS
)

#fit
model1.fit(df_train)
predict_df_train = model1.predict(df_train)


In [None]:
predict_on_test_data = model1.predict(df_test)




test_accuracy = accuracy_score(df=predict_on_test_data, 
                                   y_true_col_names=["CHURN"],
                                   y_pred_col_names=["PREDICTED_CHURN"]
                              )



# Evaluate
print("Test Accuracy:", test_accuracy)
#print("\nClassification Report:\n", classification_report(predict_on_test_data["CHURN"], predict_on_test_data["PREDICTED_CHURN"]))


In [None]:
from snowflake.ml.modeling.metrics import confusion_matrix
result = model.predict(df_validation)


metrics = {
"accuracy":accuracy_score(df=result, 
                          y_true_col_names="CHURN", 
                          y_pred_col_names="PREDICTED_CHURN"),

"precision":precision_score(df=result,
                            y_true_col_names="CHURN", 
                            y_pred_col_names="PREDICTED_CHURN"),


"recall": recall_score(df=result, 
                       y_true_col_names="CHURN",
                       y_pred_col_names="PREDICTED_CHURN"),



"f1_score":f1_score(df=result,
                   y_true_col_names="CHURN",
                   y_pred_col_names="PREDICTED_CHURN"),
"confusion_matrix":confusion_matrix(df=result, 
                                    y_true_col_name="CHURN",
                                    y_pred_col_name="PREDICTED_CHURN").tolist()
}

print(f" The Score for the xgboost model :\n {metrics}")
print(f" The Score for the xgboost model :\n {metrics}")