## Introduction ##

Customer loss can significantly impact a business’s bottom line. By detecting at-risk customers early, companies can proactively engage them with retention strategies. In this workshop, we'll explore how to use machine learning capabilities to automate the identification of dissatisfied customers—commonly referred to as churn prediction

 ** Internal** [aws - example ](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn_outputs.html#Data)

### Configuring the environment 

I have download the data and uploaded it into snowflake using the  **COPY** Command

In [None]:
# Import python packages
import streamlit as st
import pandas as pd

# We can also use Snowpark for our analyses!
from snowflake.snowpark.context import get_active_session
session = get_active_session()

#Snowflake libraries 
from snowflake import snowpark
from snowflake.ml import dataset
from snowflake.snowpark.functions import col
from snowflake.snowpark.types import *


# python libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import json
from IPython.display import display

## set the database and schema
session.use_database('ml_models')
session.use_schema('ml_models.ds')


In [None]:
#download the data 
churn = session.table("CHURN")

churn.head(5)

## EDA

Let’s explore the dataset further and uncover additional insights.

In [None]:
# get the numerical and categorical features
numerical_columns = churn.select_dtypes(include=['number']).columns.tolist()
categorical_columns = churn.select_dtypes(include=['object', 'category', 'bool']).columns.tolist()

print("Numerical Columns:", numerical_columns)
print("Categorical Columns:", categorical_columns)

In [None]:
pd.set_option("display.max_columns", 500)
df = churn.describe()
df
hist = churn.hist(bins=30, sharey=True, figsize=(10, 10))

We can see immediately that: - State appears to be quite evenly distributed. - Phone takes on too many unique values to be of any practical use. It’s possible that parsing out the prefix could have some value, but without more context on how these are allocated, we should avoid using it. - Most of the numeric features are surprisingly nicely distributed, with many showing bell-like gaussianity. VMail Message is a notable exception (and Area Code showing up as a feature we should convert to non-numeric).

In [None]:
churn = churn.drop("PHONE", axis=1)
churn["AREA_CODE"] = churn["AREA_CODE"].astype(object)


In [None]:
import matplotlib.pyplot as plt

# Histograms of numeric features by CHURN class
for column in churn.select_dtypes(include=["number"]).columns:
    hist = churn[[column, "CHURN"]].hist(by="CHURN", bins=30, edgecolor='black', figsize=(4, 3))
    plt.suptitle(f"{column} by CHURN", y=1)  # Add title
    plt.tight_layout()
    plt.show()


In [None]:
df_corr = churn.select_dtypes(include=['number']).corr()
df_corr

In [None]:
# Scatter matrix only on numeric columns
pd.plotting.scatter_matrix(churn.select_dtypes(include=['number']), figsize=(12, 12), diagonal='hist', alpha=0.5)
plt.suptitle("Scatter Matrix of Numeric Features", y=1)
plt.show()

We see several features that essentially have 100% correlation with one another. Including these feature pairs in some machine learning algorithms can create catastrophic problems, while in others it will only introduce minor redundancy and bias. Let’s remove one feature from each of the highly correlated pairs: Day Charge from the pair with Day Mins, Night Charge from the pair with Night Mins, Intl Charge from the pair with Intl Mins:

In [None]:
#churn.columns
churn = churn.drop(["Day Charge", "Eve Charge", "Night Charge", "Intl Charge"], axis=1)

In [None]:

# Make a copy to avoid modifying the original
df = churn.copy()

# Step 1: Convert bool columns to string so they are treated as categorical
bool_cols = df.select_dtypes(include='bool').columns
df[bool_cols] = df[bool_cols].astype(str)

# Step 2: One-hot encode object and bool columns, dropping first level
categorical_cols = df.select_dtypes(include='object').columns.union(bool_cols)
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Step 3: Check result

df_encoded.head()

But first, let’s convert our categorical features into numeric features.

Let’s split the data into training, validation, and test sets.

In [None]:
ALTER DATASET CHURN_TRAIN_DF DROP VERSION 'v1';
ALTER DATASET CHURN_TEST_DF DROP VERSION 'v1';
ALTER DATASET CHURN_VALIDATION_DF DROP VERSION 'v1';


In [None]:
train_data, validation_data, test_data = np.split(
    df_encoded.sample(frac=1, random_state=1729),
    [int(0.7 * len(df_encoded)), int(0.9 * len(df_encoded))],
)


## we will keep the dataset in snowflake for future use
from snowflake.ml import dataset

train_df = session.create_dataframe(train_data)
validation_df =session.create_dataframe(validation_data)
test_df = session.create_dataframe(test_data)

# Materialize DataFrame contents into a Dataset
ds1 = dataset.create_from_dataframe(
    session,
    "churn_train_df",
    "v1",
    input_dataframe=train_df)
ds2 = dataset.create_from_dataframe(
    session,
    "churn_test_df",
    "v1",
    input_dataframe=train_df)
ds3 = dataset.create_from_dataframe(
    session,
    "churn_validation_df",
    "v1",
    input_dataframe=train_df)

In [None]:
# Create a DataConnector from a Snowflake Dataset
ds_train = dataset.load_dataset(session, "churn_train_df", "v1")
# Get a Snowpark DataFrame
df_train = ds_train.read.to_snowpark_dataframe().to_pandas()

ds_validation = dataset.load_dataset(session, "churn_validation_df", "v1")
df_validation = ds_validation.read.to_snowpark_dataframe().to_pandas()


ds_test = dataset.load_dataset(session, "churn_test_df", "v1")
df_test = ds_test.read.to_snowpark_dataframe().to_pandas()





In [None]:
df_train.columns

In [None]:
import xgboost as xgb # pre-install with snowflake container runtime notebook 
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt

# Assuming 'CHURN_Yes' is your target
X = df_encoded.drop(columns=['CHURN_True.'])
y = df_encoded['CHURN_True.']


# Step 1: Define feature and target columns
target_col = 'CHURN_True.'
X_train = df_train.drop(columns=['CHURN_True.'])
y_train = df_train['CHURN_True.']

X_val = df_validation.drop(columns=['CHURN_True.'])
y_val = df_validation['CHURN_True.']

X_test = df_test.drop(columns=['CHURN_True.'])
y_test = df_test['CHURN_True.']



In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report

2r
model.fit(
    X_train,
    y_train,
    eval_set=[(X_val, y_val)],
    verbose=True
)


# Predict on test
y_pred = model.predict(X_test)

# Evaluate
print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
