# 📈 Telco Churn Model

In this Quickstart guide, we will play the role of a data scientist at a telecom company that wants to identify users who are at high risk of churning. To accomplish this, we need to build a model that can learn how to identify such users. We will demonstrate how to use Snowflake Notebook in conjunction with Snowflake/Snowpark to build a Random Forest Classifier to help us with this task.


### Prerequisites

- Familiarity with basic Python and SQL
- Familiarity with training ML models
- Familiarity with data science notebooks
- Go to the [Snowflake](https://signup.snowflake.com/) sign-up page and register for a free account. After registration, you will receive an email containing a link that will take you to Snowflake, where you can sign in.

### What You'll Learn

- How to import/load data with Snowflake Notebook
- How to train a Random Forest with Snowpark ML model
- How to visualize the predicted results from the forecasting model
- How to build an interactive web app and make predictions on new users


First, add the `imbalanced-learn` and `snowflake-ml-python` package from the package picker on the top right. We will be using these packages later in the notebook.

## Importing Data
To pull our churn dataset into SnowSight notebooks, we will pull some parquet data from AWS S3.

In [None]:
CREATE OR REPLACE STAGE TELCO_CHURN_EXTERNAL_STAGE_DEMO
    URL = 's3://sfquickstarts/notebook_demos/churn/' 

In [None]:
CREATE FILE FORMAT IF NOT EXISTS MY_PARQUET_FORMAT TYPE = PARQUET COMPRESSION = SNAPPY;

In [None]:
LS @TELCO_CHURN_EXTERNAL_STAGE_DEMO;

In [None]:
CREATE TABLE if not exists TELCO_CHURN_RAW_DATA_DEMO USING TEMPLATE ( 
    SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*)) 
    FROM 
        TABLE( INFER_SCHEMA( 
        LOCATION => '@TELCO_CHURN_EXTERNAL_STAGE_DEMO', 
        FILE_FORMAT => 'MY_PARQUET_FORMAT',
        FILES => 'telco_churn.parquet'
        ) 
    ) 
);

In [None]:
COPY INTO TELCO_CHURN_RAW_DATA_DEMO
FROM @TELCO_CHURN_EXTERNAL_STAGE_DEMO
FILES = ('telco_churn.parquet')
FILE_FORMAT = (
    TYPE=PARQUET,
    REPLACE_INVALID_CHARACTERS=TRUE,
    BINARY_AS_TEXT=FALSE
)
MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE
ON_ERROR=ABORT_STATEMENT;

In [None]:
SELECT * FROM TELCO_CHURN_RAW_DATA_DEMO;

# Working with Data

Now that we have our data loaded in, we can start working with the data using our familiar data science libraries in Python.

In [None]:
import pandas as pd
import numpy as np
import streamlit as st
import altair as alt
from imblearn.over_sampling import SMOTE 

import warnings
warnings.filterwarnings("ignore")

In [None]:
from snowflake.snowpark.context import get_active_session
session = get_active_session()

In [None]:
telco_churn_snow_df = cell9.to_df()
telco_churn_snow_df

## Exploratory Data Analysis (EDA)

Machine learning models thrive on clean and well-organized data. To ensure our models perform at their best, we'll investigate our dataset to address any missing values and visualize the distributions of each column.

### Basic Summary Statistics

In [None]:
telco_churn_snow_df.describe()

### Checking nulls with Pandas

In [None]:
telco_churn_pdf = telco_churn_snow_df.to_pandas()
telco_churn_pdf.isnull().sum()

As can be seen, there is no null value in any of the feature columns

### Visualizing Feature Distributions

In [None]:
columns = telco_churn_pdf.columns
num_columns_for_display = 3
col1, col2 , col3 = st.columns(num_columns_for_display)
index = 0
for col in columns:
    source = pd.DataFrame(telco_churn_pdf[col])
    chrt = alt.Chart(source).mark_bar().encode(
    alt.X(f"{col}:Q", bin=True),
    y='count()',
    )
    if index % num_columns_for_display == 0:
        with col1: 
            st.altair_chart(chrt)
    elif index % num_columns_for_display == 1:
        with col2: 
            st.altair_chart(chrt)
    elif index % num_columns_for_display == 2:
        with col3: 
            st.altair_chart(chrt)
    index = index + 1

### Understanding Churn Rate - Imbalanced dataset

In [None]:
telco_churn_snow_df.group_by('"Churn"').count()

If you want to understand a model, you need to know its weaknesses. When the target variable has one class that is much more frequent than the other, your data is imbalanced. This causes issues when evaluating models since both classes don't get equal attention.

In contrast to modeling an imbalanced dataset, a model trained on balanced data sees an equal amount of observations per class. By eliminating the imbalance, we also eliminate the model's potential to achieve high metric scores due to bias towards a majority class. This means that when we evaluate our model, the metrics can capture a better representation of how well the model does at making valuable predictions.

#### Comparing Big data processing with pandas v.s. Snowpark Dataframes

For the groupby aggregation query above, we used Snowpark dataframes to perform the operation. Snowpark's Dataframe API allows you to query and process data at scale in Snowflake. With Snowpark, you no longer have to convert your dataframes to pandas in memory. Snowpark lets process data in Snowflake without moving data to the system where your application code runs, and process at scale as part of the elastic and serverless Snowflake engine.

Below we look at how the query performance of the groupby aggregation with Snowpark v.s. pandas.


In [None]:
import time
start = time.time()
telco_churn_snow_df.group_by('"Churn"').count()
end = time.time()
st.markdown(f"Total Time with Snowpark: {end-start}")

In [None]:
start = time.time()
telco_churn_snow_pdf = telco_churn_snow_df.to_pandas()
end_mid = time.time()
telco_churn_snow_pdf.groupby("Churn").count()
end = time.time()
st.markdown(f"Total Time with Pandas: {end-start}")

We can see that Snowpark runs much faster. This is because of the I/O overhead for converting a Snowpark dataframe to pandas. We can see that the bulk of the time spent is on I/O.

In [None]:
st.markdown(f"I/O time to convert to Pandas dataframe: {end_mid-start}")
st.markdown(f"Processing time with Pandas dataframe: {end-end_mid}")
st.markdown(f"I/O account for {(end_mid-start)/(end-start)*100:.2f}% of processing time")

# Feature Engineering

To prepare our data for our model, we'll need to handle the imbalanced data problem by upsampling our dataset. 

For this, we'll be using the `SMOTE` algorithm from the `imblearn` package.

In [None]:
# Extract the training features
features_names = [col for col in telco_churn_pdf.columns if col not in ['Churn']]
features = telco_churn_pdf[features_names]

# extract the target
target = telco_churn_pdf['Churn']
st.markdown("## Lets balance the dataset.")
# upsample the minority class in the dataset
upsampler = SMOTE(random_state = 111)
features, target = upsampler.fit_resample(features, target)
st.dataframe(features.head())

st.markdown("## Upsampled data.")
upsampled_data = pd.concat([features, target], axis=1)
upsampled_data.reset_index(inplace=True)
upsampled_data.rename(columns={'index': 'INDEX'}, inplace=True)
st.dataframe(upsampled_data.head())

In [None]:
upsampled_data = session.create_dataframe(upsampled_data)
# Get the list of column names from the dataset
feature_names_input = [c for c in upsampled_data.columns if c != '"Churn"' and c != "INDEX"]

In [None]:
upsampled_data[feature_names_input]

Once that's taken care of, we'll use scikit-learn to preprocess our data into a format that the model expects. This means scaling our features and splitting our data into training and testing datasets.

We can perform StandardScaler preprocessing via sklearn to process in-memory or Snowpark ML preprocessing for pushdown compute. 

## Sci-kit learn Preprocessing with Pandas DataFrames

In [None]:
import sklearn.preprocessing as pp_original
# Initialize a StandardScaler object with input and output column names
scaler = pp_original.StandardScaler()
features_pdf = upsampled_data[feature_names_input].to_pandas()

# Fit the scaler to the dataset
scaler.fit(features_pdf)

# Transform the dataset using the fitted scaler
scaled_features = scaler.transform(features_pdf)
scaled_features = pd.DataFrame(scaled_features, columns = features_names)
scaled_features

## Snowpark ML preprocessing with Snowpark DataFrames

Note the similarity between the APIs used for sklearn and Snowpark ML.

In [None]:
import snowflake.ml.modeling.preprocessing as pp

# Initialize a StandardScaler object with input and output column names
scaler = pp.StandardScaler(
    input_cols=feature_names_input,
    output_cols=feature_names_input
)

# Fit the scaler to the dataset
scaler.fit(upsampled_data)

# Transform the dataset using the fitted scaler
scaled_features = scaler.transform(upsampled_data)
scaled_features

## Let's perform the train test split using 80/20.

In [None]:
# Split the scaled_features dataset into training and testing sets with an 80/20 ratio
training, testing = scaled_features.random_split(weights=[0.8, 0.2], seed=111)

# Model Training - Random Forest Classifier 

The mystery model of the day is a [random forest classifier](https://towardsdatascience.com/understanding-random-forest-58381e0602d2). I'll spare you the details on how it works, but in short, it creates an ensemble of smaller models that all make predictions on the same data. Whichever prediction has the most votes is the final prediction that the model goes with.

In [None]:
from snowflake.ml.modeling.ensemble import RandomForestClassifier

# Define the target variable (label) column name
label = ['"Churn"']

# Define the output column name for the predicted label
output_label = ['"predicted_churn"']

# Initialize a RandomForestClassifier object with input, label, and output column names
model = RandomForestClassifier(
    input_cols=feature_names_input,
    label_cols=label,
    output_cols=output_label,
)

In [None]:
# Train the RandomForestClassifier model using the training set
_ = model.fit(training)

In [None]:
# Predict the target variable (churn) for the testing set using the trained model
results = model.predict(testing)

In [None]:
testing

# Model Evaluation

Model evaluation is all about checking how well our machine learning model is doing by comparing its predictions to the actual outcomes. 

In [None]:
# return only the predicted churn values
predictions = results.to_pandas().sort_values("INDEX")[output_label].astype(int).to_numpy().flatten()
actual = testing.to_pandas().sort_values("INDEX")[['Churn']].to_numpy().flatten()

## Feature Importance

Feature importance is all about figuring out which input variables are the real MVPs when it comes to making predictions with our machine learning model. We'll find out which features are the most important by looking at how much they contribute to the model's overall performance.

In [None]:
rf = model.to_sklearn()
importances = pd.DataFrame(
    list(zip(features.columns, rf.feature_importances_)),
    columns=["feature", "importance"],
)

bar_chart = alt.Chart(importances).mark_bar().encode(
    x="importance:Q",
    y=alt.Y("feature:N", sort="-x")
)
st.altair_chart(bar_chart, use_container_width=True)

## Predicting churn for a new user
Using our trained random forest model, we can make predictions that tell us whether a new customer will churn or not.

In [None]:
account_weeks = "10"
data_usage = "1.7"
mins_per_month = "82"
daytime_calls = "67"
customer_service_calls = "4"
monthly_charge = "37"
roam_mins = "0"
overage_fee = "9.5"
renewed_contract = "true"
has_data_plan = "true"
user_vector = np.array([
    account_weeks,
    1 if renewed_contract else 0,
    1 if has_data_plan else 0,
    data_usage,
    customer_service_calls,
    mins_per_month,
    daytime_calls,
    monthly_charge,
    overage_fee,
    roam_mins,
]).reshape(1,-1)

user_dataframe = pd.DataFrame(user_vector, columns=[f'"{_}"' for _ in features.columns])


#### Input dataframe for new user

In [None]:
user_dataframe

In [None]:
user_vector = scaler.transform(user_dataframe)

In [None]:
model.predict(user_vector)[['"predicted_churn"']].values

In [None]:
st.markdown("#### Scaled dataframe for new user")
st.dataframe(user_vector)
st.markdown("#### Prediction")
predicted_value = model.predict(user_vector)[['"predicted_churn"']].values.astype(int).flatten()
user_probability = model.predict_proba(user_vector)
probability_of_prediction = max(user_probability[user_probability.columns[-2:]].values[0]) * 100
prediction = 'churn' if predicted_value == 1 else 'not churn'
st.markdown(prediction)

In [None]:
col1, col2 = st.columns(2)

with col1: 
    account_weeks = st.slider("AccountWeeks", int(features["AccountWeeks"].min()) , int(features["AccountWeeks"].max()))
    data_usage = st.slider("DataUsage", int(features["DataUsage"].min()) , int(features["DataUsage"].max()))
    mins_per_month = st.slider("DayMins", int(features["DayMins"].min()) , int(features["DayMins"].max()))
    daytime_calls = st.slider("DayCalls", int(features["DayCalls"].min()) , int(features["DayCalls"].max()))
    renewed_contract =  st.selectbox("Renewed Contract?",('true','false'))
    
with col2: 
    monthly_charge = st.slider("MonthlyCharge", int(features["MonthlyCharge"].min()) , int(features["MonthlyCharge"].max()))
    roam_mins = st.slider("RoamMins", int(features["RoamMins"].min()) , int(features["RoamMins"].max()))
    customer_service_calls = st.slider("CustServCalls", int(features["CustServCalls"].min()) , int(features["CustServCalls"].max()))
    overage_fee = st.slider("OverageFee", int(features["OverageFee"].min()) , int(features["OverageFee"].max()))
    has_data_plan = st.selectbox("Has Data Plan?",('true','false'))

user_vector = np.array([
    account_weeks,
    1 if renewed_contract else 0,
    1 if has_data_plan else 0,
    data_usage,
    customer_service_calls,
    mins_per_month,
    daytime_calls,
    monthly_charge,
    overage_fee,
    roam_mins,
]).reshape(1,-1)

user_dataframe = pd.DataFrame(user_vector, columns=[f'"{_}"' for _ in features.columns])
user_vector = scaler.transform(user_dataframe)
with col1: 
    st.markdown("#### Input dataframe for new user")
    st.dataframe(user_dataframe)
with col2:
    st.markdown("#### Scaled dataframe for new user")
    st.dataframe(user_vector)

st.markdown("#### Prediction")
predicted_value = model.predict(user_vector)[['"predicted_churn"']].values.astype(int).flatten()
user_probability = model.predict_proba(user_vector)
probability_of_prediction = max(user_probability[user_probability.columns[-2:]].values[0]) * 100
prediction = 'churn' if predicted_value == 1 else 'not churn'
st.markdown(prediction)

## Exporting Model with Timestamp

In [None]:
import pickle
import datetime
filename = f'telco-eda-model-{datetime.datetime.now()}.pkl'

pickle.dump(model, open(filename,'wb'))
print(f"Saved to {filename}")

Congratulations on making it to the end of this Lab where we explored churn modeling using Snowflake Notebooks! We learned how to import/load data to Snowflake, train a Random Forest model, visualize predictions, and build an interactive data app, and make predictions for new users.
