![giskard_logo.png](https://raw.githubusercontent.com/Giskard-AI/giskard/main/readme/Logo_full_darkgreen.png)

# About Giskard

Open-Source CI/CD platform for ML teams. Deliver ML products, better & faster. 

*   Collaborate faster with feedback from business stakeholders.
*   Deploy automated tests to eliminate regressions, errors & biases.

🏡 [Website](https://giskard.ai/)

📗 [Documentation](https://docs.giskard.ai/)

# Telco custormer churn data

In this notebook we explore how to predict customer churn, a critical factor for telecommunication companies to be able to effectively retain customers. 

## Installing `giskard` and `lightgbm`

In [None]:
!pip install giskard lightgbm

## 1. Data Reading

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
import lightgbm as lbt

In [None]:
# import telecom dataset into a pandas data frame

dataset_url="https://raw.githubusercontent.com/Giskard-AI/examples/main/datasets/WA_Fn-UseC_-Telco-Customer-Churn.csv"

df_telco=pd.read_csv(dataset_url)

# check unique values of each column
#for column in df_telco.columns:
#    print('Column: {} - Unique Values: {}'.format(column, df_telco[column].unique()))

# summary of the data frame
#df_telco.info()

# transform the column TotalCharges into a numeric data type
df_telco['TotalCharges'] = pd.to_numeric(df_telco['TotalCharges'], errors='coerce')

# drop observations with null values
df_telco.dropna(inplace=True)

# drop the customerID column from the dataset
df_telco.drop(columns='customerID', inplace=True)

# remove (automatic) from payment method names
df_telco['PaymentMethod'] = df_telco['PaymentMethod'].str.replace(' (automatic)', '', regex=False)

## 2. Initialising feature names

In [None]:
# Declare the type of each column in the dataset(example: category, numeric, text)
column_types = {'gender': "category",
                'SeniorCitizen': "numeric", 
                'Partner': "category", 
                'Dependents': "category", 
                'tenure': "numeric",
                'PhoneService': "category", 
                'MultipleLines': "category", 
                'InternetService': "category", 
                'OnlineSecurity': "category",
                'OnlineBackup': "category", 
                'DeviceProtection': "category", 
                'TechSupport': "category", 
                'StreamingTV': "category",
                'StreamingMovies': "category", 
                'Contract': "category", 
                'PaperlessBilling': "category", 
                'PaymentMethod': "category",
                'MonthlyCharges': "numeric", 
                'TotalCharges': "numeric", 
                'Churn': "category"}

# feature_types is used to declare the features the model is trained on
feature_types = {i:column_types[i] for i in column_types if i!='Churn'}

## 3. Feature Engineering (manual, without sklearn transformers)

In this notebook, we're going to wrap some of these transformations (except one-hot encoding) inside a `predict` function instead of redefining the transformations in terms of sklearn pre-defined ones. 

**Important note: All the transformers, need to be fitted outside the `prediction_function` passed to Giskard, to ensure similar transformations throughout the code.**

In [None]:
df_telco_transformed = df_telco.copy()

# label encoding (binary variables)
label_encoding_columns = ['gender', 'Partner', 'Dependents', 'PaperlessBilling', 'PhoneService', 'Churn']

# encode categorical binary features using label encoding
for column in label_encoding_columns:
    if column == 'gender':
        df_telco_transformed[column] = df_telco_transformed[column].map({'Female': 1, 'Male': 0})
    else: 
        df_telco_transformed[column] = df_telco_transformed[column].map({'Yes': 1, 'No': 0}) 
        
# one-hot encoding (categorical variables with more than two levels)
one_hot_encoding_columns = ['MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 
                            'TechSupport', 'StreamingTV',  'StreamingMovies', 'Contract', 'PaymentMethod']

#Use OneHotEncoder / optional: use drop='first' for OneHotEncoder to eliminate duplicate features
one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
array_hot_encoded =  one_hot_encoder.fit_transform(df_telco_transformed[one_hot_encoding_columns])
data_hot_encoded = pd.DataFrame(array_hot_encoded, index=df_telco_transformed.index,columns=one_hot_encoder.get_feature_names_out())
df_telco_transformed=df_telco_transformed.drop(columns=one_hot_encoding_columns)
df_telco_transformed = pd.concat([df_telco_transformed,data_hot_encoded], axis=1)

# encode categorical variables with more than two levels using one-hot encoding
# df_telco_transformed = pd.get_dummies(df_telco_transformed, columns = one_hot_encoding_columns)

# min-max normalization (numeric variables)
min_max_columns = ['tenure', 'MonthlyCharges', 'TotalCharges']

# minimum value of the column
min_column={} 
# maximum value of the column
max_column={}

# scale numerical variables using min max scaler
for column in min_max_columns:
        # minimum value of the column
        min_column[column] = df_telco_transformed[column].min()
        # maximum value of the column
        max_column[column] = df_telco_transformed[column].max()
        # min max scaler
        df_telco_transformed[column] = (df_telco_transformed[column] - min_column[column] ) / (max_column[column]  - min_column[column] )   


## 4. Data splitting

In [None]:
#------ raw data
# select independent variables
X_raw = df_telco.drop(columns='Churn')

# select dependent variables
Y_raw = df_telco.loc[:, 'Churn']

# split the data in training and testing sets
X_raw_train, X_raw_test, Y_raw_train, Y_raw_test = train_test_split(X_raw, Y_raw, test_size=0.25, random_state=40, shuffle=True)
# Prepare data to upload on Giskard
train_raw_data = pd.concat([X_raw_train, Y_raw_train], axis=1)
test_raw_data = pd.concat([X_raw_test, Y_raw_test ], axis=1)

#------ transformed data
# select independent variables
X = df_telco_transformed.drop(columns='Churn')

# select dependent variables
Y = df_telco_transformed.loc[:, 'Churn']

# split the data in training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=40, shuffle=True)
# Prepare data to upload on Giskard
train_data = pd.concat([X_train, Y_train], axis=1)
test_data = pd.concat([X_test, Y_test ], axis=1)

## 5. Models Evaluation

In [None]:
seed=123
models = {}
models['dummy_classifier']= {"model": DummyClassifier(random_state=seed, strategy='most_frequent'), "accuracy":0} 
models['k_nearest_neighbors']= {"model": KNeighborsClassifier(), "accuracy":0} 
models['logistic_regression']= {"model": LogisticRegression(random_state=seed), "accuracy":0} 
models['random_forest']= {"model": RandomForestClassifier(random_state=seed), "accuracy":0} 
models['gradient_boosting']= {"model": GradientBoostingClassifier(random_state=seed), "accuracy":0} 
models['LGBM']= {"model": lbt.LGBMClassifier(random_state=seed), "accuracy":0} 
    

# test the accuracy of each model using default hyperparameters
scoring = 'accuracy'
for name in models.keys():
    # fit the model with the training data
    models[name]['model'].fit(X_train, Y_train).predict(X_test)
    # make predictions with the testing data
    predictions = models[name]['model'].predict(X_test)
    # calculate accuracy 
    accuracy = accuracy_score(Y_test, predictions)
    # append the model name and the accuracy to the lists
    models[name]['accuracy']=accuracy
    # print classifier accuracy
    print('Classifier: {}, Accuracy: {})'.format(name, accuracy))

## 6. Let's build our `wrapped_prediction_function` function 

we pick here `LGBM`, but feel free to write it with any model of the above.

**Important note: notice how we defined `min_column[column]` and `max_column[column]` outside the `wrapped_prediction_function` function. That's important, as you don't want to `fit` some of your transformers in `wrapped_prediction_function` even if you write them manually. That's because `wrapped_prediction_function` takes as input a subset of the full dataset needed to fit some of the transformers.**

In [None]:
def wrapped_prediction_function(test_dataset):
    df_telco_transformed=test_dataset.copy()
    # label encoding (binary variables)
    label_encoding_columns = ['gender', 'Partner', 'Dependents', 'PaperlessBilling', 'PhoneService']

    # encode categorical binary features using label encoding
    for column in label_encoding_columns:
        if column == 'gender':
            df_telco_transformed[column] = df_telco_transformed[column].map({'Female': 1, 'Male': 0})
        else: 
            df_telco_transformed[column] = df_telco_transformed[column].map({'Yes': 1, 'No': 0}) 

    # one-hot encoding (categorical variables with more than two levels)
    array_hot_encoded =  one_hot_encoder.transform(df_telco_transformed[one_hot_encoding_columns])
    data_hot_encoded = pd.DataFrame(array_hot_encoded, index=df_telco_transformed.index,columns=one_hot_encoder.get_feature_names_out())
    df_telco_transformed=df_telco_transformed.drop(columns=one_hot_encoding_columns)
    df_telco_transformed = pd.concat([df_telco_transformed,data_hot_encoded], axis=1)

    # min-max normalization (numeric variables)
    min_max_columns = ['tenure', 'MonthlyCharges', 'TotalCharges']

    # scale numerical variables using min max scaler
    for column in min_max_columns:
            # min max scaler
            df_telco_transformed[column] = (df_telco_transformed[column] - min_column[column]) / (max_column[column] - min_column[column])   

        
    # choose model
    model = models['LGBM']['model']
    
    # make predictions with the testing data
    predictions = model.predict_proba(df_telco_transformed)

    return predictions

# Upload the model in Giskard 🚀🚀🚀

## Initiate a project

In [None]:
from giskard.client.giskard_client import GiskardClient

url = "http://localhost:19000" #if Giskard is installed locally (for installation, see: https://docs.giskard.ai/start/guides/installation)
#url = "http://app.giskard.ai" # If you want to upload on giskard URL
token = "YOUR GENERATED TOKEN"
client = GiskardClient(url, token)

# your_project = client.create_project("project_key", "PROJECT_NAME", "DESCRIPTION")
# Choose the arguments you want. But "project_key" should be unique and in lower case
churn_analysis_wo_tfs = client.create_project("churn_analysis_without_transformers", "Telco Kaggle Churn Analysis", "Project to predict if a customer quits")

# If you've already created a project with the key "churn-analysis" use
#churn_analysis = client.get_project("churn_analysis")


In [None]:
# Declare the type of each column in the dataset(example: category, numeric, text)
column_types = {'gender': "category",
                'SeniorCitizen': "category", 
                'Partner': "category", 
                'Dependents': "category", 
                'tenure': "numeric",
                'PhoneService': "category", 
                'MultipleLines': "category", 
                'InternetService': "category", 
                'OnlineSecurity': "category",
                'OnlineBackup': "category", 
                'DeviceProtection': "category", 
                'TechSupport': "category", 
                'StreamingTV': "category",
                'StreamingMovies': "category", 
                'Contract': "category", 
                'PaperlessBilling': "category", 
                'PaymentMethod': "category",
                'MonthlyCharges': "numeric", 
                'TotalCharges': "numeric", 
                'Churn': "category"}

# feature_types is used to declare the features the model is trained on
feature_types = {i:column_types[i] for i in column_types if i!='Churn'}

## Upload your model and a dataset (see [documentation](https://docs.giskard.ai/start/guides/upload-your-model))

In [None]:
churn_analysis_wo_tfs.upload_model_and_df(
    prediction_function=wrapped_prediction_function, # Python function which takes pandas dataframe as input and returns probabilities for classification model OR returns predictions for regression model
    model_type='classification', # "classification" for classification model OR "regression" for regression model
    df=test_raw_data, # the dataset you want to use to inspect your model
    column_types=column_types, # A dictionary with columns names of df as key and types(category, numeric, text) of columns as values
    target='Churn', # The column name in df corresponding to the actual target variable (ground truth).
    feature_names=list(feature_types.keys()), # List of the feature names of prediction_function
    classification_labels=["No","Yes"] ,  # List of the classification labels of your prediction #TODO: Check their order!!!!!
    model_name='LGBM', # Name of the model
    dataset_name='test_data' # Name of the dataset
)

## Upload more datasets

In [None]:
churn_analysis_wo_tfs.upload_df(
    df=train_raw_data, # The dataset you want to upload
    column_types=column_types, # All the column types of df
    target="Churn", # Do not pass this parameter if dataset doesn't contain target column
    name="train_data" # Name of the dataset
)