![giskard_logo.png](https://raw.githubusercontent.com/Giskard-AI/giskard/main/readme/Logo_full_darkgreen.png)

# About Giskard

Open-Source CI/CD platform for ML teams. Deliver ML products, better & faster. 

*   Collaborate faster with feedback from business stakeholders.
*   Deploy automated tests to eliminate regressions, errors & biases.

🏡 [Website](https://giskard.ai/)

📗 [Documentation](https://docs.giskard.ai/)

# Telco custormer churn data

This notebook is based on:
- [medium article](https://towardsdatascience.com/end-to-end-machine-learning-project-telco-customer-churn-90744a8df97d)
- [Kaggle dataset](https://www.kaggle.com/datasets/blastchar/telco-customer-churn)
- [Giskard credit scoring example](https://github.com/Giskard-AI/examples/blob/main/Credit%20scoring%20classification%20model.ipynb) (mainly cells that import models into giskard for inspection)

In which we will explore how to predict customer churn, a critical factor for telecommunication companies to be able to effectively retain customers. 

We will follow the same steps of the notebook displayed in [medium article](https://towardsdatascience.com/end-to-end-machine-learning-project-telco-customer-churn-90744a8df97d), adding when needed the functions needed to inspect the model in Giskard.

## 1. Data Reading

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

In [2]:
# import telecom dataset into a pandas data frame
df_telco = pd.read_csv('datasets/WA_Fn-UseC_-Telco-Customer-Churn.csv')

# check unique values of each column
#for column in df_telco.columns:
#    print('Column: {} - Unique Values: {}'.format(column, df_telco[column].unique()))

# summary of the data frame
#df_telco.info()

# transform the column TotalCharges into a numeric data type
df_telco['TotalCharges'] = pd.to_numeric(df_telco['TotalCharges'], errors='coerce')

# drop observations with null values
df_telco.dropna(inplace=True)

# drop the customerID column from the dataset
df_telco.drop(columns='customerID', inplace=True)

# remove (automatic) from payment method names
df_telco['PaymentMethod'] = df_telco['PaymentMethod'].str.replace(' (automatic)', '', regex=False)

## 2. Feature Engineering (manual, without sklearn transformers)

The next cell is taken as is from [medium article](https://towardsdatascience.com/end-to-end-machine-learning-project-telco-customer-churn-90744a8df97d), where the data transformations are written manually. In this notebook, we're going to wrap these transformation inside a `predict` function instead of redefining the transformations in terms of sklearn pre-defined ones. 

In [3]:
df_telco_transformed = df_telco.copy()

# label encoding (binary variables)
label_encoding_columns = ['gender', 'Partner', 'Dependents', 'PaperlessBilling', 'PhoneService', 'Churn']

# encode categorical binary features using label encoding
for column in label_encoding_columns:
    if column == 'gender':
        df_telco_transformed[column] = df_telco_transformed[column].map({'Female': 1, 'Male': 0})
    else: 
        df_telco_transformed[column] = df_telco_transformed[column].map({'Yes': 1, 'No': 0}) 
        
# one-hot encoding (categorical variables with more than two levels)
one_hot_encoding_columns = ['MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 
                            'TechSupport', 'StreamingTV',  'StreamingMovies', 'Contract', 'PaymentMethod']

# encode categorical variables with more than two levels using one-hot encoding
df_telco_transformed = pd.get_dummies(df_telco_transformed, columns = one_hot_encoding_columns)

# min-max normalization (numeric variables)
min_max_columns = ['tenure', 'MonthlyCharges', 'TotalCharges']

# minimum value of the column
min_column={} 
# maximum value of the column
max_column={}

# scale numerical variables using min max scaler
for column in min_max_columns:
        # minimum value of the column
        min_column[column] = df_telco_transformed[column].min()
        # maximum value of the column
        max_column[column] = df_telco_transformed[column].max()
        # min max scaler
        df_telco_transformed[column] = (df_telco_transformed[column] - min_column[column] ) / (max_column[column]  - min_column[column] )   


# 3. Data splitting

In [4]:
# select independent variables
X = df_telco_transformed.drop(columns='Churn')

# select dependent variables
y = df_telco_transformed.loc[:, 'Churn']


# split the data in training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=40, shuffle=True)

# 4. Assessing multiple models

In [5]:
def create_models(seed=2):
    '''
    Create a list of machine learning models.
            Parameters:
                    seed (integer): random seed of the models
            Returns:
                    models (list): list containing the models
    '''

    models = []
    models.append(('dummy_classifier', DummyClassifier(random_state=seed, strategy='most_frequent')))
    models.append(('k_nearest_neighbors', KNeighborsClassifier()))
    #models.append(('logistic_regression', LogisticRegression(random_state=seed)))
    models.append(('support_vector_machines', SVC(random_state=seed)))
    models.append(('random_forest', RandomForestClassifier(random_state=seed)))
    models.append(('gradient_boosting', GradientBoostingClassifier(random_state=seed)))
    
    return models

# create a list with all the algorithms we are going to assess
models = create_models()



# test the accuracy of each model using default hyperparameters
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    # fit the model with the training data
    model.fit(X_train, y_train).predict(X_test)
    # make predictions with the testing data
    predictions = model.predict(X_test)
    # calculate accuracy 
    accuracy = accuracy_score(y_test, predictions)
    # append the model name and the accuracy to the lists
    results.append(accuracy)
    names.append(name)
    # print classifier accuracy
    print('Classifier: {}, Accuracy: {})'.format(name, accuracy))

Classifier: dummy_classifier, Accuracy: 0.745164960182025)
Classifier: k_nearest_neighbors, Accuracy: 0.7531285551763367)
Classifier: support_vector_machines, Accuracy: 0.7878270762229806)
Classifier: random_forest, Accuracy: 0.7713310580204779)
Classifier: gradient_boosting, Accuracy: 0.7963594994311718)


## 5. Let's build our `predict` function 

we pick here `random_forest`, but feel free to write it with any model of the above.

**Important note: notice how we defined `min_column[column]` and `max_column[column]` outside the `predict` function. That's important, as you don't want to `fit` some of your transformers in `predict` even if you write them manually. That's because `predict` takes as input a subset of the full dataset needed to fit some of the transformers.**

In [6]:
def predict(test_dataset):
    df_telco_transformed=test_dataset.copy()
    # label encoding (binary variables)
    label_encoding_columns = ['gender', 'Partner', 'Dependents', 'PaperlessBilling', 'PhoneService']

    # encode categorical binary features using label encoding
    for column in label_encoding_columns:
        if column == 'gender':
            df_telco_transformed[column] = df_telco_transformed[column].map({'Female': 1, 'Male': 0})
        else: 
            df_telco_transformed[column] = df_telco_transformed[column].map({'Yes': 1, 'No': 0}) 

    # one-hot encoding (categorical variables with more than two levels)
    one_hot_encoding_columns = ['MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 
                                'TechSupport', 'StreamingTV',  'StreamingMovies', 'Contract', 'PaymentMethod']

    # encode categorical variables with more than two levels using one-hot encoding
    df_telco_transformed = pd.get_dummies(df_telco_transformed, columns = one_hot_encoding_columns)

    # min-max normalization (numeric variables)
    min_max_columns = ['tenure', 'MonthlyCharges', 'TotalCharges']

    # scale numerical variables using min max scaler
    for column in min_max_columns:
            # min max scaler
            df_telco_transformed[column] = (df_telco_transformed[column] - min_column[column]) / (max_column[column] - min_column[column])   

        
    # choose model
    model = models[3][1] #
    
    # make predictions with the testing data
    predictions = model.predict_proba(df_telco_transformed)

    return predictions

In [7]:
# select independent variables
X = df_telco.drop(columns='Churn')

# select dependent variables
y = df_telco.loc[:, 'Churn']


# split the data in training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=40, shuffle=True)
# Prepare data to upload on Giskard
train_data = pd.concat([X_train, y_train], axis=1)
test_data = pd.concat([X_test, y_test ], axis=1)

# Upload the model in Giskard 🚀🚀🚀


#### Install Giskard library



In [8]:
!pip install giskard==1.7.0a2
!giskard worker start -h 194.163.172.30 -d

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0mCollecting giskard==1.7.0a2
  Downloading giskard-1.7.0a2-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting grpcio-status<2.0.0,>=1.46.3
  Downloading grpcio_status-1.49.1-py3-none-any.whl (14 kB)
Collecting great-expectations<0.16.0,>=0.15.17
  Downloading great_expectations-0.15.26-py3-none-any.whl (5.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting grpcio<2.0.0,>=1.46.3
  Downloading grpcio-1.49.1-cp38-cp38-macosx_10_10_x86_64.whl (4.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m

Collecting tzlocal>=1.2
  Downloading tzlocal-4.2-py3-none-any.whl (19 kB)
Collecting makefun<2,>=1.7.0
  Downloading makefun-1.15.0-py2.py3-none-any.whl (22 kB)
Collecting marshmallow<4.0.0,>=3.7.1
  Downloading marshmallow-3.18.0-py3-none-any.whl (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.8/48.8 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ruamel.yaml<0.17.18,>=0.16
  Downloading ruamel.yaml-0.17.17-py3-none-any.whl (109 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting altair<5,>=4.0.0
  Downloading altair-4.2.0-py3-none-any.whl (812 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m812.8/812.8 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting googleapis-common-protos>=1.5.5
  Downloading googleapis_common_protos-1.56.4-py2.py3-none-any.whl (211 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32

Collecting numpy<2.0.0,>=1.21.6
  Downloading numpy-1.22.4-cp38-cp38-macosx_10_15_x86_64.whl (17.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting llvmlite<0.39,>=0.38.0rc1
  Downloading llvmlite-0.38.1-cp38-cp38-macosx_10_9_x86_64.whl (25.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.5/25.5 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting ruamel.yaml.clib>=0.1.2
  Downloading ruamel.yaml.clib-0.2.6-cp38-cp38-macosx_10_9_x86_64.whl (142 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m142.9/142.9 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyWavelets>=1.1.1
  Downloading PyWavelets-1.4.1-cp38-cp38-macosx_10_13_x86_64.whl (4.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m336.7/336.7 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Building wheels for collected packages: eli5
  Building wheel for eli5 (setup.py) ... [?25ldone
[?25h  Created wheel for eli5: filename=eli5-0.13.0-py2.py3-none-any.whl size=107728 sha256=bd4f7e8be64ce1c7428cef1d53112b0b6b75849d86b8b5ea69249ee1b24354b9
  Stored in directory: /Users/rak/Library/Caches/pip/wheels/85/ac/25/ffcd87ef8f9b1eec324fdf339359be71f22612459d8c75d89c
Successfully built eli5
Installing collected packages: wasabi, makefun, lockfile, cymem, tzdata, typer, toolz, termcolor, tenacity, tabulate, spacy-loggers, spacy-legacy, smart-open, slicer, six, setuptools, ruamel.yaml.clib, protobuf, Pillow, numpy, networkx, murmurhash, llvmlite, langcodes, kiwisolver, jsonpointer, graphviz, fonttools, docutils, dill, cycler, colorama, catalogue, backports.zoneinfo, attrs, tifffile, srsly, spacy-lookups-data, ruamel.yaml, PyWavelets, pytz-depreca

[0m[33m  DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m[33m  DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m[33m  DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m[33m  DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a

[0m[33m  DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m[33m  DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m[33m  DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m[33m  DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a

### Initiate a project

In [10]:
from giskard.client.giskard_client import GiskardClient

url = "http://localhost:9000" #if Giskard is installed locally (for installation, see: https://docs.giskard.ai/start/guides/installation)
#url = "http://app.giskard.ai" # If you want to upload on giskard URL
token = "eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsInRva2VuX3R5cGUiOiJBUEkiLCJhdXRoIjoiUk9MRV9BRE1JTiIsImV4cCI6MTY3Mjg0NjAxMX0.oiSJHiLotoyQeFrxv7cf0uGGTpZGIXEWP9kwpTwCxTk"
client = GiskardClient(url, token)

# your_project = client.create_project("project_key", "PROJECT_NAME", "DESCRIPTION")
# Choose the arguments you want. But "project_key" should be unique and in lower case
churn_analysis_wo_tfs = client.create_project("churn_analysis_without_transformers", "Telco Kaggle Churn Analysis", "Project to predict if user will default")

# If you've already created a project with the key "churn-analysis" use
#churn_analysis = client.get_project("churn_analysis")


ModuleNotFoundError: No module named 'giskard'

In [None]:
# Declare the type of each column in the dataset(example: category, numeric, text)
column_types = {'gender': "category",
                'SeniorCitizen': "numeric", 
                'Partner': "category", 
                'Dependents': "category", 
                'tenure': "numeric",
                'PhoneService': "category", 
                'MultipleLines': "category", 
                'InternetService': "category", 
                'OnlineSecurity': "category",
                'OnlineBackup': "category", 
                'DeviceProtection': "category", 
                'TechSupport': "category", 
                'StreamingTV': "category",
                'StreamingMovies': "category", 
                'Contract': "category", 
                'PaperlessBilling': "category", 
                'PaymentMethod': "category",
                'MonthlyCharges': "numeric", 
                'TotalCharges': "numeric", 
                'Churn': "category"}

# feature_types is used to declare the features the model is trained on
feature_types = {i:column_types[i] for i in column_types if i!='Churn'}

In [None]:
predict(test_data[list(feature_types.keys())])

### Upload your model and a dataset (see [documentation](https://docs.giskard.ai/start/guides/upload-your-model))

In [None]:
churn_analysis_wo_tfs.upload_model_and_df(
    prediction_function=predict, # Python function which takes pandas dataframe as input and returns probabilities for classification model OR returns predictions for regression model
    model_type='classification', # "classification" for classification model OR "regression" for regression model
    df=test_data, # the dataset you want to use to inspect your model
    column_types=column_types, # A dictionary with columns names of df as key and types(category, numeric, text) of columns as values
    target='Churn', # The column name in df corresponding to the actual target variable (ground truth).
    feature_names=list(feature_types.keys()), # List of the feature names of prediction_function
    classification_labels=["Yes","No"] ,  # List of the classification labels of your prediction #TODO: Check their order!!!!!
    model_name='random_forest', # Name of the model
    dataset_name='test_data' # Name of the dataset
)