# Data Visualization Notebook

## Objectives

*   Answer business requirement 1: 
    * As a customer I am interested to understand the patterns from my customer base, so I can better manage churn levels.


## Inputs

* outputs/datasets/collection/TelcoCustomerChurn.csv

## Outputs

* generate code that answers business requirement 1 and can be used to build Streamlit App


## Additional Comments | Insights | Conclusions




---

# Install Packages

In [None]:
! pip install pandas-profiling==2.11.0
! pip install plotly==4.14.0
! pip install feature-engine==1.0.2

# Code for restarting the runtime, that will restart colab session
# It is a good practice after you install a package in a colab session
import os
os.kill(os.getpid(), 9)

# Setup GPU

* Go to Edit → Notebook Settings
* In the Hardware accelerator menu, selects GPU
* note: when you select an option, either GPU, TPU or None, you switch among kernels/sessions

---
* How to know if I am using the GPU?
  * run the code below, if the output is different than '0' or null/nothing, you are using GPU in this session
  * Typically the output will be /device:GPU:0


In [None]:
import tensorflow as tf
tf.test.gpu_device_name()

# **Connection between: Colab Session and your GitHub Repo**

### Insert your **credentials**

* The variable's content will exist only while the session exists. Once this session terminates, the variable's content will be erased permanently.

In [None]:
from getpass import getpass
import os
from IPython.display import clear_output 

print("=== Insert your credentials === \nType in and hit Enter")
os.environ['UserName'] = getpass('GitHub User Name: ')
os.environ['UserEmail'] = getpass('GitHub User E-mail: ')
os.environ['RepoName'] = getpass('GitHub Repository Name: ')
os.environ['UserPwd'] = getpass('GitHub Account Password: ')
clear_output()
print("* Thanks for inserting your credentials!")
print(f"* You may now Clone your Repo to this Session, "
      f"then Connect this Session to your Repo.")

---

### **Clone** your GitHub Repo to your current Colab session

* So you can have access to your project's files

In [None]:
! git clone https://github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git
! rm -rf sample_data   # remove content/sample_data folder, since we dont need it for this project

import os
if os.path.isdir(os.environ['RepoName']):
  print("\n")
  %cd /content/{os.environ['RepoName']}
  print(f"\n\n* Current session directory is:{os.getcwd()}")
  print(f"* You may refresh the session folder to access {os.environ['RepoName']} folder.")
else:
  print(f"\n* The Repo {os.environ['UserName']}/{os.environ['RepoName']} was not cloned."
        f" Please check your Credentials: UserName and RepoName")

---

### **Connect** this Colab session to your GitHub Repo

* So if you need, you can push files generated in this session to your Repo.

In [None]:
! git config --global user.email {os.environ['UserEmail']}
! git config --global user.name {os.environ['UserName']}
! git remote rm origin
! git remote add origin https://{os.environ['UserName']}:{os.environ['UserPwd']}@github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git

# the logic is: create a temporary file in the sessions, update the repo. Delete this file, update the repo
# If it works, it is a signed that the session is connected to the repo.
import uuid
file_name = "session_connection_test_" + str(uuid.uuid4()) # generates a unique file name
with open(f"{file_name}.txt", "w") as file: file.write("text")
print("=== Testing Session Connectivity to the Repo === \n")
! git add . ; ! git commit -m {file_name + "_added_file"} ; ! git push origin main 
print("\n\n")
os.remove(f"{file_name}.txt")
! git add . ; ! git commit -m {file_name + "_removed_file"}; ! git push origin main

# delete your Credentials (username and password)
os.environ['UserName'] = os.environ['UserPwd'] = os.environ['UserEmail'] = ""

* If output above indicates there was a **failure in the authentication**, please insert again your credentials.

---

### **Push** generated/new files from this Session to GitHub repo

* Git status

In [None]:
! git status

* Git commit

In [None]:
CommitMsg = "added-cleaned-data"
!git add .
!git commit -m {CommitMsg}

* Git Push

In [None]:
!git push origin main


---

### **Delete** Cloned Repo from current Session

In [None]:
%cd /content
!rm -rf {os.environ['RepoName']}

print(f"\n * Please refresh session folder to validate that {os.environ['RepoName']} folder was removed from this session.")
print(f"\n\n* Current session directory is:  {os.getcwd()}")

---

# Load Data

In [None]:
import pandas as pd
df = pd.read_csv("outputs/datasets/collection/TelcoCustomerChurn.csv").drop(['customerID'], axis=1)
df.info()

# Pandas profiling

In [None]:
from pandas_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

# Correlation Study

For this analysis, we are transforming the categorical variables with One Hot Encoding technique, so we can see which levels tend to affect more Churn Levels

In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head(3)

We use `.corr()` for `spearman` and `pearson` methods, and investigate the top 10 correlations
* We know this command returns a pandas series and the first item is the correlation between Churn and Churn, which happens to be 1, so we exclude that with `[1:]`
* We sort values considering the aboslute value, by setting `key=abs`

In [None]:
corr_spearman = df_ohe.corr(method='spearman')['Churn'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

We do the same for `pearson`

In [None]:
corr_pearson = df_ohe.corr(method='pearson')['Churn'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

For both methods, we notice weak or moderate levels of correlation between Churn and a given variable. 
* Ideally, we pursue at strong correlation levels or more. However, this is not always possible

We will consider the top 7 correlation levels at `df_ohe` and will study the associated variables at `df`

In [None]:
top_n = 5
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

Therefore we are studying at df the following variables. We will investigate if:
* A churned customern typically has a month to month contract
* A churned customer typically has fiber optic
* A churned customer typically doesn't have tech support
* A churned customer doesn't have online security
* A churned customer typically has low tenure levels

In [None]:
vars_to_study = ['Contract', 'InternetService', 'OnlineSecurity', 'TechSupport', 'tenure']
vars_to_study

# EDA on selected variables

We create a separate DataFrame with `vars_to_study` and `Churn`

In [None]:
df_eda = df.filter(vars_to_study + ['Churn'])
df_eda

We plot the distribution (numerical and categorical) colored by Churn

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
def plot_categorical(df, col, target_var):

  plt.figure(figsize=(12, 5))
  sns.countplot(data=df, x=col, hue=target_var,order = df[col].value_counts().index)
  plt.xticks(rotation=90) 
  plt.title(f"{col}", fontsize=20,y=1.05)        
  plt.show()

def plot_numerical(df, col, target_var):
  plt.figure(figsize=(8, 5))
  sns.histplot(data=df, x=col, hue=target_var, kde=True,element="step") 
  plt.title(f"{col}", fontsize=20,y=1.05)
  plt.show()



target_var = 'Churn'
for col in df_eda.drop([target_var], axis=1).columns.to_list():
  if df_eda[col].dtype == 'object':
    plot_categorical(df_eda, col, target_var)
    print("\n\n")
  else:
    plot_numerical(df_eda, col, target_var)
    print("\n\n")



---

Create a separate DataFrame and transform `tenure` (numerical) into bins (categorical) for visualizing at `parallel_categories()` plot


In [None]:
from feature_engine.discretisation import ArbitraryDiscretiser
import numpy as np
tenure_map = [-np.Inf, 6, 12, 18, 24, np.Inf]
disc = ArbitraryDiscretiser(binning_dict={'tenure': tenure_map},
                            return_object=False,
                            return_boundaries=False)
df_parallel = disc.fit_transform(df_eda)
df_parallel.head()

Create map to replace `tenure` variable with more informative levels

In [None]:
n_classes = len(tenure_map) - 1
classes_ranges = disc.binner_dict_['tenure'][1:-1]

LabelsMap = {}
for n in range(0,n_classes):
  if n == 0:
    LabelsMap[n] = f"<{classes_ranges[0]}"
  elif n == n_classes-1:
    LabelsMap[n] = f"+{classes_ranges[-1]}"
  else:
    LabelsMap[n] = f"{classes_ranges[n-1]} to {classes_ranges[n]}"

LabelsMap

Replace using `.replace()`

In [None]:
df_parallel['tenure'] = df_parallel['tenure'].replace(LabelsMap)
df_parallel.head()

Creates multi-dimensional categorical data plot

In [None]:
import plotly.express as px
fig = px.parallel_categories(df_parallel, color="Churn")
fig.show()

---

# Conclusions and Next steps

The correlation indications and plots above interpretation converge. 
* The insights above will be used as reference additional investigations, like: why high churn levels in fiber optic?
* But for the present project, it answers business requeriment 1."

Find below how the insights can be used when approaching a prospect that might churn:
* If a prospect looks to be churnable, and is not showing openness to our offers we will concede free tech support and online security for 18 months.
* We will offer 15% discount for a year when the prospect switch from month to month to year plan. 
* We will give 5% discount when the prospect switch to an automated payment method.