# Project description

Interconnect's telecommunications operator would like to forecast its customer cancellation rate. If it is discovered that a user or user plans to leave, promotional codes and special plans options will be offered. The Interconnect Marketing team has compiled some of its customer's personal data, including information about their plans and contracts.

### Interconnect Services

Interconnect mainly provides two types of services:
1. Communication by landline. The phone can be connected to several lines simultaneously.
2. Internet. The network can be configured through a telephone line (DSL, *digital subscriber *) or through an optical fiber cable.

Some other services offered by the company include:

-Internet security: Antivirus software (*Dispositive protection*) and a malicious website blocker (*safety clan*).
-A technical support line (*Support Technician*).
-File storage in the cloud and backup data (*backuponline*).
-TV streaming (*streamingtv*) and films directory (*streaming of*)

The clientele can choose between a monthly payment or sign a contract of 1 or 2 years. You can use several payment methods and receive an electronic invoice after a transaction.

### Data Description

The data consists of files obtained from different sources:

-`contract.csv` -Contract information;
-`Personal.CSV` -customer personal data;
-`Internet.CSV` -Information on Internet services;
-`Phone.CSV` -Information about telephone services.

In each file, the Customerid` (Customer ID) column contains a unique code assigned to each client. The contract information is valid from February 1, 2020.

### Criteria to consider

Objective characteristic: the `'enddate'` column is equal to`' no'`.

Main metric: AUC-ROC.
Additional metric: ExactD.

Evaluación's Cristarios:

-AUC-ROC <0.75-0 SP
-0.75 ≤ AUC-ROC <0.81-4 SP
-0.81 ≤ AUC-ROC <0.85-4.5 SP
-0.85 ≤ AUC-ROC <0.87-5 SP
-0.87 ≤ AUC-ROC <0.88-5.5 SP
-AUC-ROC ≥ 0.88-6 SP

## Work plan for the interconnect customer cancellation prediction project

### 1. Data initial collection and exploration (EDA)

***Data load **: Load the CSV files (`contract. and pandas).
***Initial inspection **:*
    *Visualize the first and last rows of each dataframe (`df.head ()`, `df.tail ()`).
*Obtain information on the number of rows and columns, and data types (`df.info ()`).
    *Identify missing values ​​(`df.isnull (). Sum ()`) and decide how to treat them (imputation or elimination).
***Data Union **: Unite the dataraframes in one using `customerid` as key (` pd.merge () `).
***Exploratory analysis **:*
    ***OBJECTIVE VARIABLE **: Create "Churn" based on `enddate` (0: inactive user, 1: active user).
***"Churn" distribution **: Analyze the balance of classes.
    ***Bivariate analysis **: Analyze the relationship between each variable and "churn".
    ***Visualizations **: Histograms, bar graphs, dispersion diagrams, etc.

### 2. Data preprocessing

***Characteristics Engineering **:*
    *Create new variables (duration of the contract in months, number of services, etc.).
    *Consider interactions between variables (Internet type and contract type).
***Data transformation **:*
    *; Learning
***DISBALANCE MANAGEMENT **:*
*Given the imbalance identified in the target variable, overmone techniques (such as Smote or Randomoversampler) will be applied to generate new minority class samples ("yes") and balance the training set.

### 3. Modeling and evaluation
***Selection of models **: Consider several classification models that are suitable for the cancellation prediction problem, such as logistics regression, decision trees, Random Forest, Gradient Boosting (XGBoost, Lightgbm), neural networks, etc.
***Data Division **: Divide the data set into three parts:*
    *  Training.
    *Validation.
    *  Proof.
***Training and adjustment **: Train the models selected in the training set and adjust their hyperparameters using techniques such as cross validation and grid search (Gridsearchcv).
***Evaluation metrics **:*
    ***AUC-ROC **: It will be used as the main metric to evaluate the performance of the models, since it is a robust metric for classification problems with classes.
***Accuracy (ACCURACY) **: The accuracy of the training set will be evaluated versus the test to detect on adjustment.
    ***F1-SCORE **: This metric is useful for evaluating the balance between precision and recall, especially in problems with classes.
***Model selection **: Select the model that obtains the best performance in the validation set, considering both the AUC-ROC and the accuracy and the F1-Score.

## Import libreries

In [2]:
import math
import numpy as np
import pandas as pd
import lightgbm as lgb
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import matplotlib.dates as mdates
import seaborn as sns
import torch
import transformers
import logging
import re
from tqdm.auto import tqdm
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import stopwords as nltk_stopwords
import xgboost as xgb
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy
import random
import nltk
from sklearn.metrics import roc_curve, auc
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.dummy import DummyClassifier
from sklearn.metrics import f1_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.utils import shuffle
from sklearn.linear_model import LinearRegression
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import roc_auc_score
import statistics
import plotly.express as px
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV
from tensorflow import keras
from tensorflow.keras.models import Sequential
from scikeras.wrappers import KerasRegressor
from tensorflow.keras.layers import Dense, Dropout, Input, ELU, LeakyReLU
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.neighbors import KNeighborsClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
import lightgbm as lgb
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import learning_curve
from imblearn.over_sampling import RandomOverSampler
from collections import Counter
import warnings
from sklearn.exceptions import ConvergenceWarning  # Importar ConvergenceWarning

2025-02-11 08:42:50.730525: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Load Data

In [3]:
# Read the CSV data into a Pandas DataFrame

contract = pd.read_csv('final_provider/contract.csv')
personal = pd.read_csv('final_provider/personal.csv')
internet = pd.read_csv('final_provider/internet.csv')
phone = pd.read_csv('final_provider/phone.csv')

## Analyze and correct Data

### For Contract

In [4]:
#Show the first 5 rows of the DataFrame
contract.head()

Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15
3,7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1840.75
4,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65


In [5]:
#Analyze general information for Contrat DataFrame
contract.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
dtypes: float64(1), object(7)
memory usage: 440.3+ KB


We will need to perform some corrections to the data, like changing the data type of the date columns and the total charges. No null values shown.

In [6]:
#Correct data types
contract['BeginDate'] = pd.to_datetime(contract['BeginDate'],format='%Y-%m-%d')

# Try converting EndDate to datetime format (handle potential errors)
try:
    contract['EndDate'] = pd.to_datetime(contract['EndDate'], format='%Y-%m-%d %H:%M:%S')
except:
    pass  # Do nothing if conversion fails (likely due to no End Dates)

#Convert TotalCharges to numeric

contract['TotalCharges'] = pd.to_numeric(contract['TotalCharges'], errors='coerce')

# Display information about the DataFrame 'contract'
contract.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   customerID        7043 non-null   object        
 1   BeginDate         7043 non-null   datetime64[ns]
 2   EndDate           7043 non-null   object        
 3   Type              7043 non-null   object        
 4   PaperlessBilling  7043 non-null   object        
 5   PaymentMethod     7043 non-null   object        
 6   MonthlyCharges    7043 non-null   float64       
 7   TotalCharges      7032 non-null   float64       
dtypes: datetime64[ns](1), float64(2), object(5)
memory usage: 440.3+ KB


In [7]:
#generate descriptive statistics for the numerical columns in the DataFrame 
contract.describe()

Unnamed: 0,BeginDate,MonthlyCharges,TotalCharges
count,7043,7043.0,7032.0
mean,2017-04-30 13:01:50.918642688,64.761692,2283.300441
min,2013-10-01 00:00:00,18.25,18.8
25%,2015-06-01 00:00:00,35.5,401.45
50%,2017-09-01 00:00:00,70.35,1397.475
75%,2019-04-01 00:00:00,89.85,3794.7375
max,2020-02-01 00:00:00,118.75,8684.8
std,,30.090047,2266.771362


In [8]:
def assign_charges(row):
    start_year = row['BeginDate'].year
    #If enddate is 'no', assign the position until the current year
    if row['EndDate'] != 'No':
        end_date = pd.to_datetime(row['EndDate'], errors='coerce')  #Convise Datetime
        end_year = end_date.year if pd.notna(end_date) else pd.Timestamp.today().year
    else:
        end_year = pd.Timestamp.today().year 
    
    total_charges = row['TotalCharges']
    years = range(start_year, end_year + 1)
    
    #Calculate the annual position depending on the type
    if row['Type'] == 'month-month':
        annual_charge = total_charges * 12
    else:
        annual_charge = total_charges
    
    #Create a dataframe over the years as index and total charts as a column
    df = pd.DataFrame(annual_charge, index=years, columns=['TotalCharges'])
    
    
    return df

#Apply the function to each row and concatenate the results
contract_expanded = contract.apply(assign_charges, axis=1)

#Concatenar Los Dataframes resulting
contract_expanded = pd.concat(contract_expanded.to_list())

#Reset the index and rename the columns
contract_expanded = contract_expanded.reset_index()
contract_expanded.columns = ['year', 'TotalCharges']

    
#Group for 'Year' and add the values ​​of 'Total Charges'
contract_expanded_grouped = contract_expanded.groupby('year')['TotalCharges'].sum().reset_index()

contract_expanded_grouped

Unnamed: 0,year,TotalCharges
0,2013,57107.05
1,2014,7058583.95
2,2015,10390349.7
3,2016,12586916.5
4,2017,14296341.25
5,2018,15460304.1
6,2019,16047335.9
7,2020,13877563.15
8,2021,13193241.8
9,2022,13193241.8


In [None]:
#Create the bar chart
plt.figure(figsize=(10, 6))
contract_expanded_grouped.plot(kind='bar',x='year', y='TotalCharges')
plt.gca().yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: f'{int(x):,}'))
plt.xlabel('Year')
plt.ylabel('Total Charges')
plt.title('Total Charges per year')
plt.show()

In [None]:
#Create a histogram
plt.figure(figsize=(10, 6))
contract_expanded_grouped['TotalCharges'].plot(kind='hist')
plt.title('Histogram for Total Charges')

### For personal

In [11]:
#Show the first 5 rows of the DataFrame
personal.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents
0,7590-VHVEG,Female,0,Yes,No
1,5575-GNVDE,Male,0,No,No
2,3668-QPYBK,Male,0,No,No
3,7795-CFOCW,Male,0,No,No
4,9237-HQITU,Female,0,No,No


In [12]:
#Analyze general information for Contrat DataFrame
personal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customerID     7043 non-null   object
 1   gender         7043 non-null   object
 2   SeniorCitizen  7043 non-null   int64 
 3   Partner        7043 non-null   object
 4   Dependents     7043 non-null   object
dtypes: int64(1), object(4)
memory usage: 275.2+ KB


There is no need to modiy the data

### For internet

In [13]:
#Show the first 5 rows of the DataFrame
internet.head()

Unnamed: 0,customerID,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
0,7590-VHVEG,DSL,No,Yes,No,No,No,No
1,5575-GNVDE,DSL,Yes,No,Yes,No,No,No
2,3668-QPYBK,DSL,Yes,Yes,No,No,No,No
3,7795-CFOCW,DSL,Yes,No,Yes,Yes,No,No
4,9237-HQITU,Fiber optic,No,No,No,No,No,No


In [14]:
#Analyze general information for Internet DataFrame
internet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5517 entries, 0 to 5516
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   customerID        5517 non-null   object
 1   InternetService   5517 non-null   object
 2   OnlineSecurity    5517 non-null   object
 3   OnlineBackup      5517 non-null   object
 4   DeviceProtection  5517 non-null   object
 5   TechSupport       5517 non-null   object
 6   StreamingTV       5517 non-null   object
 7   StreamingMovies   5517 non-null   object
dtypes: object(8)
memory usage: 344.9+ KB


In [15]:
#generate descriptive statistics for the numerical columns in the DataFrame 
internet.describe()

Unnamed: 0,customerID,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
count,5517,5517,5517,5517,5517,5517,5517,5517
unique,5517,2,2,2,2,2,2,2
top,7590-VHVEG,Fiber optic,No,No,No,No,No,No
freq,1,3096,3498,3088,3095,3473,2810,2785


There is no need to modiy the data

### For phone

In [16]:
#Show the first 5 rows of the DataFrame
phone.head()

Unnamed: 0,customerID,MultipleLines
0,5575-GNVDE,No
1,3668-QPYBK,No
2,9237-HQITU,No
3,9305-CDSKC,Yes
4,1452-KIOVK,Yes


In [17]:
#Analyze general information for Phone DataFrame
phone.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6361 entries, 0 to 6360
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customerID     6361 non-null   object
 1   MultipleLines  6361 non-null   object
dtypes: object(2)
memory usage: 99.5+ KB


In [18]:
#generate descriptive statistics for the numerical columns in the DataFrame 
internet.describe()

Unnamed: 0,customerID,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
count,5517,5517,5517,5517,5517,5517,5517,5517
unique,5517,2,2,2,2,2,2,2
top,7590-VHVEG,Fiber optic,No,No,No,No,No,No
freq,1,3096,3498,3088,3095,3473,2810,2785


There is no need to modiy the data

## Unify Datasets

In [19]:
#Unify the 3 Datasets with 'merge'
df = contract.merge(personal, on= 'customerID')
df = df.merge(internet, on = 'customerID')
df = df.merge(phone,on = 'customerID')
df.head()

Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,gender,SeniorCitizen,Partner,Dependents,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,MultipleLines
0,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5,Male,0,No,No,DSL,Yes,No,Yes,No,No,No,No
1,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15,Male,0,No,No,DSL,Yes,Yes,No,No,No,No,No
2,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65,Female,0,No,No,Fiber optic,No,No,No,No,No,No,No
3,9305-CDSKC,2019-03-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,99.65,820.5,Female,0,No,No,Fiber optic,No,No,Yes,No,Yes,Yes,Yes
4,1452-KIOVK,2018-04-01,No,Month-to-month,Yes,Credit card (automatic),89.1,1949.4,Male,0,No,Yes,Fiber optic,No,Yes,No,No,Yes,No,Yes


In [20]:
#Check new DataFrame info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4835 entries, 0 to 4834
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   customerID        4835 non-null   object        
 1   BeginDate         4835 non-null   datetime64[ns]
 2   EndDate           4835 non-null   object        
 3   Type              4835 non-null   object        
 4   PaperlessBilling  4835 non-null   object        
 5   PaymentMethod     4835 non-null   object        
 6   MonthlyCharges    4835 non-null   float64       
 7   TotalCharges      4832 non-null   float64       
 8   gender            4835 non-null   object        
 9   SeniorCitizen     4835 non-null   int64         
 10  Partner           4835 non-null   object        
 11  Dependents        4835 non-null   object        
 12  InternetService   4835 non-null   object        
 13  OnlineSecurity    4835 non-null   object        
 14  OnlineBackup      4835 n

In [21]:
#Check for Nulls
df.isnull().sum()

customerID          0
BeginDate           0
EndDate             0
Type                0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        3
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
MultipleLines       0
dtype: int64

In [22]:
#Replace Nulls in TotalCharges field with mean
imputer = SimpleImputer(strategy='mean')
df['TotalCharges'] = imputer.fit_transform(df[['TotalCharges']])

In [23]:
def Contract_duration(row):
    """
    Calculate the anual charges and define if the user is active or not.

    Args:
        row: A pandas Series representing a single row of the DataFrame.

    Returns:
        A pandas Series representing a single row of the DataFrame
    """

    start_date = pd.to_datetime(row['BeginDate'])
    start = start_date.to_julian_date() 

    if row['EndDate'] != 'No':
        row['Active'] = 'No'
    else:
        row['Active'] = 'Yes'
    
    # Calculate annual charge based on contract type
    if row['Type'] == 'Month-to-month':
        row['annual_charge'] = row['TotalCharges'] * 12
    elif row['Type'] == 'Two years':
        row['annual_charge'] = row['TotalCharges'] / 2
    else: 
        row['annual_charge'] = row['TotalCharges']  # Assuming 'One year' for other cases

    return row

In [24]:
#Replace the 'Month-to-month' Type to 'Monthly' to shorten the word and have better visibility
df.loc[df['Type'].isin(['Month-to-month']), 'Type'] = 'Monthly' 

In [25]:
#Apply the function to calculate the Anual charge and determin if the user is active or not.
df = df.apply(Contract_duration, axis=1)
#Show the first rows
df.head()

Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,gender,SeniorCitizen,...,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,MultipleLines,Active,annual_charge
0,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5,Male,0,...,DSL,Yes,No,Yes,No,No,No,No,Yes,1889.5
1,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Monthly,Yes,Mailed check,53.85,108.15,Male,0,...,DSL,Yes,Yes,No,No,No,No,No,No,108.15
2,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Monthly,Yes,Electronic check,70.7,151.65,Female,0,...,Fiber optic,No,No,No,No,No,No,No,No,151.65
3,9305-CDSKC,2019-03-01,2019-11-01 00:00:00,Monthly,Yes,Electronic check,99.65,820.5,Female,0,...,Fiber optic,No,No,Yes,No,Yes,Yes,Yes,No,820.5
4,1452-KIOVK,2018-04-01,No,Monthly,Yes,Credit card (automatic),89.1,1949.4,Male,0,...,Fiber optic,No,Yes,No,No,Yes,No,Yes,Yes,1949.4


In [26]:
#Make a copy of the df in a new variable
df_unified = df

### Overview of the relation between data

In [28]:
#Create a dictionary for payment type
Payment_dict = {'Mailed check':1, 'Electronic check':2, 'Credit card (automatic)':3,
       'Bank transfer (automatic)':4}
df['PaymentMethod'] = df['PaymentMethod'].map(Payment_dict)

In [None]:
# Create a pairplot using seaborn
g = sns.pairplot(df, hue= 'Active', diag_kind="hist", x_vars=['PaymentMethod', 'gender', 'SeniorCitizen','Type'], y_vars=['annual_charge','gender','Partner', 'Dependents', 'PaperlessBilling']) 

# Set the figure size of the pairplot
g.fig.set_size_inches(12, 12)

In [None]:
# Create a pairplot using seaborn
h = sns.pairplot(df, hue= 'Active', diag_kind="hist", x_vars=['PaymentMethod', 'gender', 'SeniorCitizen','Type','annual_charge'], y_vars=['gender','InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',]) 

# Set the figure size of the pairplot
h.fig.set_size_inches(12, 12)

In [None]:
# Create a pairplot using seaborn
i = sns.pairplot(df, hue= 'Active', diag_kind="hist", x_vars=['PaymentMethod', 'gender', 'SeniorCitizen','Type','annual_charge'], y_vars=['TechSupport',
       'StreamingTV', 'StreamingMovies', 'MultipleLines',]) 

# Set the figure size of the pairplot
i.fig.set_size_inches(12, 12)

In [None]:
#Count observations by class
balance_de_clases = df['Active'].value_counts()

#Create the bar chart
plt.figure(figsize=(8, 6))
sns.countplot(x='Active', data=df) 
plt.title('Distribución de la variable objetivo: Activo/No Activo')
plt.xlabel('Cliente Activo')
plt.ylabel('Cantidad de clientes')

#Add data labels to each bar
for p in plt.gca().patches:
    plt.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha='center', va='bottom')

plt.show()

We can conclude that there is an imbalance in the 50%classes, so it will be necessary to apply balance methods to training data

## Prepare data

In [33]:
activos = df[df['Active']==1]
inactivo = df[df['Active']==0]

In [35]:
# Aplicar sobremuestro para balancear las clases 
sobremuestreo = df.sample(n=290,replace=True,random_state=0)
sobremuestreo.info()

<class 'pandas.core.frame.DataFrame'>
Index: 290 entries, 2732 to 945
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   customerID        290 non-null    object        
 1   BeginDate         290 non-null    datetime64[ns]
 2   EndDate           290 non-null    object        
 3   Type              290 non-null    object        
 4   PaperlessBilling  290 non-null    object        
 5   PaymentMethod     290 non-null    int64         
 6   MonthlyCharges    290 non-null    float64       
 7   TotalCharges      290 non-null    float64       
 8   gender            290 non-null    object        
 9   SeniorCitizen     290 non-null    int64         
 10  Partner           290 non-null    object        
 11  Dependents        290 non-null    object        
 12  InternetService   290 non-null    object        
 13  OnlineSecurity    290 non-null    object        
 14  OnlineBackup      290 non-nu

In [36]:
# use Dummies to convert the data into a format that is accepted by the models
Type_oh = pd.get_dummies(df['Type']) 
gender_oh = pd.get_dummies(df['gender']) 
InternetService_oh = pd.get_dummies(df['InternetService'])

In [None]:
# Add the Dumies to the DataFrame
df = pd.concat([df,Type_oh,gender_oh,InternetService_oh],axis=1)

In [38]:
#Replace all the Yes with No
df = df.replace({'Yes': 1, 'No': 0})

  df = df.replace({'Yes': 1, 'No': 0})


In [39]:
# Get rid of all the columns that are not accepted by the models or are irrelevant 
df = df.drop(['Type','gender','InternetService','EndDate','BeginDate','MonthlyCharges','TotalCharges'],axis=1)

In [40]:
#Set the index of the Dataframe since the customerID column is not accepted by some models ans still we dont want to lose the information
df = df.set_index('customerID')
df.head()

Unnamed: 0_level_0,PaperlessBilling,PaymentMethod,SeniorCitizen,Partner,Dependents,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,...,MultipleLines,Active,annual_charge,Monthly,One year,Two year,Female,Male,DSL,Fiber optic
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5575-GNVDE,0,1,0,0,0,1,0,1,0,0,...,0,1,1889.5,False,True,False,False,True,True,False
3668-QPYBK,1,1,0,0,0,1,1,0,0,0,...,0,0,108.15,True,False,False,False,True,True,False
9237-HQITU,1,2,0,0,0,0,0,0,0,0,...,0,0,151.65,True,False,False,True,False,False,True
9305-CDSKC,1,2,0,0,0,0,0,1,0,1,...,1,0,820.5,True,False,False,True,False,False,True
1452-KIOVK,1,3,0,0,1,0,1,0,0,1,...,1,1,1949.4,True,False,False,False,True,False,True


## Class balancing

In [41]:
#Define the features and target
X = df.drop(columns=['Active'], axis=1)
y = df['Active']

In [42]:
#Balance clases using RandomOverSampler
ros = RandomOverSampler(random_state=0)  #Random_state for reproducibility
X_resampled, y_resampled = ros.fit_resample(X, y) #Create a new data set where the minority class has the same number of samples as the majority class.
#It does it randomly duplicating existing instances of the minority class.



In [43]:
#Check the differences (After,Before)
print('Distribución de clases antes del sobremuestreo:', Counter(y))
print('Distribución de clases después del sobremuestreo:', Counter(y_resampled))

Distribución de clases antes del sobremuestreo: Counter({1: 3249, 0: 1586})
Distribución de clases después del sobremuestreo: Counter({1: 3249, 0: 3249})


In [44]:
def split_data(features_x,target_y):
    """
    Splits the input features and target into training, validation, and test sets.

    Args:
        features_x: The feature matrix (NumPy array or Pandas DataFrame).
        target_y: The target vector (NumPy array or Pandas Series).

    Returns:
        A tuple containing the features and target for the temporary, validation, and test sets, respectively.
    """

    features = features_x
    target = target_y

    # Split data into training and temporary sets
    features_t, features_temp, target_t, target_temp = train_test_split(
        features, target, test_size=0.3, random_state=12345, stratify=target # Stratify for class balance
    )

    # Split temporary set into validation and test sets
    features_v, features_test, target_v, target_test = train_test_split(
        features_temp, target_temp, test_size=0.5, random_state=12345, stratify=target_temp # Stratify here as well
    )

    return features_temp, features_v, features_test, target_temp, target_v, target_test

In [45]:
def Train_Predict(data, choosen_model, param_grid, cv):
    """
  Adjust a model using GridSearchcv to find the best hyperparameters.

    ARGS:
        Data: Dataframe with the data.
        Choosen_Model: the chosen classifier (for example, RandomforestclaSifier, DecisiSionclassifier, etc.).
        Param_Grid: Dictionary of the parameters to be tested with Gridsearchcv (specific for each model).
        CV: number of folds for cross validation.

    Returns:
    A tuple with the best parameters, the best model, and other metrics.
    """
    
    #Returns partitions.
    features_t_resampled, features_valid, features_te, target_t_resampled, target_valid, target_te = split_data(X_resampled,y_resampled)

    #Create the Gridsearchcv with the model and the parameters provided
    grid = GridSearchCV(estimator=choosen_model, param_grid=param_grid, cv=cv)

    #Adjust the model with training data
    grid.fit(features_t_resampled, target_t_resampled)

    #Get the best parameters and the best model
    best_params = grid.best_params_
    best_model = grid.best_estimator_
    best_score = grid.best_score_

    #Predicts with the best model
    train_pred = best_model.predict(features_t_resampled)
    predicted_valid = best_model.predict(features_valid)
    test_pred = best_model.predict(features_te)

    #Calculate metrics
    train_accuracy = accuracy_score(target_t_resampled, train_pred)
    valid_accuracy = accuracy_score(target_valid, predicted_valid)
    test_accuracy = accuracy_score(target_te, test_pred)
    fpr, tpr, thresholds = roc_curve(target_valid, predicted_valid)
    roc_auc = auc(fpr, tpr)
    f1 = f1_score(target_valid, predicted_valid)
    conf_matrix = confusion_matrix(target_valid, predicted_valid)

    return predicted_valid,best_params,best_model, roc_auc, fpr, tpr,f1,conf_matrix,train_accuracy,test_accuracy

In [46]:
# Function to plot the ROC curve
def graph(fpr, tpr, roc_auc):
    """
    Generates and displays the Receiver Operating Characteristic (ROC) curve.

    Args:
        fpr (array-like): False Positive Rate values.
        tpr (array-like): True Positive Rate values.
        roc_auc (float): Area Under the ROC curve.

    Returns:
        matplotlib.pyplot.show(): Displays the plot.
    """
    # Plot the ROC curve itself
    plt.plot(fpr, tpr, color='darkorange', label='ROC curve (area = %0.2f)' % roc_auc)  # Plots the actual curve using FPR and TPR.  The label includes the calculated AUC.
    
    # Plot the "random classifier" line (diagonal line)
    plt.plot([0, 1], [0, 1], color='navy', linestyle='--')  # Plots the dashed diagonal line representing a random classifier (AUC = 0.5).

    # Set plot limits for better visualization
    plt.xlim([0.0, 1.0])  # Sets the x-axis limits (FPR) from 0 to 1.
    plt.ylim([0.0, 1.05])  # Sets the y-axis limits (TPR) from 0 to 1.05 (slightly above 1 for better label placement).

    # Add axis labels and title
    plt.xlabel('False Positive Rate')  # Labels the x-axis.
    plt.ylabel('True Positive Rate')  # Labels the y-axis.
    plt.title('ROC_AUC')  # Sets the title of the plot.

    # Display the legend (showing the AUC value)
    plt.legend(loc="lower right")  # Places the legend in the lower right corner of the plot.

    return plt.show()  # Displays the generated plot.

In [47]:
def graph_confusion(conf_matrix):
    """
    Generates and displays a confusion matrix heatmap.

    Args:
        conf_matrix (numpy.ndarray): The confusion matrix as a NumPy array.

    Returns:
        matplotlib.pyplot.show: Displays the generated plot.
    """

    # Create a Pandas DataFrame for the confusion matrix with custom labels
    df_cm = pd.DataFrame(conf_matrix, index=['No', 'Si'], columns=['No', 'Si'])

    # Define a diverging color palette for the heatmap
    cmap = sns.diverging_palette(220, 10, as_cmap=True)

    # Create the confusion matrix heatmap
    plt.figure(figsize=(8, 6))  # Adjust figure size for better visualization
    sns.heatmap(df_cm, annot=True, fmt='d', cmap=cmap)  # Annotate cells with values, format as integers
    plt.xlabel('Prediction')  # Label the x-axis
    plt.ylabel('Actual')  # Label the y-axis
    plt.title('Confusion Matrix')  # Set the title of the plot
    return plt.show()  # Display the plot

In [48]:


# Suprimir los warnings de tipo FutureWarning y ConvergenceWarning
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=ConvergenceWarning)

## DecisionTreeClassifier Model

In [None]:
# Establish the parameters for the Decision Tree model's hyperparameter tuning.
param_grid_dt = {
    'criterion': ['gini', 'entropy'],  # Different criteria for splitting nodes.
    'max_depth': [None, 5, 10, 15],  # Maximum depth of the tree. None means no limit.
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node.
    'min_samples_leaf': [1, 2, 5],  # Minimum number of samples required to be at a leaf node.
    'max_features': ['auto', 'sqrt', 'log2', None]}  # Number of features to consider when looking for the best split.
                                                    # 'auto' uses all features, 'sqrt' uses the square root of the number of features,
                                                    # 'log2' uses the base-2 logarithm of the number of features, None is the same as 'auto'.
                                                    
# Apply the Train_Predict function to train, predict, and evaluate the Decision Tree model.
predicted_valid_dt,best_params_dt,best_model_dt, roc_auc_dt, fpr_dt, tpr_dt,f1_dt,conf_matrix_dt,train_accuracy_dt,test_accuracy_dt = Train_Predict(df, DecisionTreeClassifier(), param_grid_dt, 3)

In [50]:
print("Best Params:", best_params_dt)
print("ROC_AUC metric:", roc_auc_dt)
print("F1 Score:", f1_dt)
print("Train Accuracy:",train_accuracy_dt)
print("Test Accuracy:",test_accuracy_dt)

Best Params: {'criterion': 'gini', 'max_depth': 5, 'max_features': None, 'min_samples_leaf': 2, 'min_samples_split': 10}
ROC_AUC metric: 0.7631576732756589
F1 Score: 0.743047830923248
Train Accuracy: 0.7707692307692308
Test Accuracy: 0.7784615384615384


In [None]:
graph(fpr_dt,tpr_dt,roc_auc_dt)

In [None]:
graph_confusion(conf_matrix_dt)

## LogisticRegression Model

In [None]:
# Define the hyperparameter grid for Logistic Regression.
param_grid_lr = {
    'C': [0.01, 0.1, 1, 10, 100],  # Different values for the regularization strength.
                                    # Smaller values indicate stronger regularization. Which means that the model will penalize the big coefficients more.
    'solver': ['liblinear', 'lbfgs', 'saga'],  # Optimization algorithms to find the best model parameters.
                                               # 'liblinear' is suitable for smaller datasets.
                                               # 'lbfgs' and 'saga' can handle larger datasets and different penalties.
                                               # 'saga' also supports elasticnet.
    'max_iter': [100, 200, 500],  # Different values for the maximum number of iterations the solver runs.
                                     # The solver stops when it converges or reaches max_iter.
    'penalty': ['l2', 'l1', 'elasticnet'],  # Types of regularization to prevent overfitting.
                                           # 'l2' (Ridge) Penalizes the sum of the squares of the coefficients.
                                           # 'l1' (Lasso) Penalizes the sum of the absolute values ​​of the coefficients.
                                           # 'elasticnet' combines L1 and L2 penalties.
    'multi_class': ['ovr', 'multinomial']  # Handles multi-class classification problems.
                                           # 'ovr' (One-vs-Rest) trains a classifier for each class vs. the rest.
                                           # 'multinomial' trains a single classifier for all classes.
}

# Apply the Train_Predict function to train, predict, and evaluate the Logistic Regression model.
predicted_valid_lr, best_params_lr,best_model_lr, roc_auc_lr, fpr_lr, tpr_lr, f1_lr, conf_matrix_lr, train_accuracy_lr, test_accuracy_lr = Train_Predict(
    df, LogisticRegression(), param_grid_lr, 3)

In [54]:
print("Best Params:", best_params_lr)
print("ROC_AUC metric:", roc_auc_lr)
print("F1 Score:", f1_lr)
print("Train Accuracy:",train_accuracy_lr)
print("Test Accuracy:",test_accuracy_lr)

Best Params: {'C': 0.1, 'max_iter': 100, 'multi_class': 'ovr', 'penalty': 'l1', 'solver': 'liblinear'}
ROC_AUC metric: 0.7364173427138385
F1 Score: 0.7347781217750258
Train Accuracy: 0.7430769230769231
Test Accuracy: 0.7497435897435898


In [None]:
graph(fpr_lr,tpr_lr,roc_auc_lr)

In [None]:
graph_confusion(conf_matrix_lr)

## RandomForestClassifier Model

In [57]:
# Define the hyperparameter grid for RandomForestClassifier.
param_grid_rf = {
    'n_estimators': [100, 200, 500],  #Number of trees in the forest
    'max_depth': [None, 5, 10, 15],  #Maximum depth of trees
    'min_samples_split': [2, 5, 10],  #Minimum number of samples to divide a node
    'min_samples_leaf': [1, 2, 5],  #Minimum number of samples in a leafe node
}

#Use the train_predict with randomforestclassifier
predicted_valid_rf, best_params_rf,best_model_rf, roc_auc_rf, fpr_rf, tpr_rf, f1_rf, conf_matrix_rf, train_accuracy_rf, test_accuracy_rf = Train_Predict(
    df, RandomForestClassifier(), param_grid_rf, 3  # 3 pliegues en la validación cruzada
)

In [58]:
print("Best Params:", best_params_rf)
print("ROC_AUC metric:", roc_auc_rf)
print("F1 Score:", f1_rf)
print("Train Accuracy:",train_accuracy_rf)
print("Test Accuracy:",test_accuracy_rf)

Best Params: {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 500}
ROC_AUC metric: 0.8821111185915778
F1 Score: 0.8748639825897715
Train Accuracy: 0.8907692307692308
Test Accuracy: 0.8994871794871795


In [None]:
graph(fpr_rf,tpr_rf,roc_auc_rf)

In [None]:
graph_confusion(conf_matrix_rf)

## XGBoost adjusted

In [61]:
# Define the hyperparameter grid for XGBoost.
param_grid_xgb = {
    'n_estimators': [100, 200, 500],  #Number of trees
    'max_depth': [3, 5, 7],  # Maximum depth of trees
    'Learning_Rate': [0.01, 0.1, 0.3], # learning rate
    'subsample': [0.8, 0.9, 1.0], # proportion of samples used to train each tree
    'Colsample_bytree': [0.8, 0.9, 1.0], # fraction of features used for each tree
}

#Use the Train_Predict Function with XGBOOST
predicted_valid_xgb, best_params_xgb,best_model_xgb, roc_auc_xgb, fpr_xgb, tpr_xgb, f1_xgb, conf_matrix_xgb, train_accuracy_xgb, test_accuracy_xgb = Train_Predict(
    df, xgb.XGBClassifier(), param_grid_xgb, 3)  # 3 pliegues en la validación cruzada

In [62]:
print("Best Params:", best_params_xgb)
print("ROC_AUC metric:", roc_auc_xgb)
print("F1 Score:", f1_xgb)
print("Train Accuracy:",train_accuracy_xgb)
print("Test Accuracy:",test_accuracy_xgb)

Best Params: {'colsample_bytree': 0.8, 'learning_rate': 0.01, 'max_depth': 7, 'n_estimators': 500, 'subsample': 0.8}
ROC_AUC metric: 0.8964532770054195
F1 Score: 0.8919786096256684
Train Accuracy: 0.9
Test Accuracy: 0.9035897435897436


In [None]:
graph(fpr_xgb,tpr_xgb,roc_auc_xgb)


In [None]:
graph_confusion(conf_matrix_xgb)

## Modelo de redes neuronales MPL

### Modelo con 2 capas totalmente conectadas

In [65]:
features_train8, features_valid8, features_te8, target_train8, target_valid8, target_te8  = split_data(X_resampled,y_resampled)

In [66]:
def create_model_connected(input_dim):
    # Define the input layer:
    inputs = Input(shape=(input_dim,))  # Input layer with shape based on input_dim

    # Connect the dense layer to the input layer.
    x = Dense(64, activation='relu')(inputs)  # Fully connected layer with 64 neurons, ReLU activation

    # Dropout layer to prevent overfitting. Randomly disables neurons during training.
    x = Dropout(0.3)(x)  # Dropout layer with a 30% dropout rate

    # Another dense (fully connected) layer.
    x = Dense(32, activation='relu')(x)  # Fully connected layer with 32 neurons, ReLU activation

    # Another dropout layer.
    x = Dropout(0.2)(x)  # Dropout layer with a 20% dropout rate

    # Output layer for binary classification.
    outputs = Dense(1, activation='sigmoid')(x)  # Output layer with 1 neuron, Sigmoid activation

    # Define the model to use.
    model = keras.Model(inputs=inputs, outputs=outputs)  # Create the Keras model

    # Compile the model.
    model.compile(loss='binary_crossentropy', optimizer='adam')  # Loss function: binary_crossentropy
                                                                 # Optimizer: adam (an algorithm to adjust weights)

    return model  # Return the created model

In [67]:
model8= create_model_connected(features_train8.shape[1])

#Early Stopping:
# Monitor: The metric you use to monitor and decide when to stop
# Patience: Defines the number of training iterations (times) are allowed without improvement in metric
# Restore_Best_weights: Indicates whether the pesos of the model should be restored to the values ​​they had when the best performance in the metric (just before starting to be overlapping).
# MODE: Specifies the address of the improvement that is sought in metric.
# 'Max': Training stops when the metric stops increasing.
# 'min': Training stops when the metric stops decreasing. 
# 'Auto': Try to infer the direction of improvement based on the name of the metric.
early_stopping = EarlyStopping(monitor='val_auc', patience=5,restore_best_weights=True,mode='auto')

#We train the model
model8.fit(features_train8,target_train8)

 #Predict
train_pred8 = (model8.predict(features_train8) >= 0.5).astype(int) 
predicted_valid8 = (model8.predict(features_valid8) >= 0.5).astype(int) 
test_pred8 = (model8.predict(features_te8) >= 0.5).astype(int) 
    
     
train_accuracy8 = accuracy_score(target_train8, train_pred8)
valid_accuracy8 = accuracy_score(target_valid8, predicted_valid8)
test_accuracy8 = accuracy_score(target_te8, test_pred8) 


#We calculate roc_auc
fpr8, tpr8, thresholds8 = roc_curve(target_valid8, predicted_valid8)
roc_auc8 = auc(fpr8, tpr8)
f18 = f1_score(target_valid8, predicted_valid8)
conf_matrix8 = confusion_matrix(target_valid8, predicted_valid8)
print("ROC_AUC metric:", roc_auc8)
print("F1 Score:", f18)
print("Train Accuracy:",train_accuracy8)
print("Test Accuracy:",test_accuracy8)

[1m61/61[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 87.2730
[1m61/61[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
ROC_AUC metric: 0.5
F1 Score: 0.6671223513328777
Train Accuracy: 0.5
Test Accuracy: 0.49948717948717947


In [None]:
graph(fpr8,tpr8,roc_auc8)

In [None]:
graph_confusion(conf_matrix8)

## KNN Model

In [None]:
# Define the hyperparameter grid for KNN
param_grid_knn = {
    'n_neighbors': [3, 5, 7, 10],  #Number of neighbors to consider
    'weights': ['uniform', 'distance'],  #Weighting type: uniform or distance
    'metric': ['euclidean', 'manhattan'],  #Metrics for the distance between points
}

#Use the Train_Predict Function with KneightSclassifier
predicted_valid_knn, best_params_knn,best_model_knn, roc_auc_knn, fpr_knn, tpr_knn, f1_knn, conf_matrix_knn, train_accuracy_knn, test_accuracy_knn = Train_Predict(
    df, KNeighborsClassifier(), param_grid_knn, 3  # 3 pliegues en la validación cruzada
)

In [71]:
print("Best Params:", best_params_knn)
print("ROC_AUC metric:", roc_auc_knn)
print("F1 Score:", f1_knn)
print("Train Accuracy:",train_accuracy_knn)
print("Test Accuracy:",test_accuracy_knn)

Best Params: {'metric': 'manhattan', 'n_neighbors': 10, 'weights': 'distance'}
ROC_AUC metric: 1.0
F1 Score: 1.0
Train Accuracy: 1.0
Test Accuracy: 1.0


In [None]:
graph(fpr_knn,tpr_knn,roc_auc_knn)

In [None]:
graph_confusion(conf_matrix_knn)

## CatBoost Model

In [74]:
#Parameters that we want to adjust for CatCoostclassifier
param_grid_catboost = {
    'iterations': [500, 1000],  #Number of iterations
    'learning_rate': [0.01, 0.05, 0.1],  #Learning rate
    'depth': [6, 8, 10],  #Tree depth
    'l2_leaf_reg': [1, 3, 5]  #L2 regularization
}

#Use the Train_Predict function with catboostclassifier
from catboost import CatBoostClassifier
predicted_valid_catboost, best_params_catboost,best_model_catboost, roc_auc_catboost, fpr_catboost, tpr_catboost, f1_catboost, conf_matrix_catboost, train_accuracy_catboost, test_accuracy_catboost = Train_Predict(
    df, CatBoostClassifier(silent=True), param_grid_catboost, 3  #3 folds in cross validation
)


In [75]:
print("Best Params:", best_params_catboost)
print("ROC_AUC metric:", roc_auc_catboost)
print("F1 Score:", f1_catboost)
print("Train Accuracy:",train_accuracy_catboost)
print("Test Accuracy:",test_accuracy_catboost)

Best Params: {'depth': 8, 'iterations': 500, 'l2_leaf_reg': 1, 'learning_rate': 0.01}
ROC_AUC metric: 0.9015930588750126
F1 Score: 0.8961038961038961
Train Accuracy: 0.9061538461538462
Test Accuracy: 0.9107692307692308


In [None]:
graph(fpr_catboost,tpr_catboost,roc_auc_catboost)

In [None]:
graph_confusion(conf_matrix_catboost)

## lightgbm Model

In [None]:
#Parameters that we want to adjust for Lightgbm Model
param_grid_lgbm = {
    'num_leaves': [31, 50, 100],  # Number of leaves in each tree
    'Max_Depth': [5, 10, 15], # maximum depth of each tree
    'Learning_Rate': [0.01, 0.05, 0.1], # learning rate
    'n_estimators': [100, 500, 1000], # number of trees
    'Boosting_type': ['GBDT', 'dart'], # Boosting type
    'subsample': [0.7, 0.8, 1.0] # sampling fraction for each tree
}

#Use the Train_Predict Function with LGBMCLASSIFIER
predicted_valid_lgbm, best_params_lgbm, best_model_lgbm, roc_auc_lgbm, fpr_lgbm, tpr_lgbm, f1_lgbm, conf_matrix_lgbm, train_accuracy_lgbm, test_accuracy_lgbm = Train_Predict(
    df, LGBMClassifier(), param_grid_lgbm, 3  # 3 pliegues en la validación cruzada
)

In [79]:
print("Best Params:", best_params_lgbm)
print("ROC_AUC metric:", roc_auc_lgbm)
print("F1 Score:", f1_lgbm)
print("Train Accuracy:",train_accuracy_lgbm)
print("Test Accuracy:",test_accuracy_lgbm)

Best Params: {'boosting_type': 'gbdt', 'learning_rate': 0.1, 'max_depth': 10, 'n_estimators': 100, 'num_leaves': 31, 'subsample': 0.7}
ROC_AUC metric: 0.911822129464436
F1 Score: 0.9094736842105263
Train Accuracy: 0.9112820512820513
Test Accuracy: 0.9107692307692308


In [None]:
graph(fpr_lgbm,tpr_lgbm,roc_auc_lgbm)

In [None]:
graph_confusion(conf_matrix_lgbm)

##  Results

In [83]:
#Create a list with all the results
resultados = [
    (best_model_dt.__class__.__name__, roc_auc_dt,f1_dt,conf_matrix_dt),  # Extrae el nombre del modelo
    (best_model_lr.__class__.__name__, roc_auc_lr,f1_lr,conf_matrix_lr),
    (best_model_rf.__class__.__name__, roc_auc_rf,f1_rf,conf_matrix_rf),
    ("XGBoost", roc_auc_xgb,f1_xgb,conf_matrix_xgb), #Estos ya son strings, no hay problema
    ("KNeighborsClassifier", roc_auc_knn,f1_knn,conf_matrix_knn),
    ("Modelo de redes neuronales MPL con 2 capas", roc_auc8,f18,conf_matrix8),
    (best_model_catboost.__class__.__name__, roc_auc_catboost,f1_catboost,conf_matrix_catboost),
    ("LGBMClassifier", roc_auc_lgbm,f1_lgbm,conf_matrix_lgbm)
]
resultados_ordenados = sorted(resultados, key=lambda x: x[1], reverse=True)

In [84]:
#Create a DataFrame with the results
df_resultados = pd.DataFrame(resultados_ordenados, columns=['Modelo', 'ROC_AUC', 'F1-Score', 'Matriz de Confusión'])
df_resultados

Unnamed: 0,Modelo,ROC_AUC,F1-Score,Matriz de Confusión
0,KNeighborsClassifier,1.0,1.0,"[[487, 0], [0, 488]]"
1,LGBMClassifier,0.911822,0.909474,"[[457, 30], [56, 432]]"
2,CatBoostClassifier,0.901593,0.896104,"[[465, 22], [74, 414]]"
3,XGBoost,0.896453,0.891979,"[[457, 30], [71, 417]]"
4,RandomForestClassifier,0.882111,0.874864,"[[458, 29], [86, 402]]"
5,DecisionTreeClassifier,0.763158,0.743048,"[[410, 77], [154, 334]]"
6,LogisticRegression,0.736417,0.734778,"[[362, 125], [132, 356]]"
7,Modelo de redes neuronales MPL con 2 capas,0.5,0.667122,"[[0, 487], [0, 488]]"


We can observe that the KneightSclassifier model is adjusted since it shows Roc_auc Metric values: 1.0, F1 Score: 1.0, Train Accuracy: 1.0 and Test Accuracy: 1.0. So we will select the following model that is that of LGBMClassifier to make the predictions.

In [92]:
#Create a dataframe with just the Active users
just_actives = df[df['Active']==1]

In [93]:
#Make predictions with the active users only
features = just_actives.drop(columns=['Active'],axis=1)
just_actives['predicted'] = best_model_lgbm.predict(features)
just_actives


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  just_actives['predicted'] = best_model_lgbm.predict(features)


Unnamed: 0_level_0,PaperlessBilling,PaymentMethod,SeniorCitizen,Partner,Dependents,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,...,Active,annual_charge,Monthly,One year,Two year,Female,Male,DSL,Fiber optic,predicted
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5575-GNVDE,0,1,0,0,0,1,0,1,0,0,...,1,1889.50,False,True,False,False,True,True,False,1
1452-KIOVK,1,3,0,0,1,0,1,0,0,1,...,1,1949.40,True,False,False,False,True,False,True,0
6388-TABGU,0,4,0,0,1,1,1,0,0,0,...,1,3487.95,False,True,False,False,True,True,False,1
9763-GRSKD,1,1,0,1,1,1,0,0,0,0,...,1,587.45,True,False,False,False,True,True,False,1
8091-TTVAX,0,3,0,1,0,0,0,1,0,1,...,1,5681.10,False,True,False,False,True,False,True,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9767-FFLEM,1,3,0,0,0,0,0,0,0,0,...,1,2625.25,True,False,False,False,True,False,True,1
8456-QDAVC,1,4,0,0,0,0,0,0,0,1,...,1,1495.10,True,False,False,False,True,False,True,0
6840-RESVB,1,1,0,1,1,1,0,1,1,1,...,1,1990.50,False,True,False,False,True,True,False,1
2234-XADUH,1,3,0,1,1,0,1,1,0,1,...,1,7362.90,False,True,False,True,False,False,True,1


In [94]:
#Change the format
for index, row in just_actives.iterrows():
    if row['predicted'] == 1:
        just_actives.loc[index, 'predicted'] = 'Yes'
    elif row['predicted'] == 0:
        just_actives.loc[index, 'predicted'] = 'No'


In [95]:
#DataFrame with the predictions
df_unified = df_unified.reset_index()
final_df =just_actives.reset_index()
final_df = final_df[['customerID','Active','predicted']]
final_df = final_df.merge(df_unified, on= 'customerID')

In [96]:
final_df.columns

Index(['customerID', 'Active_x', 'predicted', 'level_0', 'index', 'BeginDate',
       'EndDate', 'Type', 'PaperlessBilling', 'PaymentMethod',
       'MonthlyCharges', 'TotalCharges', 'gender', 'SeniorCitizen', 'Partner',
       'Dependents', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
       'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
       'MultipleLines', 'Active_y', 'annual_charge'],
      dtype='object')

In [97]:
final = final_df[['customerID','gender','Active_y','predicted']]
final

Unnamed: 0,customerID,gender,Active_y,predicted
0,5575-GNVDE,Male,Yes,Yes
1,1452-KIOVK,Male,Yes,No
2,6388-TABGU,Male,Yes,Yes
3,9763-GRSKD,Male,Yes,Yes
4,8091-TTVAX,Male,Yes,Yes
...,...,...,...,...
3244,9767-FFLEM,Male,Yes,Yes
3245,8456-QDAVC,Male,Yes,No
3246,6840-RESVB,Male,Yes,Yes
3247,2234-XADUH,Female,Yes,Yes


In [98]:
counts = final.groupby('gender')['predicted'].count().reset_index()

In [99]:
counts = final.groupby('predicted').count().reset_index()

In [100]:
counts

Unnamed: 0,predicted,customerID,gender,Active_y
0,No,736,736,736
1,Yes,2513,2513,2513


In [None]:
#Create the bar chart
plt.figure(figsize=(8, 6))
ax = plt.bar(counts['predicted'], counts['customerID'])

#Add data labels to each bar
for p in ax.patches:
    plt.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha='center', va='bottom')

plt.title('Predicciones por cliente activo')
plt.xlabel('Abandono del servicio')
plt.ylabel('Cantidad de clientes')
plt.show()

## Conclusions of the Interconnect customer cancellation prediction project
The main objective of this project was to develop a Machine Learning model capable of predicting the probability that an Interconnect client will cancel its services. This would allow company marketing team to offer promotional codes and special plans to customers at risk of cancellation, in order to retain them and reduce the cancellation rate.

### Results
After an exhaustive exploratory analysis of the data and a rigorous preprocessing process, various classification models were evaluated to identify the best adapt to the data and comply with the established evaluation criteria. The evaluated models were:

KneightClassifier
LGBMClassifier
Catboostclassifier
XGBOOST
Randomforestclassifier
DecisiTreeclassifier
Logisticregresion
MPL neural networks with 2 layers

### Predictions
Using the LGBMClassifier model, it was predicted that 736 users will not abandon the service and 2513 will.

### Conclusions

-A Machine Learning model has been developed capable of predicting the probability that an Interconnect client will cancel its services with a good level of precision (AUC-ROC of 0.911822).
-The LGBMClassifier model was selected as the most suitable for predictions due to its balance between performance and generalization.
The predictions made suggest that a significant number of customers (2513) could leave the service in the near future.

### Recommendations

-It is essential that Interconnect's marketing equipment uses this model to identify customers at risk of cancellation and offer them promotional codes and special plans in a timely manner.
-It is recommended to carry out continuous monitoring of the model performance and update it periodically with new data to guarantee its precision and efficiency over time.
-It would be interesting to explore the possibility of including other variables in the model, such as demographic data or information on the use of services, to further improve their predictive capacity.