**Telecom Customer Churn**

---

Challenge Link: https://www.kaggle.com/c/dsrp-kaggle-semillero-01

Autor: **Keven Fernandez Carrillo** 
con el Apoyo de la comunidad **Data Science Research Perú**.

Versión: 1.0

GitHub: 
- https://github.com/KevenRFC
- https://github.com/DataScienceResearchPeru

**Anotación:** En este notebook de R se ha hecho la traducción de los pasos realizados en el notebook original de Python para ese challenge, por lo que se podrá ver que así como en Python, se han convertido todas las variables categóricas a númericas (previo al entrenamiento del modelo) con el fin de conocer en este lenguaje métodos adiciones de preprocesamiento de datos.

# 1) IMPORT & INSTALL PACKAGES

In [None]:
# general data manipulation
library('dplyr') # data manipulation
library('data.table') # data manipulation
library('caret')
library('caTools')
library('MLmetrics')

# 2) DATA UNDERSTANDING

## 2.1) Load Data

In [None]:
# Seleccion de Variables a usar en este BASELINE:
features_iniciales = c('ID',
 'Sexo',
 'AdultoMayor',
 'MesesCliente',
 'ServicioTelefonico',
 'LineasMultiples',
 'ProteccionDispositivo',
 'SoporteTecnico',
 'FacturacionElectronica',
 'MontoCargadoMes')

In [None]:
# Import from

path = "../input/"

df_train <- read.csv(paste(path,'churn_data_train.csv',sep = ""), stringsAsFactors = F, na.strings = NA)[c(features_iniciales,c("Churn"))]
df_test  <- read.csv(paste(path,'churn_data_test.csv',sep = ""), stringsAsFactors = F, na.strings = NA)[features_iniciales]

## 2.2) Data Exploration

### 2.2.1) Basic Statistics

In [None]:
dim(df_train)
dim(df_test)

In [None]:
head(df_train,5)

In [None]:
head(df_test,5)

In [None]:
str(df_train)

In [None]:
str(df_test)

In [None]:
# Defining features types
ID <- 'ID'
TARGET <- 'Churn'
df_train[TARGET] = as.factor(as.character(df_train[[TARGET]]))

In [None]:
head(df_train)

In [None]:
# Distribución del Target
#df_train[TARGET].value_counts(dropna=False)
table(df_train$Churn)

In [None]:
table(df_train$Churn)/length(df_train$Churn)*100

In [None]:
# Generar estadisticos básicos para cada variable:
### count: Count number of non-NA/null observations.	
### unique: Count uniques numbers of non-NA/null observations.
### top: Mean of the values.
### freq: Mean of the values.

### mean: Mean of the values.
### std: Standard deviation of the observations.

### min: Minimum of the values in the object.
### X%: The value of Quartil: 25% - Q1 , 50% - Q2, 75% - Q3
### max: Maximum of the values in the object.

summary(df_train)

In [None]:
summary(df_test)

### 2.2.2) EDA

#### 2.2.2.a) Evaluate missings

In [None]:
print("Train - Missings")
sapply(df_train, function(x) sum(is.na(x)|x==""))

In [None]:
print("Test - Missings")
sapply(df_test, function(x) sum(is.na(x)|x==""))

#### 2.2.3.b) Identify outliers

In [None]:
print("Code here")

#### 2.2.4.c) Adictionales

In [None]:
print("Code here")

# 3) DATA PREPARATION

In [None]:
# Copy dataset and then apply transformation to copied dataset
ds_train <- copy(df_train)

In [None]:
ds_test <- copy(df_test)

## 3.1) Data Cleaning

### 3.1.1) Impute missings

In [None]:
# AdultoMayor (imputacion por MODA)
ds_train[is.na(ds_train$AdultoMayor), 'AdultoMayor'] <- 0
ds_test[is.na(ds_test$AdultoMayor), 'AdultoMayor'] <- 0

# MesesCliente (imputacion por MEDIA)
ds_train[is.na(ds_train$MesesCliente), 'MesesCliente'] <- 32
ds_test[is.na(ds_test$MesesCliente), 'MesesCliente'] <- 32

# ProteccionDispositivo (imputacion por MODA)
ds_train[ds_train$ProteccionDispositivo=="", 'ProteccionDispositivo'] <- 'No'
ds_test[ds_test$ProteccionDispositivo=="", 'ProteccionDispositivo'] <- 'No'

# SoporteTecnico (imputacion por MODA)
ds_train[ds_train$SoporteTecnico=="", 'SoporteTecnico'] <- 'No'
ds_test[ds_test$SoporteTecnico=="", 'SoporteTecnico'] <- 'No'

# FacturacionElectronica (imputacion por MODA)
ds_train[ds_train$FacturacionElectronica=="", 'FacturacionElectronica'] <- 'Si'
ds_test[ds_test$FacturacionElectronica=="", 'FacturacionElectronica'] <- 'Si'

# MontoCargadoMes (imputacion por MEDIA)
ds_train[is.na(ds_train$MontoCargadoMes), 'MontoCargadoMes'] <- 68.7
ds_test[is.na(ds_test$MontoCargadoMes), 'MontoCargadoMes'] <- 68.7

In [None]:
print("Train - Missings")
sapply(ds_train, function(x) sum(is.na(x)|x==""))
print("Test - Missings")
sapply(ds_test, function(x) sum(is.na(x)|x==""))

### 3.1.2) Treat outliers

In [None]:
print("Code here")

## 3.2) Data Transformation

In [None]:
head(ds_train)

In [None]:
# Sexo 
ds_train$Sexo = case_when(
       ds_train$Sexo == 'Masculino' ~ 1,
       ds_train$Sexo == 'Femenino' ~ 0
)

ds_test$Sexo = case_when(
       ds_test$Sexo == 'Masculino' ~ 1,
       ds_test$Sexo == 'Femenino' ~ 0
)

In [None]:
# ServicioTelefonico 
ds_train$ServicioTelefonico = case_when(
       ds_train$ServicioTelefonico == 'Si' ~ 1,
       ds_train$ServicioTelefonico == 'No' ~ 0
)

ds_test$ServicioTelefonico = case_when(
       ds_test$ServicioTelefonico == 'Si' ~ 1,
       ds_test$ServicioTelefonico == 'No' ~ 0
)

In [None]:
# LineasMultiples 
ds_train$LineasMultiples = case_when(
       ds_train$LineasMultiples == 'Si' ~ 2,
       ds_train$LineasMultiples == 'No' ~ 1,
       ds_train$LineasMultiples == 'Sin servicio telefonico' ~ 0
)

ds_test$LineasMultiples = case_when(
       ds_test$LineasMultiples == 'Si' ~ 2,
       ds_test$LineasMultiples == 'No' ~ 1,
       ds_test$LineasMultiples == 'Sin servicio telefonico' ~ 0
)

In [None]:
# FacturacionElectronica 
ds_train$FacturacionElectronica = case_when(
       ds_train$FacturacionElectronica == 'Si' ~ 1,
       ds_train$FacturacionElectronica == 'No' ~ 0
)

ds_test$FacturacionElectronica = case_when(
       ds_test$FacturacionElectronica == 'Si' ~ 1,
       ds_test$FacturacionElectronica == 'No' ~ 0
)

In [None]:
# Crear Features Dummies
ds_train[ds_train$ProteccionDispositivo == 'Sin servicio de internet', 'ProteccionDispositivo'] <- 'SinServInter'
ds_train[ds_train$SoporteTecnico == 'Sin servicio de internet', 'SoporteTecnico'] <- 'SinServInter'

ds_test[ds_test$ProteccionDispositivo == 'Sin servicio de internet', 'ProteccionDispositivo'] <- 'SinServInter'
ds_test[ds_test$SoporteTecnico == 'Sin servicio de internet', 'SoporteTecnico'] <- 'SinServInter'

# Identificar las categorias de las variables dummies
dmy <- dummyVars(~ SoporteTecnico + ProteccionDispositivo, data=ds_train, fullRank=F, sep = "." )
# Convertir a Variables Dummies
train_dummies <- data.frame(predict(dmy, newdata=ds_train))
ds_train = cbind(ds_train,train_dummies)

test_dummies <- data.frame(predict(dmy, newdata=ds_test))
ds_test = cbind(ds_test,test_dummies)

# Eliminar columnas convertidas a dummies
ds_train <- ds_train[, !colnames(ds_train)%in%c('SoporteTecnico', 'ProteccionDispositivo')]
ds_test <- ds_test[, !colnames(ds_test)%in%c('SoporteTecnico', 'ProteccionDispositivo')]

In [None]:
head(ds_train)

In [None]:
head(ds_test)

## 3.3) Feature Engineering

In [None]:
# New Feature 1
tmp_byAdultoMayor_medianMontoMes <- aggregate(MontoCargadoMes ~ AdultoMayor, ds_train, median)
tmp_byAdultoMayor_medianMontoMes

In [None]:
ds_train$flg_bySexo_mayorMedianMontoMes <- case_when(
       ds_train$AdultoMayor == 0 & ds_train$MontoCargadoMes>=68.7 ~ 1,
       ds_train$AdultoMayor == 0 & ds_train$MontoCargadoMes<68.7 ~ 0,
       ds_train$AdultoMayor == 1 & ds_train$MontoCargadoMes>=85 ~ 1,
       ds_train$AdultoMayor == 1 & ds_train$MontoCargadoMes<85 ~ 0,
)

ds_test$flg_bySexo_mayorMedianMontoMes <- case_when(
       ds_test$AdultoMayor == 0 & ds_test$MontoCargadoMes>=68.7 ~ 1,
       ds_test$AdultoMayor == 0 & ds_test$MontoCargadoMes<68.7 ~ 0,
       ds_test$AdultoMayor == 1 & ds_test$MontoCargadoMes>=85 ~ 1,
       ds_test$AdultoMayor == 1 & ds_test$MontoCargadoMes<85 ~ 0,
)

In [None]:
head(ds_train)

In [None]:
head(ds_test)

In [None]:
# New Feature 2,3,4, ...
### Here

## 3.4) Feature Selection

In [None]:
features_to_model <- colnames(ds_train)
features_to_model <- features_to_model[!features_to_model%in%c(TARGET,ID)]
print(features_to_model)

***Select Final Features:***

In [None]:
# Selección de variables. 
### Una opción es: en base a un modelo basado en árboles, generar la importancia de Variables y seleccionar los features mas importantes.
features_to_model <- features_to_model # ['var1', 'var2', 'varn'] 

In [None]:
length(features_to_model)

In [None]:
# Features & Target
X <- ds_train[features_to_model]
y <- ds_train[TARGET]

X_summit <- ds_test[features_to_model]

In [None]:
print("Dim - train: ")
dim(X)
print("Dim - subm: ")
dim(X_summit)

## 3.5) Train & Test Split

In [None]:
set.seed(9)
index_subdata <- sample(2, nrow(X), replace=TRUE, prob=c(0.70, 0.30))
X_train <- X[index_subdata == 1, ]
X_test <- X[index_subdata == 2, ]

y_train <- as.factor(y[index_subdata == 1, ])
y_test <- as.factor(y[index_subdata == 2, ])

In [None]:
print("Dim - train: ")
dim(X_train)
print("Dim - subm: ")
dim(X_test)

# 4) Modeling & Evaluation - Simple

## 4.1.A. LogisticRegression

### 4.1.1 Training

In [None]:
# Fit the model:
logistic <- glm(y_train ~ ., data = X_train, family='binomial')

model <- logistic 

In [None]:
summary(model)

### 4.1.2 Evaluación del Modelo

In [None]:
# Generar las probabilidades
y_pred_proba_train <- predict(model, X_train, type="response")
y_pred_proba_test <- predict(model, X_test, type="response")

# Generar las predicciones:
y_pred_train <- ifelse(y_pred_proba_train>=0.5, '1', '0')
y_pred_test <- ifelse(y_pred_proba_test>=0.5, '1', '0')

In [None]:
# On Train
confusionMatrix(as.factor(y_pred_train), y_train)

In [None]:
# On Test
confusionMatrix(as.factor(y_pred_test), y_test)

El punto de corte por defecto es de 0.50 para decidir si la predicción final será 1 ó 0. A continuación trataremos de encontrar ese punto de corte que optimice la métrica de evaluación del problema..

### ****Find best threshold:****

In [None]:
list_accuracy_test <- c()
for (threshold in 0:99) {
    pred_0_1 <- if_else(y_pred_proba_test>=threshold/100,1,0)
    accu <- Accuracy(pred_0_1, y_test)
    list_accuracy_test <- c(list_accuracy_test,accu)
}

In [None]:
xs = c(0:99)/100
ys = list_accuracy_test
plot(xs, ys, type="l",col="blue")

In [None]:
best_position = which.max(list_accuracy_test)
best_scoring = list_accuracy_test[best_position]
best_threshold = ((c(0:99)/100)[best_position])
print(paste("El mejor threshold es:", best_threshold))

In [None]:
accuracy_train = Accuracy(y_train, if_else(y_pred_proba_train>=best_threshold,1,0))
accuracy_test = Accuracy(y_test, if_else(y_pred_proba_test>=best_threshold,1,0))

print(paste("Accuracy - Train:", accuracy_train))
print(paste("Accuracy - Train:", accuracy_test))

## 4.1.B. Decision Tree

### 4.1.1 Training

In [None]:
library('rpart')

In [None]:
# Fit the model:
model_tree <- rpart(y_train ~ ., data = X_train)

model <- model_tree 

In [None]:
summary(model)

### 4.1.2 Evaluación del Modelo

In [None]:
# Generar las probabilidades
y_pred_proba_train <- predict(model, X_train, type = 'prob')[,2]
y_pred_proba_test <- predict(model, X_test, type = 'prob')[,2]

# Generar las predicciones:
y_pred_train <- ifelse(y_pred_proba_train>=0.5, '1', '0')
y_pred_test <- ifelse(y_pred_proba_test>=0.5, '1', '0')

In [None]:
# On Train
confusionMatrix(as.factor(y_pred_train), y_train)

In [None]:
# On Test
confusionMatrix(as.factor(y_pred_test), y_test)

El punto de corte por defecto es de 0.50 para decidir si la predicción final será 1 ó 0. A continuación trataremos de encontrar ese punto de corte que optimice la métrica de evaluación del problema..

### ****Find best threshold:****

In [None]:
list_accuracy_test <- c()
for (threshold in 0:99) {
    pred_0_1 <- if_else(y_pred_proba_test>=threshold/100,1,0)
    accu <- Accuracy(pred_0_1, y_test)
    list_accuracy_test <- c(list_accuracy_test,accu)
}

In [None]:
xs = c(0:99)/100
ys = list_accuracy_test
plot(xs, ys, type="l",col="blue")

In [None]:
best_position = which.max(list_accuracy_test)
best_scoring = list_accuracy_test[best_position]
best_threshold = ((c(0:99)/100)[best_position])
print(paste("El mejor threshold es:", best_threshold))

In [None]:
accuracy_train = Accuracy(y_train, if_else(y_pred_proba_train>=best_threshold,1,0))
accuracy_test = Accuracy(y_test, if_else(y_pred_proba_test>=best_threshold,1,0))

print(paste("Accuracy - Train:", accuracy_train))
print(paste("Accuracy - Train:", accuracy_test))

### Feature Importances

In [None]:
importancia=as.data.frame(varImp(model))
importancia['feature'] = rownames(importancia)
library('reshape')
importancia <- importancia[order(-importancia$Overall), c('feature','Overall'), ] 
rownames(importancia) <- NULL
importancia

## 4.1.C. Random Forest

### 4.1.1 Training

In [None]:
library('randomForest')

In [None]:
# Fit the model:
set.seed(9)

model_rf <- randomForest(y_train ~ . , data = X_train, importance=T, 
                         nodesize = 10, mtry = 10, ntree=100, maxnodes = 50)
model = model_rf

In [None]:
summary(model)

### 4.1.2 Evaluación del Modelo

In [None]:
# Generar las probabilidades
y_pred_proba_train <- predict(model, X_train, type="prob")[,2]
y_pred_proba_test <- predict(model, X_test, type="prob")[,2]

# Generar las predicciones:
y_pred_train <- ifelse(y_pred_proba_train>=0.5, '1', '0')
y_pred_test <- ifelse(y_pred_proba_test>=0.5, '1', '0')

In [None]:
# On Train
confusionMatrix(as.factor(y_pred_train), y_train)

In [None]:
# On Test
confusionMatrix(as.factor(y_pred_test), y_test)

El punto de corte por defecto es de 0.50 para decidir si la predicción final será 1 ó 0. A continuación trataremos de encontrar ese punto de corte que optimice la métrica de evaluación del problema..

### ****Find best threshold:****

In [None]:
list_accuracy_test <- c()
for (threshold in 0:99) {
    pred_0_1 <- if_else(y_pred_proba_test>=threshold/100,1,0)
    accu <- Accuracy(pred_0_1, y_test)
    list_accuracy_test <- c(list_accuracy_test,accu)
}

In [None]:
xs = c(0:99)/100
ys = list_accuracy_test
plot(xs, ys, type="l",col="blue")

In [None]:
best_position = which.max(list_accuracy_test)
best_scoring = list_accuracy_test[best_position]
best_threshold = ((c(0:99)/100)[best_position])
print(paste("El mejor threshold es:", best_threshold))

In [None]:
accuracy_train = Accuracy(y_train, if_else(y_pred_proba_train>=best_threshold,1,0))
accuracy_test = Accuracy(y_test, if_else(y_pred_proba_test>=best_threshold,1,0))

print(paste("Accuracy - Train:", accuracy_train))
print(paste("Accuracy - Train:", accuracy_test))

### Feature Importances

In [None]:
varImpPlot(model, type = 2)

In [None]:
importancia= data.frame(importance(model))
importancia['feature'] = rownames(importancia)
library('reshape')
importancia <- importancia[order(-importancia$MeanDecreaseGini), c('feature','MeanDecreaseGini') ] 
rownames(importancia) <- NULL
importancia

**MODELO FINAL**

1. Como se puede notar, de los 3 tipos de algoritmos entrenados, el modelo basado en Random Forrest es el ganador con un accuracy optimizado por el punto de corte (threhold: 0.38)

# Predicciones on Submission DS

In [None]:
pred_prob_subm <- predict(model, X_summit, type="prob")[,2]
pred_subm <- ifelse(pred_prob_subm>=0.5, '1', '0')

In [None]:
Y_summit_pred = data.frame(df_test[ID])
Y_summit_pred[TARGET] <- pred_subm #pred_prob_subm
head(Y_summit_pred)

To submission:

In [None]:
write.csv(Y_summit_pred, file="krfc_submission_01_baseline_R.csv", row.names = F)