# **Exploración**

## **Librerías**

In [1]:
import pandas as pd
import numpy as np

In [2]:
pd.options.display.max_columns = False

In [3]:
import plotly.express as px
import plotly.graph_objects as go

## **Datos**

In [4]:
df_telco = pd.read_csv('../Data/Telco-Customer-Churn.csv')

In [5]:
df_telco.sample()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
3976,5261-QSHQM,Female,0,No,No,4,No,No phone service,DSL,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,24.45,86.6,Yes


In [6]:
# Identificamos los valores únicos para cada una de nuestras features
df_eda_uniques = pd.DataFrame()

for column in df_telco.columns:
    df_column_uniques = pd.DataFrame({
        'Columna': [column],
        'Unique_Values': [df_telco[f'{column}'].unique()]
    })

    df_eda_uniques = pd.concat([df_eda_uniques, df_column_uniques])

In [7]:
df_eda_uniques

Unnamed: 0,Columna,Unique_Values
0,customerID,"[7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOC..."
0,gender,"[Female, Male]"
0,SeniorCitizen,"[0, 1]"
0,Partner,"[Yes, No]"
0,Dependents,"[No, Yes]"
0,tenure,"[1, 34, 2, 45, 8, 22, 10, 28, 62, 13, 16, 58, ..."
0,PhoneService,"[No, Yes]"
0,MultipleLines,"[No phone service, No, Yes]"
0,InternetService,"[DSL, Fiber optic, No]"
0,OnlineSecurity,"[No, Yes, No internet service]"


### **Limpieza y Exploración de Features**

In [8]:
# Observamos el tipo de formato y los valores nulos de nuestro dataset
df_telco.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


El campo **customerID** es irrelevante para nuestro estudio.

In [9]:
df_telco.drop('customerID', axis=1, inplace=True)

El campo **TotalCharges** esta almacenado como un *object*, debemos realizar la conversión para poder trabajar con esa variable.

In [10]:
df_telco['TotalCharges'] = pd.to_numeric(df_telco['TotalCharges'], errors='coerce')

In [11]:
# Observamos nuevamente el tipo de formato
df_telco.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   MultipleLines     7043 non-null   object 
 7   InternetService   7043 non-null   object 
 8   OnlineSecurity    7043 non-null   object 
 9   OnlineBackup      7043 non-null   object 
 10  DeviceProtection  7043 non-null   object 
 11  TechSupport       7043 non-null   object 
 12  StreamingTV       7043 non-null   object 
 13  StreamingMovies   7043 non-null   object 
 14  Contract          7043 non-null   object 
 15  PaperlessBilling  7043 non-null   object 
 16  PaymentMethod     7043 non-null   object 


El formato de nuestro campo ya se ajusto pero ahora se evidencia que existen valores nulos. 

In [12]:
df_telco[df_telco['TotalCharges'].isna()]

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
488,Female,0,Yes,Yes,0,No,No phone service,DSL,Yes,No,Yes,Yes,Yes,No,Two year,Yes,Bank transfer (automatic),52.55,,No
753,Male,0,No,Yes,0,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.25,,No
936,Female,0,Yes,Yes,0,Yes,No,DSL,Yes,Yes,Yes,No,Yes,Yes,Two year,No,Mailed check,80.85,,No
1082,Male,0,Yes,Yes,0,Yes,Yes,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.75,,No
1340,Female,0,Yes,Yes,0,No,No phone service,DSL,Yes,Yes,Yes,Yes,Yes,No,Two year,No,Credit card (automatic),56.05,,No
3331,Male,0,Yes,Yes,0,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,19.85,,No
3826,Male,0,Yes,Yes,0,Yes,Yes,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.35,,No
4380,Female,0,Yes,Yes,0,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.0,,No
5218,Male,0,Yes,Yes,0,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,One year,Yes,Mailed check,19.7,,No
6670,Female,0,Yes,Yes,0,Yes,Yes,DSL,No,Yes,Yes,Yes,Yes,No,Two year,No,Mailed check,73.35,,No


Todos corresponden a usuarios con **tenure** = 0 por lo que podemos asumir que todavía no son usuarios activos dentro de nuestra base. Procederemos a sacarlos de nuestro análisis.

In [13]:
df_telco.dropna(inplace=True)

Vamos a ajustar el campo **PaymentMethod**

In [14]:
df_telco['PaymentMethod'].unique()

array(['Electronic check', 'Mailed check', 'Bank transfer (automatic)',
       'Credit card (automatic)'], dtype=object)

In [15]:
df_telco['PaymentMethod'] = df_telco['PaymentMethod'].str.replace('(automatic)', '')

## **EDA**

Primero procedemos a identificar como se distribuye nuestra variable objetivo

In [19]:
churn_resume = df_telco.value_counts('Churn', normalize=True).reset_index()
churn_resume

Unnamed: 0,Churn,proportion
0,No,0.734215
1,Yes,0.265785


In [53]:
fig = px.bar(
    churn_resume, 
    x='Churn', 
    y='proportion',
    color='Churn',
    title='Distribución del Churn',
    labels={
        'proportion': 'Proporción (%)'
    }
)

fig.update_traces(
    text=round(churn_resume['proportion']*100, 2),
    textposition='outside'
)

fig.show()