📚 **Importar librerías**

In [None]:
# librerías base para data science
import sys
from pathlib import Path
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# configuración para que solo se muestren 2 decimales
pd.set_option("display.float_format", "{:.2f}".format)

💾 **Cargar datos**

In [3]:
BASE_DIR = Path("/home/lof/Projects/Telco-Customer-Churn")
DATA_DIR = BASE_DIR / "data" / "interim"
churn_df = pd.read_parquet(
    DATA_DIR / "churn_type_fixed.parquet", engine="pyarrow")

📊 **Descripción del dataframe**

In [None]:
# información general
churn_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24742 entries, 0 to 24741
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   MonthlyCharges    24527 non-null  float64 
 1   StreamingMovies   24415 non-null  category
 2   Partner           24742 non-null  bool    
 3   PhoneService      24742 non-null  bool    
 4   InternetService   24445 non-null  category
 5   StreamingTV       24375 non-null  category
 6   OnlineSecurity    24393 non-null  category
 7   MultipleLines     24493 non-null  category
 8   Dependents        24742 non-null  bool    
 9   DeviceProtection  24383 non-null  category
 10  SeniorCitizen     24742 non-null  bool    
 11  TotalCharges      24560 non-null  float64 
 12  TechSupport       24379 non-null  category
 13  gender            24684 non-null  category
 14  PaperlessBilling  24742 non-null  bool    
 15  tenure            24562 non-null  float64 
 16  Churn             2474

In [None]:
# tamaño del dataframe
churn_df.shape

(24742, 20)

In [6]:
# algunos registros
churn_df.sample(5)

Unnamed: 0,MonthlyCharges,StreamingMovies,Partner,PhoneService,InternetService,StreamingTV,OnlineSecurity,MultipleLines,Dependents,DeviceProtection,SeniorCitizen,TotalCharges,TechSupport,gender,PaperlessBilling,tenure,Churn,OnlineBackup,PaymentMethod,Contract
17600,,,True,True,,,,Yes,True,,False,,,Male,True,71.0,True,,,
2719,78.1,Yes,True,True,Fiber optic,No,No,No,True,No,False,1122.4,No,Female,True,14.0,True,No,Electronic check,Month-to-month
4094,69.15,No,True,True,Fiber optic,No,No,No,True,No,False,69.15,No,Male,True,1.0,True,No,Electronic check,Month-to-month
12046,84.2,Yes,True,True,DSL,Yes,Yes,Yes,True,Yes,False,5986.55,Yes,Female,True,72.0,True,No,Bank transfer (automatic),Two year
5486,92.4,Yes,True,True,DSL,Yes,Yes,Yes,True,Yes,False,6786.1,Yes,Female,True,72.0,True,Yes,Electronic check,Two year


In [12]:
nulls_perc = (churn_df.isnull().sum() / len(churn_df) * 100).round(2)

pd.DataFrame({
    "Null count" : churn_df.isnull().sum(),
    "Null %" : nulls_perc,
    "Dtype": churn_df.dtypes
}) 

Unnamed: 0,Null count,Null %,Dtype
MonthlyCharges,215,0.87,float64
StreamingMovies,327,1.32,category
Partner,0,0.0,bool
PhoneService,0,0.0,bool
InternetService,297,1.2,category
StreamingTV,367,1.48,category
OnlineSecurity,349,1.41,category
MultipleLines,249,1.01,category
Dependents,0,0.0,bool
DeviceProtection,359,1.45,category


No hay un número de valores nulos representativos dentro de las columnas para descartar una de ellas

Variable target: **'Churn'**

📈 **Análisis univariable**

In [9]:
# descripción de columnas numericas
churn_df.describe()

Unnamed: 0,MonthlyCharges,TotalCharges,tenure
count,24527.0,24560.0,24562.0
mean,427764379.66,442082218.18,32.33
std,47347084831.8,49122084419.77,24.49
min,18.25,-9876543456.0,0.0
25%,35.35,399.9,9.0
50%,70.35,1406.33,29.0
75%,89.8,3801.7,55.0
max,5243355243554.0,5443567897654.0,72.0


In [10]:
# descripción de columnas categoricas
churn_df.describe(include="category")

Unnamed: 0,StreamingMovies,InternetService,StreamingTV,OnlineSecurity,MultipleLines,DeviceProtection,TechSupport,gender,OnlineBackup,PaymentMethod,Contract
count,24415,24445,24375,24393,24493,24383,24379,24684,24374,24502,24504
unique,4,3,4,3,4,4,3,2,3,4,3
top,No,Fiber optic,No,No,No,No,No,Male,No,Electronic check,Month-to-month
freq,9669,10693,9704,12021,11856,10703,11917,12610,10714,8263,13505
