# üîß 2. Transformaci√≥n (T - Transform)

 ### Conociendo el conjunto de datos

Antes de realizar cualquier transformaci√≥n, es fundamental entender la estructura del dataset.  
Este paso nos permite identificar qu√© variables son relevantes para el an√°lisis de evasi√≥n de clientes (`Churn`), y c√≥mo est√°n organizadas.


‚úÖ Exploraremos las columnas y sus tipos de datos  
‚úÖ Consultaremos el diccionario de datos  
‚úÖ Identificaremos las variables clave para el an√°lisis

---


## 1. **Conociendo el conjunto de datos**

## 1.1 Exploraci√≥n b√°sica del DataFrame




In [32]:
# Importaci√≥n de librer√≠as
import pandas as pd
import numpy as np


# Carga del archivo CSV generado en 01_data_extraction.ipynb
json_path = "/content/drive/MyDrive/challenge-TelecomX-ETL/data/raw/telecomx_raw.json"
df = pd.read_json(json_path)

# Vista previa
df.head()


Unnamed: 0,customerID,Churn,customer,phone,internet,account
0,0002-ORFBO,No,"{'gender': 'Female', 'SeniorCitizen': 0, 'Part...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'DSL', 'OnlineSecurity': '...","{'Contract': 'One year', 'PaperlessBilling': '..."
1,0003-MKNFE,No,"{'gender': 'Male', 'SeniorCitizen': 0, 'Partne...","{'PhoneService': 'Yes', 'MultipleLines': 'Yes'}","{'InternetService': 'DSL', 'OnlineSecurity': '...","{'Contract': 'Month-to-month', 'PaperlessBilli..."
2,0004-TLHLJ,Yes,"{'gender': 'Male', 'SeniorCitizen': 0, 'Partne...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecu...","{'Contract': 'Month-to-month', 'PaperlessBilli..."
3,0011-IGKFF,Yes,"{'gender': 'Male', 'SeniorCitizen': 1, 'Partne...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecu...","{'Contract': 'Month-to-month', 'PaperlessBilli..."
4,0013-EXCHZ,Yes,"{'gender': 'Female', 'SeniorCitizen': 1, 'Part...","{'PhoneService': 'Yes', 'MultipleLines': 'No'}","{'InternetService': 'Fiber optic', 'OnlineSecu...","{'Contract': 'Month-to-month', 'PaperlessBilli..."


### 1.1.1 Columnas anidadas

Las columnas `customer`, `phone`, `internet` y `account` contienen diccionarios con m√∫ltiples atributos.

In [10]:
# Verificar que las columnas anidadas sean dict
print(type(df.loc[0, 'account']))  # Deber√≠a ser <class 'dict'>

<class 'dict'>


## 1.2 Consultar el diccionario de datos y entender las variables
Esto nos ayudar√° a:
- Comprender el significado de cada columna.
- Identificar cu√°les son m√°s relevantes para el an√°lisis de evasi√≥n (Churn).


### 1.2.1 Expandir cada columna anidada y guardarlas en nuevos df (DataFrame)

In [13]:
# Expandir columnas anidadas
customer_df = pd.json_normalize(df['customer'])
phone_df = pd.json_normalize(df['phone'])
internet_df = pd.json_normalize(df['internet'])
account_df = pd.json_normalize(df['account'])

### 1.2.2 Combinar todo en un solo DataFrame plano

In [18]:
# Combinar columnas expandidas con las columnas principales
df_flat = pd.concat([
    df[['customerID', 'Churn']],  # columnas principales
    customer_df,
    phone_df,
    internet_df,
    account_df
], axis=1)

# Vista previa
df_flat.head()


Unnamed: 0,customerID,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,Charges.Monthly,Charges.Total
0,0002-ORFBO,No,Female,0,Yes,Yes,9,Yes,No,DSL,...,Yes,No,Yes,Yes,No,One year,Yes,Mailed check,65.6,593.3
1,0003-MKNFE,No,Male,0,No,No,9,Yes,Yes,DSL,...,No,No,No,No,Yes,Month-to-month,No,Mailed check,59.9,542.4
2,0004-TLHLJ,Yes,Male,0,No,No,4,Yes,No,Fiber optic,...,No,Yes,No,No,No,Month-to-month,Yes,Electronic check,73.9,280.85
3,0011-IGKFF,Yes,Male,1,Yes,No,13,Yes,No,Fiber optic,...,Yes,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,98.0,1237.85
4,0013-EXCHZ,Yes,Female,1,Yes,No,3,Yes,No,Fiber optic,...,No,No,Yes,Yes,No,Month-to-month,Yes,Mailed check,83.9,267.4


### 1.2.3 Verificar tipos de datos y estructura del nuevo DataFrame

In [19]:
# Informaci√≥n general del DataFrame plano
df_flat.info()

#  Estad√≠sticas descriptivas
df_flat.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7267 entries, 0 to 7266
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7267 non-null   object 
 1   Churn             7267 non-null   object 
 2   gender            7267 non-null   object 
 3   SeniorCitizen     7267 non-null   int64  
 4   Partner           7267 non-null   object 
 5   Dependents        7267 non-null   object 
 6   tenure            7267 non-null   int64  
 7   PhoneService      7267 non-null   object 
 8   MultipleLines     7267 non-null   object 
 9   InternetService   7267 non-null   object 
 10  OnlineSecurity    7267 non-null   object 
 11  OnlineBackup      7267 non-null   object 
 12  DeviceProtection  7267 non-null   object 
 13  TechSupport       7267 non-null   object 
 14  StreamingTV       7267 non-null   object 
 15  StreamingMovies   7267 non-null   object 
 16  Contract          7267 non-null   object 


Unnamed: 0,customerID,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,Charges.Monthly,Charges.Total
count,7267,7267,7267,7267.0,7267,7267,7267.0,7267,7267,7267,...,7267,7267,7267,7267,7267,7267,7267,7267,7267.0,7267.0
unique,7267,3,2,,2,2,,2,3,3,...,3,3,3,3,3,3,2,4,,6531.0
top,9995-HOTOH,No,Male,,No,No,,Yes,No,Fiber optic,...,No,No,No,No,No,Month-to-month,Yes,Electronic check,,20.2
freq,1,5174,3675,,3749,5086,,6560,3495,3198,...,3182,3195,3582,2896,2870,4005,4311,2445,,11.0
mean,,,,0.162653,,,32.346498,,,,...,,,,,,,,,64.720098,
std,,,,0.369074,,,24.571773,,,,...,,,,,,,,,30.129572,
min,,,,0.0,,,0.0,,,,...,,,,,,,,,18.25,
25%,,,,0.0,,,9.0,,,,...,,,,,,,,,35.425,
50%,,,,0.0,,,29.0,,,,...,,,,,,,,,70.3,
75%,,,,0.0,,,55.0,,,,...,,,,,,,,,89.875,


In [20]:
telecom_df = df_flat.copy()
telecom_df

Unnamed: 0,customerID,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,Charges.Monthly,Charges.Total
0,0002-ORFBO,No,Female,0,Yes,Yes,9,Yes,No,DSL,...,Yes,No,Yes,Yes,No,One year,Yes,Mailed check,65.60,593.3
1,0003-MKNFE,No,Male,0,No,No,9,Yes,Yes,DSL,...,No,No,No,No,Yes,Month-to-month,No,Mailed check,59.90,542.4
2,0004-TLHLJ,Yes,Male,0,No,No,4,Yes,No,Fiber optic,...,No,Yes,No,No,No,Month-to-month,Yes,Electronic check,73.90,280.85
3,0011-IGKFF,Yes,Male,1,Yes,No,13,Yes,No,Fiber optic,...,Yes,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,98.00,1237.85
4,0013-EXCHZ,Yes,Female,1,Yes,No,3,Yes,No,Fiber optic,...,No,No,Yes,Yes,No,Month-to-month,Yes,Mailed check,83.90,267.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7262,9987-LUTYD,No,Female,0,No,No,13,Yes,No,DSL,...,No,No,Yes,No,No,One year,No,Mailed check,55.15,742.9
7263,9992-RRAMN,Yes,Male,0,Yes,No,22,Yes,Yes,Fiber optic,...,No,No,No,No,Yes,Month-to-month,Yes,Electronic check,85.10,1873.7
7264,9992-UJOEL,No,Male,0,No,No,2,Yes,No,DSL,...,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,50.30,92.75
7265,9993-LHIEB,No,Male,0,Yes,Yes,67,Yes,No,DSL,...,No,Yes,Yes,No,Yes,Two year,No,Mailed check,67.85,4627.65


## 1.3 Identificar columnas relevantes para el an√°lisis de evasi√≥n

In [21]:
# Distribuci√≥n de la variable objetivo
telecom_df['Churn'].value_counts(normalize=True)

# Esto nos da una idea del balance entre clientes
# Que se fueron (Yes) y los que se quedaron (No).

Unnamed: 0_level_0,proportion
Churn,Unnamed: 1_level_1
No,0.711986
Yes,0.25719
,0.030824


In [24]:
display(telecom_df.head(3))
telecom_df.info()
telecom_df.dtypes


Unnamed: 0,customerID,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,Charges.Monthly,Charges.Total
0,0002-ORFBO,No,Female,0,Yes,Yes,9,Yes,No,DSL,...,Yes,No,Yes,Yes,No,One year,Yes,Mailed check,65.6,593.3
1,0003-MKNFE,No,Male,0,No,No,9,Yes,Yes,DSL,...,No,No,No,No,Yes,Month-to-month,No,Mailed check,59.9,542.4
2,0004-TLHLJ,Yes,Male,0,No,No,4,Yes,No,Fiber optic,...,No,Yes,No,No,No,Month-to-month,Yes,Electronic check,73.9,280.85


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7267 entries, 0 to 7266
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7267 non-null   object 
 1   Churn             7267 non-null   object 
 2   gender            7267 non-null   object 
 3   SeniorCitizen     7267 non-null   int64  
 4   Partner           7267 non-null   object 
 5   Dependents        7267 non-null   object 
 6   tenure            7267 non-null   int64  
 7   PhoneService      7267 non-null   object 
 8   MultipleLines     7267 non-null   object 
 9   InternetService   7267 non-null   object 
 10  OnlineSecurity    7267 non-null   object 
 11  OnlineBackup      7267 non-null   object 
 12  DeviceProtection  7267 non-null   object 
 13  TechSupport       7267 non-null   object 
 14  StreamingTV       7267 non-null   object 
 15  StreamingMovies   7267 non-null   object 
 16  Contract          7267 non-null   object 


Unnamed: 0,0
customerID,object
Churn,object
gender,object
SeniorCitizen,int64
Partner,object
Dependents,object
tenure,int64
PhoneService,object
MultipleLines,object
InternetService,object


Distribuci√≥n de la variable objetivo (`Churn`)

> En marketing y negocios, `Churn` (o tasa de abandono) se refiere a la p√©rdida de clientes o usuarios durante un per√≠odo espec√≠fico.


Antes de analizar, es importante entender el balance entre clientes que se fueron y los que se quedaron.


In [25]:
telecom_df['Churn'].value_counts(dropna=False)
telecom_df['Churn'].value_counts(normalize=True)  # proporci√≥n


Unnamed: 0_level_0,proportion
Churn,Unnamed: 1_level_1
No,0.711986
Yes,0.25719
,0.030824


### 1.3.1 Identificaci√≥n preliminar de columnas relevantes
Criterios:

- Variables demogr√°ficas (gender, SeniorCitizen, Partner, Dependents)
- Variables de servicio (InternetService, OnlineSecurity, TechSupport, etc.)
- Variables contractuales (Contract, PaymentMethod, PaperlessBilling)
- Variables financieras (Charges.Monthly, Charges.Total)
- Variable de antig√ºedad (tenure)


In [26]:
# Lista de posibles columnas explicativas para Churn
candidate_cols = [
    'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
    'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
    'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
    'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',
    'Charges.Monthly', 'Charges.Total'
]
print(f"Columnas candidatas ({len(candidate_cols)}): {candidate_cols}")

Columnas candidatas (19): ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Charges.Monthly', 'Charges.Total']


### 1.3.2 Ver se√±ales de relevancia inicial
Para num√©ricos, correlaci√≥n con Churn binario.

Para categ√≥ricas, diferencias de tasas de evasi√≥n.


In [27]:
# Convertimos Charges.Total a num√©rico para evaluar
telecom_df['Charges.Total'] = pd.to_numeric(telecom_df['Charges.Total']
                                            , errors='coerce')

# Crear versi√≥n binaria de Churn
churn_bin = telecom_df['Churn'].map({'Yes': 1, 'No': 0})

# Correlaciones para num√©ricos
num_cols = telecom_df.select_dtypes(include=['int64', 'float64']).columns
telecom_df[num_cols].corrwith(churn_bin).sort_values(ascending=False)

Unnamed: 0,0
Charges.Monthly,0.193356
SeniorCitizen,0.150889
Charges.Total,-0.199484
tenure,-0.352229


In [28]:
# Tasa de churn por categor√≠a para algunas columnas
for col in ['Contract', 'PaymentMethod', 'InternetService', 'OnlineSecurity',
            'TechSupport']:
    print(f"\n{col}:\n", telecom_df.groupby(col)['Churn'].value_counts(normalize=True).unstack())


Contract:
 Churn                           No       Yes
Contract                                    
Month-to-month  0.032459  0.554307  0.413233
One year        0.030283  0.860434  0.109282
Two year        0.027539  0.944923  0.027539

PaymentMethod:
 Churn                                      No       Yes
PaymentMethod                                          
Bank transfer (automatic)  0.028320  0.809314  0.162366
Credit card (automatic)    0.029337  0.822704  0.147959
Electronic check           0.032720  0.529243  0.438037
Mailed check               0.031832  0.783183  0.184985

InternetService:
 Churn                            No       Yes
InternetService                              
DSL              0.026929  0.788585  0.184486
Fiber optic      0.031895  0.562539  0.405566
No               0.034788  0.893738  0.071474

OnlineSecurity:
 Churn                                No       Yes
OnlineSecurity                                   
No                   0.030488  0.564579  0.

## 2. Comprobaci√≥n de incoherencias en los datos

En esta secci√≥n buscamos problemas que puedan afectar el an√°lisis:
- **Valores ausentes** (`NaN`o `null`)
- **Filas duplicadas** o `customerID` repetidos

Estos son los primeros chequeos antes de evaluar formatos y categor√≠as.


In [29]:
# Valores ausentes por columna
null_counts = telecom_df.isna().sum().sort_values(ascending=False)
print("Valores ausentes por columna:\n", null_counts)

# Filas duplicadas completas
dup_rows = telecom_df.duplicated().sum()
print(f"\nFilas duplicadas totales: {dup_rows}")

# Duplicados por 'customerID' (si existe)
if 'customerID' in telecom_df.columns:
    dup_ids = telecom_df['customerID'].duplicated().sum()
    print(f"IDs duplicados: {dup_ids}")

Valores ausentes por columna:
 Charges.Total       11
Churn                0
gender               0
SeniorCitizen        0
customerID           0
Partner              0
Dependents           0
PhoneService         0
tenure               0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
MultipleLines        0
DeviceProtection     0
TechSupport          0
StreamingMovies      0
StreamingTV          0
Contract             0
PaperlessBilling     0
PaymentMethod        0
Charges.Monthly      0
dtype: int64

Filas duplicadas totales: 0
IDs duplicados: 0


### 2.2 Errores de formato e inconsistencias en categor√≠as

Ahora revisaremos:

- Si las columnas num√©ricas contienen valores no num√©ricos.
- Las categor√≠as de variables tipo objeto para detectar inconsistencias (espacios, may√∫sculas, variantes).

In [30]:
# 1) Revisar columnas num√©ricas con posibles valores no num√©ricos
num_cols = ['Charges.Monthly', 'Charges.Total']
for col in num_cols:
    # Si no es num√©rico, mostramos los valores problem√°ticos √∫nicos
    non_numeric = telecom_df[~telecom_df[col].apply(lambda x: str(x).replace('.', '', 1).isdigit())][col].unique()
    print(f"{col} - Valores no num√©ricos:", non_numeric)

# 2) Revisar valores √∫nicos en categ√≥ricas clave
cat_cols = telecom_df.select_dtypes(include='object').columns.drop('customerID')
for col in cat_cols:
    uniques = telecom_df[col].unique()
    print(f"\n{col} ({len(uniques)} valores √∫nicos): {uniques}")

Charges.Monthly - Valores no num√©ricos: []
Charges.Total - Valores no num√©ricos: [nan]

Churn (3 valores √∫nicos): ['No' 'Yes' '']

gender (2 valores √∫nicos): ['Female' 'Male']

Partner (2 valores √∫nicos): ['Yes' 'No']

Dependents (2 valores √∫nicos): ['Yes' 'No']

PhoneService (2 valores √∫nicos): ['Yes' 'No']

MultipleLines (3 valores √∫nicos): ['No' 'Yes' 'No phone service']

InternetService (3 valores √∫nicos): ['DSL' 'Fiber optic' 'No']

OnlineSecurity (3 valores √∫nicos): ['No' 'Yes' 'No internet service']

OnlineBackup (3 valores √∫nicos): ['Yes' 'No' 'No internet service']

DeviceProtection (3 valores √∫nicos): ['No' 'Yes' 'No internet service']

TechSupport (3 valores √∫nicos): ['Yes' 'No' 'No internet service']

StreamingTV (3 valores √∫nicos): ['Yes' 'No' 'No internet service']

StreamingMovies (3 valores √∫nicos): ['No' 'Yes' 'No internet service']

Contract (3 valores √∫nicos): ['One year' 'Month-to-month' 'Two year']

PaperlessBilling (2 valores √∫nicos): ['Yes' 'No']

 #### Hallazgos clave
- **Num√©ricos:**
  - `Charges.Monthly` limpio (sin strings).
  - `Charges.Total` tiene 11 nulos (`NaN`).
- **Categ√≥ricas con valores ‚Äúextra‚Äù o vac√≠os:**
  - `Churn` tiene un valor vac√≠o `''` adem√°s de `Yes` y `No`.
- **Categor√≠as con modalidades ‚Äúsin servicio‚Äù:**
  - Ejemplo: `MultipleLines` con `"No phone service"`.
  - Ejemplo: servicios de internet como OnlineSecurity, `OnlineBackup, etc.,` con `"No internet service".`
- **Categor√≠as limpias pero con espacio a estandarizar:**
  - May√∫sculas/min√∫sculas y posibles espacios extra en `PaymentMethod`, `Contract`, etc.


## 3. Manejo de inconsistencias

Aplicamos transformaciones para garantizar datos consistentes y listos para an√°lisis:

- Eliminar valores vac√≠os en `Churn`
- Tratar nulos en `Charges.Total`
- Normalizar categor√≠as de tipo ‚Äúsin servicio‚Äù
- Estandarizar capitalizaci√≥n y espacios

### 3.1 Limpiar valores vac√≠os en Churn


In [35]:
telecom_df['Churn'] = telecom_df['Churn'].replace('', np.nan)
telecom_df['Churn'].head(15)

Unnamed: 0,Churn
0,No
1,No
2,Yes
3,Yes
4,Yes
5,No
6,No
7,No
8,No
9,No


### 3.2 Manejar datos nulos en Charges.Total: convertir a num√©rico y opcionalmente imputar
- Haciendo uso del m√©todo `to_numeric()`
- Convierte los valores de la columna Charges.Total del DataFrame `telecom_df a (float)`.

In [39]:
# Convertir a num√©rico, con NaN si hay errores
telecom_df['Charges.Total'] = pd.to_numeric(telecom_df['Charges.Total'], errors='coerce')

# Imputar 0 donde tenure = 0
mask_tenure0 = telecom_df['tenure'] == 0
telecom_df.loc[mask_tenure0, 'Charges.Total'] = 0

# Imputar la mediana en los dem√°s casos
median_total = telecom_df['Charges.Total'].median(skipna=True)
telecom_df['Charges.Total'] = telecom_df['Charges.Total'].fillna(median_total)


In [41]:
telecom_df['Charges.Total'].sample(15)

Unnamed: 0,Charges.Total
1685,5817.7
7,5377.8
2361,76.4
4380,2088.8
6809,950.2
3895,2929.75
7241,7517.7
5807,4039.5
3161,2135.5
4838,92.75


In [59]:
telecom_df.head()

Unnamed: 0,customerID,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,Dependents_bin,PhoneService_bin,MultipleLines_bin,OnlineSecurity_bin,OnlineBackup_bin,DeviceProtection_bin,TechSupport_bin,StreamingTV_bin,StreamingMovies_bin,PaperlessBilling_bin
0,0002-ORFBO,No,Female,0,Yes,Yes,9,Yes,No,DSL,...,1,1,0,0,1,0,1,1,0,1
1,0003-MKNFE,No,Male,0,No,No,9,Yes,Yes,DSL,...,0,1,1,0,0,0,0,0,1,0
2,0004-TLHLJ,Yes,Male,0,No,No,4,Yes,No,Fiber optic,...,0,1,0,0,0,1,0,0,0,1
3,0011-IGKFF,Yes,Male,1,Yes,No,13,Yes,No,Fiber optic,...,0,1,0,0,1,1,0,1,1,1
4,0013-EXCHZ,Yes,Female,1,Yes,No,3,Yes,No,Fiber optic,...,0,1,0,0,0,0,1,1,0,1


### 3.3 Homogeneizar columnas

Como: `"No phone service", "No Internet service"` a `"No"`


In [43]:
# Definici√≥n de las columnas
service_no_variants = ['MultipleLines', 'OnlineSecurity', 'OnlineBackup',
                       'DeviceProtection', 'TechSupport', 'StreamingTV',
                       'StreamingMovies']

# Bucle para reemplazar valores
for col in service_no_variants:
    telecom_df[col] = telecom_df[col].replace({
        'No phone service': 'No',
        'No internet service': 'No'
    })

In [49]:
telecom_df.sample(5)

Unnamed: 0,customerID,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,Charges.Monthly,Charges.Total
3720,5131-PONJI,No,Male,0,Yes,Yes,49,Yes,No,Fiber optic,...,Yes,Yes,No,Yes,No,Month-to-month,Yes,Credit card (automatic),90.4,4494.65
4586,6285-FTQBF,No,Male,0,Yes,Yes,72,Yes,Yes,No,...,No,No,No,No,No,Two year,No,Credit card (automatic),25.55,1867.7
3305,4610-WUVVT,,Male,1,Yes,Yes,46,Yes,Yes,Fiber optic,...,Yes,Yes,No,Yes,No,Month-to-month,Yes,Electronic check,100.7,4541.2
5224,7130-CTCUS,No,Male,1,Yes,No,16,Yes,No,DSL,...,Yes,No,No,No,No,Month-to-month,Yes,Bank transfer (automatic),54.55,825.1
4723,6475-VHUIZ,No,Female,0,Yes,No,23,Yes,No,DSL,...,No,Yes,Yes,No,No,Month-to-month,No,Electronic check,54.25,1221.55


### 3.4 Estandarizar texto

Quitar espacios extra y usar formato t√≠tulo


In [50]:
# Seleccionar las columnas de tipo object (categ√≥ricas) en una lista
cat_cols_all = telecom_df.select_dtypes(include='object').columns

# Eliminar los espacios en blanco de las cadenas de todas las columnas de texto
for col in cat_cols_all:
    telecom_df[col] = telecom_df[col].str.strip()

### 3.5 Verificar cambios

In [51]:
for col in ['Churn'] + service_no_variants:
    print(f"{col}: {telecom_df[col].unique()}")

Churn: ['No' 'Yes' nan]
MultipleLines: ['No' 'Yes']
OnlineSecurity: ['No' 'Yes']
OnlineBackup: ['Yes' 'No']
DeviceProtection: ['No' 'Yes']
TechSupport: ['Yes' 'No']
StreamingTV: ['Yes' 'No']
StreamingMovies: ['No' 'Yes']


## 4. Columna de Cuentas_Diarias


Usamos la facturaci√≥n mensual (`Charges.Monthly`) para calcular un valor diario aproximado:

$${Cuentas\_Diarias} = \frac{\text{Charges.Monthly}}{30}
$$
Esto nos permite tener un mejor seguimiento del gasto de cada cliente.


In [52]:
# Conversi√≥n segura de la columna mensual
telecom_df['Charges.Monthly'] = pd.to_numeric(telecom_df['Charges.Monthly'], errors='coerce')

# Creaci√≥n de la nueva columna
telecom_df['Cuentas_Diarias'] = telecom_df['Charges.Monthly'] / 30

# Vista previa
telecom_df[['customerID', 'Charges.Monthly', 'Cuentas_Diarias']].head()


Unnamed: 0,customerID,Charges.Monthly,Cuentas_Diarias
0,0002-ORFBO,65.6,2.186667
1,0003-MKNFE,59.9,1.996667
2,0004-TLHLJ,73.9,2.463333
3,0011-IGKFF,98.0,3.266667
4,0013-EXCHZ,83.9,2.796667


## 5. Estandarizaci√≥n y transformaci√≥n de datos

En esta fase preparamos las variables para an√°lisis y modelado:

- **Binarias** (`Yes`/`No`) ‚Üí `1` / `0`
- **Renombrar columnas** con nombres claros y consistentes
- (Opcional) Traducci√≥n a espa√±ol o nombres m√°s amigables


### 5.1 Convertir `Yes`/`No` a `1`/`0` en valores binarios


In [60]:
# Lista con las columnas que contienen "Yes" o "No"
binary_cols = [
    'Churn','Partner','Dependents','PhoneService',
    'MultipleLines','OnlineSecurity','OnlineBackup',
    'DeviceProtection','TechSupport','StreamingTV',
    'StreamingMovies','PaperlessBilling'
]

# Loop que reccore las columnas y compara la existencia de los valores de la
# lista binary_cols en las columnas del DataFrame telecom_df
# se agrega _bin al final de cada nombre de la columna para identificar como
# binario
for col in [c for c in binary_cols if c in telecom_df.columns]:
    telecom_df[col + "_bin"] = telecom_df[col].map({'Yes': 1, 'No': 0})

telecom_df.head()

Unnamed: 0,customerID,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,Dependents_bin,PhoneService_bin,MultipleLines_bin,OnlineSecurity_bin,OnlineBackup_bin,DeviceProtection_bin,TechSupport_bin,StreamingTV_bin,StreamingMovies_bin,PaperlessBilling_bin
0,0002-ORFBO,No,Female,0,Yes,Yes,9,Yes,No,DSL,...,1,1,0,0,1,0,1,1,0,1
1,0003-MKNFE,No,Male,0,No,No,9,Yes,Yes,DSL,...,0,1,1,0,0,0,0,0,1,0
2,0004-TLHLJ,Yes,Male,0,No,No,4,Yes,No,Fiber optic,...,0,1,0,0,0,1,0,0,0,1
3,0011-IGKFF,Yes,Male,1,Yes,No,13,Yes,No,Fiber optic,...,0,1,0,0,1,1,0,1,1,1
4,0013-EXCHZ,Yes,Female,1,Yes,No,3,Yes,No,Fiber optic,...,0,1,0,0,0,0,1,1,0,1


### 5.2 Renombrar columnas


In [57]:
# Renombar columnas con punto a CamelCase
telecom_df.rename(columns={
    'Charges.Monthly': 'MonthlyCharges',
    'Charges.Total': 'TotalCharges'
}, inplace=True)



### 5.3 Verificar cambios

In [55]:
telecom_df.info()
telecom_df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7267 entries, 0 to 7266
Data columns (total 34 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   customerID            7267 non-null   object 
 1   Churn                 7043 non-null   object 
 2   gender                7267 non-null   object 
 3   SeniorCitizen         7267 non-null   int64  
 4   Partner               7267 non-null   object 
 5   Dependents            7267 non-null   object 
 6   tenure                7267 non-null   int64  
 7   PhoneService          7267 non-null   object 
 8   MultipleLines         7267 non-null   object 
 9   InternetService       7267 non-null   object 
 10  OnlineSecurity        7267 non-null   object 
 11  OnlineBackup          7267 non-null   object 
 12  DeviceProtection      7267 non-null   object 
 13  TechSupport           7267 non-null   object 
 14  StreamingTV           7267 non-null   object 
 15  StreamingMovies      

Unnamed: 0,customerID,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,Dependents_bin,PhoneService_bin,MultipleLines_bin,OnlineSecurity_bin,OnlineBackup_bin,DeviceProtection_bin,TechSupport_bin,StreamingTV_bin,StreamingMovies_bin,PaperlessBilling_bin
0,0002-ORFBO,No,Female,0,Yes,Yes,9,Yes,No,DSL,...,1,1,0,0,1,0,1,1,0,1
1,0003-MKNFE,No,Male,0,No,No,9,Yes,Yes,DSL,...,0,1,1,0,0,0,0,0,1,0
2,0004-TLHLJ,Yes,Male,0,No,No,4,Yes,No,Fiber optic,...,0,1,0,0,0,1,0,0,0,1
3,0011-IGKFF,Yes,Male,1,Yes,No,13,Yes,No,Fiber optic,...,0,1,0,0,1,1,0,1,1,1
4,0013-EXCHZ,Yes,Female,1,Yes,No,3,Yes,No,Fiber optic,...,0,1,0,0,0,0,1,1,0,1


## 6. Datasets

Generamos tres DataFrames distintos:

1. **`telecom_df_full`** ‚Üí 34 columnas (originales limpias + derivadas `_bin` + `Cuentas_Diarias`)
2. **`telecom_df_clean`** ‚Üí 22 columnas (solo las originales limpias + `Cuentas_Diarias`, sin `_bin`)
3. **`telecom_df_bin`** ‚Üí 22 columnas (solo las derivadas binarias + `Cuentas_Diarias`)


In [58]:
import os

# 1. Dataset FULL (34 columnas: originales + binarias + Cuentas_Diarias)
telecom_df_full = telecom_df.copy()

# 2. Dataset CLEAN (22 columnas: originales limpias + Cuentas_Diarias, sin _bin)
cols_clean = [col for col in telecom_df.columns if not col.endswith('_bin')]

telecom_df_clean = telecom_df[cols_clean].copy()

# 3 Dataset BIN (22 columnas: derivadas binarias + Cuentas_Diarias)
bin_cols = [col for col in telecom_df.columns if col.endswith('_bin')] + ['Cuentas_Diarias']
telecom_df_bin = telecom_df[bin_cols].copy()

# Guardado de las tres versiones
output_dir = "/content/drive/MyDrive/challenge-TelecomX-ETL/data/processed"
os.makedirs(output_dir, exist_ok=True)

telecom_df_full.to_csv(f"{output_dir}/telecom_df_full.csv", index=False)
telecom_df_clean.to_csv(f"{output_dir}/telecom_df_clean.csv", index=False)
telecom_df_bin.to_csv(f"{output_dir}/telecom_df_bin.csv", index=False)

print("‚úÖ DataFrames guardados en:", output_dir)
print(f"telecom_df_full ‚Üí {telecom_df_full.shape}")
print(f"telecom_df_clean ‚Üí {telecom_df_clean.shape}")
print(f"telecom_df_bin ‚Üí {telecom_df_bin.shape}")

‚úÖ DataFrames guardados en: /content/drive/MyDrive/challenge-TelecomX-ETL/data/processed
telecom_df_full ‚Üí (7267, 34)
telecom_df_clean ‚Üí (7267, 22)
telecom_df_bin ‚Üí (7267, 13)


In [61]:
telecom_df_full.head()

Unnamed: 0,customerID,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,Dependents_bin,PhoneService_bin,MultipleLines_bin,OnlineSecurity_bin,OnlineBackup_bin,DeviceProtection_bin,TechSupport_bin,StreamingTV_bin,StreamingMovies_bin,PaperlessBilling_bin
0,0002-ORFBO,No,Female,0,Yes,Yes,9,Yes,No,DSL,...,1,1,0,0,1,0,1,1,0,1
1,0003-MKNFE,No,Male,0,No,No,9,Yes,Yes,DSL,...,0,1,1,0,0,0,0,0,1,0
2,0004-TLHLJ,Yes,Male,0,No,No,4,Yes,No,Fiber optic,...,0,1,0,0,0,1,0,0,0,1
3,0011-IGKFF,Yes,Male,1,Yes,No,13,Yes,No,Fiber optic,...,0,1,0,0,1,1,0,1,1,1
4,0013-EXCHZ,Yes,Female,1,Yes,No,3,Yes,No,Fiber optic,...,0,1,0,0,0,0,1,1,0,1


In [62]:
telecom_df_bin.head()

Unnamed: 0,Churn_bin,Partner_bin,Dependents_bin,PhoneService_bin,MultipleLines_bin,OnlineSecurity_bin,OnlineBackup_bin,DeviceProtection_bin,TechSupport_bin,StreamingTV_bin,StreamingMovies_bin,PaperlessBilling_bin,Cuentas_Diarias
0,0.0,1,1,1,0,0,1,0,1,1,0,1,2.186667
1,0.0,0,0,1,1,0,0,0,0,0,1,0,1.996667
2,1.0,0,0,1,0,0,0,1,0,0,0,1,2.463333
3,1.0,1,0,1,0,0,1,1,0,1,1,1,3.266667
4,1.0,1,0,1,0,0,0,0,1,1,0,1,2.796667


In [63]:
telecom_df_clean.head()

Unnamed: 0,customerID,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Cuentas_Diarias
0,0002-ORFBO,No,Female,0,Yes,Yes,9,Yes,No,DSL,...,No,Yes,Yes,No,One year,Yes,Mailed check,65.6,593.3,2.186667
1,0003-MKNFE,No,Male,0,No,No,9,Yes,Yes,DSL,...,No,No,No,Yes,Month-to-month,No,Mailed check,59.9,542.4,1.996667
2,0004-TLHLJ,Yes,Male,0,No,No,4,Yes,No,Fiber optic,...,Yes,No,No,No,Month-to-month,Yes,Electronic check,73.9,280.85,2.463333
3,0011-IGKFF,Yes,Male,1,Yes,No,13,Yes,No,Fiber optic,...,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,98.0,1237.85,3.266667
4,0013-EXCHZ,Yes,Female,1,Yes,No,3,Yes,No,Fiber optic,...,No,Yes,Yes,No,Month-to-month,Yes,Mailed check,83.9,267.4,2.796667


---

### **Conclusiones**

- Los datos originales fueron guardados correctamente en la carpeta `data/processed`, conteniendo informaci√≥n de **7267 clientes** y **21 columnas originales**.  
- Se identificaron y documentaron las variables m√°s relevantes para el an√°lisis de evasi√≥n de clientes (`Churn`), considerando atributos demogr√°ficos, de servicio, contractuales y financieros.  
- Se detectaron incoherencias como valores nulos en `Charges.Total` y entradas vac√≠as en `Churn`, adem√°s de valroes como `"No phone service"` y `"No internet service"`, las cuales fueron unificadas.  
- Las columnas num√©ricas fueron convertidas correctamente a formato num√©rico (`int64, float64`) y las categ√≥ricas estandarizadas en su formato de texto.  
- Se cre√≥ la columna **`Cuentas_Diarias`** para obtener una m√©trica de facturaci√≥n diaria aproximada por cliente.  
- Se generaron variables binarias (`_bin`) a partir de todas las columnas Yes/No para facilitar an√°lisis y modelado.  
- Se crearon y guardaron tres versiones del dataset:
  1. **`telecom_df_full`** (34 columnas, incluye originales, binarias y `Cuentas_Diarias`).
  2. **`telecom_df_clean`** (22 columnas, solo originales limpias + `Cuentas_Diarias`).
  3. **`telecom_df_bin`** (22 columnas, solo variables binarias + `Cuentas_Diarias`).

---


‚û°Ô∏è **Siguiente paso**: An√°lisis exploratorio de datos (EDA) en `03_data_analysis.ipynb` para visualizaciones y patrones iniciales de `Churn` .

---

