#PREPROCESAMIENTO.
En este notebook, desarrollamos  bloque de preprocesamientos aplicados al df virgen.
Para luego ser volcado en la etapa de evaluacion de modelo.

##00-BIBLIOTECAS.

In [None]:
import requests
from io import StringIO
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
# Extraccion de df desde google drivee.
file_id = "16ypxCIBr9wSGVEaXqWdZUfz9w4xzccwo"
download_link = f"https://drive.google.com/uc?id={file_id}"
response = requests.get(download_link)
csv_data = StringIO(response.text)
# transformamos en df.
df = pd.read_csv(csv_data, encoding='utf-8')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2666 entries, 0 to 2665
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   State                   2666 non-null   object 
 1   Account length          2666 non-null   int64  
 2   Area code               2666 non-null   int64  
 3   International plan      2666 non-null   object 
 4   Voice mail plan         2666 non-null   object 
 5   Number vmail messages   2666 non-null   int64  
 6   Total day minutes       2666 non-null   float64
 7   Total day calls         2666 non-null   int64  
 8   Total day charge        2666 non-null   float64
 9   Total eve minutes       2666 non-null   float64
 10  Total eve calls         2666 non-null   int64  
 11  Total eve charge        2666 non-null   float64
 12  Total night minutes     2666 non-null   float64
 13  Total night calls       2666 non-null   int64  
 14  Total night charge      2666 non-null   

In [None]:
# Creamos copias
df1 = df.copy()

##01-PREPROCESAMIENTO.
En este bloque, desarrollamos preprocesamiento donde abarca las siguientes tareas:
- Transformacion de tipos de dato.
- Eliminacion de variable "State".
- Eliminacion de outliers.



In [None]:
# Función que transforma 'International plan' a int.
def mapear_international_plan(df):
    df_copy = df.copy()  # Crear una copia del DataFrame
    df_copy['International plan'] = df_copy['International plan'].map({'Yes': 1, 'No': 0})
    return df_copy

# Función que transforma 'Voice mail plan' a int.
def mapear_voice_mail_plan(df):
    df_copy = df.copy()  # Crear una copia del DataFrame
    df_copy['Voice mail plan'] = df_copy['Voice mail plan'].map({'Yes': 1, 'No': 0})
    return df_copy

# Eliminamos variable categorica "State".
df1 = df1.drop('State', axis=1)


# Función para eliminar outliers
def eliminar_outliers(df):
    columns_to_check = ['Number vmail messages', 'Total day minutes',
                             'Total day charge', 'Total eve minutes',
                             'Total eve charge', 'Total night minutes',
                             'Total night charge', 'Total intl minutes',
                             'Total intl calls']

    df_copy = df.copy()

    for column in columns_to_check:
        q1 = df_copy[column].quantile(0.25)
        q3 = df_copy[column].quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr

        # Eliminar las filas con outliers solo en la columna actual
        df_copy = df_copy.drop(df_copy[(df_copy[column] < lower_bound) | (df_copy[column] > upper_bound)].index)

    return df_copy


# Normalizacion (NO LO VAMOS A IMPLEMENTAR, PORQUE NECESITAMOS INTERPRETAR LOS FEATURES MAS ADELANTE CON SHAP)

# Funcion de transformacion de variable
def mapear_target(df):
    df_copy = df.copy()  # Crear una copia del DataFrame
    df_copy['Churn'] = df_copy['Churn'].astype(int)  # Convertir valores booleanos a enteros (0 para False, 1 para True)
    return df_copy

In [None]:
# Ahora, aplicamos las transformaciones al DataFrame df1
df1 = mapear_international_plan(df1)
df1 = mapear_voice_mail_plan(df1)
df1 = eliminar_outliers(df1)
df1 = mapear_target(df1)

In [None]:
df1.head()

Unnamed: 0,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,128,415,0,1,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0
1,107,415,0,1,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0
2,137,415,0,0,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,0
3,84,408,1,0,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,0
4,75,415,1,0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,0


In [None]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2501 entries, 0 to 2665
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Account length          2501 non-null   int64  
 1   Area code               2501 non-null   int64  
 2   International plan      2501 non-null   int64  
 3   Voice mail plan         2501 non-null   int64  
 4   Number vmail messages   2501 non-null   int64  
 5   Total day minutes       2501 non-null   float64
 6   Total day calls         2501 non-null   int64  
 7   Total day charge        2501 non-null   float64
 8   Total eve minutes       2501 non-null   float64
 9   Total eve calls         2501 non-null   int64  
 10  Total eve charge        2501 non-null   float64
 11  Total night minutes     2501 non-null   float64
 12  Total night calls       2501 non-null   int64  
 13  Total night charge      2501 non-null   float64
 14  Total intl minutes      2501 non-null   

In [None]:
# Descargar csv.
df1.to_csv("Prepro01.csv", index=False)

# CONCLUSION
Podemos ver que el set de datos original, ya ha pasado por una limpieza previa (ej: El tratamiento de los datos nulos). Osea un proceso de Data wraining.