Antes de comenzar con el Análisis Exploratorio de los Datos de nuestro proyecto, exponemos el flujo que hemos seguido:

1. EDA inicial. Hemos empezado explorando el dataset con los datos "en crudo" tal cuál estaban, para conocer la distribución, presencia de nulos, duplicados, outliers, correlaciones, etc.

2. Eliminación de duplicados. Es el segundo paso, para no trabajar con datos redundantes.

3. Análisis de outliers y nulos. Una vez eliminados los duplicados, hemos visualizado los datos mediante histogramas, boxplots y gráficos de barras para tener una visión general de la distribución de nuestros datos, valores atípicos y plantear cómo llevaremos a cabo la gestión de los nulos.

4. Gestión de nulos. Imputación, eliminación o creación de categorías según el caso.

5. Normalización / estandarización de variables. Sobre un dataset ya sin inconsistencias.



### ANÁLISIS EXPLORATORIO DE DATOS

Este análisis exploratorio de datos (EDA) corresponde a una primera aproximación, realizada antes de aplicar un tratamiento a los valores nulos, eliminar columnas redundantes o corregir resultados inconsistentes.

Por lo tanto, los resultados mostrados aquí deben interpretarse con cautela, ya que no reflejan todavía la realidad completa y podrían diferir significativamente de los resultados finales una vez que se haya realizado la limpieza y preparación adecuada del dataset.

Es fundamental contextualizar las condiciones en que se han obtenido estos datos, teniendo siempre presente que esta fase busca únicamente una visión inicial para detectar patrones, anomalías y posibles problemas de calidad de datos que deberán resolverse en etapas posteriores del análisis.

In [2]:
# importamos las librerías que necesitamos

# Tratamiento de datos
import pandas as pd
import numpy as np
from IPython.display import display
import warnings


In [3]:
# ver todas las columnas
pd.set_option('display.max_columns', None)

In [4]:
# ver todas las filas
pd.set_option('display.max_rows', None)

In [5]:
# Cargamos el csv

df = pd.read_csv("ABC_data.csv", index_col = 0)


In [6]:
df.head()

Unnamed: 0,age,attrition,businesstravel,dailyrate,department,distancefromhome,education,educationfield,employeecount,employeenumber,environmentsatisfaction,gender,hourlyrate,jobinvolvement,joblevel,jobrole,jobsatisfaction,maritalstatus,monthlyincome,monthlyrate,numcompaniesworked,over18,overtime,percentsalaryhike,performancerating,relationshipsatisfaction,standardhours,stockoptionlevel,totalworkingyears,trainingtimeslastyear,worklifebalance,yearsatcompany,yearsincurrentrole,yearssincelastpromotion,yearswithcurrmanager,sameasmonthlyincome,datebirth,salary,roledepartament,numberchildren,remotework
0,51,No,,2015.722222,,6,3,,1,1,1,0,,3,5,resEArch DIREcToR,3,,"16280,83$","42330,17$",7,Y,No,13,30,3,Full Time,0,,5,30.0,20,,15,15,"16280,83$",1972,"195370,00$",,,Yes
1,52,No,,2063.388889,,1,4,Life Sciences,1,2,3,0,,2,5,ManAGeR,3,,,"43331,17$",0,,,14,30,1,,1,340.0,5,30.0,33,,11,9,,1971,"199990,00$",,,1
2,42,No,travel_rarely,1984.253968,Research & Development,4,2,Technical Degree,1,3,3,0,,3,5,ManaGER,4,Married,,"41669,33$",1,,No,11,30,4,,0,220.0,3,,22,,11,15,,1981,"192320,00$",ManaGER - Research & Development,,1
3,47,No,travel_rarely,1771.404762,,2,4,Medical,1,4,1,1,,3,4,ReseArCH DIrECtOr,3,Married,"14307,50$","37199,50$",3,Y,,19,30,2,Full Time,2,,2,,20,,5,6,"14307,50$",1976,"171690,00$",,,False
4,46,No,,1582.771346,,3,3,Technical Degree,1,5,1,1,,4,4,sAleS EXECUtIve,1,Divorced,"12783,92$","33238,20$",2,Y,No,12,30,4,,1,,5,30.0,19,,2,8,"12783,92$",1977,,,,0


In [7]:
df.tail()

Unnamed: 0,age,attrition,businesstravel,dailyrate,department,distancefromhome,education,educationfield,employeecount,employeenumber,environmentsatisfaction,gender,hourlyrate,jobinvolvement,joblevel,jobrole,jobsatisfaction,maritalstatus,monthlyincome,monthlyrate,numcompaniesworked,over18,overtime,percentsalaryhike,performancerating,relationshipsatisfaction,standardhours,stockoptionlevel,totalworkingyears,trainingtimeslastyear,worklifebalance,yearsatcompany,yearsincurrentrole,yearssincelastpromotion,yearswithcurrmanager,sameasmonthlyincome,datebirth,salary,roledepartament,numberchildren,remotework
1673,43,No,,488.944444,,-26,3,Medical,1,824,2,1,,4,1,rESEaRcH SciEnTiST,3,Single,"3949,17$","10267,83$",4,,,12,30.0,4,,0,,2,30,3,,1,2,"3949,17$",1980,,,,Yes
1674,47,No,,1973.984127,,26,4,,1,1087,4,1,,3,5,mANager,3,Married,"15943,72$","41453,67$",3,Y,No,11,30.0,3,Full Time,1,270.0,2,30,5,,1,0,"15943,72$",1976,"191324,62$",,,False
1675,29,No,travel_rarely,290.03551,,15,3,,1,528,3,0,,3,1,reSearch sCienTiSt,4,,,"6090,75$",1,,No,19,30.0,1,Part Time,0,60.0,1,30,6,,1,5,,1994,"28111,13$",,,False
1676,47,No,travel_rarely,1032.487286,,4,3,Life Sciences,1,76,3,1,,2,3,maNufACTURING DIREctOr,2,Divorced,"8339,32$","21682,23$",8,,Yes,12,,3,Part Time,1,,4,30,22,,14,10,"8339,32$",1976,"100071,84$",,,Yes
1677,32,No,,556.256661,,2,2,Life Sciences,1,401,4,1,69.532083,3,2,resEArch scIENTisT,3,Single,,"11681,39$",4,Y,Yes,14,30.0,4,Part Time,0,100.0,2,30,8,,0,7,,1991,"53914,11$",,,0


La función que hemos definido a continuación, es una función de "limpieza" que detecta y convierte automáticamente las columnas con tipo object al formato más adecuado. En primer lugar, intenta transformarlas en valores numéricos; si no es posible, prueba con formato de fecha; y, en caso contrario, las convierte en variables categóricas. De esta forma, se optimiza la estructura del dataset, facilitando tanto el análisis exploratorio como el modelado posterior, y evitando inconsistencias en el manejo de los tipos de datos.

In [8]:
def limpiar_df(df):
    """
    Convierte automáticamente columnas object que sean numéricas, fechas o categóricas.
    Modifica el DataFrame recibido.
    """
    warnings.filterwarnings("ignore", category=UserWarning)

    for col in df.select_dtypes(include='object').columns:
        # Intentar convertir a número
        num = pd.to_numeric(df[col], errors='coerce')
        if num.notna().sum() == len(df[col]):
            df[col] = num
            continue

        # Intentar convertir a fecha
        fechas = pd.to_datetime(df[col], errors='coerce', dayfirst=False)
        if fechas.notna().sum() == len(df[col]):
            df[col] = fechas
            continue

        # Si no es ni número ni fecha, convertir a categoría
        df[col] = df[col].astype('category')
    
    warnings.resetwarnings()
    return df


A continuación se presenta una función eda completa, que  muestra las primeras filas de la tabla, las últimas, sus dimensiones y tipos de datos, además de generar estadísticas descriptivas tanto para variables numéricas como categóricas. También identifica valores nulos, filas duplicadas y valores únicos por columna. Adicionalmente, presenta la distribución de las variables categóricas y construye un resumen general de todas las columnas, destacando aquellas con datos faltantes. 

Con esto se obtiene una visión completa del estado inicial del dataset, lo que facilita la detección de problemas de calidad y la preparación para etapas posteriores de análisis o modelado.

In [9]:
def eda(df):
    """
    Realiza un análisis exploratorio de datos sobre un DataFrame ya limpio:
    - Primeras filas
    - Últimas filas
    - Dimensiones
    - Tipos de datos
    - Estadísticas numéricas y categóricas
    - Valores nulos
    - Filas duplicadas
    - Valores únicos
    - Distribución de variables categóricas
    - Resumen general
    - Resumen de columnas con nulos
    """
    print("ANÁLISIS EXPLORATORIO DE DATOS\n")

    print("PRIMERAS 10 FILAS:")
    display(df.head(10))

    print("\nÚLTIMAS 10 FILAS:")
    display(df.tail(10))

    print("\nDIMENSIONES:")
    print(df.shape)

    print("\n INFORMACIÓN GENERAL:")
    df.info()

    print("\n TIPOS DE DATOS POR COLUMNA:")
    print(df.dtypes)

    # Columnas numéricas
    num_cols = df.select_dtypes(include='number').columns
    if len(num_cols) > 0:
        print("\n ESTADÍSTICAS NUMÉRICAS:")
        display(df[num_cols].describe().T)

    # Columnas categóricas reales
    cat_cols = df.select_dtypes(include='object').columns
    if len(cat_cols) > 0:
        print("\n ESTADÍSTICAS CATEGÓRICAS:")
        display(df[cat_cols].describe())

    # Valores nulos
    print("\n VALORES NULOS:")
    nulos_df = pd.DataFrame({
        "Conteo": df.isnull().sum(),
        "Porcentaje": (df.isnull().sum()/len(df)*100).round(2)
    }).sort_values(by="Porcentaje", ascending=False)
    display(nulos_df)

    # Filas duplicadas
    print("\n FILAS DUPLICADAS:", df.duplicated().sum())

    # Valores únicos
    print("\n VALORES ÚNICOS POR COLUMNA:")
    display(df.nunique().sort_values(ascending=False))

    # Distribución de variables categóricas
    for col in cat_cols:
        print(f"\n COLUMNA: {col}")
        display(df[col].value_counts().head())

    # Resumen general
    resumen = pd.DataFrame({
        "Columna": df.columns,
        "Dtype": df.dtypes.astype(str),
        "Valores únicos": df.nunique().values,
        "% Nulos": (df.isnull().sum()/len(df)*100).round(2).values
    })
    print("\nRESUMEN GENERAL DE COLUMNAS:")
    display(resumen)

    # Resumen SOLO con columnas que tienen nulos
    resumen_nulos = resumen[resumen["% Nulos"] > 0].sort_values(by="% Nulos", ascending=False)
    print("\nRESUMEN DE COLUMNAS CON VALORES NULOS:")
    display(resumen_nulos)

    return resumen, resumen_nulos

In [10]:
resumen = eda(df)  

ANÁLISIS EXPLORATORIO DE DATOS

PRIMERAS 10 FILAS:


Unnamed: 0,age,attrition,businesstravel,dailyrate,department,distancefromhome,education,educationfield,employeecount,employeenumber,environmentsatisfaction,gender,hourlyrate,jobinvolvement,joblevel,jobrole,jobsatisfaction,maritalstatus,monthlyincome,monthlyrate,numcompaniesworked,over18,overtime,percentsalaryhike,performancerating,relationshipsatisfaction,standardhours,stockoptionlevel,totalworkingyears,trainingtimeslastyear,worklifebalance,yearsatcompany,yearsincurrentrole,yearssincelastpromotion,yearswithcurrmanager,sameasmonthlyincome,datebirth,salary,roledepartament,numberchildren,remotework
0,51,No,,2015.722222,,6,3,,1,1,1,0,,3,5,resEArch DIREcToR,3,,"16280,83$","42330,17$",7,Y,No,13,30,3,Full Time,0,,5,30.0,20,,15,15,"16280,83$",1972,"195370,00$",,,Yes
1,52,No,,2063.388889,,1,4,Life Sciences,1,2,3,0,,2,5,ManAGeR,3,,,"43331,17$",0,,,14,30,1,,1,340.0,5,30.0,33,,11,9,,1971,"199990,00$",,,1
2,42,No,travel_rarely,1984.253968,Research & Development,4,2,Technical Degree,1,3,3,0,,3,5,ManaGER,4,Married,,"41669,33$",1,,No,11,30,4,,0,220.0,3,,22,,11,15,,1981,"192320,00$",ManaGER - Research & Development,,1
3,47,No,travel_rarely,1771.404762,,2,4,Medical,1,4,1,1,,3,4,ReseArCH DIrECtOr,3,Married,"14307,50$","37199,50$",3,Y,,19,30,2,Full Time,2,,2,,20,,5,6,"14307,50$",1976,"171690,00$",,,False
4,46,No,,1582.771346,,3,3,Technical Degree,1,5,1,1,,4,4,sAleS EXECUtIve,1,Divorced,"12783,92$","33238,20$",2,Y,No,12,30,4,,1,,5,30.0,19,,2,8,"12783,92$",1977,,,,0
5,48,No,,1771.920635,Research & Development,22,3,Medical,1,6,4,1,,3,4,MANAger,4,,"14311,67$","37210,33$",3,,No,11,30,2,,1,,3,30.0,22,,4,7,"14311,67$",1975,,MANAger - Research & Development,,Yes
6,59,No,,1032.487286,,25,3,Life Sciences,1,7,1,1,,3,3,Sales ExeCutIVe,1,,"8339,32$","21682,23$",7,Y,,11,30,4,Part Time,0,280.0,3,20.0,21,,7,9,"8339,32$",1964,"100071,84$",,,True
7,42,No,travel_rarely,556.256661,,1,1,,1,8,2,0,69.532083,3,2,Sales eXEcUTiVe,3,Married,,"11681,39$",1,,No,25,40,3,Part Time,0,200.0,3,30.0,20,,11,6,,1981,"53914,11$",,,0
8,41,No,,1712.18254,,2,5,,1,9,2,1,,3,4,mANAGEr,1,Married,"13829,17$","35955,83$",7,,No,16,30,2,Full Time,1,220.0,2,30.0,18,,11,8,"13829,17$",1982,"165950,00$",,,True
9,41,No,travel_frequently,1973.984127,,9,3,,1,10,1,0,,3,5,reSEaRCH DIrectoR,3,,"15943,72$","41453,67$",2,,No,17,30,2,,1,210.0,2,40.0,18,,0,11,"15943,72$",1982,,,,0



ÚLTIMAS 10 FILAS:


Unnamed: 0,age,attrition,businesstravel,dailyrate,department,distancefromhome,education,educationfield,employeecount,employeenumber,environmentsatisfaction,gender,hourlyrate,jobinvolvement,joblevel,jobrole,jobsatisfaction,maritalstatus,monthlyincome,monthlyrate,numcompaniesworked,over18,overtime,percentsalaryhike,performancerating,relationshipsatisfaction,standardhours,stockoptionlevel,totalworkingyears,trainingtimeslastyear,worklifebalance,yearsatcompany,yearsincurrentrole,yearssincelastpromotion,yearswithcurrmanager,sameasmonthlyincome,datebirth,salary,roledepartament,numberchildren,remotework
1668,33,No,,556.256661,,25,3,,1,755,4,0,,2,2,MANuFaCtURIng DIRectoR,2,,,"11681,39$",1,Y,No,13,30.0,4,Part Time,0,50.0,2,30,5,,0,2,,1990,"53914,11$",,,True
1669,33,No,travel_rarely,356.15873,,13,1,Life Sciences,1,977,2,1,44.519841,3,1,rEseaRcH scIEnTiST,4,Single,"2876,67$","7479,33$",3,Y,,18,30.0,1,Full Time,0,50.0,4,30,3,,0,2,"2876,67$",1990,"34520,00$",,,0
1670,48,No,,233.071429,,20,4,Medical,1,1342,4,0,,3,1,rEsEARCh ScieNTist,3,Married,"1882,50$","4894,50$",4,,,17,30.0,1,Part Time,2,130.0,2,20,0,,0,0,"1882,50$",1975,"22590,00$",,,0
1671,31,No,,556.256661,,12,3,Medical,1,1608,4,1,,3,2,HEaltHCarE REPreSENtAtIve,4,,,"11681,39$",0,,,11,30.0,3,Part Time,2,100.0,2,10,9,,8,5,,1992,"53914,11$",,,True
1672,39,No,travel_rarely,1032.487286,,2,5,,1,369,1,0,129.060911,4,3,saLEs exEcUTIVe,3,Single,,"21682,23$",0,,No,18,30.0,4,Part Time,0,90.0,3,30,8,,0,7,,1984,"100071,84$",,,0
1673,43,No,,488.944444,,-26,3,Medical,1,824,2,1,,4,1,rESEaRcH SciEnTiST,3,Single,"3949,17$","10267,83$",4,,,12,30.0,4,,0,,2,30,3,,1,2,"3949,17$",1980,,,,Yes
1674,47,No,,1973.984127,,26,4,,1,1087,4,1,,3,5,mANager,3,Married,"15943,72$","41453,67$",3,Y,No,11,30.0,3,Full Time,1,270.0,2,30,5,,1,0,"15943,72$",1976,"191324,62$",,,False
1675,29,No,travel_rarely,290.03551,,15,3,,1,528,3,0,,3,1,reSearch sCienTiSt,4,,,"6090,75$",1,,No,19,30.0,1,Part Time,0,60.0,1,30,6,,1,5,,1994,"28111,13$",,,False
1676,47,No,travel_rarely,1032.487286,,4,3,Life Sciences,1,76,3,1,,2,3,maNufACTURING DIREctOr,2,Divorced,"8339,32$","21682,23$",8,,Yes,12,,3,Part Time,1,,4,30,22,,14,10,"8339,32$",1976,"100071,84$",,,Yes
1677,32,No,,556.256661,,2,2,Life Sciences,1,401,4,1,69.532083,3,2,resEArch scIENTisT,3,Single,,"11681,39$",4,Y,Yes,14,30.0,4,Part Time,0,100.0,2,30,8,,0,7,,1991,"53914,11$",,,0



DIMENSIONES:
(1678, 41)

 INFORMACIÓN GENERAL:
<class 'pandas.core.frame.DataFrame'>
Index: 1678 entries, 0 to 1677
Data columns (total 41 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       1678 non-null   object 
 1   attrition                 1678 non-null   object 
 2   businesstravel            877 non-null    object 
 3   dailyrate                 1678 non-null   float64
 4   department                312 non-null    object 
 5   distancefromhome          1678 non-null   int64  
 6   education                 1678 non-null   int64  
 7   educationfield            904 non-null    object 
 8   employeecount             1678 non-null   int64  
 9   employeenumber            1678 non-null   int64  
 10  environmentsatisfaction   1678 non-null   int64  
 11  gender                    1678 non-null   int64  
 12  hourlyrate                411 non-null    float64
 13  jobinvolvement      

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
dailyrate,1678.0,668.079714,470.787298,104.103175,290.03551,556.256661,971.956349,2063.388889
distancefromhome,1678.0,4.504172,14.652066,-49.0,2.0,5.0,11.0,29.0
education,1678.0,2.932658,1.02427,1.0,2.0,3.0,4.0,5.0
employeecount,1678.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
employeenumber,1678.0,809.859952,467.084867,1.0,403.25,813.5,1215.75,1614.0
environmentsatisfaction,1678.0,4.264005,6.912695,1.0,2.0,3.0,4.0,49.0
gender,1678.0,0.398689,0.489774,0.0,0.0,0.0,1.0,1.0
hourlyrate,411.0,83.140768,57.272101,13.012897,36.254439,69.532083,116.987103,255.963294
jobinvolvement,1678.0,2.740763,0.710359,1.0,2.0,3.0,3.0,4.0
joblevel,1678.0,2.064362,1.099425,1.0,1.0,2.0,3.0,5.0



 ESTADÍSTICAS CATEGÓRICAS:


Unnamed: 0,age,attrition,businesstravel,department,educationfield,jobrole,maritalstatus,monthlyincome,monthlyrate,over18,overtime,performancerating,standardhours,totalworkingyears,worklifebalance,yearsincurrentrole,sameasmonthlyincome,salary,roledepartament,remotework
count,1678,1678,877,312,904,1678,1003,1189,1678,740,982,1478,1327,1129,1564,35,1189,1393,312,1678
unique,54,2,3,3,6,1579,5,493,673,1,2,2,2,40,4,10,493,583,301,5
top,35,No,travel_rarely,Research & Development,Life Sciences,mANager,Married,"2342,59$","11681,39$",Y,No,30,Part Time,100,30,20,"2342,59$","53914,11$",Sales exECutIVE - Sales,1
freq,88,1406,616,203,367,5,419,228,326,740,714,1257,927,151,946,12,228,270,2,375



 VALORES NULOS:


Unnamed: 0,Conteo,Porcentaje
numberchildren,1678,100.0
yearsincurrentrole,1643,97.91
roledepartament,1366,81.41
department,1366,81.41
hourlyrate,1267,75.51
over18,938,55.9
businesstravel,801,47.74
educationfield,774,46.13
overtime,696,41.48
maritalstatus,675,40.23



 FILAS DUPLICADAS: 64

 VALORES ÚNICOS POR COLUMNA:


employeenumber              1614
jobrole                     1579
monthlyrate                  673
dailyrate                    673
salary                       583
monthlyincome                493
sameasmonthlyincome          493
roledepartament              301
hourlyrate                   194
distancefromhome              69
age                           54
datebirth                     43
totalworkingyears             40
environmentsatisfaction       38
yearsatcompany                37
yearswithcurrmanager          18
yearssincelastpromotion       16
percentsalaryhike             15
numcompaniesworked            10
yearsincurrentrole            10
trainingtimeslastyear          7
educationfield                 6
education                      5
remotework                     5
maritalstatus                  5
joblevel                       5
worklifebalance                4
jobsatisfaction                4
relationshipsatisfaction       4
stockoptionlevel               4
jobinvolve


 COLUMNA: age


age
35    88
31    88
34    86
29    82
36    79
Name: count, dtype: int64


 COLUMNA: attrition


attrition
No     1406
Yes     272
Name: count, dtype: int64


 COLUMNA: businesstravel


businesstravel
travel_rarely        616
travel_frequently    168
non-travel            93
Name: count, dtype: int64


 COLUMNA: department


department
Research & Development     203
Sales                       93
Human Resources             16
Name: count, dtype: int64


 COLUMNA: educationfield


educationfield
Life Sciences       367
Medical             286
Marketing           106
Technical Degree     70
Other                63
Name: count, dtype: int64


 COLUMNA: jobrole


jobrole
mANager     5
mAnageR     3
ManagEr     3
mAnaGeR     3
ManageR     3
Name: count, dtype: int64


 COLUMNA: maritalstatus


maritalstatus
Married     419
Single      343
Divorced    194
Marreid      36
divorced     11
Name: count, dtype: int64


 COLUMNA: monthlyincome


monthlyincome
2342,59$     228
4492,84$     227
8339,32$     105
12783,92$     42
15943,72$     25
Name: count, dtype: int64


 COLUMNA: monthlyrate


monthlyrate
11681,39$    326
6090,75$     308
21682,23$    150
33238,20$     55
41453,67$     38
Name: count, dtype: int64


 COLUMNA: over18


over18
Y    740
Name: count, dtype: int64


 COLUMNA: overtime


overtime
No     714
Yes    268
Name: count, dtype: int64


 COLUMNA: performancerating


performancerating
3,0    1257
4,0     221
Name: count, dtype: int64


 COLUMNA: standardhours


standardhours
Part Time    927
Full Time    400
Name: count, dtype: int64


 COLUMNA: totalworkingyears


totalworkingyears
10,0    151
6,0      88
8,0      86
9,0      71
5,0      68
Name: count, dtype: int64


 COLUMNA: worklifebalance


worklifebalance
3,0    946
2,0    374
4,0    162
1,0     82
Name: count, dtype: int64


 COLUMNA: yearsincurrentrole


yearsincurrentrole
2,0    12
7,0     5
0,0     4
4,0     3
1,0     3
Name: count, dtype: int64


 COLUMNA: sameasmonthlyincome


sameasmonthlyincome
2342,59$     228
4492,84$     227
8339,32$     105
12783,92$     42
15943,72$     25
Name: count, dtype: int64


 COLUMNA: salary


salary
53914,11$     270
28111,13$     255
100071,84$    122
153407,07$     45
191324,62$     28
Name: count, dtype: int64


 COLUMNA: roledepartament


roledepartament
Sales exECutIVE  -  Sales                                2
humAN resoURCEs  -  Human Resources                      2
labORAtoRy tEcHNICIAN  -  Research & Development         2
hEalthCaRe reprEseNTaTiVe  -  Research & Development     2
LaBoratory TECHnICIAn  -  Research & Development         2
Name: count, dtype: int64


 COLUMNA: remotework


remotework
1        375
True     362
False    318
0        318
Yes      305
Name: count, dtype: int64


RESUMEN GENERAL DE COLUMNAS:


Unnamed: 0,Columna,Dtype,Valores únicos,% Nulos
age,age,object,54,0.0
attrition,attrition,object,2,0.0
businesstravel,businesstravel,object,3,47.74
dailyrate,dailyrate,float64,673,0.0
department,department,object,3,81.41
distancefromhome,distancefromhome,int64,69,0.0
education,education,int64,5,0.0
educationfield,educationfield,object,6,46.13
employeecount,employeecount,int64,1,0.0
employeenumber,employeenumber,int64,1614,0.0



RESUMEN DE COLUMNAS CON VALORES NULOS:


Unnamed: 0,Columna,Dtype,Valores únicos,% Nulos
numberchildren,numberchildren,float64,0,100.0
yearsincurrentrole,yearsincurrentrole,object,10,97.91
department,department,object,3,81.41
roledepartament,roledepartament,object,301,81.41
hourlyrate,hourlyrate,float64,194,75.51
over18,over18,object,1,55.9
businesstravel,businesstravel,object,3,47.74
educationfield,educationfield,object,6,46.13
overtime,overtime,object,2,41.48
maritalstatus,maritalstatus,object,5,40.23


Guardamos el dataframe que hemos creado para el EDA en un CSV para los pasos posteriores previos a la visualización como son tratamiento de outliers, limpieza y normalización de datos y gestión de nulos.



In [11]:

df.to_csv("datos.csv", encoding="utf-8")