#**Deserción de empleados**
Una empresa de productos médicos desea empezar determinando si un empleado abandonará la empresa Attrition o no en un momento dado. Como se trata de un problema de clasificación binaria, lo más probable es que utilicen un modelo de machine learning. La empresa ha logrado recolectar 30 datos de 400 de sus empleados, pero no está segura si ese conjunto de datos sean los correctos para lo que pretende hacer, por lo que decidió contratarte como científico de datos para generar un set de datos adecuado para esta actividad.

In [35]:
# importar librerías
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
import numpy as np

In [6]:
#Leer archivo y colocar los datos en un frame de Pandas

EmpleadosAttrition = pd.read_csv('/empleados RETO.csv')

In [8]:
# Mostrar información del DataFrame EmpleadosAttrition
EmpleadosAttrition.info()
EmpleadosAttrition.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 30 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       400 non-null    int64 
 1   BusinessTravel            396 non-null    object
 2   Department                400 non-null    object
 3   DistanceFromHome          400 non-null    object
 4   Education                 400 non-null    int64 
 5   EducationField            400 non-null    object
 6   EmployeeCount             400 non-null    int64 
 7   EmployeeNumber            400 non-null    int64 
 8   EnvironmentSatisfaction   400 non-null    int64 
 9   Gender                    400 non-null    object
 10  JobInvolvement            400 non-null    int64 
 11  JobLevel                  400 non-null    int64 
 12  JobRole                   400 non-null    object
 13  JobSatisfaction           400 non-null    int64 
 14  MaritalStatus             

Unnamed: 0,Age,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,Gender,...,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsInCurrentRole,YearsSinceLastPromotion,Attrition
0,50,Travel_Rarely,Research & Development,1 km,2,Medical,1,997,4,Male,...,22,4,3,80,32,1,2,4,1,No
1,36,Travel_Rarely,Research & Development,6 km,2,Medical,1,178,2,Male,...,20,4,4,80,7,0,3,2,0,No
2,21,Travel_Rarely,Sales,7 km,1,Marketing,1,1780,2,Male,...,13,3,2,80,1,3,3,0,1,Yes
3,52,Travel_Rarely,Research & Development,7 km,4,Life Sciences,1,1118,2,Male,...,19,3,4,80,18,4,3,6,4,No
4,33,Travel_Rarely,Research & Development,15 km,1,Medical,1,582,2,Male,...,12,3,4,80,15,2,4,6,7,Yes


In [9]:
#Elimina las columnas que, con alta probabilidad, no tienen relación alguna con la salida.

#EmployeeCount: número de empleados, todos tienen un 1
#EmployeeNumber: ID del empleado, el cual es único para cada empleado
#Over18: mayores de edad, todos dicen “Y”
#StandardHours: horas de trabajo, todos tienen “80”

EmpleadosAttrition = EmpleadosAttrition.drop(['EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours'], axis = 1)

#Verificar

EmpleadosAttrition.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 26 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       400 non-null    int64 
 1   BusinessTravel            396 non-null    object
 2   Department                400 non-null    object
 3   DistanceFromHome          400 non-null    object
 4   Education                 400 non-null    int64 
 5   EducationField            400 non-null    object
 6   EnvironmentSatisfaction   400 non-null    int64 
 7   Gender                    400 non-null    object
 8   JobInvolvement            400 non-null    int64 
 9   JobLevel                  400 non-null    int64 
 10  JobRole                   400 non-null    object
 11  JobSatisfaction           400 non-null    int64 
 12  MaritalStatus             395 non-null    object
 13  MonthlyIncome             400 non-null    int64 
 14  NumCompaniesWorked        

In [10]:
#Año de contratación (entero) a partir de HiringDate
EmpleadosAttrition["Year"] = pd.DatetimeIndex(pd.to_datetime(EmpleadosAttrition["HiringDate"], errors="coerce")).year.astype("Int64")

#Años en la compañía hasta 2018 usando Year
EmpleadosAttrition["YearsAtCompany"] = (2018 - EmpleadosAttrition["Year"]).astype("Int64")

#Verificar
EmpleadosAttrition[["Year", "YearsAtCompany"]].head()

Unnamed: 0,Year,YearsAtCompany
0,2013,5
1,2015,3
2,2017,1
3,2010,8
4,2011,7


In [11]:
#Renombrar la variable original para conservar la versión con texto
EmpleadosAttrition = EmpleadosAttrition.rename(columns={"DistanceFromHome": "DistanceFromHome_km"})

#Crear nueva variable entera (quitando " km")
EmpleadosAttrition["DistanceFromHome"] = EmpleadosAttrition["DistanceFromHome_km"].str[:2].astype(int)

# Verificar
EmpleadosAttrition[["DistanceFromHome_km", "DistanceFromHome"]].head()

Unnamed: 0,DistanceFromHome_km,DistanceFromHome
0,1 km,1
1,6 km,6
2,7 km,7
3,7 km,7
4,15 km,15


In [12]:
#Borrar las columnas Year, HiringDate y DistanceFromHome_km debido a que ya no son útiles.
EmpleadosAttrition = EmpleadosAttrition.drop(['Year', 'HiringDate', 'DistanceFromHome_km'], axis = 1)

#Verificar

EmpleadosAttrition.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 26 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       400 non-null    int64 
 1   BusinessTravel            396 non-null    object
 2   Department                400 non-null    object
 3   Education                 400 non-null    int64 
 4   EducationField            400 non-null    object
 5   EnvironmentSatisfaction   400 non-null    int64 
 6   Gender                    400 non-null    object
 7   JobInvolvement            400 non-null    int64 
 8   JobLevel                  400 non-null    int64 
 9   JobRole                   400 non-null    object
 10  JobSatisfaction           400 non-null    int64 
 11  MaritalStatus             395 non-null    object
 12  MonthlyIncome             400 non-null    int64 
 13  NumCompaniesWorked        400 non-null    int64 
 14  OverTime                  

In [13]:
#Generar un nuevo frame que contenga el MonthlyIncome promedio por departamento de los empleados y colocarlo en una variable llamada SueldoPromedio
SueldoPromedioDepto = (EmpleadosAttrition.groupby("Department", as_index=False)["MonthlyIncome"].mean().rename(columns={"MonthlyIncome": "SueldoPromedio"}))

# Verificar
SueldoPromedioDepto

Unnamed: 0,Department,SueldoPromedio
0,Human Resources,6239.888889
1,Research & Development,6804.149813
2,Sales,7188.25


In [18]:
# Normalización min-max de MonthlyIncome al rango [0, 1]
scaler = MinMaxScaler(feature_range=(0, 1))
EmpleadosAttrition[["MonthlyIncome"]] = scaler.fit_transform(EmpleadosAttrition[["MonthlyIncome"]])

# Verificar
EmpleadosAttrition[["MonthlyIncome"]]

Unnamed: 0,MonthlyIncome
0,0.864269
1,0.207340
2,0.088062
3,0.497574
4,0.664470
...,...
395,0.075248
396,0.187197
397,0.589327
398,0.121124


In [29]:
#Convertir todas las variables categóricas a numéricas
EmpleadosAttrition["BusinessTravel"] = EmpleadosAttrition["BusinessTravel"].astype("category").cat.codes
EmpleadosAttrition["Department"] = EmpleadosAttrition["Department"].astype("category").cat.codes
EmpleadosAttrition["EducationField"] = EmpleadosAttrition["EducationField"].astype("category").cat.codes
EmpleadosAttrition["Gender"] = EmpleadosAttrition["Gender"].astype("category").cat.codes
EmpleadosAttrition["JobRole"] = EmpleadosAttrition["JobRole"].astype("category").cat.codes
EmpleadosAttrition["MaritalStatus"] = EmpleadosAttrition["MaritalStatus"].astype("category").cat.codes
EmpleadosAttrition["Attrition"] = EmpleadosAttrition["Attrition"].astype("category").cat.codes
EmpleadosAttrition["OverTime"] = EmpleadosAttrition["OverTime"].astype("category").cat.codes

#Verificar
EmpleadosAttrition[["BusinessTravel", "Department", "EducationField", "Gender", "JobRole", "MaritalStatus", "Attrition", "OverTime"]]

Unnamed: 0,BusinessTravel,Department,EducationField,Gender,JobRole,MaritalStatus,Attrition,OverTime
0,3,1,3,1,5,1,0,0
1,3,1,3,1,4,1,0,0
2,3,2,2,1,8,3,1,0
3,3,1,1,1,0,3,0,0
4,3,1,3,1,3,2,1,1
...,...,...,...,...,...,...,...,...
395,3,1,3,1,2,2,1,1
396,3,2,1,0,7,2,1,1
397,2,1,4,1,5,1,0,1
398,3,1,3,0,2,2,0,0


In [30]:
#Calcular la correlación lineal de cada una de las variables con respecto al Attrition.
print("Age:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["Age"]))
print("BusinessTravel:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["BusinessTravel"]))
print("Department:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["Department"]))
print("Education:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["Education"]))
print("EducationField:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["EducationField"]))
print("EnvironmentSatisfaction:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["EnvironmentSatisfaction"]))
print("Gender:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["Gender"]))
print("JobInvolvement:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["JobInvolvement"]))
print("JobLevel:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["JobLevel"]))
print("JobRole:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["JobRole"]))
print("JobSatisfaction:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["JobSatisfaction"]))
print("MaritalStatus:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["MaritalStatus"]))
print("MonthlyIncome:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["MonthlyIncome"]))
print("NumCompaniesWorked:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["NumCompaniesWorked"]))
print("OverTime:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["OverTime"]))
print("PercentSalaryHike:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["PercentSalaryHike"]))
print("PerformanceRating:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["PerformanceRating"]))
print("RelationshipSatisfaction:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["RelationshipSatisfaction"]))
print("TotalWorkingYears:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["TotalWorkingYears"]))
print("TrainingTimesLastYear:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["TrainingTimesLastYear"]))
print("WorkLifeBalance:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["WorkLifeBalance"]))
print("YearsInCurrentRole:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["YearsInCurrentRole"]))
print("YearsSinceLastPromotion:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["YearsSinceLastPromotion"]))
print("YearsAtCompany:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["YearsAtCompany"]))
print("DistanceFromHome:", EmpleadosAttrition["Attrition"].corr(EmpleadosAttrition["DistanceFromHome"]))

Age: -0.21212111259206842
BusinessTravel: 0.08289944709936396
Department: 0.05423584846525848
Education: -0.05553079310984968
EducationField: 0.051184319893178454
EnvironmentSatisfaction: -0.12432694416927641
Gender: -0.02883870932201118
JobInvolvement: -0.16678478967808572
JobLevel: -0.21426604317577613
JobRole: 0.07868359123827778
JobSatisfaction: -0.16495679045325196
MaritalStatus: 0.18040360207821057
MonthlyIncome: -0.19493601628571225
NumCompaniesWorked: -0.009081775325922558
OverTime: 0.3247769883361847
PercentSalaryHike: -0.06087967966387507
PerformanceRating: -0.00647103495364531
RelationshipSatisfaction: -0.03094453065814358
TotalWorkingYears: -0.21332903741456125
TrainingTimesLastYear: -0.07088383377930298
WorkLifeBalance: -0.021722932349419282
YearsInCurrentRole: -0.2039180268902555
YearsSinceLastPromotion: -0.06900034357702403
YearsAtCompany: -0.1762870458375972
DistanceFromHome: 0.05273215994064873


In [31]:
#Seleccionar solo aquellas variables que tengan una correlación mayor o igual a 0.1
EmpleadosAttritionFinal =  EmpleadosAttrition[["Age", "EnvironmentSatisfaction", "JobInvolvement", "JobLevel", "JobSatisfaction", "MaritalStatus", "MonthlyIncome", "Attrition", "OverTime", "TotalWorkingYears", "YearsInCurrentRole", "YearsAtCompany"]]

EmpleadosAttritionFinal

Unnamed: 0,Age,EnvironmentSatisfaction,JobInvolvement,JobLevel,JobSatisfaction,MaritalStatus,MonthlyIncome,Attrition,OverTime,TotalWorkingYears,YearsInCurrentRole,YearsAtCompany
0,50,4,3,4,4,1,0.864269,0,0,32,4,5
1,36,2,3,2,2,1,0.207340,0,0,7,2,3
2,21,2,3,1,2,3,0.088062,1,0,1,0,1
3,52,2,3,3,2,3,0.497574,0,0,18,6,8
4,33,2,3,3,3,2,0.664470,1,1,15,6,7
...,...,...,...,...,...,...,...,...,...,...,...,...
395,33,3,3,1,4,2,0.075248,1,1,8,4,5
396,31,2,1,2,3,2,0.187197,1,1,4,2,2
397,37,2,3,3,4,1,0.589327,0,1,10,8,10
398,38,4,3,1,3,2,0.121124,0,0,7,0,0


In [37]:
#Crea una nueva variable llamada EmpleadosAttritionPCA formada por los componentes principales del frame EmpleadosAttritionFinal.

# X con np.nan en vez de NA
X = EmpleadosAttritionFinal.drop(columns=["Attrition"]).to_numpy(dtype=float)
y = EmpleadosAttritionFinal["Attrition"].to_numpy()

# Imputar nulos con mediana
imputer = SimpleImputer(strategy="median")
X_imputed = imputer.fit_transform(X)

# Escalar y PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)

pca = PCA(n_components=2)
pca.fit(X_scaled)

print(pca.components_)
print(pca.explained_variance_)
print(pca.explained_variance_ratio_)

EmpleadosAttritionPCA = pca.transform(X_scaled)
print("Shape:", EmpleadosAttritionPCA.shape)
print(EmpleadosAttritionPCA[:5, :])

[[ 0.30537203 -0.02827751 -0.0137137   0.46488144  0.01549716 -0.0640889
   0.45738843 -0.02787712  0.46033045  0.33640561  0.38761291]
 [-0.42194462  0.06996263 -0.10448744 -0.18905516  0.25185805 -0.32073532
  -0.20417047  0.03176739 -0.17761529  0.55833125  0.46704297]]
[3.71492324 1.26243579]
[0.33687599 0.11447997]
Shape: (400, 2)
[[ 3.23057527 -1.27710493]
 [-1.02007606 -0.20010152]
 [-3.00303278 -0.37966809]
 [ 1.62964487 -1.49253808]
 [ 1.05559825  0.00524667]]


In [39]:
#Agrega el mínimo número de Componentes Principales en columnas del frame EmpleadosAttritionPCA que logren explicar el 80% de la varianza

#Imputar nulos en las columnas predictoras con la mediana y asegurar tipos NumPy
for col in EmpleadosAttritionFinal.columns:
    if col != "Attrition" and EmpleadosAttritionFinal[col].isna().any():
        EmpleadosAttritionFinal[col] = EmpleadosAttritionFinal[col].fillna(EmpleadosAttritionFinal[col].median())

EmpleadosAttritionFinal = EmpleadosAttritionFinal.apply(pd.to_numeric, errors="coerce")

#  Separar X y escalar
X = EmpleadosAttritionFinal.drop(columns=["Attrition"]).values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA con objetivo de varianza acumulada >= 80%
pca80 = PCA(n_components=0.80)
X_pca80 = pca80.fit_transform(X_scaled)

# Agregar columnas C0..C{n-1} al DataFrame
n_comp = pca80.n_components_
pc_names = [f"C{i}" for i in range(n_comp)]
pca_df = pd.DataFrame(X_pca80, columns=pc_names, index=EmpleadosAttritionFinal.index)

for name in pc_names:
    EmpleadosAttritionFinal[name] = pca_df[name]

# Verificación
print("Componentes añadidos:", pc_names)
print("Varianza explicada acumulada:", np.sum(pca80.explained_variance_ratio_))
print("Shape actualizado:", EmpleadosAttritionFinal.shape)



Componentes añadidos: ['C0', 'C1', 'C2', 'C3', 'C4', 'C5']
Varianza explicada acumulada: 0.820008227531978
Shape actualizado: (400, 18)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  EmpleadosAttritionFinal[col] = EmpleadosAttritionFinal[col].fillna(EmpleadosAttritionFinal[col].median())


In [41]:
EmpleadosAttritionFinal.to_csv("EmpleadosAttritionFinal.csv", index=False)