# Proyecto final Mineria de datos

Prediccion de renuncias

In [66]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#-------
from sklearn.preprocessing import MinMaxScaler

## Ingesta de datos

Se cargan los datos posteriores al procesamiento hecho en la plataforma dataiku

In [22]:
data=pd.read_csv(r'data/predictive_data_joined_prepared.csv', sep=',')
data['Retired'].replace({1:'Retired',0:'Not Retired'},inplace=True)

In [23]:
data.dtypes.sort_index()

Age                          int64
BusinessTravel              object
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField              object
EnvironmentSatisfaction    float64
Gender                      object
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction            float64
MaritalStatus               object
MonthlyIncome                int64
NumCompaniesWorked           int64
PercentSalaryHike            int64
PerformanceRating            int64
Retired                     object
StandardHours                int64
StockOptionLevel             int64
TotalWorkingYears            int64
TrainingTimesLastYear        int64
WorkLifeBalance            float64
YearsAtCompany               int64
YearsSinceLastPromotion      int64
YearsWithCurrManager         int64
dtype: object

In [25]:
for i in data.columns:
    if data[i].dtype==object:
        data[i]=data[i].astype('category')

In [46]:
data.drop(columns=['StandardHours'],inplace=True)

## Seleccion de Factores

Identificar los factores clave de las renuncias de  personal.

In [47]:
df_corr=data.copy()

In [48]:
dropf=['Gender','Retired']
notdropf=['BusinessTravel','Department','EducationField',
          'JobRole','MaritalStatus']

In [49]:
df_corr=pd.get_dummies(df_corr, columns=notdropf, drop_first=False,dtype=int)
df_corr=pd.get_dummies(df_corr, columns=dropf, drop_first=True,dtype=int)

In [61]:
df_corr['StockOptionLevel']

0       0
1       1
2       3
3       3
4       2
       ..
4405    1
4406    0
4407    0
4408    1
4409    0
Name: StockOptionLevel, Length: 4410, dtype: int64

In [59]:
df_corr.corr().iloc[:,-1:].abs().sort_values(by='Retired_Retired')

Unnamed: 0,Retired_Retired
StockOptionLevel,0.002172
JobRole_Laboratory Technician,0.003482
DistanceFromHome,0.004398
EducationField_Marketing,0.004832
Gender_Male,0.007092
EducationField_Medical,0.007377
JobLevel,0.009296
JobRole_Sales Representative,0.011686
EducationField_Life Sciences,0.012745
JobRole_Human Resources,0.012799


Segun nuestro analisis de correlacion, _las variables mas representativas para definir si un recurso abandonara o no la compañia_ son:

In [60]:
df_corr.corr().iloc[:,-1:].abs().sort_values(by='Retired_Retired').tail(6)

Unnamed: 0,Retired_Retired
YearsAtCompany,0.128184
Age,0.141897
YearsWithCurrManager,0.144643
TotalWorkingYears,0.158626
MaritalStatus_Single,0.162858
Retired_Retired,1.0


Y se consideraron variables irrelevantes las que tengan una correlacion con la variable objetivo `<0.01`

In [63]:
#Serian:
# Variables irrelevantes con correlacion < 0.01 con la variable objetivo
irrelevantes = df_corr.corr().iloc[:, -1:].abs()
irrelevantes = irrelevantes[irrelevantes['Retired_Retired'] < 0.01]
irrelevantes

Unnamed: 0,Retired_Retired
DistanceFromHome,0.004398
JobLevel,0.009296
StockOptionLevel,0.002172
EducationField_Marketing,0.004832
EducationField_Medical,0.007377
JobRole_Laboratory Technician,0.003482
Gender_Male,0.007092


por lo que las variable irrelevantes para los ejercicios de clasificacion son:

- `DistanceFromHome`

- `JobLevel`

- `StockOptionLevel`

- `Gender`


## Mineria

### **Predictiva (Clasificacion)**


1. **Preparación de datos:** variables numéricas se deben normalizar y variables categórcias se crean dummies
2. **Aprendizaje del Modelo:** Kmeans, método del codo/rodilla
3. **Evaluación del Modelo:** Inertia, silueta
4. **Perfilamiento:** Descripción de centroides

#### Preparacion

In [69]:
dfc=data.copy()

In [70]:
dropf=['Gender','Retired']
notdropf=['BusinessTravel','Department','EducationField',
          'JobRole','MaritalStatus']

In [71]:
dfc=pd.get_dummies(dfc, columns=notdropf, drop_first=False,dtype=int)
dfc=pd.get_dummies(dfc, columns=dropf, drop_first=True,dtype=int)

In [72]:

#1.
variables_numericas=[i for i in dfc.columns if dfc[i].dtype != "category"]
min_max_scaler = MinMaxScaler()
min_max_scaler.fit(dfc[variables_numericas]) 
dfc[variables_numericas]= min_max_scaler.transform(dfc[variables_numericas]) 

In [77]:
print(
    f'Las columnas despues de la obtencion de dummies pasaron de {data.columns.size} a {dfc.columns.size}'
)

Las columnas despues de la obtencion de dummies pasaron de 25 a 44


#### Entrenamiento

Se entrenaran 3 modelos de clasificacion diferentes:

- Neural Networks
- Regresion Logistica
- KNN

In [None]:
#Regresion Logistica