# Projeto de TCC: ***Impacto no uso de People Analytics em decisões organizacionais para identificação de talentos***

**Objetivo:** Este notebook objetiva apresentar o Trabalho de Conclusão de Curso para Data Science and Analytics, junto a USP-Esalq, utilizando a metodologia CRISP-DM, desde sua fase de Entendimento dos Dados à última etapa de Implantação. 

**Link Dataset:** https://www.kaggle.com/datasets/bhrt97/hr-analytics-classification

*Para acessar a documentação do projeto, com os insights, instruções e resultados obtidos, acesse o arquivo README.md deste repositório*

## Importar Bibliotecas

In [2]:
%load_ext autotime
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 10 s (started: 2025-06-16 22:54:27 -03:00)


In [12]:
# Ajustes visuais para visualização das tabelas e células

pd.set_option('display.max_colwidth', None) # remover truncamento de valores das colunas
pd.set_option('display.max_rows', None) # remover truncamento do número de linhas exibidas
pd.set_option('display.max_columns', None) # remover truncamento do número de colunas exibidas
pd.set_option('display.float_format', '{:.2f}'.format) # valores quebrados serão setados com 2 casas decimais

time: 0 ns (started: 2025-06-16 23:04:55 -03:00)


## Input dos Dados (carregando dataset)

In [3]:
# carregando apenas base de treino para análise exploratória
df = pd.read_csv("dataset/train_hr_class.csv")
df.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


time: 171 ms (started: 2025-06-16 22:56:35 -03:00)


In [6]:
# Renomeando nomes das colunas, para facilitar a manipulação e compreensão da tabela
df = df.rename(columns={
    "employee_id": "matricula",
    "department": "departamento",
    "region": "regiao",
    "education": "escolaridade",
    "gender": "genero",
    "recruitment_channel": "canal_recrutamento",
    "no_of_trainings": "qtd_treinamentos",
    "age": "idade",
    "previous_year_rating": "avaliacao_anterior",
    "length_of_service": "tempo_empresa",
    "KPIs_met >80%": "kpis_atingidos",
    "awards_won?": "premios",
    "avg_training_score": "media_treinamento",
    "is_promoted": "promovido"
})

df.head()

Unnamed: 0,matricula,departamento,regiao,escolaridade,genero,canal_recrutamento,qtd_treinamentos,idade,avaliacao_anterior,tempo_empresa,kpis_atingidos,premios,media_treinamento,promovido
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


time: 31 ms (started: 2025-06-16 22:59:50 -03:00)


## Análise Exploratória dos Dados

In [7]:
# Estrutura do Dataset
print(f"Linhas: {df.shape[0]} / Colunas: {df.shape[1]}")

Linhas: 54808 / Colunas: 14
time: 0 ns (started: 2025-06-16 23:01:02 -03:00)


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 54808 entries, 0 to 54807
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   matricula           54808 non-null  int64  
 1   departamento        54808 non-null  object 
 2   regiao              54808 non-null  object 
 3   escolaridade        52399 non-null  object 
 4   genero              54808 non-null  object 
 5   canal_recrutamento  54808 non-null  object 
 6   qtd_treinamentos    54808 non-null  int64  
 7   idade               54808 non-null  int64  
 8   avaliacao_anterior  50684 non-null  float64
 9   tempo_empresa       54808 non-null  int64  
 10  kpis_atingidos      54808 non-null  int64  
 11  premios             54808 non-null  int64  
 12  media_treinamento   54808 non-null  int64  
 13  promovido           54808 non-null  int64  
dtypes: float64(1), int64(8), object(5)
memory usage: 6.3+ MB
time: 78 ms (started: 2025-06-16 23:03:01 -03:00)


In [13]:
# Estatística Descritiva
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
matricula,54808.0,39195.83,22586.58,1.0,19669.75,39225.5,58730.5,78298.0
qtd_treinamentos,54808.0,1.25,0.61,1.0,1.0,1.0,1.0,10.0
idade,54808.0,34.8,7.66,20.0,29.0,33.0,39.0,60.0
avaliacao_anterior,50684.0,3.33,1.26,1.0,3.0,3.0,4.0,5.0
tempo_empresa,54808.0,5.87,4.27,1.0,3.0,5.0,7.0,37.0
kpis_atingidos,54808.0,0.35,0.48,0.0,0.0,0.0,1.0,1.0
premios,54808.0,0.02,0.15,0.0,0.0,0.0,0.0,1.0
media_treinamento,54808.0,63.39,13.37,39.0,51.0,60.0,76.0,99.0
promovido,54808.0,0.09,0.28,0.0,0.0,0.0,0.0,1.0


time: 62 ms (started: 2025-06-16 23:05:03 -03:00)
