## Baixando e salvando o Dataset
Dowloading and Saving the Dataset

In [6]:
import kagglehub
import os

os.chdir("/content/churn-analysis/dataset")
# Download latest version
path = kagglehub.dataset_download("pavansubhasht/ibm-hr-analytics-attrition-dataset")

print("Path to dataset files:", path)

Path to dataset files: /root/.cache/kagglehub/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset/versions/1


In [8]:
import shutil

# Diretório padrão usado pela KaggleHub
default_path = "/root/.cache/kagglehub/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset/versions/1"

# Diretório personalizado onde deseja mover os arquivos
custom_dir = "/content/churn-analysis/dataset"

# Mover os arquivos
shutil.move(default_path, custom_dir)

print(f"Arquivos movidos para: {custom_dir}")

Arquivos movidos para: /content/churn-analysis/dataset


##Verificando a estrutura e o estado dos dados

Verification of the data state and structure

In [19]:
import pandas as pd

dataset_path = "/content/churn-analysis/dataset"

df = pd.read_csv(f"{dataset_path}/WA_Fn-UseC_-HR-Employee-Attrition.csv")

print(df.head())

   Age Attrition     BusinessTravel  DailyRate              Department  \
0   41       Yes      Travel_Rarely       1102                   Sales   
1   49        No  Travel_Frequently        279  Research & Development   
2   37       Yes      Travel_Rarely       1373  Research & Development   
3   33        No  Travel_Frequently       1392  Research & Development   
4   27        No      Travel_Rarely        591  Research & Development   

   DistanceFromHome  Education EducationField  EmployeeCount  EmployeeNumber  \
0                 1          2  Life Sciences              1               1   
1                 8          1  Life Sciences              1               2   
2                 2          2          Other              1               4   
3                 3          4  Life Sciences              1               5   
4                 2          1        Medical              1               7   

   ...  RelationshipSatisfaction StandardHours  StockOptionLevel  \
0  ...

In [12]:
# Resumo dos dados para identificar valores nulos
print(df.info())

# Contar valores nulos em cada coluna
null_counts = df.isnull().sum()
print("Valores nulos por coluna:\n", null_counts)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

In [15]:
# Selecionar todas as colunas com dtype object
object_columns = df.select_dtypes(include=['object']).columns

# Iterar por cada coluna e imprimir os valores únicos
for col in object_columns:
    print(f"Coluna: {col}")
    print(df[col].unique())
    print("-" * 40)  # Linha para separar cada coluna

Coluna: Attrition
['Yes' 'No']
----------------------------------------
Coluna: BusinessTravel
['Travel_Rarely' 'Travel_Frequently' 'Non-Travel']
----------------------------------------
Coluna: Department
['Sales' 'Research & Development' 'Human Resources']
----------------------------------------
Coluna: EducationField
['Life Sciences' 'Other' 'Medical' 'Marketing' 'Technical Degree'
 'Human Resources']
----------------------------------------
Coluna: Gender
['Female' 'Male']
----------------------------------------
Coluna: JobRole
['Sales Executive' 'Research Scientist' 'Laboratory Technician'
 'Manufacturing Director' 'Healthcare Representative' 'Manager'
 'Sales Representative' 'Research Director' 'Human Resources']
----------------------------------------
Coluna: MaritalStatus
['Single' 'Married' 'Divorced']
----------------------------------------
Coluna: Over18
['Y']
----------------------------------------
Coluna: OverTime
['Yes' 'No']
----------------------------------------


Com isso verificamos que não há valores nulos mesmo nos campos que não sao inteiros.


In [18]:
# Selecionar todas as colunas com dtype int64
int_columns = df.select_dtypes(include=['int64']).columns

# Gerar o resumo estatístico e verificar valores negativos
for col in int_columns:
    print(f"Coluna: {col}")
    stats = df[col].describe()  # Resumo estatístico da coluna
    print(stats)

    # Verificar se o valor mínimo é menor que 0
    if stats['min'] < 0:
        print(f"Atenção: Existem valores negativos na coluna '{col}'!")
    else:
        print("Sem valores negativos.")
    print("-" * 40)  # Linha para separar cada coluna


Coluna: Age
count    1470.000000
mean       36.923810
std         9.135373
min        18.000000
25%        30.000000
50%        36.000000
75%        43.000000
max        60.000000
Name: Age, dtype: float64
Sem valores negativos.
----------------------------------------
Coluna: DailyRate
count    1470.000000
mean      802.485714
std       403.509100
min       102.000000
25%       465.000000
50%       802.000000
75%      1157.000000
max      1499.000000
Name: DailyRate, dtype: float64
Sem valores negativos.
----------------------------------------
Coluna: DistanceFromHome
count    1470.000000
mean        9.192517
std         8.106864
min         1.000000
25%         2.000000
50%         7.000000
75%        14.000000
max        29.000000
Name: DistanceFromHome, dtype: float64
Sem valores negativos.
----------------------------------------
Coluna: Education
count    1470.000000
mean        2.912925
std         1.024165
min         1.000000
25%         2.000000
50%         3.000000
75%     