# Data Profiling

This script will
- Use YData Profiling to comprehensively profile the raw data
    - Visualise correlations between features
    - Flag alerts found within the data
- Save an HTML profile report to /reports/hr_attrition_profiling.html

In [None]:
# ---
# HR Attrition Dataset - Data Profiling
# ---

import pandas as pd
from ydata_profiling import ProfileReport

# Load data
df = pd.read_csv("../data/raw/hr_attrition_dataset.csv")

# Quick peek
df.head()
df.shape
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 37 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   employee_id                 25000 non-null  int64  
 1   snapshot_date               25000 non-null  object 
 2   hire_date                   25000 non-null  object 
 3   region                      25000 non-null  object 
 4   department                  25000 non-null  object 
 5   role                        25000 non-null  object 
 6   level                       25000 non-null  object 
 7   is_manager                  25000 non-null  int64  
 8   age                         25000 non-null  int64  
 9   gender                      25000 non-null  object 
 10  remote_status               25000 non-null  object 
 11  commute_km                  25000 non-null  float64
 12  tenure_years                25000 non-null  float64
 13  base_salary                 250

In [6]:
profile = ProfileReport(df, title="HR Attrition Dataset Profiling Report", explorative=True)
profile.to_file("../reports/hr_attrition_profiling.html")

Summarize dataset: 100%|██████████| 447/447 [00:33<00:00, 13.39it/s, Completed]                                                   
Generate report structure: 100%|██████████| 1/1 [00:06<00:00,  6.93s/it]
Render HTML: 100%|██████████| 1/1 [00:05<00:00,  5.82s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 67.73it/s]


## Correlation Heatmap

<img src="../reports/Correlations.png" width="50%">

## Alerts

<img src="../reports/Alerts.png" width="50%">

## Double-check quality

In [None]:
# Check missing values
df.isnull().sum().sort_values(ascending=False)

# Check unique values per column (helps detect categorical vs numeric)
df.nunique().sort_values()

# Look for duplicates
df.duplicated().sum()

0