## The Goal of Collecting this Dataset

The IBM HR Analytics Employee Attrition dataset was collected to analyze employee attrition (whether employees leave the company) based on various factors such as age, salary, job role, and work-life balance. The primary goal is to predict whether an employee is likely to leave the company and identify the key factors influencing this decision.


## Source of the Dataset

The dataset was obtained from Kaggle. You can find the dataset at the following link:

[IBM HR Analytics Employee Attrition Dataset on Kaggle](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset)


## General Information about the Dataset

- **Number of Attributes (Columns)**: 35 attributes
- **Number of Objects (Rows)**: 1470 records
- **Class Name/Label**: Attrition (Yes/No)

   
  ## Types of Attributes

| Numerical Attributes       | Categorical Attributes  |
|----------------------------|-------------------------|
| Age                        | Attrition               |
| DailyRate                  | BusinessTravel          |
| DistanceFromHome           | Department              |
| Education                  | EducationField          |
| EmployeeCount              | Gender                  |
| EmployeeNumber             | JobRole                 |
| EnvironmentSatisfaction     | MaritalStatus           |
| HourlyRate                 | Over18                  |
| JobInvolvement             | OverTime                |
| JobLevel                   |                         |
| JobSatisfaction            |                         |
| MonthlyIncome              |                         |
| MonthlyRate                |                         |
| NumCompaniesWorked         |                         |
| PercentSalaryHike          |                         |
| PerformanceRating          |                         |
| RelationshipSatisfaction   |                         |
| StandardHours              |                         |
| StockOptionLevel           |                         |
| TotalWorkingYears          |                         |
| TrainingTimesLastYear      |                         |
| WorkLifeBalance            |                         |
| YearsAtCompany             |                         |
| YearsInCurrentRole         |                         |
| YearsSinceLastPromotion    |                         |
| YearsWithCurrManager       |                         |



In [6]:

import pandas as pd

# Load the dataset
df = pd.read_csv('Dataset/Dataset_HR_Employee-Attrition.csv')

# Display the first 5 rows of the dataset
print(df.head())

# Display the number of rows and columns
print(f"Number of Rows: {df.shape[0]}")
print(f"Number of Columns: {df.shape[1]}")

# Display information about the columns and their data types
print(df.info())





   Age Attrition     BusinessTravel  DailyRate              Department  \
0   41       Yes      Travel_Rarely       1102                   Sales   
1   49        No  Travel_Frequently        279  Research & Development   
2   37       Yes      Travel_Rarely       1373  Research & Development   
3   33        No  Travel_Frequently       1392  Research & Development   
4   27        No      Travel_Rarely        591  Research & Development   

   DistanceFromHome  Education EducationField  EmployeeCount  EmployeeNumber  \
0                 1          2  Life Sciences              1               1   
1                 8          1  Life Sciences              1               2   
2                 2          2          Other              1               4   
3                 3          4  Life Sciences              1               5   
4                 2          1        Medical              1               7   

   ...  RelationshipSatisfaction StandardHours  StockOptionLevel  \
0  ...

### Phase 2: Data Summarization and Preprocessing


1. Taking a Sample of the Dataset


We'll take a random sample of 20 employees from the dataset using the sample() function. This will help us analyze a small subset of the data.

In [4]:
import pandas as pd

# Load the dataset
df = pd.read_csv('Dataset/Dataset_HR_Employee-Attrition.csv')

# Set seed for reproducibility
import numpy as np
np.random.seed(30)

# Take a sample of 20 rows from the dataset
sample = df.sample(n=20)
print(sample)

      Age Attrition     BusinessTravel  DailyRate              Department  \
461    35        No      Travel_Rarely        195                   Sales   
640    24        No         Non-Travel       1269  Research & Development   
509    33        No  Travel_Frequently       1296  Research & Development   
788    28        No      Travel_Rarely        857  Research & Development   
950    31        No         Non-Travel        587                   Sales   
1127   23        No      Travel_Rarely        977  Research & Development   
706    40       Yes         Non-Travel       1479                   Sales   
645    29       Yes      Travel_Rarely        341                   Sales   
414    24       Yes      Travel_Rarely       1448                   Sales   
547    42       Yes  Travel_Frequently        933  Research & Development   
1173   36        No      Travel_Rarely        711  Research & Development   
880    32        No  Travel_Frequently        116  Research & Development   

2. Checking for Missing Values

We will check if there are any missing (null) values in the dataset. Missing values can impact our analysis and might require handling.

In [5]:
# Check for missing values in the dataset
missing_values = df.isnull().sum()
print("Missing Values in each column:")
print(missing_values)

# Check if there are any missing values in total
print("Total Missing Values:", missing_values.sum())


Missing Values in each column:
Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCu

We can see that in our dataset, there are no null values. This means that all columns have complete data without any missing entries, which ensures that we can proceed with analysis without the need for handling missing data.