<b>Exploraitary Data Analysis</b>

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Read the dataset
avdata = pd.read_csv("attrition_availabledata_17.csv")
compt = pd.read_csv("attrition_competition_17.csv")

In [None]:
print(avdata.shape)
print(compt.shape)

In [None]:
# Define a list for column names
colnames_av = list(avdata)
colnames_co = list(compt)

print(colnames_av)

# Print the unique values for each column
for col in colnames_av:
    print(col, "          ", set(avdata[col]))


In [None]:
print(len(set(avdata["NumCompaniesWorked"])))
print(len(set(avdata["PercentSalaryHike"])))
print(len(set(avdata["TrainingTimesLastYear"])))

The *availabledata* has 31 columns nd 2940 rows, whereas the *competition* has 30 columns and 1470 rows. All of the columns are the same; however, the *availabledata* has an extra column named *attrition*.

When the data is further examined, we can see that the variables are of the following types:

| Variable Name                 | Variable Type | Cardinality, if not numerical |
|--------------------------|----------|----------|
| hrs                      | Numerical     |      |
| absences                 | Numerical     |      |
| JobInvolvement           | Categorical     | 4     |
| PerformanceRating        | Categorical     | 2     |
| EnvironmentSatisfaction  | Categorical     | 4     |
| JobSatisfaction          | Categorical     | 4     |
| WorkLifeBalance          | Categorical     | 4     |
| Age                      | Numerical     |      |
| BusinessTravel           | Categorical     | 3     |
| Department               | Categorical     | 3     |
| DistanceFromHome         | Numerical     |      |
| Education                | Categorical     | 5     |
| EducationField           | Categorical     | 6     |
| EmployeeCount            | Constant Column     |  1    |
| EmployeeID               | ID Column     |      |
| Gender                   | Binary     |    2  |
| JobLevel                 | Categorical     | 5     |
| JobRole                  | Categorical     | 9     |
| MaritalStatus            | Categorical     | 3     |
| MonthlyIncome            | Numerical     |      |
| NumCompaniesWorked       | Categorical     |  10    |
| Over18                   | Constant Column     |    1  |
| PercentSalaryHike        | Categorical     |   15   |
| StandardHours            | Constant Colum     |     1 |
| StockOptionLevel         | Categorical     | 4     |
| TotalWorkingYears        | Numerical     |      |
| TrainingTimesLastYear    | Categorical     | 7     |
| YearsAtCompany           | Numerical     |      |
| YearsSinceLastPromotion  | Numerical     |      |
| YearsWithCurrManager     | Numerical     |      |
| Attrition                | Binary     |  2    |




We can see that some categorical variables have high cardinality. Namely; *absences* (24), *JobRole* (9), *NumCompaniesWorked* (10), and *PercentSalaryHike* (15).

Now let us see if there are any missing values.

In [None]:
missing_av = avdata.isnull().sum()
print(missing_av)

missing_compt = compt.isnull().sum()
print(missing_compt)

There are **no missing values** in either of the datasets.


The target column is a binary variable, making this problem a **classification** problem. Let us check if it is balanced or not.

In [None]:
# Count the values for attrition
attrition_counts = avdata['Attrition'].value_counts()
print(attrition_counts)

# Calculate class proportions
class_proportions = attrition_counts / len(avdata)
print(class_proportions)

We can see that only 16% of the attrition data is *Yes*, making it **imbalanced**.

In [None]:
numerical_vars = [
    "hrs",
    "absences",
    "Age",
    "DistanceFromHome",
    "MonthlyIncome",
    "TotalWorkingYears",
    "YearsAtCompany",
    "YearsSinceLastPromotion",
    "YearsWithCurrManager"
]

binary_vars = [
    "Gender",
    "Attrition"
]

categorical_vars = [
    "JobInvolvement",
    "PerformanceRating",
    "EnvironmentSatisfaction",
    "JobSatisfaction",
    "WorkLifeBalance",
    "BusinessTravel",
    "Department",
    "Education",
    "EducationField",
    "JobLevel",
    "JobRole",
    "MaritalStatus",
    "NumCompaniesWorked",
    "PercentSalaryHike",
    "StockOptionLevel",
    "TrainingTimesLastYear"
]

# Loop through categorical variables to calculate proportions
for col in categorical_vars:
    print(f"Proportions for {col}:")
    proportions = avdata[col].value_counts(normalize=True)  # Calculate proportions
    print(proportions)
    print("-" * 50)  # Separator for readability


In [None]:
import matplotlib.pyplot as plt

# Loop through categorical variables to create pie charts
for col in categorical_vars:
    # Get the value counts and their proportions
    proportions = avdata[col].value_counts(normalize=True)
    
    # Plotting the pie chart
    plt.figure(figsize=(6, 6))
    proportions.plot.pie(autopct='%1.1f%%', startangle=90, cmap='Set3')
    plt.title(f"Proportions for {col}")
    plt.ylabel('')  # Hide the y-axis label
    plt.show()
