# Project 2
## Introduction
This project analyzes employee attrition using a dataset and various data science techniques. 

The primary objectives are:
- To explore the data using EDA techniques.
- To preprocess the data for machine learning.
- To build a classification model to predict attrition.
- To visualize and interpret results.

## Dataset
The dataset contains information about employees, including demographics, job roles, and factors potentially linked to attrition.

File Path: WA_Fn-UseC_-HR-Employee-Attrition.csv
Shape: Printed at runtime to verify the dimensions of the data.


## Imports & Data Preparation
For this project i am using the following libraries:
- pandas
- Matplotlib
- Seaborn 
- scikit-learn 
- NumPy
- mlxtend
- KMeans

In [19]:
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

file_path = 'WA_Fn-UseC_-HR-Employee-Attrition.csv'
employee_data = pd.read_csv(file_path)


  and should_run_async(code)


### First Checks
We will start by settings pandas to display 1000 rows and 1000 columns so that we can check a few more values

Checking the shape of the dataset we can see that we have 35 columns and 1470 rows

In [None]:
# This will be set to see most of the infomation of any print that i make
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000); 

print("Data shape: {}".format(employee_data.shape))

We then check the information of the dataset
As we can see we don´t have null values in any column and also the datatypes on the data frame are either int64 or object

In [None]:
# Columns datatypes and missign values to check whether i will need to remove null values
employee_data.info()
employee_data.head()

# EDA

Lets start by describing the dataset

In [None]:
#The T is to present the describe in tabular form
employee_data.describe().T

Here we will try to remove single value columns.
Lets start by showing an histogram of every numeric column and then making pie charts for every non-numeric column


In [None]:
# Histogram to check for single value columns to remove
employee_data.hist(figsize=(20,20))
plt.show()

# Pie Chart Distrubtion for non numeric cols
for col in employee_data.select_dtypes(include='object').columns:
    counts = employee_data[col].value_counts()
    plt.figure(figsize=(8, 6), facecolor='white')
    plt.pie(counts, labels=counts.index, autopct='%1.1f%%', startangle=90)
    plt.title(f"{col} Distribution")  # f-string for string formatting
    plt.show()


As we can see, we can remove the EmployeeCount, StandardHours, Over18 since they are single value columns


In [None]:
employee_data.drop([ 'EmployeeCount', 'StandardHours', 'Over18'], axis=1,inplace=True)

We are going to check one 

In [None]:
plt.figure(figsize=(30, 7))
sns.countplot(data=employee_data,x='JobRole', hue='Attrition', palette='viridis')
plt.title('Count of Attrition in Different Job Roles', fontsize=16)
plt.xlabel('Job Role', fontsize=13)
plt.ylabel('Count', fontsize=13)
plt.show()

# 

In [None]:
# Features and target
X = employee_data.drop(columns=['Attrition'])
y = employee_data['Attrition']

# Encoding categorical variables
categorical_cols = X.select_dtypes(include=[object]).columns.tolist()

X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)
X = X.astype(int)
#Encode Attrition
y = LabelEncoder().fit_transform(y)
correlation_matrix = X.corr()
plt.figure(figsize=(30, 20))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', cbar=True, square=True, 
            cbar_kws={"shrink": .75}, linewidths=.5)
plt.title('Correlation Matrix of Features', fontsize=16)
plt.tight_layout()
plt.show()

Since the Job Level and Monthly Income are very correlated i am going to remove the Monthly Income since i can work with a range of 5 levels.
We can also check a high correlation in the columns Years at company, Years at current Role, Years since last promotion and years with current manager so we will remove Years at current Role, Years since last promotion and years with current manager.
Another high correlation we can find is between 

In [None]:
X = X.drop(columns=['MonthlyIncome','TotalWorkingYears','YearsInCurrentRole','YearsWithCurrManager'])
X.info()

In [None]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
dt_classifier = DecisionTreeClassifier(random_state=1234)
dt_classifier = dt_classifier.fit(X_train, y_train) 
plt.figure(figsize=(40,20))
tree.plot_tree(dt_classifier,
               feature_names=X.columns,
               class_names=list(map(str, np.unique(y))),  # Convert class names to strings
               filled=True, rounded=True)
plt.show()


In [None]:
y_pred = dt_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Classification Report
print(classification_report(y_test,y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=np.unique(y), yticklabels=np.unique(y))
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

## Cross Validation

In [None]:
# Prepare the data
employee_data_feature_names = employee_data.columns.values.tolist()[:4]
employee_data_features = employee_data[employee_data_feature_names]

employee_data_target = employee_data[employee_data.columns.values.tolist()[4]]
employee_data_target_names = list(set(employee_data_target))

print('Features:',employee_data_feature_names, '   Classes:', employee_data_target_names)

# Instantiate the model
cv_classifier = DecisionTreeClassifier(random_state=27)

In [None]:
# Evaluate the model using cross validation
acc_score = cross_val_score(cv_classifier, employee_data_features, employee_data_target, cv=10)
print("CV Mean Accuracy: %0.3f (+/- %0.3f)" % (acc_score.mean(), acc_score.std()) )

f1_score = cross_val_score(cv_classifier, employee_data_features, employee_data_target, cv=10, scoring='f1_macro')
print("CV Mean F1: %0.3f (+/- %0.3f)" % (np.mean(f1_score), np.std(f1_score)) )

# Build the model with the complete data
#final_classifier = cv_classifier.fit(iris_features, iris_target)

