## Part 1: Preprocessing

In [93]:
!pip install tensorflow # For use in Google Colab



In [94]:
# Import our dependencies
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder

import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Dense

#  Import and read the attrition data
attrition_df = pd.read_csv('https://static.bc-edx.com/ai/ail-v-1-0/m19/lms/datasets/attrition.csv')
attrition_df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,Sales,1,2,Life Sciences,2,94,3,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,Research & Development,8,1,Life Sciences,3,61,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,Research & Development,2,2,Other,4,92,2,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,Research & Development,3,4,Life Sciences,4,56,3,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,Research & Development,2,1,Medical,1,40,3,...,3,4,1,6,3,3,2,2,2,2


In [95]:
# Determine the number of unique values in each column.
attrition_df.nunique()

Unnamed: 0,0
Age,43
Attrition,2
BusinessTravel,3
Department,3
DistanceFromHome,29
Education,5
EducationField,6
EnvironmentSatisfaction,4
HourlyRate,71
JobInvolvement,4


In [96]:
# Create y_df with the Attrition and Department columns
y_df = attrition_df[['Attrition', 'Department']]
y_df.head()



Unnamed: 0,Attrition,Department
0,Yes,Sales
1,No,Research & Development
2,Yes,Research & Development
3,No,Research & Development
4,No,Research & Development


In [97]:
# Create a list of at least 10 column names to use as X data
feature_columns = [
    'Age',                      # Different age groups may have varying career goals and stability
    'DistanceFromHome',         # Longer commute might contribute to job dissatisfaction and higher attrition rates
    'EnvironmentSatisfaction',  # Satisfaction with the work environment directly affects an employee's happiness
    'JobSatisfaction',          # Overall job satisfaction is a strong predictor of attrition
    'OverTime',                 # Excessive overtime can lead to burnout
    'TotalWorkingYears',        # Employees with more experience might be more stable or, alternatively, might seek better opportunities
    'WorkLifeBalance',          # Poor work-life balance can lead to dissatisfaction and attrition
    'YearsAtCompany',           # Tenure at the company may influence loyalty and the likelihood of staying
    'YearsInCurrentRole',       # Time spent in the same role without advancement might lead to frustration
    'YearsSinceLastPromotion',  # Lack of recent promotions can decrease motivation and increase attrition risk
    'NumCompaniesWorked',       # Employees who have worked at many companies might be more prone to switch jobs
    'TrainingTimesLastYear'     # Access to training might affect job satisfaction and retention
]


# Create X_df using your selected columns
X_df = attrition_df[feature_columns]



# Show the data types for X_df
print(X_df.dtypes)
#print(X_df.head())



Age                         int64
DistanceFromHome            int64
EnvironmentSatisfaction     int64
JobSatisfaction             int64
OverTime                   object
TotalWorkingYears           int64
WorkLifeBalance             int64
YearsAtCompany              int64
YearsInCurrentRole          int64
YearsSinceLastPromotion     int64
NumCompaniesWorked          int64
TrainingTimesLastYear       int64
dtype: object


In [98]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, random_state=42)
X_train.head()


Unnamed: 0,Age,DistanceFromHome,EnvironmentSatisfaction,JobSatisfaction,OverTime,TotalWorkingYears,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,NumCompaniesWorked,TrainingTimesLastYear
1097,24,21,3,1,No,2,3,1,1,0,0,3
727,18,5,2,4,No,0,3,0,0,0,1,2
254,29,20,4,4,No,10,3,3,2,0,2,2
1175,39,12,4,2,No,7,3,5,4,1,4,3
1341,31,20,2,3,No,10,3,10,8,0,1,2


In [99]:
# Convert your X data to numeric data types however you see fit
# Add new code cells as necessary

# Encode the 'OverTime' column in both X_train and X_test
X_train['OverTime'] = X_train['OverTime'].map({'Yes': 1, 'No': 0}).astype(np.int64)
X_test['OverTime'] = X_test['OverTime'].map({'Yes': 1, 'No': 0}).astype(np.int64)

#print(X_train.dtypes)
#print(X_test.dtypes)


In [100]:
# Create a StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform both the training and testing data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)



In [101]:
# Create a OneHotEncoder
encoder = OneHotEncoder()  # drop='first' to avoid multicollinearity

# Fit the encoder to the training data
encoder.fit(y_train[['Department']])


# Create two new variables by applying the encoder
# to the training and testing data
y_train_encoded = encoder.transform(y_train[['Department']]).toarray()
y_test_encoded = encoder.transform(y_test[['Department']]).toarray()

#print(y_train_encoded[:5])


In [102]:
# Fit the encoder to the training data

# Create two new variables by applying the encoder
# to the training and testing data
y_train['Attrition'] = y_train['Attrition'].map({'Yes': 1, 'No': 0}).astype(np.int64)
y_test['Attrition'] = y_test['Attrition'].map({'Yes': 1, 'No': 0}).astype(np.int64)

#print(y_train_encoded[:5])


## Create, Compile, and Train the Model

In [103]:
# Find the number of columns in the X training data
input_shape = X_train_scaled.shape[1]
print(f"Number of input features: {input_shape}")


# Create the input layer
input_layer = layers.Input(shape=(input_shape,))


# Create at least two shared layers
shared_layer_1 = layers.Dense(units=16, activation='relu')(input_layer)
shared_layer_2 = layers.Dense(units=8, activation='relu')(shared_layer_1)


Number of input features: 12


In [104]:
# Create a branch for Department
# with a hidden layer and an output layer

# Create the hidden layer
department_hidden_layer = layers.Dense(units=4, activation='relu')(shared_layer_2)


# Create the output layer
department_output = layers.Dense(units=3, activation='softmax', name='department_output')(department_hidden_layer)



In [105]:
# Create a branch for Attrition
# with a hidden layer and an output layer

# Create the hidden layer
attrition_hidden_layer = layers.Dense(units=4, activation='relu')(shared_layer_2)


# Create the output layer
attrition_output = layers.Dense(units=1, activation='sigmoid', name='attrition_output')(attrition_hidden_layer)



In [106]:
# Create the model
model = Model(inputs=input_layer, outputs=[department_output, attrition_output])


# Compile the model
model.compile(optimizer='adam',
              loss={'department_output': 'categorical_crossentropy', 'attrition_output': 'binary_crossentropy'},
              metrics={'department_output': 'accuracy', 'attrition_output': 'accuracy'})


# Summarize the model
model.summary()



In [107]:
# Train the model
history = model.fit(
    X_train_scaled,
    {'department_output': y_train_encoded, 'attrition_output': y_train['Attrition']},
    epochs=20,
    batch_size=32,
    validation_data=(X_test_scaled, {'department_output': y_test_encoded, 'attrition_output': y_test['Attrition']})
)


Epoch 1/20
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 27ms/step - attrition_output_accuracy: 0.7686 - department_output_accuracy: 0.5924 - loss: 1.7055 - val_attrition_output_accuracy: 0.8673 - val_department_output_accuracy: 0.6667 - val_loss: 1.5770
Epoch 2/20
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - attrition_output_accuracy: 0.8423 - department_output_accuracy: 0.6429 - loss: 1.5399 - val_attrition_output_accuracy: 0.8673 - val_department_output_accuracy: 0.6667 - val_loss: 1.3914
Epoch 3/20
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - attrition_output_accuracy: 0.8352 - department_output_accuracy: 0.6307 - loss: 1.4022 - val_attrition_output_accuracy: 0.8673 - val_department_output_accuracy: 0.6667 - val_loss: 1.2745
Epoch 4/20
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 17ms/step - attrition_output_accuracy: 0.8428 - department_output_accuracy: 0.6741 - loss: 1.2759 - val_at

In [108]:
# Evaluate the model with the testing data
test_loss, department_accuracy, attrition_accuracy = model.evaluate(
    X_test_scaled,
    {'department_output': y_test_encoded, 'attrition_output': y_test['Attrition']}
)


[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - attrition_output_accuracy: 0.8447 - department_output_accuracy: 0.6493 - loss: 1.1742 


In [110]:
# Print the accuracy for both department and attrition
print(f"Test Loss: {test_loss}")
print(f"Department Loss: {department_output}")
print(f"Attrition Loss: {attrition_output}")
print(f"Department Accuracy: {department_accuracy}")
print(f"Attrition Accuracy: {attrition_accuracy}")

Test Loss: 1.1366230249404907
Department Loss: <KerasTensor shape=(None, 3), dtype=float32, sparse=False, name=keras_tensor_46>
Attrition Loss: <KerasTensor shape=(None, 1), dtype=float32, sparse=False, name=keras_tensor_48>
Department Accuracy: 0.8673469424247742
Attrition Accuracy: 0.6666666865348816


# Summary

In the provided space below, briefly answer the following questions.

1. Is accuracy the best metric to use on this data? Why or why not?

2. What activation functions did you choose for your output layers, and why?

3. Can you name a few ways that this model might be improved?

YOUR ANSWERS HERE

1. Why Accuracy May Not Be the Best Metric:
  * Class Imbalance in Attrition: If the dataset is imbalanced — meaning that one class (e.g., "No Attrition") is significantly more frequent than the other class (e.g., "Yes Attrition") — then accuracy can give a misleading sense of performance.
  * Multi-class Classification (Department): Accuracy can work well when predicting the department, as it's a multi-class problem with likely more balanced distribution across the different departments. However, accuracy might still miss nuances, such as how well the model is distinguishing between specific classes.
  * Better metrics might include: Precision, Recall, and F1-Score.  A Confusion Matrix might also provide insight.

2. Sigmoid for Attrition; Softmax for Department.
  * For the Attrition Output, the Sigmoid activation function was used because it's ideal for binary classification problems, such as predicting whether an employee will leave the company (Attrition = Yes or No); it works well with the binary cross-entropy loss used to optimize the model.
  * For Department Output, the Softmax activation function was used because it's used for multi-class classification problems, such as predicting the department (3 classes: e.g., Sales, R&D, HR) and it works well with the categorical cross-entropy loss used in the department branch.

3. Improving the model can involve modifying the architecture, tuning hyperparameters, and enhancing data preprocessing.
  * Handle class imbalance (for Attrition prediction) by assigning weights to the minority class when trainning the model.
  * Feature Engineering may help if the raw data is not capturing important information about employee behavior.
  * Hyperparameter tuning like using Grid Search over a range of hyperparameters to find the best combination and/or adding dropout layers can prevent overfitting.
  * Advanced architectures over the simple feed-forward architecture to better capture complex patterns (e.g., more hidden layers).
  * Use Cross-Validation techniques such as K-Fold Cross-Validation to validate the train/test split.
  * Lastly, neural networks might not be the optimal solution for structured tabular data.  Random Forests, Gradient Boosting, and Ensemble Models might do better with this data.