## Part 1: Preprocessing

In [2]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras import layers

#  Import and read the attrition data
attrition_df = pd.read_csv('https://static.bc-edx.com/ai/ail-v-1-0/m19/lms/datasets/attrition.csv')
attrition_df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,Sales,1,2,Life Sciences,2,94,3,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,Research & Development,8,1,Life Sciences,3,61,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,Research & Development,2,2,Other,4,92,2,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,Research & Development,3,4,Life Sciences,4,56,3,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,Research & Development,2,1,Medical,1,40,3,...,3,4,1,6,3,3,2,2,2,2


In [3]:
# Determine the number of unique values in each column.
attrition_df.nunique()

Age                         43
Attrition                    2
BusinessTravel               3
Department                   3
DistanceFromHome            29
Education                    5
EducationField               6
EnvironmentSatisfaction      4
HourlyRate                  71
JobInvolvement               4
JobLevel                     5
JobRole                      9
JobSatisfaction              4
MaritalStatus                3
NumCompaniesWorked          10
OverTime                     2
PercentSalaryHike           15
PerformanceRating            2
RelationshipSatisfaction     4
StockOptionLevel             4
TotalWorkingYears           40
TrainingTimesLastYear        7
WorkLifeBalance              4
YearsAtCompany              37
YearsInCurrentRole          19
YearsSinceLastPromotion     16
YearsWithCurrManager        18
dtype: int64

In [4]:
# Create y_df with the Attrition and Department columns

y_df = attrition_df[['Attrition', 'Department']]
y_df.head()

Unnamed: 0,Attrition,Department
0,Yes,Sales
1,No,Research & Development
2,Yes,Research & Development
3,No,Research & Development
4,No,Research & Development


In [5]:
# Create a list of at least 10 column names to use as X data

column_names = ["Education", "Age",
                "DistanceFromHome",
                "JobSatisfaction",
                "OverTime",
                "StockOptionLevel",
                "WorkLifeBalance",
                "YearsAtCompany",
                "YearsSinceLastPromotion",
                "NumCompaniesWorked"]


# Create X_df using your selected columns
X_df = pd.DataFrame(attrition_df, columns=column_names)
X_df.head()

# Show the data types for X_df

X_df.dtypes

Education                   int64
Age                         int64
DistanceFromHome            int64
JobSatisfaction             int64
OverTime                   object
StockOptionLevel            int64
WorkLifeBalance             int64
YearsAtCompany              int64
YearsSinceLastPromotion     int64
NumCompaniesWorked          int64
dtype: object

In [6]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, random_state=42)

In [7]:
# Convert your X data to numeric data types however you see fit
# Add new code cells as necessary
print(X_train.value_counts('OverTime'))

X_train['OverTime'] = X_train['OverTime'].replace({'Yes': 1, 'No': 0})
X_test['OverTime'] = X_test['OverTime'].replace({'Yes': 1, 'No': 0})

# Display the value counts of the 'OverTime' column in the training and testing sets
print("Value counts of 'OverTime' column in training set:")
print(X_train.value_counts('OverTime'))

print("\nValue counts of 'OverTime' column in testing set:")
print(X_test.value_counts('OverTime'))

OverTime
No     780
Yes    322
Name: count, dtype: int64
Value counts of 'OverTime' column in training set:
OverTime
0    780
1    322
Name: count, dtype: int64

Value counts of 'OverTime' column in testing set:
OverTime
0    274
1     94
Name: count, dtype: int64


In [8]:
# Create a StandardScaler

scaler = StandardScaler()
# Fit the StandardScaler to the training data
scaler.fit(X_train)

# Scale the training and testing data

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [9]:
# Create a OneHotEncoder for the Department column

from sklearn.preprocessing import OneHotEncoder
department_encoder = OneHotEncoder(sparse_output=False)
# Fit the encoder to the training data

department_encoder.fit(y_train[['Department']])
# Transform the 'Department' column in the training and testing data
y_train_department_encoded = department_encoder.transform(y_train[['Department']])
y_test_department_encoded = department_encoder.transform(y_test[['Department']])

# Convert the encoded columns to DataFrames
y_train_department_encoded_df = pd.DataFrame(y_train_department_encoded, columns=department_encoder.get_feature_names_out(['Department']))
y_test_department_encoded_df = pd.DataFrame(y_test_department_encoded, columns=department_encoder.get_feature_names_out(['Department']))

# Print the resulting arrays
print("\nEncoded 'Department' column in training set:")
print(y_train_department_encoded_df[:5])

print("\nEncoded 'Department' column in testing set:")
print(y_test_department_encoded_df[:5])




Encoded 'Department' column in training set:
   Department_Human Resources  Department_Research & Development  \
0                         0.0                                1.0   
1                         0.0                                0.0   
2                         0.0                                0.0   
3                         0.0                                0.0   
4                         0.0                                0.0   

   Department_Sales  
0               0.0  
1               1.0  
2               1.0  
3               1.0  
4               1.0  

Encoded 'Department' column in testing set:
   Department_Human Resources  Department_Research & Development  \
0                         0.0                                0.0   
1                         0.0                                1.0   
2                         1.0                                0.0   
3                         0.0                                1.0   
4                         0.

In [10]:
# Create a OneHotEncoder for the Attrition column


attrition_encoder = OneHotEncoder(sparse_output=False)

# Fit the encoder to the 'Attrition' column in the training data
attrition_encoder.fit(y_train[['Attrition']])

# Transform the 'Attrition' column in the training and testing data
y_train_attrition_encoded = attrition_encoder.transform(y_train[['Attrition']])
y_test_attrition_encoded = attrition_encoder.transform(y_test[['Attrition']])

# Convert the encoded columns to DataFrames
y_train_attrition_encoded_df = pd.DataFrame(y_train_attrition_encoded, columns=attrition_encoder.get_feature_names_out(['Attrition']))
y_test_attrition_encoded_df = pd.DataFrame(y_test_attrition_encoded, columns=attrition_encoder.get_feature_names_out(['Attrition']))

# Print the resulting arrays for verification
print("\nEncoded 'Attrition' column in training set:")
print(y_train_attrition_encoded_df[:5])

print("\nEncoded 'Attrition' column in testing set:")
print(y_test_attrition_encoded_df[:5])




Encoded 'Attrition' column in training set:
   Attrition_No  Attrition_Yes
0           1.0            0.0
1           1.0            0.0
2           1.0            0.0
3           1.0            0.0
4           1.0            0.0

Encoded 'Attrition' column in testing set:
   Attrition_No  Attrition_Yes
0           1.0            0.0
1           1.0            0.0
2           0.0            1.0
3           1.0            0.0
4           1.0            0.0


## Create, Compile, and Train the Model

In [11]:
# Find the number of columns in the X training data

X_train_scaled.shape[1]
# Create the input layer

input_layer = layers.Input(shape=(X_train_scaled.shape[1],))
# Create at least two shared layers
shared_layer1 = layers.Dense(64, activation='relu')(input_layer)
shared_layer2 = layers.Dense(128, activation='relu')(shared_layer1)

In [12]:
# Create a branch for Department
# with a hidden layer and an output layer

# Create the hidden layer
department_hidden = layers.Dense(32, activation='relu')(shared_layer2)

# Create the output layer
department_output = layers.Dense(3, activation='softmax', name='department_output')(department_hidden)


In [13]:
# Create a branch for Attrition
# with a hidden layer and an output layer

# Create the hidden layer
attrition_hidden = layers.Dense(32, activation='relu')(shared_layer2)

# Create the output layer
attrition_output = layers.Dense(2, activation='sigmoid', name='attrition_output')(attrition_hidden)


In [14]:
# Create the model
model = Model(inputs=input_layer, outputs=[department_output, attrition_output])

# Compile the model
model.compile(optimizer='adam',
              loss={'department_output': 'categorical_crossentropy',
                    'attrition_output': 'binary_crossentropy'},
              metrics={'department_output': 'accuracy',
                       'attrition_output': 'accuracy'})

# Summarize the model
model.summary()

In [19]:
# Train the model
history = model.fit(X_train_scaled,
                    {'department_output': y_train_department_encoded_df, 'attrition_output': y_train_attrition_encoded},
                    epochs=50, batch_size=32, validation_split=0.2)


Epoch 1/50
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - attrition_output_accuracy: 0.8130 - department_output_accuracy: 0.6660 - loss: 1.4477 - val_attrition_output_accuracy: 0.7873 - val_department_output_accuracy: 0.6063 - val_loss: 1.3721
Epoch 2/50
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - attrition_output_accuracy: 0.8105 - department_output_accuracy: 0.6629 - loss: 1.2663 - val_attrition_output_accuracy: 0.7873 - val_department_output_accuracy: 0.6063 - val_loss: 1.3363
Epoch 3/50
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - attrition_output_accuracy: 0.8323 - department_output_accuracy: 0.6417 - loss: 1.1680 - val_attrition_output_accuracy: 0.7873 - val_department_output_accuracy: 0.6063 - val_loss: 1.3068
Epoch 4/50
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - attrition_output_accuracy: 0.8528 - department_output_accuracy: 0.6740 - loss: 1.1164 - val_attrit

In [20]:
# Evaluate the model with the testing data
model.evaluate(X_test_scaled,
               {'department_output': y_test_department_encoded_df, 'attrition_output': y_test_attrition_encoded})

[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 457us/step - attrition_output_accuracy: 0.8175 - department_output_accuracy: 0.5577 - loss: 2.2020


[2.0547635555267334, 0.83152174949646, 0.5461956262588501]

In [21]:
# Print the accuracy for both department and attrition
print("Department Accuracy:", history.history['department_output_accuracy'][-1])
print("Attrition Accuracy:", history.history['attrition_output_accuracy'][-1])

Department Accuracy: 0.9511918425559998
Attrition Accuracy: 0.9659478068351746


# Summary

In the provided space below, briefly answer the following questions.

1. Is accuracy the best metric to use on this data? Why or why not?

2. What activation functions did you choose for your output layers, and why?

3. Can you name a few ways that this model might be improved?

YOUR ANSWERS HERE

1. no. Upon examining the accuracy results, the model appears to be functioning exceptionally well. However, the confusion matrix clearly indicates that the model's performance is significantly inferior.

Relying just on accuracy can be deceptive. Confusion matrices offer a more intricate evaluation of performance. Despite achieving high training accuracy, the presence of low precision, recall, and F1-scores for specific classes indicates that there are problems that cannot be identified by accuracy alone.


An example of a potential issue is overfitting, where the model becomes excessively focused on the training data and includes irrelevant noise and information that do not apply to fresh, unseen data. Therefore, the model attains a high level of accuracy when applied to the training data, but exhibits subpar performance when tested on the test data, as evidenced by the confusion matrix.

2. The ReLU activation function was selected because of its capacity to efficiently learn common features in deep neural networks.

The computational effectiveness of the system speeds up the training process, making it especially beneficial for handling huge datasets containing employee information. ReLU's promotion of sparsity enhances the model's generalization ability by prioritizing the most pertinent characteristics.

3. Model Tuning: Conduct experiments with various topologies, such as augmenting the number of layers or modifying the neuron count in current layers. Optimize hyperparameters such as the learning rate, batch size, and number of epochs.

Ensure that the dataset is evenly distributed for both outputs, creating a balanced dataset. If there is a disparity in the distribution of classes, particularly in the Department column, it is advisable to consider employing strategies such as oversampling or undersampling.

Regularization: Employ regularization methods such as dropout or L2 regularization to mitigate overfitting and enhance generalization.