<a href="https://colab.research.google.com/github/GusvdW/neural-networks-challenge-2/blob/main/neural_networks_challenge_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Part 1: Preprocessing

In [1]:
# Import our dependencies
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras import layers
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from tensorflow.keras.layers import Input, Dense, Concatenate
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score
import tensorflow as tf
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from tensorflow.keras import models
from tensorflow.keras.metrics import BinaryAccuracy, CategoricalAccuracy

#  Import and read the attrition data
attrition_df = pd.read_csv('https://static.bc-edx.com/ai/ail-v-1-0/m19/lms/datasets/attrition.csv')
attrition_df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,Sales,1,2,Life Sciences,2,94,3,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,Research & Development,8,1,Life Sciences,3,61,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,Research & Development,2,2,Other,4,92,2,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,Research & Development,3,4,Life Sciences,4,56,3,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,Research & Development,2,1,Medical,1,40,3,...,3,4,1,6,3,3,2,2,2,2


In [3]:
# Determine the number of unique values in each column.
attrition_df.nunique()

Unnamed: 0,0
Age,43
Attrition,2
BusinessTravel,3
Department,3
DistanceFromHome,29
Education,5
EducationField,6
EnvironmentSatisfaction,4
HourlyRate,71
JobInvolvement,4


In [4]:
# # Create y_df with the Attrition and Department columns
y_df = attrition_df[['Attrition', 'Department']]
print(y_df.head())

  Attrition              Department
0       Yes                   Sales
1        No  Research & Development
2       Yes  Research & Development
3        No  Research & Development
4        No  Research & Development


In [5]:
# Create a list of at least 10 column names to use as X data
feature_columns = [
    'Age',
    'HourlyRate',
    'Attrition',
    'Education',
    'Department',
    'JobInvolvement',
    'JobLevel',
    'JobSatisfaction',
    'PercentSalaryHike',
    'NumCompaniesWorked'
]

# Create X_df using your selected columns
X_df = attrition_df[feature_columns]
# Show the data types for X_df
print(X_df.dtypes)


Age                    int64
HourlyRate             int64
Attrition             object
Education              int64
Department            object
JobInvolvement         int64
JobLevel               int64
JobSatisfaction        int64
PercentSalaryHike      int64
NumCompaniesWorked     int64
dtype: object


In [6]:
# Split the data into training and testing sets
y_attrition = X_df['Attrition'].map({'Yes': 1, 'No': 0})  # Binary target for attrition
y_department = X_df['Department']  # Multi-class target for department

X = X_df.drop(columns=['Attrition', 'Department'])
X_train, X_test, y_train_attrition, y_test_attrition, y_train_dept, y_test_dept = train_test_split(
    X, y_attrition, y_department, test_size=0.2, random_state=42
)
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train_attrition shape: {y_train_attrition.shape}")
print(f"y_test_attrition shape: {y_test_attrition.shape}")
print(f"y_train_dept shape: {y_train_dept.shape}")
print(f"y_test_dept shape: {y_test_dept.shape}")

X_train shape: (1176, 8)
X_test shape: (294, 8)
y_train_attrition shape: (1176,)
y_test_attrition shape: (294,)
y_train_dept shape: (1176,)
y_test_dept shape: (294,)


In [7]:
# Check unique values in the Department column for the testing set
unique_departments_test = y_test_dept.unique()

print("Unique values in the Department column of the testing set:")
print(unique_departments_test)

Unique values in the Department column of the testing set:
['Sales' 'Research & Development' 'Human Resources']


In [8]:
# Verify the mapping
print(y_train_attrition.unique())

[0 1]


In [9]:
# Convert your X data to numeric data types however you see fit
# Add new code cells as necessary
X = attrition_df[feature_columns].drop(columns=['Attrition'])

X['Department'] = X['Department'].map({'Research & Development': 0, 'Sales': 1, 'Human Resources': 2})


y_attrition = attrition_df['Attrition'].map({'Yes': 1, 'No': 0})
y_department = X['Department']


X_train, X_test, y_train_attrition, y_test_attrition, y_train_dept, y_test_dept = train_test_split(
    X, y_attrition, y_department, test_size=0.2, random_state=42)

num_columns = ['Age', 'HourlyRate', 'Education', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'PercentSalaryHike', 'NumCompaniesWorked', 'Department']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_columns)
    ])

X_train_transformed = preprocessor.fit_transform(X_train)
X_test_transformed = preprocessor.transform(X_test)


# Verify the shapes and head of your data
print("X_train_transformed shape:", X_train_transformed.shape)
print("X_test_transformed shape:", X_test_transformed.shape)



X_train_transformed shape: (1176, 9)
X_test_transformed shape: (294, 9)


In [10]:
# Check the target labels
print("Unique values in Attrition (y_attrition):")
print(y_attrition.unique())

print("Unique values in Department (y_department):")
print(y_department.unique())


Unique values in Attrition (y_attrition):
[1 0]
Unique values in Department (y_department):
[1 0 2]


In [11]:
# Create a StandardScaler
scaler = StandardScaler()

# Fit the StandardScaler to the training data
# Scale the training and testing data
scaler.fit(X_train)


X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)



In [12]:
# Create a OneHotEncoder for the Department column
# Fit the encoder to the training data
# Create two new variables by applying the encoder
# to the training and testing data
encoder = OneHotEncoder(sparse=False)

y_train_dept_encoded = encoder.fit_transform(y_train_dept.values.reshape(-1, 1))
y_test_dept_encoded = encoder.transform(y_test_dept.values.reshape(-1, 1))

encoded_features_train = encoder.fit_transform(X_train[['Department']])
encoded_features_test = encoder.transform(X_test[['Department']])


encoded_features_train_df = pd.DataFrame(encoded_features_train, columns=encoder.get_feature_names_out(['Department']))
encoded_features_test_df = pd.DataFrame(encoded_features_test, columns=encoder.get_feature_names_out(['Department']))


X_train_final = X_train.drop(columns=['Department']).reset_index(drop=True)
X_test_final = X_test.drop(columns=['Department']).reset_index(drop=True)



print("Final training columns:")
print(X_train_final.columns)

print(f"Shape of y_train_dept_encoded: {y_train_dept_encoded.shape}")
print(f"Shape of y_test_dept_encoded: {y_test_dept_encoded.shape}")

Final training columns:
Index(['Age', 'HourlyRate', 'Education', 'JobInvolvement', 'JobLevel',
       'JobSatisfaction', 'PercentSalaryHike', 'NumCompaniesWorked'],
      dtype='object')
Shape of y_train_dept_encoded: (1176, 3)
Shape of y_test_dept_encoded: (294, 3)




In [13]:
# Create a OneHotEncoder for the Attrition column
# Fit the encoder to the training data
# Create two new variables by applying the encoder
# to the training and testing data
encoder_attrition = OneHotEncoder(sparse=False, drop='first')

encoded_attrition_train = encoder_attrition.fit_transform(y_train_attrition.values.reshape(-1, 1))

encoded_attrition_test = encoder_attrition.transform(y_test_attrition.values.reshape(-1, 1))

encoded_attrition_train_df = pd.DataFrame(encoded_attrition_train, columns=encoder_attrition.get_feature_names_out(['Attrition']))
encoded_attrition_test_df = pd.DataFrame(encoded_attrition_test, columns=encoder_attrition.get_feature_names_out(['Attrition']))

print("Encoded Attrition (training):")
print(encoded_attrition_train_df.head())

print("Encoded Attrition (testing):")
print(encoded_attrition_test_df.head())

Encoded Attrition (training):
   Attrition_1
0          0.0
1          0.0
2          0.0
3          0.0
4          0.0
Encoded Attrition (testing):
   Attrition_1
0          0.0
1          0.0
2          1.0
3          0.0
4          0.0




## Create, Compile, and Train the Model

In [14]:
# Find the number of columns in the X training data
num_columns = X_train_final.shape[1]
print(f"Number of columns (features) in X_train: {num_columns}")

# Create the input layer
input_layer = layers.Input(shape=(num_columns,))

# Create at least two shared layers
shared_layer_1 = layers.Dense(64, activation='relu')(input_layer)
shared_layer_2 = layers.Dense(32, activation='relu')(shared_layer_1)

Number of columns (features) in X_train: 8


In [15]:
# Create a branch for Department
# with a hidden layer and an output layer
# Create the hidden layer
# Create the output layer
department_hidden_layer = layers.Dense(16, activation='relu')(shared_layer_2)
department_output = layers.Dense(3, activation='softmax', name='department_output')(department_hidden_layer)


In [16]:
# Create a branch for Attrition
# with a hidden layer and an output layer
# Create the hidden layer
# Create the output layer
attrition_hidden_layer = layers.Dense(16, activation='relu')(shared_layer_2)
attrition_output = layers.Dense(1, activation='sigmoid', name='attrition_output')(attrition_hidden_layer)


In [17]:
# Create the model
model = models.Model(inputs=input_layer, outputs=[attrition_output, department_output])
# Compile the model
model.compile(optimizer='adam',
              loss={'attrition_output': 'binary_crossentropy',
                    'department_output': 'categorical_crossentropy'},
              metrics={'attrition_output': BinaryAccuracy(name='attrition_accuracy'),
                       'department_output': CategoricalAccuracy(name='department_accuracy')})
# Summarize the model
model.summary()

In [21]:
# Train the model
history = model.fit(
    X_train_final,
    {'attrition_output': y_train_attrition,
     'department_output': y_train_dept_encoded},
    epochs=50,
    batch_size=32,
    validation_split=0.2
)

Epoch 1/50
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - attrition_output_attrition_accuracy: 0.8498 - department_output_department_accuracy: 0.6275 - loss: 1.1399 - val_attrition_output_attrition_accuracy: 0.7881 - val_department_output_department_accuracy: 0.6271 - val_loss: 1.3307
Epoch 2/50
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - attrition_output_attrition_accuracy: 0.8293 - department_output_department_accuracy: 0.6440 - loss: 1.1795 - val_attrition_output_attrition_accuracy: 0.7966 - val_department_output_department_accuracy: 0.6314 - val_loss: 1.3302
Epoch 3/50
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - attrition_output_attrition_accuracy: 0.8438 - department_output_department_accuracy: 0.6272 - loss: 1.1615 - val_attrition_output_attrition_accuracy: 0.7966 - val_department_output_department_accuracy: 0.6271 - val_loss: 1.3108
Epoch 4/50
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37

In [26]:
# Evaluate the model with the testing data
results = model.evaluate(
    X_test_final,
    {'attrition_output': y_test_attrition,
     'department_output': y_test_dept_encoded},
    batch_size=32
)
attrition_pred = model.predict(X_test_final)[0]
department_pred = model.predict(X_test_final)[1]

attrition_pred_classes = (attrition_pred > 0.5).astype(int)

department_pred_classes = np.argmax(department_pred, axis=1)

attrition_accuracy = BinaryAccuracy()
attrition_accuracy.update_state(y_test_attrition, attrition_pred_classes)
attrition_acc_result = attrition_accuracy.result().numpy()

department_accuracy = CategoricalAccuracy()
department_accuracy.update_state(y_test_dept_encoded, department_pred)
department_acc_result = department_accuracy.result().numpy()

[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - attrition_output_attrition_accuracy: 0.8483 - department_output_department_accuracy: 0.6320 - loss: 1.1873 
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step 
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step 


In [27]:
# Print the accuracy for both department and attrition
# Print the full

print(f"Manual Attrition Accuracy: {attrition_acc_result}")
print(f"Manual Department Accuracy: {department_acc_result}")

print(f"Full Results: {results}")


Manual Attrition Accuracy: 0.8673469424247742
Manual Department Accuracy: 0.6564626097679138
Full Results: [1.1475584506988525, 0.8673469424247742, 0.6564626097679138]


# Summary

In the provided space below, briefly answer the following questions.

1. Is accuracy the best metric to use on this data? Why or why not?

2. What activation functions did you choose for your output layers, and why?

3. Can you name a few ways that this model might be improved?

YOUR ANSWERS HERE

1. In the attrition prediction (binary classification), if most employees do not leave (say 90% don't leave and 10% do), the model could achieve high accuracy simply by predicting "no attrition" for everyone. In this case, accuracy wouldn’t provide much insight into how well the model is detecting employees who are actually going to leave. In the department prediction (multi-class classification), if the dataset is dominated by one department, accuracy could similarly reflect a bias toward that majority class. Thus, while accuracy is intuitive and widely used, it may not be the best metric for your data if there’s class imbalance.

2. For binary classification, we want to know how likely it is that an event will occur (e.g., an employee leaving the company). The sigmoid function produces an output that can be directly interpreted as a probability. This is key because it gives us insight into the confidence of the prediction.

3. Here are a few ways that your model might be improved: Feature Engineering - Add More Relevant Features: If the current feature set isn’t capturing enough information about the employees, you could consider adding more relevant features. Balance the Dataset - Address Class Imbalance: If there is an imbalance in the attrition data (e.g., far more employees who don’t leave than those who do), the model may have difficulty detecting the minority class (attrition cases). You can address this by oversampling the minority class (e.g., with SMOTE—Synthetic Minority Over-sampling Technique), or undersampling the majority class. Class weights: Assign higher weights to the minority class during training to make the model more sensitive to it. Keras allows you to pass class weights in the fit() function. Cross-Validation - You can us K-fold cross-validation to get a more reliable estimate of model performance across different data partitions.
Data Augmentation - Augment the Data: For some structured data, it may be possible to augment the data by creating synthetic samples that represent realistic variations of existing data. This is more common in image and text data, but techniques like SMOTE can be applied for numerical data augmentation in cases of class imbalance.