## Part 1: Preprocessing

In [1]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras import layers

#  Import and read the attrition data
attrition_df = pd.read_csv('https://static.bc-edx.com/ai/ail-v-1-0/m19/lms/datasets/attrition.csv')
attrition_df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,Sales,1,2,Life Sciences,2,94,3,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,Research & Development,8,1,Life Sciences,3,61,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,Research & Development,2,2,Other,4,92,2,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,Research & Development,3,4,Life Sciences,4,56,3,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,Research & Development,2,1,Medical,1,40,3,...,3,4,1,6,3,3,2,2,2,2


In [2]:
# Determine the number of unique values in each column
attrition_df.nunique()

Age                         43
Attrition                    2
BusinessTravel               3
Department                   3
DistanceFromHome            29
Education                    5
EducationField               6
EnvironmentSatisfaction      4
HourlyRate                  71
JobInvolvement               4
JobLevel                     5
JobRole                      9
JobSatisfaction              4
MaritalStatus                3
NumCompaniesWorked          10
OverTime                     2
PercentSalaryHike           15
PerformanceRating            2
RelationshipSatisfaction     4
StockOptionLevel             4
TotalWorkingYears           40
TrainingTimesLastYear        7
WorkLifeBalance              4
YearsAtCompany              37
YearsInCurrentRole          19
YearsSinceLastPromotion     16
YearsWithCurrManager        18
dtype: int64

In [3]:
# Select target variables: 'Attrition' (indicates whether an employee left)
# and 'Department' (categorical feature for the employee's department)
y_df = attrition_df[['Attrition', 'Department']]

In [4]:
#display a sample of the target data
y_df.head()

Unnamed: 0,Attrition,Department
0,Yes,Sales
1,No,Research & Development
2,Yes,Research & Development
3,No,Research & Development
4,No,Research & Development


In [5]:
# Define a list of numerical and categorical columns to be used as features (X variables)
selected_columns = ['Education', 'Age', 'DistanceFromHome', 'JobSatisfaction', 'OverTime',
                    'StockOptionLevel', 'WorkLifeBalance', 'YearsAtCompany',
                    'YearsSinceLastPromotion', 'NumCompaniesWorked']
# Create a DataFrame containing only the selected features
X_df = attrition_df[selected_columns]

# Show the data types for X_df
print(X_df.dtypes)


Education                   int64
Age                         int64
DistanceFromHome            int64
JobSatisfaction             int64
OverTime                   object
StockOptionLevel            int64
WorkLifeBalance             int64
YearsAtCompany              int64
YearsSinceLastPromotion     int64
NumCompaniesWorked          int64
dtype: object


In [6]:
#Display a sample of the X data 
X_df.head()

Unnamed: 0,Education,Age,DistanceFromHome,JobSatisfaction,OverTime,StockOptionLevel,WorkLifeBalance,YearsAtCompany,YearsSinceLastPromotion,NumCompaniesWorked
0,2,41,1,4,Yes,0,1,6,0,8
1,1,49,8,2,No,1,3,10,1,1
2,2,37,2,3,Yes,0,3,0,0,6
3,4,33,3,3,Yes,0,3,8,3,1
4,1,27,2,2,No,1,3,2,2,9


In [7]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split

In [8]:
# Convert categorical column 'OverTime' to numeric (Yes=1, No=0)
X_df['OverTime'] = X_df['OverTime'].map({'Yes': 1, 'No': 0})

# Check for missing values and fill them (if any)
X_df.fillna(X_df.median(), inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_df['OverTime'] = X_df['OverTime'].map({'Yes': 1, 'No': 0})
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_df.fillna(X_df.median(), inplace=True)


In [9]:
# display X data after converting
X_df.head()

Unnamed: 0,Education,Age,DistanceFromHome,JobSatisfaction,OverTime,StockOptionLevel,WorkLifeBalance,YearsAtCompany,YearsSinceLastPromotion,NumCompaniesWorked
0,2,41,1,4,1,0,1,6,0,8
1,1,49,8,2,0,1,3,10,1,1
2,2,37,2,3,1,0,3,0,0,6
3,4,33,3,3,1,0,3,8,3,1
4,1,27,2,2,0,1,3,2,2,9


In [10]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, random_state=42)

In [11]:
# Create a StandardScaler
scaler=StandardScaler()

# Fit the StandardScaler to the training data
scaler.fit(X_train)

# Fit the StandardScaler to the training data and transform it
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [12]:
from sklearn.preprocessing import OneHotEncoder

# Initialize OneHotEncoder for the 'Department' column
dept_encoder = OneHotEncoder(sparse_output=False)

#  Fit the encoder to the 'Department' column of the training data
dept_encoder.fit(y_train[['Department']])  # Fit only on the training data to avoid data leakage

# Fit and transform the training data
dept_train = dept_encoder.transform(y_train[['Department']])
dept_test = dept_encoder.transform(y_test[['Department']])


In [13]:
# Display the first 5 rows of the encoded training data
print("Encoded Department (Training):")
print(dept_train[:5])
# Display the first 5 rows of the encoded testing data
print("Encoded Department (Testing):")
print(dept_test[:5])

Encoded Department (Training):
[[0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 1. 0.]]
Encoded Department (Testing):
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]]


In [14]:
# Initialize OneHotEncoder for the 'Attrition' column
attrition_encoder = OneHotEncoder(sparse_output=False)  # Use sparse_output=False for dense arrays

# Fit the encoder to the 'Attrition' column of the training data
attrition_encoder.fit(y_train[['Attrition']])  # Fit only on the training data to avoid data leakage

# Transform the 'Attrition' column for both training and testing data
attrition_train = attrition_encoder.transform(y_train[['Attrition']])  # Transform training data
attrition_test = attrition_encoder.transform(y_test[['Attrition']])    # Transform testing data


In [15]:
# Display the first 5 rows of the encoded training data
print("Encoded Attrition (Training):")
print(attrition_train[:5])
# Display the first 5 rows of the encoded testing data
print("Encoded Attrition (Testing):")
print(attrition_test[:5])

Encoded Attrition (Training):
[[1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]]
Encoded Attrition (Testing):
[[1. 0.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]]


## Part 2: Create, Compile, and Train the Model

In [16]:
# Find the number of columns in the X training data
input_dim = X_train_scaled.shape[1]
print(f"Number of columns in X_train_scaled: {input_dim}")

# Create the input layer
from tensorflow.keras import Input

# Create the input layer
input_layer = Input(shape=(input_dim,), name="input_layer")

# Create two shared layers
shared_layer_1 = layers.Dense(64, activation="relu", name="shared_layer_1")(input_layer)
shared_layer_2 = layers.Dense(128, activation="relu", name="shared_layer_2")(shared_layer_1)

Number of columns in X_train_scaled: 10


In [17]:
# # Create a branch for Department
# Create the hidden layer
department_branch = layers.Dense(32, activation="relu", name="department_hidden_layer")(shared_layer_2)
# Create the output layer
department_output = layers.Dense(3, activation="softmax", name="department_output")(department_branch)

In [18]:
# # Create a branch for Attrition
# Create the hidden layer
attrition_branch = layers.Dense(32, activation="relu", name="attrition_hidden_layer")(shared_layer_2)
# Create the output layer
attrition_output = layers.Dense(attrition_train.shape[1], activation="softmax", name="attrition_output")(attrition_branch)

In [19]:
# Define the model
model = Model(inputs=input_layer, outputs=[department_output, attrition_output], name="multi_output_model")

# Compile the model
model.compile(
    optimizer="adam",
    loss={
        "department_output": "categorical_crossentropy",  # Loss for Department
        "attrition_output": "binary_crossentropy",   #Attrition column (binary classification).
    },
    metrics={
        "department_output": "accuracy",  # Metric for Department
        "attrition_output": "accuracy",   # Metric for Attrition
    },
)
# Print the model summary
model.summary()

In [20]:
history = model.fit(
    X_train_scaled,  # Input features
    {
        "department_output": dept_train,  # Target for Department
        "attrition_output": attrition_train,   # Target for Attrition
    },
    validation_data=(
        X_test_scaled,  # Validation input features
        {
            "department_output": dept_test,  # Validation target for Department
            "attrition_output": attrition_test,   # Validation target for Attrition
        },
    ),
    epochs=100,
    batch_size=32,
    verbose=1,
)

Epoch 1/100
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 28ms/step - attrition_output_accuracy: 0.7669 - attrition_output_loss: 0.6003 - department_output_accuracy: 0.5772 - department_output_loss: 0.9282 - loss: 1.5285 - val_attrition_output_accuracy: 0.8673 - val_attrition_output_loss: 0.3794 - val_department_output_accuracy: 0.6667 - val_department_output_loss: 0.7833 - val_loss: 1.1979
Epoch 2/100
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - attrition_output_accuracy: 0.8442 - attrition_output_loss: 0.4281 - department_output_accuracy: 0.6789 - department_output_loss: 0.7435 - loss: 1.1716 - val_attrition_output_accuracy: 0.8673 - val_attrition_output_loss: 0.3714 - val_department_output_accuracy: 0.6633 - val_department_output_loss: 0.7914 - val_loss: 1.1946
Epoch 3/100
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - attrition_output_accuracy: 0.8194 - attrition_output_loss: 0.4175 - department_output_accu

In [21]:
# Evaluate the model
results = model.evaluate(
    X_test_scaled,
    {
        "department_output": dept_test,  # Test target for Department
        "attrition_output": attrition_test,   # Test target for Attrition
    },
    verbose=0,
)

In [22]:
# Print evaluation results
print(f"Test Loss: {results[0]}")
print(f"Department Accuracy: {results[3]}")
print(f"Attrition Accuracy: {results[4]}")

Test Loss: 4.515039443969727
Department Accuracy: 0.8061224222183228
Attrition Accuracy: 0.5612244606018066


# Summary

1. **Is accuracy the best metric to use on this data? Why or why not?**

   Accuracy might not be the best metric for this model because the dataset could be imbalanced, which is common in classification tasks. In imbalanced datasets, accuracy can be misleading. The model may predict the majority class for most cases, which would still result in high accuracy but poor performance for the minority class. Other metrics like F1-score, precision, recall, or the area under the ROC curve (AUC-ROC) would be more informative, especially when dealing with class imbalance.

2. **What activation functions did you choose for your output layers, and why?**

   For the output layers of classification tasks, a common choice is:
   - **Softmax**: Used for multi-class classification problems, like predicting the "department" output, where the model needs to choose from multiple classes. Softmax provides a probability distribution across the classes.
   - **Sigmoid**: Used for binary classification tasks, such as predicting attrition (yes/no). Sigmoid outputs probabilities for each class, with the result being a value between 0 and 1.

3. **Can you name a few ways that this model might be improved?**

   A few ways to improve the model:
   - **Data Augmentation**: If the dataset is small or has class imbalance, augmenting the data through oversampling, undersampling, or techniques like SMOTE can help improve model performance.
   - **Hyperparameter Tuning**: Tuning hyperparameters (such as learning rate, batch size, optimizer type like Adam vs. SGD) can enhance the convergence rate and generalization.
   - **Regularization**: Implementing regularization techniques like dropout or L2 regularization can prevent overfitting, especially if the model is too complex for the dataset.
   - **Feature Engineering**: Adding more relevant features or transforming existing ones (e.g., normalization or encoding categorical variables) can help the model capture better patterns in the data.
   - **Ensemble Methods**: Combining predictions from multiple models (e.g., random forest, XGBoost) can boost the overall performance and reduce model variance.
