<a href="https://colab.research.google.com/github/IvannaPrice/neural-network-challenge-2/blob/main/attrition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Part 1: Preprocessing

In [37]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras import layers

#  Import and read the attrition data
attrition_df = pd.read_csv('https://static.bc-edx.com/ai/ail-v-1-0/m19/lms/datasets/attrition.csv')
attrition_df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,Sales,1,2,Life Sciences,2,94,3,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,Research & Development,8,1,Life Sciences,3,61,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,Research & Development,2,2,Other,4,92,2,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,Research & Development,3,4,Life Sciences,4,56,3,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,Research & Development,2,1,Medical,1,40,3,...,3,4,1,6,3,3,2,2,2,2


In [38]:
# Determine the number of unique values in each column.
attrition_df.nunique()

Unnamed: 0,0
Age,43
Attrition,2
BusinessTravel,3
Department,3
DistanceFromHome,29
Education,5
EducationField,6
EnvironmentSatisfaction,4
HourlyRate,71
JobInvolvement,4


In [39]:
attrition_df.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'Department', 'DistanceFromHome',
       'Education', 'EducationField', 'EnvironmentSatisfaction', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

In [40]:
# Create y_df with the Attrition and Department columns
# Create y_df with only the 'Attrition' and 'Department' columns
y_df = attrition_df[['Attrition', 'Department']]
y_df.head()

Unnamed: 0,Attrition,Department
0,Yes,Sales
1,No,Research & Development
2,Yes,Research & Development
3,No,Research & Development
4,No,Research & Development


In [41]:
# Create a list of at least 10 column names to use as X data
# Updated column name to match the actual column name in the dataframe
x_columns = ['Age', 'Education', 'JobLevel', 'HourlyRate', 'TotalWorkingYears',
             'YearsAtCompany', 'YearsInCurrentRole', 'YearsWithCurrManager',
             'JobSatisfaction', 'WorkLifeBalance']


# Create X_df using your selected columns
X_df = attrition_df[x_columns]
X_df.head()

# Show the data types for X_df
X_df.dtypes

Unnamed: 0,0
Age,int64
Education,int64
JobLevel,int64
HourlyRate,int64
TotalWorkingYears,int64
YearsAtCompany,int64
YearsInCurrentRole,int64
YearsWithCurrManager,int64
JobSatisfaction,int64
WorkLifeBalance,int64


In [42]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, random_state=42)


In [43]:
# Convert your X data to numeric data types however you see fit
# Add new code cells as necessary
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)
X_df.head()

Unnamed: 0,Age,Education,JobLevel,HourlyRate,TotalWorkingYears,YearsAtCompany,YearsInCurrentRole,YearsWithCurrManager,JobSatisfaction,WorkLifeBalance
0,41,2,2,94,8,6,4,5,4,1
1,49,1,2,61,10,10,7,7,2,3
2,37,2,1,92,7,0,0,0,3,3
3,33,4,1,56,8,8,7,0,3,3
4,27,1,1,40,6,2,2,2,2,3


In [44]:
from sklearn.preprocessing import StandardScaler
# Create a StandardScaler
scaler = StandardScaler()

# Fit the StandardScaler to the training data
X_train_scaled = scaler.fit_transform(X_train)

# Scale the training and testing data
X_test_scaled = scaler.transform(X_test)


In [45]:
from sklearn.preprocessing import OneHotEncoder
# Create a OneHotEncoder for the Department column
# The 'sparse' parameter was removed in scikit-learn 1.2
# It defaults to 'False' now.
department_encoder = OneHotEncoder(sparse_output=False)  # Use sparse_output instead of sparse

# Fit the encoder to the training data
# Create two new variables by applying the encoder
# to the training and testing data
y_train_department_encoded = department_encoder.fit_transform(y_train[['Department']])
y_test_department_encoded = department_encoder.transform(y_test[['Department']])

In [46]:
from sklearn.preprocessing import OneHotEncoder
# Create a OneHotEncoder for the Attrition column
# The 'sparse' parameter was removed in scikit-learn 1.2
# and it defaults to 'False' now. Use 'sparse_output' instead.
attrition_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore') # using sparse_output instead of sparse

# Fit the encoder to the training data


# Create two new variables by applying the encoder
# to the training and testing data
y_train_attrition_encoded = attrition_encoder.fit_transform(y_train[['Attrition']])
y_test_attrition_encoded = attrition_encoder.transform(y_test[['Attrition']])

## Create, Compile, and Train the Model

In [47]:
# Find the number of columns in the X training data
# Find the number of columns in the X training data
input_shape = X_train_scaled.shape[1]
print(input_shape)


from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
# Create the input layer
input_layer = Input(shape=(input_shape,))

# Create at least two shared layers
shared_layer1 = Dense(64, activation='relu')(input_layer)
shared_layer2 = Dense(32, activation='relu')(shared_layer1)

10


In [48]:
# Create a branch for Department
# with a hidden layer and an output
# Create a branch for Department with a hidden layer and an output layer
department_hidden_layer = Dense(16, activation='relu')(shared_layer2)

# Create the hidden layer
# Create the hidden layer
department_hidden_layer = Dense(16, activation='relu')(shared_layer2)

# Create the output layer
# Create the output layer
department_output = Dense(y_train_department_encoded.shape[1], activation='softmax', name='department_output')(department_hidden_layer)
print(department_output)

<KerasTensor shape=(None, 3), dtype=float32, sparse=False, name=keras_tensor_22>


In [49]:
# Create a branch for Attrition
# with a hidden layer and an output layer

# Create the hidden layer
attrition_hidden_layer = Dense(16, activation='relu')(shared_layer2)

# Create the output layer
attrition_output = Dense(y_train_attrition_encoded.shape[1], activation='softmax', name='attrition_output')(attrition_hidden_layer)
print(attrition_output)

<KerasTensor shape=(None, 2), dtype=float32, sparse=False, name=keras_tensor_24>


In [50]:
from tensorflow.keras.models import Model
# Create the model
model = Model(inputs=input_layer, outputs=[department_output, attrition_output])

# Compile the model
model.compile(optimizer='adam',
              loss={'department_output': 'categorical_crossentropy', 'attrition_output': 'categorical_crossentropy'},
              metrics={'department_output': 'accuracy', 'attrition_output': 'accuracy'})

# Summarize the model
model.summary()

In [51]:
# Train the model
history = model.fit(
    X_train_scaled,
    {'department_output': y_train_department_encoded, 'attrition_output': y_train_attrition_encoded},
    validation_data=(
        X_test_scaled,
        {'department_output': y_test_department_encoded, 'attrition_output': y_test_attrition_encoded}
    ),
    epochs=50,
    batch_size=32
)



Epoch 1/50
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 11ms/step - attrition_output_accuracy: 0.7247 - department_output_accuracy: 0.3523 - loss: 1.6803 - val_attrition_output_accuracy: 0.8673 - val_department_output_accuracy: 0.6701 - val_loss: 1.2724
Epoch 2/50
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - attrition_output_accuracy: 0.8389 - department_output_accuracy: 0.6673 - loss: 1.2473 - val_attrition_output_accuracy: 0.8673 - val_department_output_accuracy: 0.6667 - val_loss: 1.1706
Epoch 3/50
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - attrition_output_accuracy: 0.8307 - department_output_accuracy: 0.6537 - loss: 1.2017 - val_attrition_output_accuracy: 0.8673 - val_department_output_accuracy: 0.6667 - val_loss: 1.1560
Epoch 4/50
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - attrition_output_accuracy: 0.8378 - department_output_accuracy: 0.6588 - loss: 1.1641 - val_attri

In [52]:
# Evaluate the model with testing data
evaluation = model.evaluate(
    X_test_scaled,
    {'department_output': y_test_department_encoded, 'attrition_output': y_test_attrition_encoded}
)

# Print accuracy for each branch
# Print the entire evaluation result to understand the structure
print("Full evaluation output:", evaluation)
print(f"Department Test Accuracy: {evaluation[-2]}")
print(f"Attrition Test Accuracy: {evaluation[-1]}")


[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - attrition_output_accuracy: 0.8452 - department_output_accuracy: 0.5817 - loss: 1.3198 
Full evaluation output: [1.257799744606018, 0.8639456033706665, 0.6088435649871826]
Department Test Accuracy: 0.8639456033706665
Attrition Test Accuracy: 0.6088435649871826


In [53]:
# Print the accuracy for both department and attrition
print(f"Department Test Accuracy: {evaluation[-2] * 100:.2f}%")
print(f"Attrition Test Accuracy: {evaluation[-1] * 100:.2f}%")

Department Test Accuracy: 86.39%
Attrition Test Accuracy: 60.88%


# Summary

In the provided space below, briefly answer the following questions.

1. Is accuracy the best metric to use on this data? Why or why not?

2. What activation functions did you choose for your output layers, and why?

3. Can you name a few ways that this model might be improved?

YOUR ANSWERS HERE

1. Accuracy is often used as a baseline metric for classification tasks, but it may not always be the best choice, especially if the classes are imbalanced. In this case, if one class (like Attrition or certain Department categories) is much more frequent than others, accuracy might give a misleading impression of model performance by favoring the majority class.

For this dataset, alternative metrics like F1 score, precision, and recall may provide a better understanding of the model’s performance, particularly if we care more about correctly predicting certain classes (e.g., detecting attrition accurately).
2. The activation function chosen for both output layers (Department and Attrition) was softmax. This choice is appropriate because:

Softmax is typically used for multi-class classification tasks, where each output represents the probability of a specific class. It ensures that the probabilities across all classes sum to 1, making it ideal for interpreting model outputs as probabilities.
For Department, softmax was used to predict multiple department categories.
For Attrition, even though it's a binary classification, softmax was chosen to allow the model to output probabilities for each class (Attrition/No Attrition) in a similar format to the multi-class output.
3.Here are a few potential ways to improve this model:

Feature Engineering: Review and add relevant features that may improve predictive power. For example, combining features or creating new ones based on domain knowledge could help the model better capture patterns.
Hyperparameter Tuning: Experiment with different hyperparameters, such as learning rate, batch size, and number of epochs. You could use tools like Grid Search or Random Search to find optimal parameters.
Model Complexity: Increase model complexity by adding more hidden layers or units, especially for the Department branch if it’s underperforming.
Regularization: Add dropout layers or apply L2 regularization to prevent overfitting, especially if the model performs well on training data but poorly on validation/testing data.
Class Imbalance Techniques: If the data is imbalanced (e.g., in Attrition classes), consider using techniques such as class weighting, oversampling, or undersampling to help the model perform better on minority classes.