<a href="https://colab.research.google.com/github/Alexac12322/neural-network-challenge-2/blob/main/attrition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Part 1: Preprocessing

In [1]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras import layers

#  Import and read the attrition data
attrition_df = pd.read_csv('https://static.bc-edx.com/ai/ail-v-1-0/m19/lms/datasets/attrition.csv')
attrition_df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,Sales,1,2,Life Sciences,2,94,3,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,Research & Development,8,1,Life Sciences,3,61,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,Research & Development,2,2,Other,4,92,2,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,Research & Development,3,4,Life Sciences,4,56,3,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,Research & Development,2,1,Medical,1,40,3,...,3,4,1,6,3,3,2,2,2,2


In [2]:
# Determine the number of unique values in each column.
attrition_df.nunique()

Unnamed: 0,0
Age,43
Attrition,2
BusinessTravel,3
Department,3
DistanceFromHome,29
Education,5
EducationField,6
EnvironmentSatisfaction,4
HourlyRate,71
JobInvolvement,4


In [3]:
# Create y_df with the Attrition and Department columns

y_df = attrition_df[['Attrition', 'Department']]

In [179]:
# Create a list of at least 10 column names to use as X data
X_columns = [
    'Age',
    'BusinessTravel',
    'DistanceFromHome',
    'Education',
    'EnvironmentSatisfaction',
    'JobInvolvement',
    'JobLevel',
    'JobSatisfaction',
    'NumCompaniesWorked',
    'OverTime'
]


# Create X_df using your selected columns
X_df = attrition_df[X_columns]

# Show the data types for X_df

print(X_df.dtypes)

Age                         int64
BusinessTravel             object
DistanceFromHome            int64
Education                   int64
EnvironmentSatisfaction     int64
JobInvolvement              int64
JobLevel                    int64
JobSatisfaction             int64
NumCompaniesWorked          int64
OverTime                   object
dtype: object


In [16]:
# Assuming X_df contains your features and y_df contains your target variable
X = X_df
y = y_df['Attrition']

In [33]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, random_state=42)

In [34]:
# Convert your X data to numeric data types however you see fit
X_df['BusinessTravel'] = pd.Categorical(X_df['BusinessTravel']).codes
X_df['OverTime'] = pd.Categorical(X_df['OverTime']).codes
# Add new code cells as necessary
print(X_df.dtypes)

Age                        int64
BusinessTravel              int8
DistanceFromHome           int64
Education                  int64
EnvironmentSatisfaction    int64
JobInvolvement             int64
JobLevel                   int64
JobSatisfaction            int64
NumCompaniesWorked         int64
OverTime                    int8
dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_df['BusinessTravel'] = pd.Categorical(X_df['BusinessTravel']).codes
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_df['OverTime'] = pd.Categorical(X_df['OverTime']).codes


In [35]:
# Create a StandardScaler

scaler = StandardScaler()
# Fit the StandardScaler to the training data
scaler.fit(X_train)

# Scale the training and testing data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Scale the training and testing data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [36]:
print(y_df.columns)

Index(['Attrition', 'Department'], dtype='object')


In [37]:
print(y_train.columns)

Index(['Attrition', 'Department'], dtype='object')


In [46]:


# Create a OneHotEncoder for the Department column
onehot_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# Fit the encoder to the training data
onehot_encoder.fit(y_train[['Department']])

# Create two new variables by applying the encoder to the training and testing data
y_train_encoded = onehot_encoder.transform(y_train[['Department']])
y_test_encoded = onehot_encoder.transform(y_test[['Department']])

# Get the feature names for the encoded columns
department_columns = onehot_encoder.get_feature_names_out(['Department'])

# Create DataFrames with the encoded values
y_train_encoded_df = pd.DataFrame(y_train_encoded, columns=department_columns, index=y_train.index)
y_test_encoded_df = pd.DataFrame(y_test_encoded, columns=department_columns, index=y_test.index)

# Concatenate the encoded department columns with the original y_train and y_test
y_train_final = pd.concat([y_train.drop('Department', axis=1), y_train_encoded_df], axis=1)
y_test_final = pd.concat([y_test.drop('Department', axis=1), y_test_encoded_df], axis=1)



print(np.array(y_train_encoded)[:5])


[[0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 1. 0.]]


In [44]:

# Create a OneHotEncoder for the Attrition column
attrition_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# Fit the encoder to the training data
attrition_encoder.fit(y_train[['Attrition']])

# Create two new variables by applying the encoder to the training and testing data
y_train_encoded = attrition_encoder.transform(y_train[['Attrition']])
y_test_encoded = attrition_encoder.transform(y_test[['Attrition']])

# Get the feature names for the encoded columns
attrition_columns = attrition_encoder.get_feature_names_out(['Attrition'])

# Create DataFrames with the encoded values
y_train_encoded_df = pd.DataFrame(y_train_encoded, columns=attrition_columns, index=y_train.index)
y_test_encoded_df = pd.DataFrame(y_test_encoded, columns=attrition_columns, index=y_test.index)

# Concatenate the encoded Attrition columns with the original y_train and y_test
y_train_final = pd.concat([y_train.drop('Attrition', axis=1), y_train_encoded_df], axis=1)
y_test_final = pd.concat([y_test.drop('Attrition', axis=1), y_test_encoded_df], axis=1)

print("y_train_encoded:")
print(np.array(y_train_encoded))


y_train_encoded:
[[1. 0.]
 [1. 0.]
 [1. 0.]
 ...
 [0. 1.]
 [1. 0.]
 [1. 0.]]


## Create, Compile, and Train the Model

In [165]:
# Find the number of columns in the X training data
X_train_scaled.shape[1]

# Create the input layer
input_layer = layers.Input(shape=(X_df.shape[1],),name='input_features')

# Create at least two shared layers
shared_layer1 = layers.Dense(64, activation='relu')(input_layer)
shared_layer2 = layers.Dense(128, activation='relu')(shared_layer1)



In [169]:
# Create a branch for Department
# with a hidden layer and an output layer

# Create the hidden layer
department_hidden = layers.Dense(32, activation='relu')(shared_layer2)

# Create the output layer

department_output = layers.Dense(3, activation='softmax', name='department_output')(department_hidden)


In [170]:
# Create a branch for Attrition
# with a hidden layer and an output layer

# Create the hidden layer
attrition_hidden = layers.Dense(32, activation='relu')(shared_layer2)

# Create the output layer

attrition_output = layers.Dense(1, activation='sigmoid', name='attrition_output')(attrition_hidden)

In [171]:
# Create the model
model = Model(inputs = input_layer, outputs = [attrition_output, department_output])

# Compile the model

model.compile(optimizer='adam', loss= {'attrition_output': 'binary_crossentropy', 'department_output': 'categorical_crossentropy'}, metrics = {'attrition_output': 'accuracy', 'department_output': 'accuracy'})

# Summarize the model
model.summary()



In [156]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

In [172]:
try:
    history = model.fit(
        X_train_scaled,
        {'attrition_output': y_attrition_train_encoded,
         'department_output': y_department_train_encoded},
        epochs=100,
        batch_size=32,
        validation_split=0.2,
        verbose=1
    )
except Exception as e:
    print(f"An error occurred: {e}")
    print("Model output names:", model.output_names)
    print("y_attrition_train_encoded shape:", y_attrition_train_encoded.shape)
    print("y_department_train_encoded shape:", y_department_train_encoded.shape)

Epoch 1/100
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 12ms/step - attrition_output_accuracy: 0.7814 - department_output_accuracy: 0.5390 - loss: 1.5075 - val_attrition_output_accuracy: 0.7966 - val_department_output_accuracy: 0.6314 - val_loss: 1.2945
Epoch 2/100
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - attrition_output_accuracy: 0.8295 - department_output_accuracy: 0.6461 - loss: 1.2041 - val_attrition_output_accuracy: 0.7966 - val_department_output_accuracy: 0.6314 - val_loss: 1.2317
Epoch 3/100
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - attrition_output_accuracy: 0.8566 - department_output_accuracy: 0.6697 - loss: 1.1085 - val_attrition_output_accuracy: 0.8178 - val_department_output_accuracy: 0.6271 - val_loss: 1.2141
Epoch 4/100
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - attrition_output_accuracy: 0.8478 - department_output_accuracy: 0.6691 - loss: 1.0774 - val_a

In [176]:
# Evaluate the model with the testing data
evaluation_results = model.evaluate(
    X_test_scaled,
    {'attrition_output': y_attrition_test_encoded, 'department_output': y_department_test_encoded},
    verbose=1  # Set to 1 to see the progress bar
)

# Print the results in the desired format
print("\nEvaluation Results:")
for name, value in zip(model.metrics_names, evaluation_results):
    print(f"{name}: {value:.4f}")

# Print the list of results
print("\nList of results:")
print(evaluation_results)

[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - attrition_output_accuracy: 0.8408 - department_output_accuracy: 0.5700 - loss: 4.1501 

Evaluation Results:
loss: 3.7438
compile_metrics: 0.8605

List of results:
[3.743776559829712, 0.8605442047119141, 0.5680271983146667]


In [178]:
# Print the accuracy for both department and attrition
print(f"Attrition Accuracy: {attrition_accuracy}")
print(f"Department Accuracy: {department_accuracy}")

Attrition Accuracy: 0.8605442047119141
Department Accuracy: 0.5680271983146667


# Summary

In the provided space below, briefly answer the following questions.

1. Is accuracy the best metric to use on this data? Why or why not?

2. What activation functions did you choose for your output layers, and why?

3. Can you name a few ways that this model might be improved?

YOUR ANSWERS HERE


1. Accuracy may not be the best metric for this data, especially for the Attrition prediction.The Attrition data is likely imbalanced, with fewer employees leaving than staying.In imbalanced datasets, accuracy can be misleading as a model could achieve high accuracy by simply predicting the majority class.






2.Attrition output: Sigmoid activation
Appropriate for binary classification (employee leaves or stays)
Outputs a probability between 0 and 1
For Department output: Softmax activation
Suitable for multi-class classification (predicting department)
Outputs probabilities for each class that sum to 1





3.The model could be improved by incorporating techniques such as hyperparameter tuning, using more complex architectures like LSTM or GRU for sequential data, and employing regularization methods like dropout to prevent overfitting. Additionally, exploring feature engineering and addressing class imbalance could enhance performance.