## Part 1: Preprocessing

In [1]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import pandas as pd
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras import layers

#  Import and read the attrition data
attrition_df = pd.read_csv('https://static.bc-edx.com/ai/ail-v-1-0/m19/lms/datasets/attrition.csv')
attrition_df.head()


Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,Sales,1,2,Life Sciences,2,94,3,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,Research & Development,8,1,Life Sciences,3,61,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,Research & Development,2,2,Other,4,92,2,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,Research & Development,3,4,Life Sciences,4,56,3,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,Research & Development,2,1,Medical,1,40,3,...,3,4,1,6,3,3,2,2,2,2


In [2]:
# Determine the number of unique values in each column.
display(f"attrition_df.nunique's are: ")
print(attrition_df.nunique())
# I'm going to add some other dataframe measures
print(f"dataframe info... {attrition_df.info()}")
print("")
print("--------------------------------------")
print(f"attrition value counts... {attrition_df['Attrition'].value_counts()}")
print("")
print(f"department value counts... {attrition_df['Department'].value_counts()}")
print("")
print(attrition_df.columns)

"attrition_df.nunique's are: "

Age                         43
Attrition                    2
BusinessTravel               3
Department                   3
DistanceFromHome            29
Education                    5
EducationField               6
EnvironmentSatisfaction      4
HourlyRate                  71
JobInvolvement               4
JobLevel                     5
JobRole                      9
JobSatisfaction              4
MaritalStatus                3
NumCompaniesWorked          10
OverTime                     2
PercentSalaryHike           15
PerformanceRating            2
RelationshipSatisfaction     4
StockOptionLevel             4
TotalWorkingYears           40
TrainingTimesLastYear        7
WorkLifeBalance              4
YearsAtCompany              37
YearsInCurrentRole          19
YearsSinceLastPromotion     16
YearsWithCurrManager        18
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 27 columns):
 #   Column                    Non-Null Cou

In [3]:
# Create y_df with the Attrition and Department columns
y_df = attrition_df[['Attrition', 'Department']]
print(f"y_df.head is \n{y_df.head()}")
print(f"y_df[Department] value counts are: \n{y_df['Department'].value_counts()}")

y_df.head is 
  Attrition              Department
0       Yes                   Sales
1        No  Research & Development
2       Yes  Research & Development
3        No  Research & Development
4        No  Research & Development
y_df[Department] value counts are: 
Department
Research & Development    961
Sales                     446
Human Resources            63
Name: count, dtype: int64


In [4]:

# this is NOT part of main flow of the homework, but I'm going to use it for column selection for X_df
# The following was used to export data for "Pipey" analysis to pick the top features based on several other classifiers.
# Pipey is a Streamlit app that we created in AI project 2.  The output is attached for reference in the temp_data folder. The file is called "Pipey Analysis to Pick the most Important Features.pdf"

df_for_pipey = attrition_df.copy()
df_for_pipey = df_for_pipey.drop(columns=['Department'])
df_for_pipey["Attrition"] = df_for_pipey["Attrition"].map({"Yes": 1, "No": 0})

print("df_for_pipey head")
print(df_for_pipey.head())

df_for_pipey.to_csv('temp_data/df_for_pipey.csv', index=False)

df_for_pipey head
   Age  Attrition     BusinessTravel  DistanceFromHome  Education  \
0   41          1      Travel_Rarely                 1          2   
1   49          0  Travel_Frequently                 8          1   
2   37          1      Travel_Rarely                 2          2   
3   33          0  Travel_Frequently                 3          4   
4   27          0      Travel_Rarely                 2          1   

  EducationField  EnvironmentSatisfaction  HourlyRate  JobInvolvement  \
0  Life Sciences                        2          94               3   
1  Life Sciences                        3          61               2   
2          Other                        4          92               2   
3  Life Sciences                        4          56               3   
4        Medical                        1          40               3   

   JobLevel  ... PerformanceRating  RelationshipSatisfaction StockOptionLevel  \
0         2  ...                 3             

In [5]:
# Create a list of at least 10 column names to use as X data
X_columns = ['Age', 'EnvironmentSatisfaction', 'JobLevel', 'JobSatisfaction',
       'NumCompaniesWorked', 'OverTime', 'StockOptionLevel',
       'TotalWorkingYears', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager']

# these were selected based on feature importance from using "Pipey", which was our AI project #2. All encoding and scaling was conducted and several models were considered. 
# Pipey Analysis is the PDF file in the Neural_Network_Challege\temp_data folder. I selected 12 columns.

# Create X_df using your selected columns
X_df = attrition_df[X_columns]

# Show the data types for X_df

X_df.dtypes

Age                         int64
EnvironmentSatisfaction     int64
JobLevel                    int64
JobSatisfaction             int64
NumCompaniesWorked          int64
OverTime                   object
StockOptionLevel            int64
TotalWorkingYears           int64
YearsAtCompany              int64
YearsInCurrentRole          int64
YearsSinceLastPromotion     int64
YearsWithCurrManager        int64
dtype: object

In [6]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, random_state=42)

In [7]:
# Convert your X data to numeric data types however you see fit
# Add new code cells as necessary
print("value counts for overtime")
display(X_train['OverTime'].value_counts())
X_train["OverTime"] = X_train["OverTime"].map({"Yes": 1, "No": 0})
X_test["OverTime"] = X_test["OverTime"].map({"Yes": 1, "No": 0})

X_train.head()


value counts for overtime


OverTime
No     780
Yes    322
Name: count, dtype: int64

Unnamed: 0,Age,EnvironmentSatisfaction,JobLevel,JobSatisfaction,NumCompaniesWorked,OverTime,StockOptionLevel,TotalWorkingYears,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
1343,29,4,1,1,3,0,0,11,3,2,1,2
1121,36,2,2,3,6,0,0,15,1,0,0,0
1048,34,4,2,1,3,0,0,15,13,9,3,12
1393,27,4,2,4,1,0,0,7,7,7,0,7
527,32,4,2,4,1,0,0,10,10,7,0,8


In [8]:
# Create a StandardScaler
scaler = StandardScaler()
# Fit the StandardScaler to the training data
scaler.fit(X_train)

# Scale the training and testing data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [9]:
# Create a OneHotEncoder for the Department column
OHE = OneHotEncoder()

# Fit the encoder to the training data
department_encoded_train = OHE.fit_transform(y_train[['Department']])
department_encoded_train_df  = pd.DataFrame(department_encoded_train.toarray(), columns=OHE.get_feature_names_out())
# Create two new variables by applying the encoder

display('for department train...')
display(department_encoded_train_df.head())
# to the training and testing data

department_encoded_test = OHE.fit_transform(y_test[['Department']])
department_encoded_test_df  = pd.DataFrame(department_encoded_test.toarray(), columns=OHE.get_feature_names_out())
# Create two new variables by applying the encoder
print(" -------------------------------------- ")
display('for department test...')
display(department_encoded_test_df.head())

'for department train...'

Unnamed: 0,Department_Human Resources,Department_Research & Development,Department_Sales
0,0.0,1.0,0.0
1,0.0,0.0,1.0
2,0.0,0.0,1.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0


 -------------------------------------- 


'for department test...'

Unnamed: 0,Department_Human Resources,Department_Research & Development,Department_Sales
0,0.0,0.0,1.0
1,0.0,1.0,0.0
2,1.0,0.0,0.0
3,0.0,1.0,0.0
4,0.0,1.0,0.0


In [10]:
# Create a OneHotEncoder for the Attrition column
# Fit the encoder to the training data
attrition_encoded_train = OHE.fit_transform(y_train[['Attrition']])
attrition_encoded_train_df  = pd.DataFrame(attrition_encoded_train.toarray(), columns=OHE.get_feature_names_out())

# Create two new variables by applying the encoder
# to the training and testing data
attrition_encoded_test = OHE.transform(y_test[['Attrition']])
attrition_encoded_test_df  = pd.DataFrame(attrition_encoded_test.toarray(), columns=OHE.get_feature_names_out())

display('for Attrition train...')
display(attrition_encoded_train_df.head())
print(" -------------------------------------- ")
display('for Attrition test...')
display(attrition_encoded_test_df.head())

'for Attrition train...'

Unnamed: 0,Attrition_No,Attrition_Yes
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,1.0,0.0


 -------------------------------------- 


'for Attrition test...'

Unnamed: 0,Attrition_No,Attrition_Yes
0,1.0,0.0
1,1.0,0.0
2,0.0,1.0
3,1.0,0.0
4,1.0,0.0


## Create, Compile, and Train the Model

In [11]:
# Find the number of columns in the X training data
X_train_num_columns = X_train.shape[1]
# Create the input layer
input_layer = layers.Input(shape=(X_train_num_columns,))
# Create at least two shared layers
second_layer = layers.Dense(64, activation='relu')(input_layer)
third_layer = layers.Dense(64, activation='relu')(second_layer)

In [12]:
# Create a branch for Department
# with a hidden layer and an output layer   # Create the hidden layer
forth_layer = layers.Dense(32, activation='relu')(third_layer)
# Create the output layer
department_output = layers.Dense(3, activation='softmax', name='department_output')(forth_layer)


In [13]:
# Create a branch for Attrition
# with a hidden layer and an output layer
# Create the hidden layer
forth_prime_layer = layers.Dense(32, activation='relu')(third_layer)

# Create the output layer
attrition_output = layers.Dense(2, activation='softmax', name='attrition_output')(forth_prime_layer)


In [14]:
# Create the model
model = Model(inputs=input_layer, outputs=[department_output, attrition_output])

# Compile the model
model.compile(optimizer='adam', loss={'department_output': 'categorical_crossentropy', 'attrition_output': 'categorical_crossentropy'}, metrics=['accuracy'])

# Summarize the model
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, 12)]                 0         []                            
                                                                                                  
 dense (Dense)               (None, 64)                   832       ['input_1[0][0]']             
                                                                                                  
 dense_1 (Dense)             (None, 64)                   4160      ['dense[0][0]']               
                                                                                                  
 dense_2 (Dense)             (None, 32)                   2080      ['dense_1[0][0]']             
                                                                                              

In [15]:
# Train the model
model.fit(X_train_scaled, {'department_output': department_encoded_train_df, 'attrition_output': attrition_encoded_train_df}, epochs=100, shuffle=True, verbose=2, batch_size=10)


Epoch 1/100
111/111 - 1s - loss: 1.3292 - department_output_loss: 0.8690 - attrition_output_loss: 0.4602 - department_output_accuracy: 0.6025 - attrition_output_accuracy: 0.8140 - 942ms/epoch - 8ms/step
Epoch 2/100
111/111 - 0s - loss: 1.1402 - department_output_loss: 0.7609 - attrition_output_loss: 0.3792 - department_output_accuracy: 0.6561 - attrition_output_accuracy: 0.8457 - 86ms/epoch - 778us/step
Epoch 3/100
111/111 - 0s - loss: 1.0933 - department_output_loss: 0.7381 - attrition_output_loss: 0.3552 - department_output_accuracy: 0.6606 - attrition_output_accuracy: 0.8603 - 88ms/epoch - 789us/step
Epoch 4/100
111/111 - 0s - loss: 1.0670 - department_output_loss: 0.7218 - attrition_output_loss: 0.3452 - department_output_accuracy: 0.6642 - attrition_output_accuracy: 0.8557 - 110ms/epoch - 993us/step
Epoch 5/100
111/111 - 0s - loss: 1.0446 - department_output_loss: 0.7092 - attrition_output_loss: 0.3354 - department_output_accuracy: 0.6633 - attrition_output_accuracy: 0.8639 - 94ms

<keras.src.callbacks.History at 0x1caf98268f0>

In [17]:
# Evaluate the model with the testing data
loss_accuracy_array = model.evaluate(X_test_scaled, {'department_output': department_encoded_test_df, 'attrition_output': attrition_encoded_test_df}, verbose=2)
loss_accuracy_array

12/12 - 0s - loss: 4.3793 - department_output_loss: 2.7891 - attrition_output_loss: 1.5902 - department_output_accuracy: 0.5679 - attrition_output_accuracy: 0.7853 - 29ms/epoch - 2ms/step


[4.379306793212891,
 2.789116859436035,
 1.5901895761489868,
 0.5679348111152649,
 0.7853260636329651]

In [19]:
# Print the accuracy for both department and attrition
print(f"Departmental Accuracy is: {round(loss_accuracy_array[3]*100,2)}")
print(f"Attrition Accuracy is: {round(loss_accuracy_array[4]*100,2)}")

Departmental Accuracy is: 56.79
Attrition Accuracy is: 78.53


# Summary

In the provided space below, briefly answer the following questions.

1. Is accuracy the best metric to use on this data? Why or why not?

2. What activation functions did you choose for your output layers, and why?

3. Can you name a few ways that this model might be improved?

YOUR ANSWERS HERE

1. Accuracy may not be the best measure, since the value counts for Attribtion show that the data is not balanced. There were 1233 'No's for attribtion and only 237 'Yes's. Balanced Accuracy might be a better metric since it would weighted the accuaracy of the positive (Yes) class equally with the negative class.  The same basically applies to Department as well, where human resources value counts were small compared with both Research & Development and Sales. 
2. For both Attrition and Department, I selected softmax over sigmoidal. For department, the output is three levels, so the choice is clear; softmax must be used. It is also consistent with the probabilities for each department should sum up to one, since we are picking between one of the three. I chose softmax for attrition as well. Since the instructions indicated that OneHotEncoding was to be used and that two output columns were to be generated, softmax seemed more correct. Also, we are chosing between two categories, which are mutually exclusive. Thus this is a multiclass problem and not just a multilabel problem.
3. A couple possibilities come to mind for potential improvements:
   A) Lower the number of neurons in the hidden layers. There is evidence of over-fitting, since the training accuaracy for Attrition is 99% but the test accuaracy for Attrition was only 78.5%.  The same is true with training accuaracy for Department being 97.5%, whereas the test accuracy for Department is 56.9%. These maybe clear signs of overfitting.
   B) Decrease the number of layers in the model to help with the overfitting just described.  The model seems fairly complex given the data. My analysis to pick the columns looked at six different classifies (Pipey Analysis...). Models should as Logistic Regression and SVC are much simplier and yet they performed as well or better than this neuro net.
   C) Experiment with different columns. While I used "feature importance" from other models to select the 12 columns I selected, the instructions only said "select at least 10" but did not asked for any analysis to determine the most important features to include.