# AWS Final Project

This following project is to solve and know how a company can retain their most valuable potential employees by analyzing what causes attrition in a this comapany. This will help us to understand what are the driving factors for people to leave and in the end make the necessary changes. 

The dataset was taken from: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset




**Issue:**

We need to identify the key reasons for which employees choose to leave the company they work for.

**Goal:**

Offer solutions for companies to promptly recognise these factors and implement requisite adjustments to address them.





In [1]:
# Install necessary libraries
!pip install --upgrade scikit-learn

!pip install --upgrade sagemaker

!pip install --upgrade s3fs

Collecting botocore<1.35.0,>=1.34.69 (from boto3<2.0,>=1.33.3->sagemaker)
  Using cached botocore-1.34.75-py3-none-any.whl.metadata (5.7 kB)
Using cached botocore-1.34.75-py3-none-any.whl (12.1 MB)
Installing collected packages: botocore
  Attempting uninstall: botocore
    Found existing installation: botocore 1.34.51
    Uninstalling botocore-1.34.51:
      Successfully uninstalled botocore-1.34.51
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 2.12.1 requires botocore<1.34.52,>=1.34.41, but you have botocore 1.34.75 which is incompatible.
awscli 1.32.69 requires botocore==1.34.69, but you have botocore 1.34.75 which is incompatible.[0m[31m
[0mSuccessfully installed botocore-1.34.75
Collecting botocore<1.34.52,>=1.34.41 (from aiobotocore<3.0.0,>=2.5.4->s3fs)
  Using cached botocore-1.34.51-py3-none-any.whl.metadata (5.7 kB)
Using cached b

## Preprocessing 

There was a bucket created specifically for this final project, the bucket name is "awshrdataset", in this bucket we include our dataset to be processed. 

Below you will find the code use in the SageMaker Notebook instance to call the bucket and read it.

In [2]:
# Import necessary libraries
import boto3
import pandas as pd
import sagemaker
from sagemaker import get_execution_role
import io
import s3fs
from sagemaker.amazon.amazon_estimator import get_image_uri

# Load the dataset from S3
bucket_name = 'awshrdataset'  
file_key = 'HR-Employee-Attrition.csv'  

# Get the execution role for the notebook instance (this provides necessary permissions)
role = get_execution_role()

# Create a S3 client
s3_client = boto3.client('s3')

# Get the object from S3
obj = s3_client.get_object(Bucket=bucket_name, Key=file_key)

# Read the object (which is of 'bytes' type) as a pandas dataframe
df = pd.read_csv(obj['Body'])


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


Check if the data is loaded correctly:

In [3]:
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


Lets do a sanity check for NAN values

In [4]:
# SUM ALL NAN VALUES
df.isnull().sum().sum()

0

There are some unnecessary columns, which are better to take away to optimize the model:

In [5]:
# DROP UNNECESSARY COLUMNS
df = df.drop(columns=['EmployeeCount', 'EmployeeNumber', 'Over18', 'JobRole','StandardHours'])

In order for SageMaker to read it as a classification task, the target variable needs to be in front of the dataset. Therefore, there is a need to do a label encoder for a 1 and 0. Where 1 is Yes and 0 is a No in Attrition of the employees.

Find below the code that perform that:

In [6]:
# First column is the target, rest are features
from sklearn.preprocessing import LabelEncoder
# Create a LabelEncoder object
le = LabelEncoder()

df['Attrition'] = le.fit_transform(df['Attrition'])
df = pd.concat([df['Attrition'], df.drop('Attrition', axis=1)], axis=1)

df.head()

Unnamed: 0,Attrition,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,1,41,Travel_Rarely,1102,Sales,1,2,Life Sciences,2,Female,...,3,1,0,8,0,1,6,4,0,5
1,0,49,Travel_Frequently,279,Research & Development,8,1,Life Sciences,3,Male,...,4,4,1,10,3,3,10,7,1,7
2,1,37,Travel_Rarely,1373,Research & Development,2,2,Other,4,Male,...,3,2,0,7,3,3,0,0,0,0
3,0,33,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,4,Female,...,3,3,0,8,3,3,8,7,3,0
4,0,27,Travel_Rarely,591,Research & Development,2,1,Medical,1,Male,...,3,4,1,6,3,3,2,2,2,2


## Train, validation and test split for ML

Divide the dataframe into train, validation and test splits. We use random state for reprocucibility of the outcome. Also the train accounts for the 80% of the dataset. Validation and Test account for 10% each giving us a 100% use of the whole dataset.

In [7]:
# TRAIN, VALIDATION, TEST SPLIT
from sklearn.model_selection import train_test_split
train, testval = train_test_split(df, train_size=0.8, random_state=1200)
val, test = train_test_split(testval, train_size=0.5, random_state=1200)

train.shape, val.shape, test.shape

((1176, 30), (147, 30), (147, 30))

To interface with SageMaker, we upload our data to S3. This is achieved by using .to_csv to convert the DataFrame to a CSV string held in a StringIO object, which is then directly uploaded to S3.

In [8]:
s3 = boto3.resource('s3')

def upload_to_s3(df, bucket, filename):
    
    placeholder = io.StringIO()
    df.to_csv(placeholder, header=False, index=False)
    object = s3.Object(bucket, filename)
    object.put(Body=placeholder.getvalue())

Upload the sets to the bucket:

In [9]:
upload_to_s3(train, 'awshrdataset', 'train.csv')
upload_to_s3(val, 'awshrdataset', 'val.csv')
upload_to_s3(test, 'awshrdataset', 'test.csv')

## Setting up the model

Now, let's configure our model setup.

We employ the Estimator class from the sagemaker.estimator module. This class establishes the runtime environment for training jobs of a model.

We define the following parameters:

1. Container Name: SageMaker operates using containers. Here, we reference a pre-existing container containing all necessary components to execute XGBoost.
2. Role Name: Similar to Lambda functions, training jobs require a role with appropriate permissions. We previously created this role upon initiating the notebook server.
3. Number of Training Instances: In this case, we utilize one instance, although larger tasks may necessitate scaling with multiple instances.
4. Instance Type: We opt for an instance type included in the SageMaker payed tier.
5. Hyperparameters: We set is as a binary:logistic because it is a classification task of the Attrition column.



In [10]:
role = sagemaker.get_execution_role()
region_name = boto3.Session().region_name
container = sagemaker.image_uris.retrieve('xgboost', region_name, version='0.90-1')
output_location = 's3://awshrdataset/SageMakerOutput/'

hyperparams = {
    'num_round': '20',
    'objective': 'binary:logistic' 
}

estimator = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    output_path=output_location,
    hyperparameters=hyperparams,
    sagemaker_session=sagemaker.Session()
)

In [11]:
from sagemaker.inputs import TrainingInput

train_channel = TrainingInput(
    's3://awshrdataset/train.csv',
    content_type='text/csv'
)
val_channel = TrainingInput(
    's3://awshrdataset/val.csv',
    content_type='text/csv'
)
channels_for_training = {
    'train': train_channel,
    'validation': val_channel
}

estimator.fit(inputs=channels_for_training, logs=False)

INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2024-04-01-21-58-01-694



2024-04-01 21:58:02 Starting - Starting the training job.....
2024-04-01 21:58:37 Starting - Preparing the instances for training..........
2024-04-01 21:59:29 Downloading - Downloading input data....
2024-04-01 21:59:54 Downloading - Downloading the training image......
2024-04-01 22:00:29 Training - Training image download completed. Training in progress....
2024-04-01 22:00:51 Uploading - Uploading generated training model..
2024-04-01 22:01:07 Completed - Training job completed


As we can see above, the training has been done successfully, it can also be seen in Training Jobs in SageMaker.

In [12]:
# Lets see the name of the training job
estimator._current_job_name

'sagemaker-xgboost-2024-04-01-21-58-01-694'

In [13]:
# Get the metrics
metrics = sagemaker.analytics.TrainingJobAnalytics(
    estimator.latest_training_job.job_name
)

# Display the metrics
metrics = metrics.dataframe()
print(metrics)



   timestamp       metric_name     value
0        0.0       train:error  0.012755
1        0.0  validation:error  0.156463


Because it is a classification task, we only have the train and validation errors.

**Low Training Error (1.28%):** Indicates good performance on familiar training data in the early stages.

**Higher Validation Error (15.65%):** Suggests the model isn't as effective on new, unseen data.

**Important to Monitor:** We need to watch how these errors change throughout training.

**Potential Adjustments Needed:** If the gap persists or widens, we might have to tweak our model or training approach.

Now we deploy the endpoint so it is ready:

In [14]:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge', serializer=sagemaker.serializers.CSVSerializer())

INFO:sagemaker:Creating model with name: sagemaker-xgboost-2024-04-01-22-01-10-619
INFO:sagemaker:Creating endpoint-config with name sagemaker-xgboost-2024-04-01-22-01-10-619
INFO:sagemaker:Creating endpoint with name sagemaker-xgboost-2024-04-01-22-01-10-619


------!