# Problem Introduction

# What is the Problem
The problem  is to develop a predictive model that can help determine the likelihood of task completion among system administrators. By accurately predicting when tasks are completed and, more importantly, when they are likely not to be completed, the HR department can make informed decisions to optimize resource allocation, improve team assignments, and ensure tasks are executed more efficiently.

# Why is it important
Predicting task completion is crucial for improving organizational productivity and resource management. Knowing in advance which tasks are likely to be completed and which are at risk of non-completion helps HR and management:

Identify bottlenecks in workflow.
Optimize workforce allocation and team structures.
Minimize wasted resources and increase productivity.
Improve employee satisfaction by aligning workload with individual or team capabilities.
Provide better performance management and training support based on predictive insights.

# Who are the Key Stakeholders?
- HR Department: The primary beneficiary, as they will use the insights to optimize team assignments, manage workloads, and address issues affecting task completion.
- System Administrators: Their work patterns and outcomes are being analyzed, so they are directly impacted by the recommendations for improving task efficiency and resource allocation.
- Team Managers/Supervisors: They need insights to better manage their teams, redistribute tasks, and identify areas where employees may need additional support or training.
- Senior Leadership: They are invested in overall organizational productivity and efficiency, and any initiative that improves resource allocation can have a positive impact on the bottom line.
- Data Scientists/Analysts: Responsible for creating, evaluating, and maintaining the predictive model to ensure it provides actionable insights and continues to perform well over time.

In [1]:
#importing the required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_fscore_support
import numpy as np

In [3]:
path = "C:\\Users\\Gahan\\Downloads\\SystemAdministrators.csv"

# Data Collection
we are loading the data and then displaying it.

In [4]:
#loading data
df = pd.read_csv(path)

In [5]:
df.head(5)

Unnamed: 0,task_completed,employee_experience,training_level4,training_level6,training_level8
0,1,10.9,1,0,0
1,1,9.9,1,0,0
2,1,10.4,0,1,0
3,1,13.7,0,1,0
4,1,9.4,0,0,1


# Data Exploration and Understanding
we are displaying the statistical summary of the data.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   task_completed       75 non-null     int64  
 1   employee_experience  75 non-null     float64
 2   training_level4      75 non-null     int64  
 3   training_level6      75 non-null     int64  
 4   training_level8      75 non-null     int64  
dtypes: float64(1), int64(4)
memory usage: 3.1 KB


In [7]:
df.describe()

Unnamed: 0,task_completed,employee_experience,training_level4,training_level6,training_level8
count,75.0,75.0,75.0,75.0,75.0
mean,0.2,6.8,0.76,0.173333,0.066667
std,0.402694,2.273645,0.429959,0.381084,0.251124
min,0.0,2.7,0.0,0.0,0.0
25%,0.0,5.2,1.0,0.0,0.0
50%,0.0,6.3,1.0,0.0,0.0
75%,0.0,7.85,1.0,0.0,0.0
max,1.0,13.7,1.0,1.0,1.0


# Data Cleaning or Preprocessing
This is not required since the data is already preprocessed data.

# Feature Selection
since it is a small dataset with less features, we don't need to perform this step.

In [9]:
# checking for missed or null values
df.isnull().sum()

task_completed         0
employee_experience    0
training_level4        0
training_level6        0
training_level8        0
dtype: int64

# Modelling
# Data splitting
Here we are performing the train test split and also dividing the target variable from the dataset in order to train the model.

In [10]:
X = df.drop('task_completed', axis=1)
y = df['task_completed']


In [11]:
X.head(5)

Unnamed: 0,employee_experience,training_level4,training_level6,training_level8
0,10.9,1,0,0
1,9.9,1,0,0
2,10.4,0,1,0
3,13.7,0,1,0
4,9.4,0,0,1


In [12]:
y.head(5)

0    1
1    1
2    1
3    1
4    1
Name: task_completed, dtype: int64

Here we are splitting the dataset into train and test. 80 percent of the data is given for training while 20 percent of the data is used for testing.

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Classification Modelling
here we are training two different classification models and checking the performance measures and then declaring the winning model based on threshold.

# SVM Model

here we are looping c value for various 10 powers

In [14]:
C_values = [10**i for i in range(-4, 5)]

training the svm model with each c value and finding the best c value 

In [20]:
# Training and evaluating SVM models for each C value
best_C = None
best_accuracy = 0
svm_results = []

for C in C_values:
    model = SVC(C=C, kernel='linear', probability=True, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    svm_results.append((C, accuracy))
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_C = C
print("best C value for svm is : ", best_C)

best C value for svm is :  1


In [19]:
# Training the final SVM model using the best C value
final_svm_model = SVC(C=best_C, kernel='linear', probability=True, random_state=42)
final_svm_model.fit(X_train, y_train)
y_proba = final_svm_model.predict_proba(X_test)[:, 1]

creating the performance measure table over a range of possible threshold values.

In [21]:
# Defining the thresholds and evaluating the SVM model performance
thresholds = np.arange(0.0, 1.1, 0.1)
performance_measures = []

for threshold in thresholds:
    y_pred_threshold = (y_proba >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred_threshold).ravel()
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred_threshold, average='binary', zero_division=0)
    accuracy = accuracy_score(y_test, y_pred_threshold)
    performance_measures.append((threshold, tn, tp, fn, fp, precision, recall, f1, accuracy))

performance_df = pd.DataFrame(performance_measures, columns=['Threshold', 'TN', 'TP', 'FN', 'FP', 'Precision', 'Recall', 'F1', 'Accuracy'])


In [24]:
print("Performance Measure Table for svm model is :\n ", performance_df)

Performance Measure Table for svm model is :
      Threshold  TN  TP  FN  FP  Precision  Recall        F1  Accuracy
0         0.0   0   5   0  10   0.333333     1.0  0.500000  0.333333
1         0.1   7   5   0   3   0.625000     1.0  0.769231  0.800000
2         0.2   9   5   0   1   0.833333     1.0  0.909091  0.933333
3         0.3  10   4   1   0   1.000000     0.8  0.888889  0.933333
4         0.4  10   3   2   0   1.000000     0.6  0.750000  0.866667
5         0.5  10   3   2   0   1.000000     0.6  0.750000  0.866667
6         0.6  10   2   3   0   1.000000     0.4  0.571429  0.800000
7         0.7  10   0   5   0   0.000000     0.0  0.000000  0.666667
8         0.8  10   0   5   0   0.000000     0.0  0.000000  0.666667
9         0.9  10   0   5   0   0.000000     0.0  0.000000  0.666667
10        1.0  10   0   5   0   0.000000     0.0  0.000000  0.666667


# Logistic Model
Training a logistic model.

In [25]:
# Training a Logistic Regression model
logistic_model = LogisticRegression(random_state=42)
logistic_model.fit(X_train, y_train)
y_proba_logistic = logistic_model.predict_proba(X_test)[:, 1]

creating the performance measure table over a range of possible threshold values.

In [26]:
# Evaluating the Logistic Regression model performance over thresholds
logistic_performance_measures = []

for threshold in thresholds:
    y_pred_threshold_logistic = (y_proba_logistic >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred_threshold_logistic).ravel()
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred_threshold_logistic, average='binary', zero_division=0)
    accuracy = accuracy_score(y_test, y_pred_threshold_logistic)
    logistic_performance_measures.append((threshold, tn, tp, fn, fp, precision, recall, f1, accuracy))

logistic_performance_df = pd.DataFrame(logistic_performance_measures, columns=['Threshold', 'TN', 'TP', 'FN', 'FP', 'Precision', 'Recall', 'F1', 'Accuracy'])

In [28]:
#Printing the results
# Print results
print("Best C for SVM:", best_C)
print("\nSVM Performance Table:\n", performance_df)
print("\nLogistic Regression Performance Table:\n", logistic_performance_df)

Best C for SVM: 1

SVM Performance Table:
     Threshold  TN  TP  FN  FP  Precision  Recall        F1  Accuracy
0         0.0   0   5   0  10   0.333333     1.0  0.500000  0.333333
1         0.1   7   5   0   3   0.625000     1.0  0.769231  0.800000
2         0.2   9   5   0   1   0.833333     1.0  0.909091  0.933333
3         0.3  10   4   1   0   1.000000     0.8  0.888889  0.933333
4         0.4  10   3   2   0   1.000000     0.6  0.750000  0.866667
5         0.5  10   3   2   0   1.000000     0.6  0.750000  0.866667
6         0.6  10   2   3   0   1.000000     0.4  0.571429  0.800000
7         0.7  10   0   5   0   0.000000     0.0  0.000000  0.666667
8         0.8  10   0   5   0   0.000000     0.0  0.000000  0.666667
9         0.9  10   0   5   0   0.000000     0.0  0.000000  0.666667
10        1.0  10   0   5   0   0.000000     0.0  0.000000  0.666667

Logistic Regression Performance Table:
     Threshold  TN  TP  FN  FP  Precision  Recall        F1  Accuracy
0         0.0   0  

# Picking a winning Model
At Threshold 0.3, Logistic Regression performs perfectly with Precision = 1.0, Recall = 1.0, F1 = 1.0, Accuracy = 1.0, while the SVM's recall drops to 0.8.

Therefore, based on above results, Logistic Regression is the better choice since it consistently achieves high performance metrics at lower thresholds and demonstrates perfect performance at a threshold of 0.3.

# Evaluation at a Low Probability Threshold 
I am picking up a lower threshold of 0.3.

# How many False Positives? 
0 false positives. 
   
# What do these numbers represent?  
It implies that there are no instances where the model incorrectly predicted a positive outcome. This minimizes unnecessary actions, reducing wasted resources and operational costs.

# What are the potential costs to the business if we were to make these mistakes in practice?
Reduces operational inefficiency, as only genuinely positive cases are processed.


# How many False Negatives? 
0 False Negatives.

# What do these numbers represent? 
It implies that there are no cases where a true positive was missed. This ensures that all relevant cases are identified and handled correctly, preventing missed opportunities.

# What are the potential costs to the business if we were to make these mistakes in practice?
It ensures that no opportunities are missed, maintaining high reliability.

# Which prediction mistakes do you consider to be more costly?
False Positives: Lead to wasted resources and potentially unnecessary actions.
False Negatives: Represent missed opportunities, potentially leading to lost revenue, lower efficiency, or other missed critical tasks.
From the above info we can conclude that false negatives may be more costly if the missed cases are critical to business success, such as failing to complete necessary tasks or losing important opportunities.

#  Evaluation at two Other Probability Thresholds
# At threshold 0.1

# How many False Positives? 
2 false positives

# What do these numbers represent?  
There are two instances of misclassified positives 

# What are the potential costs to the business if we were to make these mistakes in practice?
The Two instances of misclassified positives could lead to minor wasted resources or attention directed at non-relevant cases.
# 
How many False Negatives? 
0 false negatives.

#  What do these numbers represent?
there are No missed opportunities.

#  What are the potential costs to the business if we were to make these mistakes in practice
It ensures that all relevant tasks are identified. but it could potentially lead to lost revenue, lower efficiency, or other missed critical tasks.

# At threshold 0.4

# How many False Positives? 
0 false positives

# What do these numbers represent?  
There are no instances of misclassified positives 

# What are the potential costs to the business if we were to make these mistakes in practice?
It is maintaining efficiency but if present it could lead to minor wasted resources or attention directed at non-relevant cases.

# How many False Negatives?  
1 false negatives.

# What do these numbers represent? 
there is one missed opportunity.

# What are the potential costs to the business if we were to make these mistakes in practice?
it could potentially leading to a lost opportunity. it is potentially leading to lost revenue, lower efficiency, or other missed critical tasks.?

# Based on your careful consideration of probability threshold options and the corresponding speculated risks/costs - which probability threshold do you recommend going forward with?
Based on the evaluation, Threshold = 0.3 for Logistic Regression is recommended.
Because it achieves perfect performance (no false positives or negatives), ensuring maximum efficiency and accuracy for the business context without missed opportunities or wasted resources.