# **Model - Stress Prediction**

## Stage 1: Retrieving and loading the dataset

As the first step, we load our dataset and provide some initial overview on the data through shape, columns and info methods.

In [1]:
import pandas as pd
import numpy as np

file_path = 'AI_Developer_Performance.csv';
data = pd.read_csv(file_path);
print(data.head());

   Hours_Coding  Lines_of_Code  Bugs_Found  Bugs_Fixed  AI_Usage_Hours  \
0             7            416           9           7               6   
1             4            269          16          13               5   
2            11            439           3           0               2   
3             8            472          15           9               4   
4             5            265          19          16               5   

   Sleep_Hours  Cognitive_Load  Task_Success_Rate  Coffee_Intake  \
0          5.9              92                 34              7   
1          5.1              85                 36              2   
2          6.2              38                 79              2   
3          4.2              26                 94              5   
4          8.1              82                 33              6   

   Stress_Level  Task_Duration_Hours  Commits  Errors  
0            99                 10.5       20       3  
1           100                  9

In [2]:
print(f"Shape of the data (rows, columns): {data.shape}")
print(f"Columns in data: {list(data.columns)}\n")

data.info()


Shape of the data (rows, columns): (1000, 13)
Columns in data: ['Hours_Coding', 'Lines_of_Code', 'Bugs_Found', 'Bugs_Fixed', 'AI_Usage_Hours', 'Sleep_Hours', 'Cognitive_Load', 'Task_Success_Rate', 'Coffee_Intake', 'Stress_Level', 'Task_Duration_Hours', 'Commits', 'Errors']

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Hours_Coding         1000 non-null   int64  
 1   Lines_of_Code        1000 non-null   int64  
 2   Bugs_Found           1000 non-null   int64  
 3   Bugs_Fixed           1000 non-null   int64  
 4   AI_Usage_Hours       1000 non-null   int64  
 5   Sleep_Hours          1000 non-null   float64
 6   Cognitive_Load       1000 non-null   int64  
 7   Task_Success_Rate    1000 non-null   int64  
 8   Coffee_Intake        1000 non-null   int64  
 9   Stress_Level         1000 non-null   int64  
 10  Task_Duration_

#### Initial Data Quality Assessment

- All variables are numerical, which makes the dataset suitable for
  regression modeling without categorical encoding.
- No missing values were observed at this stage.
- The dataset is well-structured and ready for further preprocessing steps.

This confirms that the data is accessible, complete, and appropriate for
stress level prediction using machine learning techniques.

## Stage 2: Data Preparation
In this section, we prepare the dataset for stress level prediction. 
This includes:
1. Data Cleansing : removing duplicates, missing or invalid values.
2. Data Transformation : feature engineering and scaling.

### 2.1 Data Cleansing

In [None]:
duplicates = data.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

if duplicates > 0:
    df = data.drop_duplicates()
    print("Duplicates removed.")
else:
    print("No duplicates found.")

In [None]:
missing_values = data.isnull().sum();
print(f'Number of missing values in each column:\n {missing_values}')

No missing values were found in the dataset, indicating completeness
and suitability for modeling.


In [None]:
invalid_coding_hours = data[(data['Hours_Coding'] <= 0) | (data['Hours_Coding'] >= 24)].shape[0]
invalid_lines = data[data['Lines_of_Code'] < 0].shape[0]
invalid_bugs = data[data['Bugs_Found'] < 0].shape[0]
invalid_bugs_fixed = data[data['Bugs_Fixed'] < 0].shape[0]
invalid_sleep_hours = data[(data['Sleep_Hours'] <= 0) | (data['Sleep_Hours'] >= 24)].shape[0]
invalid_AI_hours = data[(data['AI_Usage_Hours'] < 0) | (data['AI_Usage_Hours'] > 24)].shape[0]
invalid_coffee = data[data['Coffee_Intake'] < 0].shape[0]
invalid_task_hours = data[(data['Task_Duration_Hours'] <= 0)].shape[0]

print(f"Invalid Coding Hours: {invalid_coding_hours}")
print(f"Invalid Lines of Code: {invalid_lines}")
print(f"Invalid Bugs Found: {invalid_bugs}")
print(f"Invalid Bugs Fixed: {invalid_bugs_fixed}")
print(f"Invalid Sleep Hours: {invalid_sleep_hours}")
print(f"Invalid AI Usage Hours: {invalid_AI_hours}")
print(f"Invalid Coffee Intake: {invalid_coffee}")
print(f"Invalid Task Duration Hours: {invalid_task_hours}")


All values are within reasonable ranges.
This confirms the dataset is clean and ready for feature engineering.


### 2.2 Feature Engineering

We create new features that may improve model performance.


In [None]:
data['Bug_Fix_Ratio'] = np.where(data['Bugs_Found'] > 0, data['Bugs_Fixed'] / data['Bugs_Found'], 0)
data['Bug_Fix_Ratio'] = data['Bug_Fix_Ratio'].round(2)
print(data[['Bugs_Fixed','Bugs_Found','Bug_Fix_Ratio']].head())

The Bug_Fix_Ratio measures how efficiently a developer fixes the bugs 
they encounter. A higher value indicates better debugging efficiency.


### 2.3 Selecting Features and Target 


In [None]:
features = ['Hours_Coding', 'Lines_of_Code', 'Bugs_Found', 'Bugs_Fixed','AI_Usage_Hours', 'Sleep_Hours', 
            'Cognitive_Load','Coffee_Intake', 'Task_Duration_Hours', 'Commits','Errors', 'Bug_Fix_Ratio']

target_stress = 'Stress_Level'

X = data[features]
Y = data[target_stress]

print(X.head())

### 2.4 Feature Scaling

Since input features have different scales, we standardize them 
using StandardScaler. This ensures that features contribute equally 
to the regression model and improves convergence for some algorithms.


In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=features)
print(X_scaled.head());

Y_DF = pd.DataFrame(Y)

y_scaler = StandardScaler()
Y_scaled = y_scaler.fit_transform(Y_DF)

Y_scaled = pd.DataFrame(Y_scaled, columns=[target_stress])
print(Y_scaled.head())



### Summary of Data Preparation

- Duplicate rows and missing values were checked; none required removal.
- Feature engineering created Bug_Fix_Ratio to capture debugging efficiency.
- Input features and target variable were separated for modeling.
- Features were standardized to ensure uniform contribution to regression.

The dataset is now clean, transformed, and ready for exploratory data analysis.


## Stage 3: Exploratory Data Analysis (EDA)

### 3.1 Descriptive Statistics

We begin by summarizing the central tendency, dispersion, and range 
of all features and the target variable Stress_Level. 
Descriptive statistics help identify potential outliers and understand 
overall variability in developer behavior.


In [None]:
desc_stats = data[features + [target_stress]].describe().T
desc_stats['variance'] = data[features + [target_stress]].var()
desc_stats['skewness'] = data[features + [target_stress]].skew()

desc_stats

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

combined_scaled = pd.concat([X_scaled, Y_scaled], axis=1)

plt.figure(figsize=(10,6))
sns.boxplot(data=combined_scaled, orient='h')
plt.title('Boxplot of Features and Stress Level')
plt.show()



The boxplot shows the distribution of Stress Level after scaling. Stress levels have moderate spread with the median near the center, indicating most observations cluster around average stress, with some higher and lower cases. Outliers represent relative extremes, not errors.

Scaling X and Y was necessary because original features had different ranges, ensuring fair comparison.


The descriptive statistics provide insights such as:
- Mean and median values for working hours, sleep, cognitive load, and stress levels.
- Minimum and maximum values to spot extreme cases.
- Standard deviation to assess variability among developers.

### 3.2 Distribution Analysis

We visualize the distribution of key variables to understand patterns 
and skewness. This helps in detecting imbalances or unusual trends.


In [None]:
plt.figure(figsize=(7,4))
sns.histplot(data['Stress_Level'], bins=15, kde=True, color='Tomato')
plt.title('Distribution of Stress Level')
plt.xlabel('Stress Level')
plt.ylabel('Frequency')
plt.show()

print('\n')

plt.figure(figsize=(7,4))
sns.histplot(data['Sleep_Hours'], bins=15, kde=True, color='Skyblue')
plt.title('Distribution of Sleep Hours')
plt.xlabel('Sleep Hours')
plt.ylabel('Frequency')
plt.show()

print('\n')

plt.figure(figsize=(7,4))
sns.histplot(data['Hours_Coding'], bins=15, kde=True, color='green')
plt.title('Distribution of Hours Coding')
plt.xlabel('Hours Coding')
plt.ylabel('Frequency')
plt.show()

- Stress Level: Mostly uniform, with a spike at the high end showing a group under extreme stress.
- Sleep Hours: Uniformly spread between 4–9 hours, with 6 hours slightly more common.
- Coding Hours: Multimodal, indicating different patterns or groups of developers.

### 3.3 Relationship Analysis

We analyze relationships between stress levels and key features 
to identify potential predictors.


In [None]:
plt.figure(figsize=(8,5))
sns.scatterplot(x='Sleep_Hours',y='Stress_Level',data=data,color='purple')
plt.title('Sleep Hours vs Stress Level')
plt.xlabel('Sleep Hours')
plt.ylabel('Stress Level')
plt.show()

The scatter plot shows a wide dispersion of stress levels across all sleep durations (4–9 hours). There is no clear linear or nonlinear trend between sleep hours and stress level. High and low stress values appear at nearly all sleep durations, indicating a weak or negligible relationship between the two variables.

In [None]:
plt.figure(figsize=(8,5))
sns.scatterplot(x='Cognitive_Load', y='Stress_Level', data=data, color='orange')
plt.title('Stress_Level vs Cognitive_Load')
plt.xlabel('Cognitive Load')
plt.ylabel('Stress Level')
plt.show()

The scatter plot shows a strong positive relationship between cognitive load and stress level. As cognitive load increases, stress levels also rise in a nearly linear pattern, indicating that higher mental demands are associated with higher stress.

In [None]:
plt.figure(figsize=(8,5))
sns.scatterplot(x='AI_Usage_Hours', y='Stress_Level', data=data, color='green')
plt.title('Stress_Level vs AI_Usage_Hours')
plt.xlabel('AI Usage Hours')
plt.ylabel('Stress Level')
plt.show()

The scatter plot shows stress levels (≈30–100) across AI usage hours (0–6). Stress values are widely spread at every usage level, indicating no clear trend or strong correlation between AI usage hours and stress.

### 3.4 Correlation Analysis

We compute pairwise correlations to identify the strongest predictors 
of stress and detect multicollinearity among features.


In [None]:
corr_matrix = data[features + [target_stress]].corr().round(2)
display(corr_matrix)

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', square=True)
plt.title('Correlation Heatmap')
plt.show()

The heatmap shows stress level is strongly linked to cognitive load, with higher mental workload increasing stress. Most other factors, like AI usage, sleep, coding time, commits, and errors, show weak or no correlation.

Strong relationships also appear between hours coding and lines of code, and between bugs found and bugs fixed. Overall, stress is mainly driven by cognitive demands.

**EDA Summary**

- Stress_Level varies moderately, suitable for regression.
- Sleep_Hours and Cognitive_Load strongly relate to stress.
- Workload features show expected correlations.
- No major anomalies affect modeling.

These insights inform feature selection and model choice.

## **4. Stress Level Prediction Model**

### 4.1 Model Selection

The target variable Stress_Level is continuous, therefore Linear Regression 
is selected as the appropriate modeling technique.
Linear Regression models the relationship between stress levels and multiple 
independent variables by fitting a linear equation to the observed data.


### 4.2 Train-test split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, test_size=0.2, random_state=42)
print("Training samples: ", X_train.shape[0])
print("Testing samples: ", X_test.shape[0])

### 4.3 Model Training

The Linear Regression model is trained using the training dataset 
to learn the relationship between input features and stress levels.


In [None]:
from sklearn.linear_model import LinearRegression

LR_model = LinearRegression();

LR_model.fit(X_train, Y_train)


### 4.4 Model interpretation 

In [None]:
coefficients = pd.DataFrame({'Feature': features,'Coefficient': LR_model.coef_}).sort_values(by='Coefficient', ascending=False)

print("Intercept:", LR_model.intercept_)
display(coefficients)

The table shows how each feature affects Stress_Level in the linear regression model:

- Positive coefficients increase stress; negative coefficients decrease stress.
- Cognitive_Load (21.32) has the largest impact.
- Bugs_Fixed (1.44) and Task_Duration_Hours (0.15) slightly increase stress.
- Bug_Fix_Ratio (-0.76) and Bugs_Found (-0.85) reduce stress.
- Sleep_Hours and Coffee_Intake have smaller negative effects.

The intercept represents stress when all features are zero.

### 4.5 Model Prediction

In [None]:
Y_predicted = LR_model.predict(X_test)

### 4.6 Actual vs Predicted Stress_Level


In [None]:
results_table = pd.concat([
    X_test.reset_index(drop=True),
    pd.DataFrame(Y_test).reset_index(drop=True),
    pd.DataFrame(Y_predicted, columns=['Predicted_Stress']).reset_index(drop=True)
], axis=1)

results_table['Residuals'] = results_table['Stress_Level'] - results_table['Predicted_Stress']

results_table

The table shows test set results: features, actual Stress_Level, predicted Stress, and residuals. Positive residuals mean underestimation, negative means overestimation. This helps assess model performance.

In [None]:
plt.scatter(Y_test, Y_predicted, color='blue', label='Predictions')
plt.plot([Y_test.min(), Y_test.max()], [Y_test.min(), Y_test.max()], 
         color='red', linestyle='--', label='Perfect Fit')
plt.xlabel("Actual Stress_Level")
plt.ylabel("Predicted Stress_Level")
plt.title("Actual vs Predicted Stress_Level")
plt.legend()
plt.grid(True)
plt.show()


The scatter plot compares predicted vs actual stress. The red line shows perfect predictions. Most points are close to the line, with small deviations and randomly distributed residuals, indicating good model performance.

### 4.7 Model Evaluation
The Linear Regression model is evaluated using MAE, RMSE and R-square.
These metrics confirm the model’s accuracy and reliability.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
mae = mean_absolute_error(Y_test, Y_predicted)
rmse = np.sqrt(mean_squared_error(Y_test, Y_predicted))
r2 = r2_score(Y_test, Y_predicted)

print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.2f}")

- MAE = 4.75: Predictions differ from actual stress by about 4.75 units on average.
- RMSE = 5.56: Larger errors are limited, showing reliable predictions.
- R² = 0.93: The model explains 93% of stress variability.

These evaluation metrics confirm that the Linear Regression model provides reliable stress predictions and offers interpretable insights into how workload and lifestyle factors influence developer stress.


### 4.8 Prediction Example 

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

new_data = pd.DataFrame({
    'Hours_Coding': [6],
    'Lines_of_Code': [350],
    'Bugs_Found': [10],
    'Bugs_Fixed': [7],
    'AI_Usage_Hours': [3],
    'Sleep_Hours': [6.5],
    'Cognitive_Load': [57],
    'Coffee_Intake': [3],
    'Task_Duration_Hours': [8],
    'Commits': [17],
    'Errors': [5],
    'Bug_Fix_Ratio': [0.7]
})

new_data = new_data[features]

scaler = StandardScaler()
scaler.fit(X)

new_data_scaled = pd.DataFrame(scaler.transform(new_data), columns=features)

predicted_stress = LR_model.predict(new_data_scaled)
print(f"Predicted Stress Level: {predicted_stress[0]:.2f}")


## Stage 5: Results and discussion

**Results and discussions are summarized in the report.**
