## The goal of collecting this dataset:
The goal of collecting the Student Stress Factors dataset is to conduct a comprehensive analysis of the factors contributing to student stress, with a focus on classifying students into different stress levels and clustering them based on common stress-related characteristics. By examining variables such as academic workload, personal life, social pressures, and mental health, this dataset aims to identify patterns and relationships that can classify students’ stress levels. Additionally, clustering techniques will be used to group students with similar stress profiles, which can provide insights for developing targeted strategies to reduce stress, improve well-being, and enhance academic performance.

## The source of the dataset:
https://www.kaggle.com/datasets/rxnach/student-stress-factors-a-comprehensive-analysis

In [48]:
import pandas as pd
df = pd.read_csv('Dataset/StressLevelDataset(in).csv')

## General information about the dataset:
Number of attributes: 21
Number of objects: 1100
Attribute types: All columns are integer types (int64)
## General information about the dataset:
Number of attributes: 21

Number of objects: 1100

Attribute types: All columns are integer types (int64)

Class label: stress_level

In [49]:
import pandas as pd
df = pd.read_csv('Dataset/StressLevelDataset(in).csv')
num_objects = len(df)
attributes_info = pd.DataFrame({
    'Attribute Name': df.columns,
    'Data Type': df.dtypes.values
})
print("Number of attributes:" ,len(df.columns))
print()
print("Attributes and their types:")
print(attributes_info)
print()
print("Number of objects: ",num_objects)

#### Check the Current Distribution of the Class Label:


In [50]:
# Check the current distribution of the class label

# Use value_counts to get the count of each unique value in the 'stress_level' column
# Set normalize=True to get the relative frequencies as percentages, multiplied by 100
class_distribution = df['stress_level'].value_counts(normalize=True) * 100
print("Class label distribution in the full dataset:") # Print a message to describe the output
print(class_distribution) # Display the distribution of the class labels in percentages

### Graphs:


#### Bar char (Stress Level):

In [51]:
import matplotlib.pyplot as plt
import seaborn as sns

# Display the class distribution (counts of each class label)
print(class_distribution)
plt.figure(figsize=(6, 4)) # Set up the figure size for the plot
sns.countplot(x='stress_level', data=df, color='lightblue') # Use Seaborn's countplot to plot the distribution of the 'stress_level' column
plt.title('Stress Level distribution') # Set the title of the plot
plt.show() # Display the plot

This bar chart indicates that the dataset has an equal distribution of data across all stress levels, where 0 represents low stress, 1 stands for medium stress, and 2 indicates high stress. This balance is crucial for ensuring the integrity of analysis and predictions, as it prevents class imbalance that can distort results or cause predictive models to become biased. In an imbalanced dataset, models may disproportionately favor the majority class (e.g., "low stress"), leading to inaccurate and unfair predictions for underrepresented categories (e.g., "high stress"). The balanced representation in this dataset ensures more reliable and fair predictions, allowing for better understanding and intervention across all stress levels, thereby promoting effective strategies for managing student stress.

#### Pie chart (Social Support):

In [52]:
data2 = df['social_support'].value_counts(normalize=True) * 100 # Calculate the percentage distribution of the 'social_support' column
custom_colors = ['#ff9999', '#66b3ff', '#99ff99', '#ffcc99'] 
data2.plot.pie(autopct='%1.1f%%', figsize=(6, 6), startangle=90, colors=custom_colors) # Plot the percentage distribution as a pie chart
plt.title('Percentage Distribution of Social Support') # Set the title of the pie chart
plt.ylabel('') # Remove the y-axis label for a cleaner pie chart presentation
plt.show() # Display the pie chart

The pie chart reveals that the majority of students (over 70%) experience either low or high levels of social support, with the largest percentage (41.6%) feeling well-supported. This suggests that many students have access to strong social networks, which can play a critical role in their academic success and mental health. However, around 8% of students report having no support, which is concerning. A lack of social support can significantly impact a student’s ability to manage stress, maintain motivation, and succeed academically. Addressing this issue through targeted interventions, such as peer mentoring, counseling services, or group activities, could help those with little to no support build stronger connections and improve their overall well-being. Understanding these different levels of social support is key to developing strategies that support students’ academic performance and mental health.

#### Box plot (Bullying):

In [65]:
# Create a box plot for the 'bullying' column in the dataset
sns.boxplot(data=df['bullying'], color='lightblue')
plt.title('Bullying Distribution') # Set the title for the plot
plt.show() # Display the plot

The box plot shows the distribution of the "bullying" variable, with the interquartile range (IQR) captured by the box, and the 25th and 75th percentiles at the edges. The median bullying level is around 3, as indicated by the line inside the box. The whiskers extend to the minimum and maximum values, showing that most data points lie between 1 and 5, with no significant outliers. This distribution suggests that bullying incidents are relatively common in the population, with most students experiencing moderate levels of bullying. The lack of outliers indicates that extreme cases are rare, and interventions may need to focus on addressing the more typical experiences of bullying rather than isolated severe incidents. Understanding this spread is crucial for tailoring prevention and support programs.

#### Histogram (Stress level, Academic performance)

In [54]:
plt.figure(figsize=(10, 6)) # Set the figure size for the histogram
plt.hist(df['stress_level'], 
         bins=15,                 # Number of bins for the histogram
         color='coral',          # Color for the Stress Level histogram
         edgecolor='black',      # Color of the bin edges
         alpha=0.6,              # Transparency level of the histogram
         label='Stress Level',    # Label for the legend
         histtype='stepfilled',   # Style of the histogram
         linewidth=1.5)          # Width of the bin edges

plt.hist(df['academic_performance'], 
         bins=15,                 # Number of bins for the histogram
         color='darkblue',       # Color for the Academic Performance histogram
         edgecolor='white',      # Color of the bin edges
         alpha=0.6,              # Transparency level of the histogram
         label='Academic Performance', # Label for the legend
         histtype='stepfilled',   # Style of the histogram
         linewidth=1.5)          # Width of the bin edges

plt.title('Comparison of Stress Level and Academic Performance Distribution', fontsize=14, fontweight='bold')  # Title of the plot
plt.xlabel('Value', fontsize=12)        # X-axis label
plt.ylabel('Frequency', fontsize=12)    # Y-axis label
plt.grid(True, linestyle='--', alpha=0.7)  # Style of the gridlines
plt.legend(frameon=True, fancybox=True, shadow=True, loc='upper right', fontsize=11)  # Legend properties

# Display the histogram
plt.show()


Based on the histogram illustrating the relationship between stress level and academic performance, it is clear that students with higher academic achievement tend to exhibit elevated stress levels. This trend indicates that as students strive for better grades and higher academic standing, they often encounter increased pressure and demands associated with their studies. 

#### Scater plot(Anxiety Level, Self esteem)

In [64]:
plt.figure(figsize=(10, 6))  # Set the figure size for the scatter plot

# Normalize the anxiety level for color mapping
norm = plt.Normalize(df['anxiety_level'].min(), df['anxiety_level'].max())

# Create a scatter plot with colors based on anxiety levels
scatter = plt.scatter(df['anxiety_level'], df['self_esteem'], 
                      c=df['anxiety_level'],  # Color based on anxiety levels
                      cmap='viridis',         # Colormap to use
                      norm=norm,              # Normalize values for colormap
                      alpha=0.7,             # Set transparency of points
                      edgecolor='black')      # Outline color for points

plt.title('Anxiety Level vs. Self esteem')  # Set the title for the scatter plot to describe the data being represented
plt.xlabel('Anxiety Level')  # Label the x-axis to indicate that it represents anxiety levels
plt.ylabel('Self esteem')  # Label the y-axis to indicate that it represents depression levels
cbar = plt.colorbar(scatter) # Add a colorbar to indicate the mapping of colors to anxiety levels
cbar.set_label('Anxiety Level')  # Label for the color bar

plt.show() 



This plot illustrates a negative relationship between anxiety level and self-esteem, suggesting that as anxiety levels increase, self-esteem tends to decrease. This finding aligns with expectations in psychological studies, where heightened anxiety is often associated with lower self-worth and confidence. The observed pattern indicates a potential link between these variables in your data, highlighting the importance of addressing anxiety to promote healthier self-esteem levels. Further analysis on the correlation between anxiety and self-esteem, as well as the potential causation, may be warranted to understand the dynamics of these relationships better and to develop effective interventions.

### Statistical summaries:

In [9]:
# Calculate statistical summaries (mean, variance, etc.)
stat_summary = df.describe().T  # Transpose the summary for better readability
stat_summary['variance'] = df.var() # Calculate variance for each column and add it to the statistical summary
print(stat_summary[['mean', 'std', 'variance']]) # Display the statistical summary including mean, standard deviation, and variance

# Data Preprocessing:

## Handling Duplicates

Duplicate rows in a dataset can introduce redundancy, skewing the analysis and leading to inaccurate results. In this step, we checked for and removed any duplicate rows using the duplicated() function. This ensures that each data point is unique, preserving the quality and integrity of the dataset.

In [10]:
# Handling Duplicates
num_duplicates = df.duplicated().sum()
print("Number of duplicate rows:", num_duplicates)

# Remove duplicates and save the cleaned dataset
data = df.drop_duplicates()

# Save after handling duplicates
data.to_csv('Cleaned_Dataset.csv', index=False)

Result:

Upon checking the dataset, the result showed 0 duplicate rows, meaning that no redundant data points were found. The dataset is clean, and no further action regarding duplicates was necessary

## Missing values:

In [11]:
# Check for missing values
missing_values = df.isnull().sum() # This creates a Series containing the count of missing values per column
print("Missing values per column:") # Print a message indicating that missing values will be displayed
print(missing_values) # Print missing values per column

## Handling Outliers:

Outliers are extreme values that can impact the accuracy of data analysis. To address this, we used the Interquartile Range (IQR) method, which identifies outliers by looking for values significantly above or below the normal range in numeric columns. Instead of removing these outliers, we cap their values to minimize their influence while preserving the overall dataset structure.

In [12]:
data = pd.read_csv('Cleaned_Dataset.csv')
import numpy as np

# Outlier handling using IQR method
outlier_threshold = 1.5

def count_outliers(column_data):
    q1 = np.percentile(column_data, 25)
    q3 = np.percentile(column_data, 75)
    iqr = q3 - q1
    upper_bound = q3 + outlier_threshold * iqr
    lower_bound = q1 - outlier_threshold * iqr
    outliers = (column_data > upper_bound) | (column_data < lower_bound)
    return sum(outliers)

# Select numeric columns
numeric_columns = data.select_dtypes(include=[np.number]).columns

# Detect outliers in each numeric column
outlier_counts = {}
total_rows_with_outliers = 0

for column in numeric_columns:
    outliers = count_outliers(data[column])
    outlier_counts[column] = outliers
    total_rows_with_outliers += outliers

# Print outlier summary
print("Outlier Counts:")
for column, count in outlier_counts.items():
    print(f"{column}: {count} rows with outliers")

print(f"Total Rows with Outliers: {total_rows_with_outliers}")

Result:

The result shows that most variables have no outliers, indicating well-distributed data. However, noise_level (173 rows), study_load (165 rows), and living_conditions (62 rows) contain a significant number of outliers, suggesting unusual variations in these columns. These outliers may require further investigation.

##### handling outliers:

In [13]:
import numpy as np
import pandas as pd


data = pd.read_csv('Cleaned_Dataset.csv')


outlier_threshold = 1.5


def count_outliers(column_data):
    column_data = column_data.dropna() 
    q1 = np.percentile(column_data, 25)
    q3 = np.percentile(column_data, 75)
    iqr = q3 - q1
    upper_bound = q3 + outlier_threshold * iqr
    lower_bound = q1 - outlier_threshold * iqr
    outliers = (column_data > upper_bound) | (column_data < lower_bound)
    return sum(outliers)


numeric_columns = data.select_dtypes(include=[np.number]).columns

outlier_counts = {}
total_rows = len(data)

for column in numeric_columns:
   
    outliers = count_outliers(data[column])
    outlier_counts[column] = outliers

    
    non_na_data = data[column].dropna()

    
    q1 = np.percentile(non_na_data, 25)
    q3 = np.percentile(non_na_data, 75)
    iqr = q3 - q1
    upper_bound = q3 + outlier_threshold * iqr
    lower_bound = q1 - outlier_threshold * iqr
    
   
    data[column] = np.clip(data[column], lower_bound, upper_bound)


data.to_csv('Cleaned_Dataset.csv', index=False)


Checking the results by counting outliers after handling them.

In [14]:
data = pd.read_csv('Cleaned_Dataset.csv')
import numpy as np

# Outlier handling using IQR method
outlier_threshold = 1.5

def count_outliers(column_data):
    q1 = np.percentile(column_data, 25)
    q3 = np.percentile(column_data, 75)
    iqr = q3 - q1
    upper_bound = q3 + outlier_threshold * iqr
    lower_bound = q1 - outlier_threshold * iqr
    outliers = (column_data > upper_bound) | (column_data < lower_bound)
    return sum(outliers)

# Select numeric columns
numeric_columns = data.select_dtypes(include=[np.number]).columns

# Detect outliers in each numeric column
outlier_counts = {}
total_rows_with_outliers = 0

for column in numeric_columns:
    outliers = count_outliers(data[column])
    outlier_counts[column] = outliers
    total_rows_with_outliers += outliers

# Print outlier summary
print("Outlier Counts:")
for column, count in outlier_counts.items():
    print(f"{column}: {count} rows with outliers")

print(f"Total Rows with Outliers: {total_rows_with_outliers}")


# Data Transformation


Encoding was not used because it is only necessary for categorical or textual data. Since the dataset contained numerical data, there was no need for encoding.


## Normalization
In the transformation process, Normalization was applied because the data consisted of numerical values that needed to be scaled to a specific range (typically 0 to 1). This ensures that features with different scales do not disproportionately influence machine learning models.

In [15]:
data1 = pd.read_csv('Cleaned_Dataset.csv')
data1 = pd.DataFrame(data1)
# Columns to normalize
columns_to_normalize = [
    'anxiety_level', 'self_esteem', 'depression', 'blood_pressure', 
    'sleep_quality', 'breathing_problem', 'noise_level', 'living_conditions', 
    'study_load', 'future_career_concerns', 'social_support', 'peer_pressure', 
    'extracurricular_activities', 'bullying', 'stress_level'
]

# Apply Decimal scaling normalization
for column in columns_to_normalize:
    max_abs_value = data1[column].abs().max()
    data1[column] = data1[column] / (10 ** len(str(int(max_abs_value))))

# Output the normalized data
print(data1.head())

# Save the normalized dataset
data1.to_csv('Cleaned_Dataset.csv', index=False)

Result:

As seen in the table, the values for variables such as anxiety_level, self_esteem, blood_pressure, and others have been normalized. For example, blood_pressure now ranges between 0 and 1, ensuring consistent scaling across all features. This allows for more balanced analysis and model training without certain features dominating due to larger magnitudes.

# Aggregation

Aggregation is a technique used to summarize data by grouping it based on specific categories. In this step, we grouped the dataset by stress_level and calculated the mean for numeric columns like anxiety_level, depression, and self_esteem. For categorical variables like bullying, we summed the values to provide insight into the total occurrences within each stress level group.

In [16]:
data1 = pd.read_csv('Cleaned_Dataset.csv')
# Step 5: Aggregation based on stress_level
aggregated_df = data1.groupby('stress_level').agg({
    'anxiety_level': 'mean',  
    'depression': 'mean',    
    'self_esteem': 'mean',    
    'bullying': 'sum'  # Example of sum for categorical variables
})



# Output aggregated data
print("Aggregated data:")
print(aggregated_df)

Result:

The aggregated data shows the mean values of anxiety_level, depression, and self_esteem for each stress_level group, as well as the total number of bullying incidents. For instance, individuals with a stress_level of 2 have a higher mean anxiety_level (16.40) and depression (19.83) compared to those with a stress_level of 0. This helps identify patterns and correlations between stress levels and various psychological and behavioral metric

# Discretization

Discretization is a process where continuous data is divided into discrete categories or bins. In this step, we transformed the anxiety_level variable into three categories: Low, Medium, and High. This simplifies the data and makes it easier to analyze trends across different levels of anxiety.

In [17]:
# Load the dataset
data1 = pd.read_csv('Cleaned_Dataset.csv')


# Discretization of anxiety_level into categories
data1['anxiety_level'] = pd.cut(data1['anxiety_level'], bins=3, labels=['Low', 'Medium', 'High'])

# Save the discretized dataset
data1.to_csv('Cleaned_Dataset.csv', index=False)

# Display the first few rows
print("Data after discretization:")
print(data1[['anxiety_level', 'anxiety_level']].head())

Result:

The anxiety_level column has been discretized into the bins Low, Medium, and High. For example, a anxiety_level of 14 is classified as Medium, while a value of 16 is classified as High. This categorization helps in interpreting the data more intuitively and allows for easier comparisons across different groups.

# Feature Selection
The number of available features is 20, and the feature selection includes blood pressure, sleep quality, future career concerns, bullying, and stress level.
 

In [18]:
from sklearn.feature_selection import SelectKBest, f_classif
import pandas as pd

# Load the dataset
df = pd.read_csv('Cleaned_Dataset.csv')

# Separate features from the target variable
X = df.drop(columns=['anxiety_level', 'anxiety_level_binned'])  # Exclude non-numeric columns
y = df['anxiety_level']

# Select only numeric columns for feature selection
X_numeric = X.select_dtypes(include=[float, int])

# Check the number of features
n_features = X_numeric.shape[1]
print('Number of features available:', n_features)

# Specify the number of features to choose
num_features_to_select = min(5, n_features)  # Choose the least between 5 and the actual number of features
selector = SelectKBest(score_func=f_classif, k=num_features_to_select)

# Apply feature selection
X_selected = selector.fit_transform(X_numeric, y)

# Get selected feature indicators
selected_indices = selector.get_support(indices=True)

# Get selected feature names
selected_features = X_numeric.columns[selected_indices]

print('Selected Features:', selected_features)

## Loading Data

In [163]:
#Load data
df=pd.read_csv('Cleaned_Dataset.csv')
print(df)