# Correlation and Feature Selection

In this notebook, we will analyze the correlation matrix of the cleaned sleep dataset and select features based on their correlation with the target variable. This will help us understand which features are most relevant for predicting stress levels.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the cleaned dataset
data_path = '../data/processed/sleep_cleaned.csv'
df = pd.read_csv(data_path)

df.head()

## Correlation Matrix

We will compute the correlation matrix to identify relationships between features and the target variable.

In [2]:
correlation_matrix = df.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', square=True)
plt.title('Correlation Matrix')
plt.show()

## Feature Selection

Based on the correlation matrix, we will select features that have a significant correlation with the target variable (stress level). We will define a threshold for correlation to filter out less relevant features.

In [3]:
target_variable = 'stress_level'
correlation_threshold = 0.3

relevant_features = correlation_matrix[target_variable][abs(correlation_matrix[target_variable]) > correlation_threshold].index.tolist()
relevant_features.remove(target_variable)  # Remove target variable from the list

print('Relevant features based on correlation:', relevant_features)

## Conclusion

In this notebook, we analyzed the correlation matrix and selected features that have a significant correlation with the target variable. This will aid in building a predictive model for stress levels based on sleep patterns and lifestyle factors.