# 📁 02 - Exploratory Data Analysis (EDA)

### 🎯 Objective
This notebook explores the cleaned and merged dataset to understand patterns, trends, and potential relationships between variables. The focus is on the target variable (e.g., `stress_score`) and its interactions with other features like nutrition, sleep, heart rate, and physical activity.

---

### 🛠️ Key Steps

1. **Trend & Seasonality**  
   - Line plot of the target over time  
   - Time decomposition (trend, seasonal, residual)

2. **Distributions & Outliers**  
   - Histogram and boxplots of the target variable 
   - Comparison between weekday and weekend behaviors

3. **Relationships**  
   - Scatter plots of target vs selected variables  

4. **Correlation Analysis**  
   - Heatmaps of normal features  
   - Identify multicollinearity or predictive potential

5. **Autocorrelation & Lag Structure**  
   - ACF and PACF plots  
   - Manual lag feature creation for temporal impact analysis

---

### 📦 Output
- **Charts and plots** supporting interpretation of features  
- **Feature candidates** to be carried forward into modeling  

---

> 📝 Note: Insights from this notebook are used to guide feature engineering and model selection in the next stage of the pipeline.


In [None]:
import pandas as pd
import numpy as np
np.set_printoptions(suppress=True)

In [None]:
# Due data manipulation (to float64) from the preview notbook, the NaN are treat them as zero values.
# For this reason, some calls show them as NaN and some not.
data = pd.read_csv('../data/processed/data.csv')

In [None]:
data.dtypes

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.info()

In [None]:
data.describe(include='all')

---

#### Feature Engeneering (Just for Plotting purposes)

In [None]:
# First transform the date to date time:
data['date'] = pd.to_datetime(data['date'])

# Then, you can extract all these features from it:
data['year'] = data['date'].dt.year
data['month'] = data['date'].dt.month
data['day_of_week'] = data['date'].dt.dayofweek
data['is_weekend'] = data['day_of_week'].isin([5, 6]).astype(int)


---

#### Data Visualization
##### Due high dimmentionality context, I'll only plot the variables or relationships that personally seems interesting to me.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# First, I'm interesting on see the target variable over time so:
plt.figure(figsize=(18, 9))
sns.lineplot(x='date', y='stress_score', data=data)
plt.title('Stress Score over Time')
plt.xlabel('Date')
plt.ylabel('Stress Score')
plt.tight_layout()
plt.show()

In [None]:
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

# Decompose
decomp = seasonal_decompose(data['stress_score'], model='additive', period=7)

# Plot with customization
fig, axs = plt.subplots(4, 1, figsize=(12, 8), sharex=True)

axs[0].plot(decomp.observed, label='Observed', color='black')
axs[0].legend(loc='upper left')
axs[0].set_ylabel('Score')

axs[1].plot(decomp.trend, label='Trend', color='blue')
axs[1].legend(loc='upper left')
axs[1].set_ylabel('Trend')

axs[2].plot(decomp.seasonal, label='Seasonality', color='green')
axs[2].legend(loc='upper left')
axs[2].set_ylabel('Seasonal')

axs[3].plot(decomp.resid, label='Residuals', color='red')
axs[3].legend(loc='upper left')
axs[3].set_ylabel('Residuals')
axs[3].set_xlabel('Date')

# Super Title
plt.suptitle('Seasonal Decomposition of Stress Score', fontsize=14)
# Automatically adjusts subplot parameters
plt.tight_layout()
plt.show()


In [None]:
# What about the values' distribution of the target?
plt.figure(figsize=(12,8))
sns.histplot(data['stress_score'])
plt.title("Histogram of Stress Score")
plt.xlabel('Stress Score')
plt.ylabel('Number of occurrences')
plt.show()

- Beyond some outliers present, the 'Stress Score' distribution seem to be normal.

In [None]:
plt.figure(figsize=(6,14))
sns.boxenplot(data['stress_score'])
plt.title('Boxplot of Stress Score')
plt.ylabel('Value')
plt.show()

In [None]:
# The day of the week afect the stress score?
plt.figure(figsize=(12,8))
sns.boxenplot(data=data, x='day_of_week', y='stress_score')
plt.title('Stress Score per Day of the week')
plt.xlabel('Day of the week')
# Here first need to declare as a list the original xticks and then rename them:
plt.xticks([0,1,2,3,4,5,6], ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.ylabel('Stress Score')
plt.show()

In [None]:
# What about the stress score vs sleep mental recovery and Sleep Score?
fig, axes = plt.subplots(1,2, figsize=(12,8)) # 1 row two columns so you need to access in this way:

axes[0].scatter(x='stress_score', y='sleep_mental_recovery', data=data)
axes[0].set_title('Stress and Sleep Mental Recovery')
axes[0].set_xlabel('Stress Score')
axes[0].set_ylabel('Sleep Mental Recovery')

axes[1].scatter(x='stress_score', y='sleep_sleep_score', data=data)
axes[1].set_title('Stress and Sleep Score')
axes[1].set_xlabel('Stress Score')
axes[1].set_ylabel('Sleep Score')

plt.show()

- Seems follow a uniform distribution so no relationship at all.

In [None]:
# And if we check the main macronutrients and total calories against stress score?

# Create a figure and a 2x2 grid of subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 12)) # 2 rows, 2 columns

axes[0,0].scatter(x='ingested_calories', y='stress_score', data=data)
axes[0,0].set_title('Ingested Calories vs Stress Score')

axes[0,1].scatter(x='carbohydrate', y='stress_score', data=data)
axes[0,1].set_title('Carbs vs Stress Score')

axes[1,0].scatter(x='protein', y='stress_score', data=data)
axes[1,0].set_title('Protein vs Stress Score')

axes[1,1].scatter(x='total_fat', y='stress_score', data=data)
axes[1,1].set_title('Fats vs Stress Score')
plt.show()

In [None]:
# Activity Day against Stress Score:
fig, axes = plt.subplots(2,2, figsize=(12,12))

axes[0,0].scatter(x='total_excercise_calories', y='stress_score', data=data)
axes[0,0].set_title('Total Excercise Calories vs Stress Score')
axes[0,0].set_xlabel('Total excercise calories')
axes[0,0].set_ylabel('Stress Score')

axes[0,1].scatter(x='burned_tef_calories', y='stress_score', data=data)
axes[0,1].set_title('Burned TEF Calories vs Stress Score')
axes[0,1].set_xlabel('Burned TEF calories')
axes[0,1].set_ylabel('Stress Sore')

axes[1,0].scatter(x='burned_active_time', y='stress_score', data=data)
axes[1,0].set_title('Burned Active Time VS Stress Score')
axes[1,0].set_xlabel('Burned Active Time')
axes[1,0].set_ylabel('Stress Sore')

axes[1,1].scatter(x='burned_rest_calories', y='stress_score', data=data)
axes[1,1].set_title('Burned Rest Calories VS Stress Score')
axes[1,1].set_xlabel('Burned Rest Calories')
axes[1,1].set_ylabel('Stress Score')

plt.show()

In [None]:
#Due the high dimmentionality I will select, based on my personal judgment, some features:
interest_cols = ['total_floors_climbed', 'potassium', 'total_fat', 'protein',
                 'sugar', 'ingested_calories', 'carbohydrate', 'body_fat', 'step_count',
                 'total_excercise_calories', 'burned_tef_calories', 
                 'burned_active_time','burned_rest_calories', 
                 'sleep_mental_recovery','sleep_physical_recovery',
                'sleep_movement_awakening', 'stress_score', 'heart_rate']

#In order to check correlations:
plt.figure(figsize=(16,10))
sns.heatmap(data[interest_cols].corr(), annot=True, cmap='coolwarm')
plt.title('Selected Feature Correlation')
plt.show()


##### Some insights:
- For these subset of features, we can see a lot of correlations are insignificatelly, so no multicollinearity at first hands. 
- If the model that we will use do feature selection by itself perhaps it's not important, but if we need to drop features manually then this type of analysis helps with this task. 

----

#### Time Series Analysis
##### The real power of time series EDA is that, you're not just looking at static relationships (as in classic tabular EDA), but temporal effects which can reveal patterns you’d completely miss otherwise. If we think a moment, what we eat today could affect how we will sleep at night. In the same way, the stress of today perhaps affect how we will feel when we wake up tomorrow.

In [None]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt

# Autocorrelation Plot (ACF)
fig, ax = plt.subplots(figsize=(10, 4))
plot_acf(data['stress_score'].dropna(), lags=30, ax=ax, alpha=0.05)
ax.set_title('Autocorrelation of Stress Score')
ax.set_xlabel('Lag (days)')
ax.set_ylabel('Correlation')
plt.grid(True)
plt.tight_layout()
plt.show()

# Partial Autocorrelation Plot (PACF)
fig, ax = plt.subplots(figsize=(10, 4))
plot_pacf(data['stress_score'].dropna(), lags=30, ax=ax, alpha=0.05, method='ywm')  # 'ywm' is stable
ax.set_title('Partial Autocorrelation of Stress Score')
ax.set_xlabel('Lag (days)')
ax.set_ylabel('Partial Correlation')
plt.grid(True)
plt.tight_layout()
plt.show()


In [None]:
from statsmodels.tsa.stattools import adfuller

result = adfuller(data['stress_score'].dropna())
result
#p-value: 0.022 => p < 0.05 => stationary


In [None]:
# We will 'shift' the values one day:
data['stress_score_lagged'] = data['stress_score'].shift(1)

#Key notation: .shift() returns a Series, and dropna(inplace=True) on a Series returns None.

In [None]:
# In the same way, we will do the same for 'sleep mental recovery':
data['sleep_mental_recovery_lagged'] = data['sleep_mental_recovery'].shift(1)

# For the ingested calories:
data['ingested_calories_lagged'] = data['ingested_calories'].shift(1)

#And for the burned calories too:
data['burned_active_time_lagged'] = data['burned_active_time'].shift(1)
data['burned_rest_calories_lagged'] = data['burned_rest_calories'].shift(1)
data['burned_tef_calories_lagged'] = data['burned_tef_calories'].shift(1)
data['total_steps_burned_calories_lagged'] = data['total_steps_burned_calories'].shift(1)

# At the end, we will drop the na introduced by the shifted features:
data.dropna(subset=['stress_score_lagged', 
                    'sleep_mental_recovery_lagged',
                    'ingested_calories_lagged',
                    'burned_active_time_lagged',
                    'burned_rest_calories_lagged',
                    'burned_tef_calories_lagged',
                    'total_steps_burned_calories_lagged'], inplace=True)

In [None]:
fig, axes = plt.subplots(2,2, figsize=(12,12))

# How the stress of yesterday affects the sleep mental recovery?
axes[0,0].scatter(x='sleep_mental_recovery', y='stress_score_lagged', data=data)
axes[0,0].set_title('Sleep Mental Recovery VS Stress Score Lagged')
axes[0,0].set_xlabel('Sleep Mental Recovery')
axes[0,0].set_ylabel('Stress Score Lagged')

# How sleep badly yesterday affects the stress today?
axes[0,1].scatter(x='sleep_mental_recovery_lagged', y='stress_score', data=data)
axes[0,1].set_title('Sleep Mental Recovery Lagged VS Stress Score ')
axes[0,1].set_xlabel('Sleep Mental Recovery Lagged')
axes[0,1].set_ylabel('Stress Score')

# How the yesterday's stress affects the amount of calories ingested today?
axes[1,0].scatter(x='ingested_calories', y='stress_score_lagged', data=data)
axes[1,0].set_title('Ingested Calories VS Stress Score Lagged')
axes[1,0].set_xlabel('Ingested Calories')
axes[1,0].set_ylabel('Stress Score Lagged')

# How the amount of ingested calories yersterday affects the stress of today?
axes[1,1].scatter(x='ingested_calories_lagged', y='stress_score', data=data)
axes[1,1].set_title('Ingested Calories Lagged VS Stress Score')
axes[1,1].set_xlabel('Ingested Calories Lagged')
axes[1,1].set_ylabel('Stress Score')

## 📌 Insights Summary

- **Target Variable — Stress Score:**  
  - The series shows clear **trend**, **seasonality**, and **cyclic behavior**.  
  - Residuals appear to have **zero mean** and **constant variance**.  
  - The distribution seems approximately **normal**.  
  - No clear differences were found between **weekdays and weekends**.

- **Interactions with Other Features:**  
  - **Sleep metrics** (e.g., mental recovery) and **food intake** do **not** show strong direct relationships with the stress score.  
  - Initial scatterplots suggest only weak or no visible patterns.

- **Lagged Features:**  
  - Introducing 1-day lagged versions of some features **did not reveal new patterns** or stronger relationships with the target variable.

- **Multicollinearity:**  
  - Most selected features show **low pairwise correlations**, suggesting no immediate multicollinearity concerns.

- **Time Structure Validated:**  
  - **ACF/PACF** plots and **seasonal decomposition** confirmed a **weekly seasonality**, justifying the use of time-aware modeling strategies.

---

➡️ These findings informed the decision to proceed with **time series–oriented models** (like XGBoost, RNNs, or LSTMs) using a **reduced feature set** for better generalization and interpretability.
