##### Problem Analysis Workshop 4 - 11th November 2025

Cemil Caglar Yapici - 9081058

### 1. Convert Factor Variables to Numeric:

First, we convert categorical text features (factors) into numeric codes. This is useful for binary or ordinal categories:

In [None]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.preprocessing import PowerTransformer

Note: Well There are lots of Datasets (CSV) some of them -our previous workshop's datasets- don't include date so I donwloaded new ones to test here:

In [None]:
# Load the dataset
df = pd.read_csv('Datas/detailed_meals_macros_.csv')

# Print dataset dimensions and first few rows
print("Dataset shape:", df.shape)
df.head(3)


In [None]:
# Convert binary categorical 'Gender' to numeric (Male=1, Female=0)
df['Gender'] = df['Gender'].map({'Male': 1, 'Female': 0})

# Convert ordinal categorical 'Activity Level' to numeric codes
activity_map = {'Sedentary': 0, 'Lightly Active': 1, 'Moderately Active': 2, 'Very Active': 3}
df['Activity Level'] = df['Activity Level'].map(activity_map)

# Verify the conversion on the first few records
df[['Gender', 'Activity Level']].head(5)


### 2. Convert Calendar Dates to Julian Format:

If the dataset contains calendar date fields, we should convert them to a numeric format (Julian date or ordinal) for modeling.


In [None]:
# =========================
# Workshop 4 — Calendar → Julian features (ccy)
# =========================

# 0) Setup
import re
import pandas as pd
import numpy as np

# 1) Helper: parse all date-like columns and add Julian features
def add_julian_features_ccy(df: pd.DataFrame, cols=None, tz=None, prefix=""):
    """
    - Tries to parse date-like columns to pandas datetime (UTC or given tz).
    - Adds:
        <col>__datetime       (normalized pandas datetime64[ns, tz])
        <col>__ordinal        (Julian day number = datetime.toordinal())
        <col>__dayofyear      (Julian day-of-year, 1..366)
    - Returns modified DataFrame and a list of processed columns.
    """
    df = df.copy()

    # Discover candidate columns if not provided
    if cols is None:
        candidates = []
        for c in df.columns:
            if re.search(r"(date|time|timestamp)", str(c), flags=re.I):
                candidates.append(c)
        cols = candidates

    processed = []
    for c in cols:
        if c not in df.columns:
            continue
        # Parse to datetime
        dt = pd.to_datetime(
            df[c],
            errors="coerce",
            infer_datetime_format=True,
            utc=True if tz is None else False
        )
        # Localize/convert timezone if requested
        if tz is not None:
            # If naive, localize; if aware, convert
            if dt.dt.tz is None:
                dt = dt.dt.tz_localize(tz)
            else:
                dt = dt.dt.tz_convert(tz)

        # Save normalized datetime col
        base = f"{prefix}{c}"
        df[f"{base}__datetime"] = dt

        # Ordinal (Julian day number) requires naive datetimes -> use .date()
        # We'll compute on the UTC/converted wall-clock date
        ordinal = df[f"{base}__datetime"].dt.tz_convert("UTC") if df[f"{base}__datetime"].dt.tz is not None else df[f"{base}__datetime"]
        ordinal = ordinal.dt.date.map(lambda d: d.toordinal() if pd.notnull(d) else np.nan)
        df[f"{base}__ordinal"] = ordinal

        # Day-of-year (Julian day in year)
        df[f"{base}__dayofyear"] = df[f"{base}__datetime"].dt.dayofyear

        processed.append(c)

    return df, processed

# 2) Load your new datasets
path_food = "Datas/daily_food_nutrition_dataset.csv"     # uploaded file
path_act  = "Datas/dailyActivity_merged1.csv"            # uploaded file

df_food_ccy = pd.read_csv(path_food)
df_act_ccy  = pd.read_csv(path_act)

print("Shapes -> food:", df_food_ccy.shape, "activity:", df_act_ccy.shape)

# 3) Apply the converter
#    (Let the helper auto-discover columns: e.g., 'Date', 'ActivityDate', 'Timestamp', etc.)
df_food_ccy, food_dates = add_julian_features_ccy(df_food_ccy, cols=None, tz=None, prefix="food_")
df_act_ccy, act_dates   = add_julian_features_ccy(df_act_ccy,  cols=None, tz=None, prefix="act_")

print("Detected date-like columns in food dataset:", food_dates)
print("Detected date-like columns in activity dataset:", act_dates)

# 4) Quick sanity check: show new columns created for each dataset
def preview_new_date_cols_ccy(df, prefix, n=5):
    new_cols = [c for c in df.columns if c.startswith(prefix) and ("__datetime" in c or "__ordinal" in c or "__dayofyear" in c)]
    return df[new_cols].head(n)

print("\nFOOD — new date features:")
display(preview_new_date_cols_ccy(df_food_ccy, "food_"))

print("\nACTIVITY — new date features:")
display(preview_new_date_cols_ccy(df_act_ccy, "act_"))

# 5) (Optional) If you want plain date (yyyy-mm-dd) for merging/EDA:
def add_plain_date_ccy(df, datetime_col, out_col="PlainDate"):
    df[out_col] = pd.to_datetime(df[datetime_col], errors="coerce").dt.date
    return df



### 3. Convert Categorical Variables to Dummy Variables:

For nominal categorical variables with multiple classes (more than two categories), we use dummy encoding (one-hot encoding). This creates new binary indicator columns for each category value:

We take Dietary Preference (e.g., Omnivore, Vegetarian, Vegan) and create dummy columns for each category.

In [None]:
# One-hot encode the 'Dietary Preference' nominal category
df = pd.get_dummies(df, columns=['Dietary Preference'])

# Inspect the new dummy columns alongside original categorical conversions
cols_to_show = ['Gender', 'Activity Level'] + [col for col in df.columns if col.startswith('Dietary Preference')]
df[cols_to_show].head(5)


### 4. Apply Box-Cox Transformation:

The Box-Cox transformation is a power transformation that can help normalize a skewed distribution of a numeric variable. It finds an optimal exponent (λ) to transform the data closer to a normal distribution.

In [None]:
# Apply Box-Cox transformation to the 'Weight' column (must be positive values)
weight_data = df['Weight'].dropna()  # ensure no missing values
transformed_weight, best_lambda = stats.boxcox(weight_data)

# Save the transformed values into a new column
df['Weight_BoxCox'] = transformed_weight

print(f"Optimal λ for Box-Cox on Weight: {best_lambda:.4f}")
df[['Weight', 'Weight_BoxCox']].head(5)


### 5. Apply Tukey’s Ladder of Powers Transformation:

Tukey’s Ladder of Powers is another approach to stabilize variance and normalize data by applying various power transformations. It is similar in spirit to Box-Cox but can handle zero or negative values (using, for example, Yeo-Johnson method).

In [None]:
# Apply Tukey's Ladder of Powers (via Yeo-Johnson) to 'Snacks Calories'
snacks_data = df['Snacks Calories'].fillna(0)  # handle any missing by 0 for demo
transformed_snacks, lambda_snacks = stats.yeojohnson(snacks_data)

# Save transformed values
df['SnacksCalories_Tukey'] = transformed_snacks

print(f"Optimal λ for Tukey (Yeo-Johnson) on Snacks Calories: {lambda_snacks:.4f}")
df[['Snacks Calories', 'SnacksCalories_Tukey']].head(5)


### Summary of Transformations


This notebook has undergone various data preprocessing procedures with the K-Means project data set. We have numerically converted the categorical features textually based (Gender, Activity Level), further introduced dummy indicator columns for the nominal categories (Dietary Preference). We have converted calendar dates into numerical ordinals for analytical measures (see example date). Via Box-Cox and Tukey's Ladder of Powers, we have conducted transformations of tailored variations of specified numeric data features to decrease skwness and to approximate a normal distribution shape. These steps serve to make the data more amenable to clustering and statistical procedures to follow.