# Dietary Analysis and Quality of Diets

In this notebook we explore the relationship between life quality, economic factors, and diet quality across countries.

## Research Questions

- $RQ_1$ How does diet impact life expectancy?
- $RQ_2$ How do economic factors impact the quality of diets in different countries?
- $RQ_3$ What are the main dietary patterns across regions and economic levels?
    
### Motivations

- $M_1$ Understand if higher economic prosperity and superior living conditions lead to better diet standards.
- $M_2$ Inform policy and health strategies by linking socio-economic development with nutrition.

### Hypotheses

- $H_1$ Diet has a measurable impact on life expectancy.
- $H_{2\_1}$ Countries with lower GDP and life expectancy will exhibit diets with higher ammounts of cereals and vegetables.
- $H_{2\_1}$ Countries with higher GDP and life expectancy will exhibit diets with higher ammounts of meats and sugars.
- $H_3$ There are regional dietary patterns that are consistent across countries with similar economic levels.

## Import Required Libraries
Import the necessary libraries, including pandas, numpy, and any libraries required for regression analysis.

In [None]:
import polars as pl
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import ipywidgets as widgets
from ipywidgets import interact
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline

# Set the style for seaborn
sns.set_theme(style="whitegrid")

## Data Overview and Preprocessing

Load the cleaned dataset (which includes socio-economic and dietary data) and review its overall structure. In previous notebooks, similar steps were taken to process the data.

We are going to normalize some data and stratify the countries by GDP per capita.

In [None]:
# Load the processed dataset
data: pl.DataFrame = pl.read_csv("../data/cleaned/data.csv", separator=",")
data = data.with_columns((pl.col("GDP") / pl.col("Population")).alias("GDP per capita"))
data.head()

In [None]:
food_categories = [
    "Daily calorie supply per person from other commodities",
    "Daily calorie supply per person from sugar",
    "Daily calorie supply per person from oils and fats",
    "Daily calorie supply per person from meat",
    "Daily calorie supply per person from fruits and vegetables",
    "Daily calorie supply per person from starchy roots",
    "Daily calorie supply per person from pulses",
    "Daily calorie supply per person from cereals and grains",
    "Daily calorie supply per person from alcoholic beverages",
]

---

# $RQ_1$ How does diet impact life expectancy?

## Plan

1. **Stratify the Countries by GDP per Capita Based on Quantiles**:
   - We will divide the countries into different groups based on their GDP per capita using quantiles. This will help us isolate the effect of economic status on life expectancy and food quality.

2. **Analyze the Impact of Food Categories on Life Expectancy Using Regression Techniques**:
   - For each group of countries, we will analyze the impact of different food categories on life expectancy. We will use regression techniques to determine the relationship between diet and life expectancy.

In [None]:
# Visualize the distribution of GDP per capita
plt.figure(figsize=(10, 6))
sns.histplot(data["GDP per capita"], bins=30, kde=True)
plt.title("Distribution of GDP per Capita")
plt.xlabel("GDP per Capita")
plt.ylabel("Frequency")
plt.show()

# Visualize the distribution of life expectancy
plt.figure(figsize=(10, 6))
sns.histplot(data["Life expectancy"], bins=30, kde=True)
plt.title("Distribution of Life expectancy")
plt.xlabel("Life expectancy")
plt.ylabel("Frequency")
plt.show()

From these visualzations, we can see that GDP per capita is not normally distributed. We will use quantiles to stratify the countries into different groups based on this.

In [None]:
num_quantiles = 4

quantile_low = data.select(pl.quantile("GDP per capita", 0.25)).item()
quantile_mid = data.select(pl.quantile("GDP per capita", 0.5)).item()
quantile_high = data.select(pl.quantile("GDP per capita", 0.75)).item()

print(f"Low quantile: {quantile_low}")
print(f"Mid quantile: {quantile_mid}")
print(f"High quantile: {quantile_high}")

data = data.with_columns(
    pl.when(pl.col("GDP per capita") < quantile_low)
    .then(pl.lit("Very Low"))
    .when(pl.col("GDP per capita") < quantile_mid)
    .then(pl.lit("Low"))
    .when(pl.col("GDP per capita") < quantile_high)
    .then(pl.lit("Mid"))
    .otherwise(pl.lit("High"))
    .alias("GDP per capita quantile")
)

data.head()

In [None]:
def update_choropleth(col):
    fig = px.choropleth(
        data.to_pandas(),
        locations="Country",
        locationmode="country names",
        color=col,
        hover_name="Country",
        color_continuous_scale="Viridis",
        projection="natural earth",
        title=f"{col} per country",
    )
    fig.update_geos(showcoastlines=True, coastlinecolor="Black")
    fig.update_layout(margin={"r": 0, "t": 50, "l": 0, "b": 0})
    fig.show()


# Define the columns you want to allow for selection.
columns_options = data.columns

interact(
    update_choropleth,
    col=widgets.Dropdown(options=columns_options, description="Select Column:"),
)

### Check relationship between GDP per capita, life expectancy, and diet quality

In [None]:
correlation_df = data.select(
    [
        "GDP per capita",
        "Life expectancy",
        "Daily total caloric ingestion",
        *food_categories,
    ]
)

correlation_matrix = correlation_df.to_pandas().corr()

plt.figure(figsize=(30, 25))

sns.heatmap(
    data=correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5
)

# Set the title
plt.title(label="Correlation Matrix of the Dataset")

# Show the plot
plt.show()

We start to see some indicators in the data that may suggest that diet is not directly related to life expectancy as we had expected, but rather it being influenced by the economic status of the country. We are going to analyze this further by doing the following:
- First, we are going to see the difference in correlation in the different stratified groups.
- Second, we will try, as there migth be some intricacies in how these variables correlate, to use regression analysis to see if we can predict both life expectancy and GDP per capita based on the diet, we will explain how further in the notebook.

In [None]:
very_low_gdp = data.filter(pl.col("GDP per capita quantile") == "Very Low")
low_gdp = data.filter(pl.col("GDP per capita quantile") == "Low")
mid_gdp = data.filter(pl.col("GDP per capita quantile") == "Mid")
high_gdp = data.filter(pl.col("GDP per capita quantile") == "High")

# Need 4 subplots
fig, axs = plt.subplots(2, 2, figsize=(30, 25))

for axes, quantile in zip(
    [[0, 0], [0, 1], [1, 0], [1, 1]], [very_low_gdp, low_gdp, mid_gdp, high_gdp]
):
    correlation_df = quantile.select(
        [
            "Life expectancy",
            "GDP per capita",
            "Daily total caloric ingestion",
            *food_categories,
        ]
    )

    correlation_matrix = correlation_df.to_pandas().corr(method="spearman")

    sns.heatmap(
        data=correlation_matrix,
        annot=True,
        fmt=".2f",
        cmap="coolwarm",
        linewidths=0.5,
        ax=axs[*axes],
    )
    plt.title(f"{quantile.select('GDP per capita quantile').unique().item()}")


plt.show()

So it does indeed seem that for similar GDP the diet does not have a significant impact on anything, as the prior 0.74 correlation between calory intake and life expectancy is now way lower. We can try to figure out visually whether we are in the right or not by plotting the data.

First we need to check whether we have enough variation in the data within same groups.

In [None]:
def plot_violin(food_category):
    plt.figure(figsize=(12, 8))
    sns.boxplot(x="GDP per capita quantile", y=food_category, data=data.to_pandas())
    plt.title(f"Violin Plot of {food_category} by GDP per Capita Quantile")
    plt.xlabel("GDP per Capita Quantile")
    plt.ylabel(food_category)
    plt.show()


# Create an interactive widget to select the food category
interact(
    plot_violin,
    food_category=widgets.Dropdown(
        options=[*food_categories, "Life expectancy"], description="Food Category:"
    ),
)

It does seem that we have enough variation indeed for all groups, so maybe the correlation is not the best way to analyze this data as there might be some intricacies in how these variables correlate. We will try to use regression analysis to see if we can predict life expectancy within the same group based on the diet, to see if there is a significant impact.

In [None]:
import warnings

param_grid = {
    "polynomialfeatures__degree": [2, 3],
    "ridge__alpha": [0.1, 1, 10],
    "ridge__solver": ["auto", "svd", "cholesky", "lsqr", "sparse_cg", "sag", "saga"],
}
poly_regression_results = {}

for quantile in ["Very Low", "Low", "Mid", "High"]:
    warnings.filterwarnings("ignore")

    quantile_data = data.filter(pl.col("GDP per capita quantile") == pl.lit(quantile))
    X = quantile_data[food_categories].to_pandas()
    y = quantile_data["Life expectancy"].to_pandas()

    pipeline = make_pipeline(PolynomialFeatures(), Ridge())

    grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring="r2")
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    grid_search.fit(X_train, y_train)
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    poly_regression_results[f"Quantile {quantile}"] = {
        "MSE": mse,
        "R2": r2,
        "Best Degree": grid_search.best_params_["polynomialfeatures__degree"],
        "y_test": y_test,
        "y_pred": y_pred,
    }

for quantile, metrics in poly_regression_results.items():
    print(
        f"\nResults for {quantile}: MSE = {metrics['MSE']:.2f}, R2 = {metrics['R2']:.2f}, Best Degree = {metrics['Best Degree']}"
    )

plt.figure(figsize=(12, 8))

colors = {"Very Low": "blue", "Low": "green", "Mid": "orange", "High": "red"}

for quantile in ["Very Low", "Low", "Mid", "High"]:
    y_test = poly_regression_results[f"Quantile {quantile}"]["y_test"]
    y_pred = poly_regression_results[f"Quantile {quantile}"]["y_pred"]
    plt.scatter(
        y_test,
        y_pred,
        color=colors[quantile],
        label=f"{quantile} GDP per Capita Quantile",
    )

plt.title("Polynomial Regression for GDP per Capita Quantiles")
plt.xlabel("Actual Life Expectancy")
plt.ylabel("Predicted Life Expectancy")
plt.xlim(40, 100)
plt.ylim(40, 100)
plt.legend()
plt.show()

## Conclusion

As we can see we could not obtain a significant model for any of the groups (negative r² values mean that a straight line fits the data best than the model) even with advanced regression techniques and a thorough grid search for hyperparameters, so it seems that overall factors of the diet does not have a significant impact on the life expectancy of a population. Now this is not to say that diet does not have an impact on life expectancy, but rather that the overall factors of a society or cultural diets, such as mediterranean, assian, etc. don't seem to directly correlate with life expectancy. So what can we extract from this then?
- We are not able to prove or disprove that there are superior diets that lead to longer life expectancy based solely on per country data.
- We can see that there is a significant correlation between GDP per capita and life expectancy, which was a side effect of the data we had, but it is not the main focus of this notebook.
- We see an overall increment of calory intake with GDP per capita, which is expected as richer countries tend to have more access to food, but further details on this relationship are to be explored in the next $RQ$

Unfortunately, as we could not find any particular relationship between diet and life expectancy, we are not able to provide further insights on which balance of food is the best for a longer life expectancy.

---

# $RQ_2$ How do economic factors impact the quality of diets in different countries?

---

# $RQ_3$ What are the main dietary patterns across regions and economic levels?