## Phase I Project Proposal
### How Has Nutritional Value in Processed Food Changed Over Time?

#### Name: Steven Yu, DS 3000

### Introduction

How have the nutritional qualities of packaged foods shifted over the last two decades? OpenFoodFacts, a global open database of food products, provides a dataset for studying this question. It contains information on ingredients, macros, additives, labels, and more from millions of products across different countries. By examining trends over time, we can investigate whether things like sugar, salt, and fat levels are increasing or decreasing, how the use of additives has changed, and whether there are discrepancies in nutrition trends between different countries/regions. This analysis can shed light on how the food industry is adapting to consumer preferences. It may also highlight differences between regions (e.g., Europe vs. United States).

### Data Collection

To reduce complexity, I will focus on data for breakfast cereals in the United States and United Kingdom, collecting the dataset from OpenFoodFacts API. For each product, we'll collect the year of entry (created/last modified), nutritional values (sugar, salt, fat, protein, etc.) and labels. 

The code below queries the OpenFoodFacts API to fetch examples within the US/UK and category of breakfast cereals and adds each product and its relevant information to a Pandas DataFrame, then saves this to a CSV file. 

In [None]:
import requests
import pandas as pd
import time

COUNTRY_SLUGS = ["united-states","united-kingdom"]
CATEGORY_SLUG = "breakfast-cereals"
PAGE_SIZE = 250
MAX_PAGES = 10
SLEEP_BETWEEN_CALLS = 0.2

BASE_URL = "https://world.openfoodfacts.org/api/v2/search"

FIELDS = [
    "code","product_name","brands","countries_tags","categories_tags",
    "labels_tags","additives_tags","nutriscore_grade","nova_group",
    "created_t","last_modified_t","categories","countries","nutriments"
]

HEADERS = {"User-Agent": "DS3000-OpenFoodFacts-Project (rough-draft)"}


def fetch_page(page, country_slug):
    params = {
        "categories_tags_en": CATEGORY_SLUG,
        "countries_tags_en": country_slug,
        "page": page,
        "page_size": PAGE_SIZE,
        "fields": ",".join(FIELDS),
        "sort_by": "last_modified_t"
    }
    r = requests.get(BASE_URL, params=params, headers=HEADERS, timeout=30)
    r.raise_for_status()
    return r.json()

def extract_float(nutriments, key):
    '''
    Helper function to extract a float from nutriments dictionary by first checking for the key, then converting to float if possible.

    Args:
        nutriments (dict): Nutriments
        key (str): Key

    Returns:
        float: Float
    '''
    if not isinstance(nutriments, dict):
        return None
    val = nutriments.get(key)
    try:
        return float(val)
    except (TypeError, ValueError):
        return None

def collect_products(max_pages=MAX_PAGES, country_slugs=COUNTRY_SLUGS):
    rows = []
    for country_slug in country_slugs:
        total = None
        for p in range(1, max_pages + 1):
            data = fetch_page(p, country_slug)
            if total is None:
                total = data.get("count")
                print(f"Total matching products for {country_slug} (reported by API): {total}")
            for prod in data.get("products", []) or []:
                nutr = prod.get("nutriments", {}) or {}
                rows.append({
                    "code": prod.get("code"),
                    "product_name": prod.get("product_name"),
                    "categories": prod.get("categories"),
                    "nutriscore_grade": prod.get("nutriscore_grade"),
                    "nova_group": prod.get("nova_group"),
                    "created_t": prod.get("created_t"),
                    "last_modified_t": prod.get("last_modified_t"),
                    "country": country_slug,
                    # Nutrition per 100g from "nutriments"
                    "energy_100g": extract_float(nutr, "energy_100g"),
                    "sugars_100g": extract_float(nutr, "sugars_100g"),
                    "fat_100g": extract_float(nutr, "fat_100g"),
                    "saturated_fat_100g": extract_float(nutr, "saturated-fat_100g"),
                    "salt_100g": extract_float(nutr, "salt_100g"),
                })
            time.sleep(SLEEP_BETWEEN_CALLS)
    return pd.DataFrame(rows)

food_df = collect_products()
food_df.to_csv("food_data.csv", index=False)
food_df.head()

Total matching products for united-states (reported by API): 3234
Total matching products for united-kingdom (reported by API): 1686


Unnamed: 0,code,product_name,categories,nutriscore_grade,nova_group,created_t,last_modified_t,country,energy_100g,sugars_100g,fat_100g,saturated_fat_100g,salt_100g
0,30000010204,OLD FASHIONED,"Plant-based foods and beverages, Plant-based f...",a,1.0,1499190028,1759671315,united-states,,,,,
1,810589031735,Classic Cinnamon Superfood Instant Oatmeal,"Plant-based foods and beverages, Plant-based f...",unknown,3.0,1657146785,1759667695,united-states,1560.0,37.9,8.14,,1.69
2,7501058614940,Fitness fruits,"Plant-based foods and beverages, Plant-based f...",c,,1661198073,1759633702,united-states,1414.0,24.0,3.2,1.8,0.775
3,30000561959,Instant Oatmeal - Flavor Variety,"Plant-based foods and beverages,Plant-based fo...",d,4.0,1561570726,1759628143,united-states,1560.0,27.9,4.65,4.65,1.16
4,11110141187,CASHEW COCONUT GRANOLA,"Plant-based foods and beverages,Plant-based fo...",d,,1723466997,1759622260,united-states,1880.0,18.3,18.3,5.0,0.917


### Data Usage and Remaining Issues

To further clean the food data above, I would first convert energy_100g from kilojoules to kcal. I would handle missing nutriments by dropping rows with missing key nutriment facts (energy, sugars, fat, or salt). I would convert the timestamps for "created_t" and "last_modified_t" to dates. The categories column can be removed: the categories are mostly the same across all breakfast cereals. 

I will use this data to address two key questions of interest:
- Are sugars_100g, fat_100g, and salt_100g in US and UK breakfast cereals trending downward/upward over the last decade?
- Is there a significant difference in nutritional value trends between US and UK breakfast cereals?

To quantify these, I will make time-series trend plots by year, nutrient, and country, then implement two models: a NumPy linear regression to model sugars_100g as a function of year and product characteristics, and a scikit-learn classifier to predict the country of origin from nutrition info. 