<a href="https://colab.research.google.com/github/AlexanderNeuwirth/CS5265_Project1/blob/main/CS5265_Project1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1 Definition
Created for CS5265 at Vanderbilt University

## Background

How much can be inferred about a product's nutritional attributes from what you can easily see on the product listing?

In other words, can you predict nutritional factors from brand, serving size, key words in the product title, and common ingredients?

This type of problem is of particular interest in a healthcare setting, where understanding a product in context of its nutritional components is necessary to make food purchasing decisions that are safe for specific medical needs. I often use similar data in my day-to-day work on healthcare procurement systems. It's interesting and potentially useful to explore possible relationships between different products, ingredients, and brands, and their corresponding nutritional features.

Due to wide available of government datasets and recipe sites with easily scraped online datasets, a fair amount of work has been done in this area. 

Random forest has been successfully applied to [predict processed vs unprocessed food](https://www.medrxiv.org/content/10.1101/2021.05.22.21257615v2.full), primarily from ingredient lists.

In Norway, large amounts of missing data in government databases led at least one group to attempt nutrition extraction [purely from natural language product descriptions](https://static1.squarespace.com/static/606f36b890215d7048ddaac0/t/62ed22f1ad65d913278ca3cb/1659708147387/PREDICTING+A+FOOD+PRODUCT%E2%80%99S+MISSING+NUTRITIONAL+VALUES+USING+MACHINE+LEARNING.pdf). Their method suffers in cases with poor quality descriptions, and has no mechanism to weight by token importance, but outperforms simple imputation methods.

Even nutrition prediction from raw images has seen [some success](https://arxiv.org/pdf/2011.01082.pdf?from=article_link), with confirmation by at least one [subsquent review experiment](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8787663/) that prepared new samples with known nutrition details.

No published method seems to approach this problem using simple feature engineering on the description and ingredient list, seeking to predict nutrient attributes. That is the gap I seek to investigate in this work.

## Project Description

This project will attempt to perform binary classification on whether products meet a "low fat" dietary classification using only those higher-level features like brand, serving size, key words in the product title, and common ingredients.

This project uses the [USDA Branded Food Products Dataset](https://data.nal.usda.gov/dataset/usda-branded-food-products-database). This data contains detailed ingredient listings, serving amounts, and nutrient quantities for about 250,000 branded and private-label food products sold in the United States. All entries are laboratory results with thorough human review and are held to high data quality standards by law, as the precise values inform many clinical and regulatory policy enforcement.

Some raw data fields include:
* **Product Title**: A text description
* **Serving Size**: Continuous-valued numerical usage quantity (normalized to either grams or milliliters)
* **Nutritionals**: Continuous-valued numerical nutrient quantities (by weight)
* **Ingredients**: A comma-separated list of ingredients (as strings)

Some easily derived processed data fields could include:
* **Total Calories**: by summing simple benchmark multipliers of carbs, proteins, and fats
* **Low Fat**: by thresholding fat total against the conventional threshold of 3 grams per serving
* **Binary "Contains Ingredient" Features**: by filtering on the presence of common ingredient strings in the string ingredient column

If successful in this task and similar binary classification tasks of interest, it may be appropriate to attempt regression on more complex target variables (like calorie count, grams of fat, or grams of fiber).

Example Input Features:
* Serving size (continuous)
* Brand (categorical)
* Key terms extracted from description (e.g. "Healthy", "Diet", "Organic") (binary)
* Presence of specific ingredient terms, extracted from ingredient list (e.g. "Butter", "Corn Syrup", "Sugar"

Example Output Features:
* Low-Fat (binary classification)
* *(Aspirational)* Calorie count (integer regression)

## Performance Metric
For the initial goal of binary low-fat classification, an appropriate performance metric is simple percent accuracy.
If accuracy is low, precision and recall will be useful metrics to determine any bias in the error. If classes are heavily unbalanced (e.g. there are relatively few low-fat products) it may be appropriate to use an accuracy metric that reflects both precision and recall, like F1 score.

Mathematical formulations of all of the above metrics are provided below, where T and F stand for "True" and "False", and P and N stand for "Positive" and "Negative."

$$Accuracy = \frac{TP+TN}{TP+TN+FP+FN}$$
$$Precision = \frac{TP}{TP+FP}$$
$$Recall = \frac{TP}{TP+FN}$$
$$F1 = \frac{2*Precision*Recall}{Precision+Recall} = \frac{2*TP}{2*TP+FP+FN}$$


For the stretch goal of calorie count regression, an appropriate performance metric may be mean squared error (MAE). (Given below where x is the actual value, y is the predicted value, and D is the number of data points being evaluated.)

$$\sum_{i=1}^{D}|x_i-y_i|$$

This would be easily interpretable as how far off the model's predictions are from the actual number of calories (as opposed to mean squared error, which would be less readily interpretable.)