<a href="https://colab.research.google.com/github/AlexanderNeuwirth/CS5265_Project1/blob/main/CS5265_Project1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1 Definition
Created for CS5265 at Vanderbilt University

## Background

How much can be inferred about a product's nutritional attributes from what you can easily see on the product listing?

In other words, can you predict nutritional factors from brand, serving size, key words in the product title, and common ingredients?

This type of problem is of particular interest in a healthcare setting, where understanding a product in context of its nutritional components is necessary to make food purchasing decisions that are safe for specific medical needs. I often use similar data in my day-to-day work on healthcare procurement systems. It's interesting and potentially useful to explore possible relationships between different products, ingredients, and brands, and their corresponding nutritional features.

Due to wide available of government datasets and recipe sites with easily scraped online datasets, a fair amount of work has been done in this area. 

Random forest has been successfully applied to [predict processed vs unprocessed food](https://www.medrxiv.org/content/10.1101/2021.05.22.21257615v2.full), primarily from ingredient lists.

In Norway, large amounts of missing data in government databases led at least one group to attempt nutrition extraction [purely from natural language product descriptions](https://static1.squarespace.com/static/606f36b890215d7048ddaac0/t/62ed22f1ad65d913278ca3cb/1659708147387/PREDICTING+A+FOOD+PRODUCT%E2%80%99S+MISSING+NUTRITIONAL+VALUES+USING+MACHINE+LEARNING.pdf). Their method suffers in cases with poor quality descriptions, and has no mechanism to weight by token importance, but outperforms simple imputation methods.

Even nutrition prediction from raw images has seen [some success](https://arxiv.org/pdf/2011.01082.pdf?from=article_link), with confirmation by at least one [subsquent review experiment](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8787663/) that prepared new samples with known nutrition details.

No published method seems to approach this problem using simple feature engineering on the description and ingredient list, seeking to predict nutrient attributes. That is the gap I seek to investigate in this work.

## Project Description

This project will attempt to perform binary classification on whether products meet a "low fat" dietary classification using only those higher-level features like brand, serving size, key words in the product title, and common ingredients.

This project uses the [USDA Branded Food Products Dataset](https://data.nal.usda.gov/dataset/usda-branded-food-products-database). This data contains detailed ingredient listings, serving amounts, and nutrient quantities for about 250,000 branded and private-label food products sold in the United States. All entries are laboratory results with thorough human review and are held to high data quality standards by law, as the precise values inform many clinical and regulatory policy enforcement.

Some raw data fields include:
* **Product Title**: A text description
* **Serving Size**: Continuous-valued numerical usage quantity (normalized to either grams or milliliters)
* **Nutritionals**: Continuous-valued numerical nutrient quantities (by weight)
* **Ingredients**: A comma-separated list of ingredients (as strings)

Some easily derived processed data fields could include:
* **Total Calories**: by summing simple benchmark multipliers of carbs, proteins, and fats
* **Low Fat**: by thresholding fat total against the conventional threshold of 3 grams per serving
* **Binary "Contains Ingredient" Features**: by filtering on the presence of common ingredient strings in the string ingredient column

If successful in this task and similar binary classification tasks of interest, it may be appropriate to attempt regression on more complex target variables (like calorie count, grams of fat, or grams of fiber).

Example Input Features:
* Serving size (continuous)
* Brand (categorical)
* Key terms extracted from description (e.g. "Healthy", "Diet", "Organic") (binary)
* Presence of specific ingredient terms, extracted from ingredient list (e.g. "Butter", "Corn Syrup", "Sugar"

Example Output Features:
* Low-Fat (binary classification)
* *(Aspirational)* Calorie count (integer regression)

## Performance Metric
For the initial goal of binary low-fat classification, an appropriate performance metric is simple percent accuracy.
If accuracy is low, precision and recall will be useful metrics to determine any bias in the error. If classes are heavily unbalanced (e.g. there are relatively few low-fat products) it may be appropriate to use an accuracy metric that reflects both precision and recall, like F1 score.

Mathematical formulations of all of the above metrics are provided below, where T and F stand for "True" and "False", and P and N stand for "Positive" and "Negative."

$$Accuracy = \frac{TP+TN}{TP+TN+FP+FN}$$
$$Precision = \frac{TP}{TP+FP}$$
$$Recall = \frac{TP}{TP+FN}$$
$$F1 = \frac{2*Precision*Recall}{Precision+Recall} = \frac{2*TP}{2*TP+FP+FN}$$


For the stretch goal of calorie count regression, an appropriate performance metric may be mean squared error (MAE). (Given below where x is the actual value, y is the predicted value, and D is the number of data points being evaluated.)

$$\sum_{i=1}^{D}|x_i-y_i|$$

This would be easily interpretable as how far off the model's predictions are from the actual number of calories (as opposed to mean squared error, which would be less readily interpretable.)

## Basic EDA
Key questions:

1. Are full macronutrient details available for each product in the data? If not, what % of coverage do we have?
2. Are units standardized for both servings and ingredients? If not, how difficult will standardization be?
3. How much diversity of manufacturer/brand types exists in the data? Including count, frequent terms, and general recognizability of names in a random sample.
4. What range of serving sizes exists in the data? What are the extreme outliers?

### Google Drive Connection

In [11]:
from google.colab import drive
drive.mount('/content/drive')
!ls drive/MyDrive/food_nutrition_data/

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
BFPD_Doc.pdf			 Nutrients.csv	Serving_size.csv
Derivation_Code_Description.csv  Products.csv


### Data Loading

In [124]:
import pandas as pd

products_df = pd.read_csv("drive/MyDrive/food_nutrition_data/Products.csv")
nutrients_df = pd.read_csv("drive/MyDrive/food_nutrition_data/Nutrients.csv")
servings_df = pd.read_csv("drive/MyDrive/food_nutrition_data/Serving_size.csv")

  exec(code_obj, self.user_global_ns, self.user_ns)


### Nutrient Joining

In [125]:
coverage = products_df.NDB_Number.isin(nutrients_df.NDB_No).value_counts()
print(f"{round(coverage[1]*100/coverage.sum(), 2)}% of products have some nutrient info")

99.48% of products have some nutrient info


In [126]:
# Verify that each nutrient code corresponds to a unique name
print(nutrients_df.groupby("Nutrient_Code").Nutrient_name.nunique().value_counts())

1    95
Name: Nutrient_name, dtype: int64


In [127]:
# Find nutrients that appear at least ~10% of the time
occurence_counts = nutrients_df.Nutrient_name.value_counts()
occurence_counts[(occurence_counts > nutrients_df.NDB_No.nunique() * 0.1)]

Carbohydrate, by difference           237635
Total lipid (fat)                     237559
Protein                               237432
Sodium, Na                            236887
Energy                                228500
Sugars, total                         223394
Fatty acids, total saturated          205694
Cholesterol                           202966
Fiber, total dietary                  198171
Iron, Fe                              196981
Calcium, Ca                           196065
Fatty acids, total trans              194801
Vitamin C, total ascorbic acid        178809
Vitamin A, IU                         174501
Potassium, K                           53337
Fatty acids, total monounsaturated     33822
Fatty acids, total polyunsaturated     33799
Vitamin D                              27668
Name: Nutrient_name, dtype: int64

In [128]:
# Drop nutrients that don't occur that often. These seem to be manually entered, and many apply to only a handful of products
nutrients_to_drop = occurence_counts[(occurence_counts < len(nutrients_pivot) * 0.1)].index
nutrients_df = nutrients_df[~nutrients_df.Nutrient_name.isin(nutrients_to_drop)]

In [129]:
# Verify that we don't really need to normalize output UOMs since they are consistent for a given nutrient (no mixing mg and g)
print(f"Maximium unique units of measure per nutrient: {nutrients_df.groupby('Nutrient_name').Output_uom.nunique().max()}")

Maximium unique units of measure per nutrient: 1


In [130]:
# Roll up nutrients df of product-nutrient pairs into nutrient feature columns
nutrients_pivot = nutrients_df.pivot_table(values='Output_value', index=nutrients_df['NDB_No'], columns='Nutrient_name', aggfunc='first')
nutrients_pivot.head()

Nutrient_name,"Calcium, Ca","Carbohydrate, by difference",Cholesterol,Energy,"Fatty acids, total monounsaturated","Fatty acids, total polyunsaturated","Fatty acids, total saturated","Fatty acids, total trans","Fiber, total dietary","Iron, Fe","Potassium, K",Protein,"Sodium, Na","Sugars, total",Total lipid (fat),"Vitamin A, IU","Vitamin C, total ascorbic acid",Vitamin D
NDB_No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
45001524,50.0,35.0,25.0,200.0,,,3.75,0.0,0.0,0.0,,2.5,75.0,30.0,6.25,0.0,3.0,
45001528,0.0,43.24,0.0,162.0,,,0.0,0.0,0.0,0.0,,0.0,703.0,37.84,0.0,270.0,9.7,
45001529,0.0,41.18,0.0,176.0,,,0.0,0.0,0.0,0.0,,0.0,676.0,35.29,0.0,0.0,0.0,
45001530,0.0,34.29,0.0,143.0,,,0.0,0.0,0.0,0.0,,0.0,971.0,28.57,0.0,0.0,0.0,
45001531,0.0,45.95,0.0,189.0,,,0.0,0.0,0.0,0.0,,0.0,757.0,43.24,0.0,0.0,0.0,


In [131]:
# Merge nutrient columns to original product data
df = pd.merge(products_df, nutrients_pivot, left_on="NDB_Number", right_on="NDB_No", how="left")

In [133]:
df["Total lipid (fat)"].isna().sum()

1530

In [139]:
missing_carbs
print(f"{round(df['Total lipid (fat)'].isna().sum()*100/len(df), 2)}% of products are missing total fat")
print(f"{round(df['Protein'].isna().sum()*100/len(df), 2)}% of products are missing total protein")
print(f"{round(df['Carbohydrate, by difference'].isna().sum()*100/len(df), 2)}% of products are missing total carbs")

0.64% of products are missing total fat
0.69% of products are missing total protein
0.61% of products are missing total carbs


##Feature Engineering
**based on your findings from Basic EDA, briefly describe your plan for feature engineering
 
(e.g., what transformations do you plan to do on any of the features, do you plan to drop any features, etc). If you have multiple complex features or features that may require trial and error, feel free to create one issue for each one of those features**

##Train-Test Split
**based on the metadata (such as size
and target class distribution) of your dataset, briefly outline your train/test percent split. Include the percentage for your golden holdout set if you plan to leave one out**

##Initial Pipeline
**briefly describe the
types of transformers you may need (such as an imputer, a column transformer, etc)**

##Model Fitting and Evaluation
**list 1-3 assumptions you have about feature importance or how you anticipate your model’s performance will be**
