# Food Nutrition Analysis



## Table of Contents

1. [Abstract](#Abstract)  
    1.1 [Background](#Background)  
	1.2 [About](#About)
2. [Procedures](#Procedures)  
	2.1 [OCR / Text-scan](#OCRandTextScan)  
	&nbsp; 2.1.1 [Identify Image Texts](#IdentifyImageTexts)  
	2.2 [Web Crawler for Data Collection](#WebCrawler)  
    &nbsp; 2.2.1 [2.2.1 EWG Food Nutrition and Rating Information](#EWG)  
    &nbsp; 2.2.2 [2.2.2 Costco Food Nutrition Facts Image Data](#Costco)  
	2.3 [Machine Learning Models](#MLModels)  
	&nbsp; 2.3.1 [Linear Regression](#LinearRegression)  
	&nbsp; 2.3.2 [Decision Tree Regressor](#DecisionTreeRegressor)  
	2.4 [Language Model](#LanguageModel)  
	2.5 [Recommendation System Algorithm](#RecommendationSystemAlgorithm)  
	2.6 [User-centered research](#UserCenteredResearch)  
	2.7 [Streamlit Dashboard](#Streamlit)
4. [Conclusion](#Conclusion)



<a name="Abstract"></a>
## 1. Abstract 
<a name="Background"></a>
### 1.1 Background
Smart food labeling provides a comprehensive breakdown of the nutritional information associated with the food product. The AI system scrutinizes the extracted text to ascertain the quantities of macronutrients (such as carbohydrates, proteins, and fats) and the total caloric content. This information is presented in a user-friendly format, allowing consumers to make informed decisions regarding their dietary choices.
<a name="About"></a>
### 1.2 About
A core aspect of smart food labeling is the meticulous analysis of the ingredients employed in food products. The AI system scrutinizes the extracted ingredients list, cross-referencing it against an extensive database of recognized ingredients. This analysis aids in identifying specific components that may be present in the product. This level of transparency empowers consumers to quickly ascertain whether a product aligns with their dietary needs or restrictions.
Please view the code for detailed descriptions of what each function does, as such information is clearly recorded in respective docstrings. A detailed guide on how to run the project and how to setup the conda virtual environment is included in the README.md file.

### 1.3 Install Dependencies


In [124]:
!pip install pandas easyocr uuid lxml scikit-learn openai streamlit



### 1.4 Import Modules

In [136]:
import pandas as pd
import os
from PIL import Image
import easyocr
import uuid
import requests
from lxml import etree
from urllib.parse import urljoin
import re
import numpy as np
from urllib.parse import urljoin, urlparse, urlsplit
import re
import json
import time
from math import ceil
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.tree import DecisionTreeRegressor
from openai import OpenAI
import streamlit


os.environ['KMP_DUPLICATE_LIB_OK']='True'

# this will prevent running the web crawling part of the notebook to save time and storage,
# The execution of web crawling part is optional, it won't effect any other cell if you don't run it.
web_crawling = False

# this will prevent plotting user centered research chart
# set it to true if you want to plot user centered research charts
plot_user_center_chart = False

<a name="Procedures"></a>
## 2. Procedures

<a name="OCRandScanText" ></a>
### 2.1 OCR / Text-scan
OCR stands for Optical Character Recognition. It's a technology that enables the conversion of different types of documents, such as scanned paper documents, PDF files, or images captured by a digital camera, into editable and searchable data. OCR software works by analyzing the patterns in the document images and recognizing the characters to convert them into machine-encoded text. 
In this project, we have used the easyOCR library to read text from the food nutrition label to obtain nutritional data, which becomes the basis of our analysis
<a name="IdentifyImageTexts"></a>
#### 2.1.1 Identify Image Texts
With easyOCR, we read the various text blocks within the food nutrition table image. This information is in the form of a list of strings. To demonstrate, we have included an image named “foodlabel.png” in the INPUTS folder, which is also the image we used for the demo during the presentation.
In the coding process, we had to adjust a few parameters to get the clarity right, so that the library would group pieces of text together in the way we liked.
The output is something like this:

{
    "Category" : "meals",
    "Calories" : 240,
    "Total Fat" : 4, 
    "Saturated Fat" : 1.5,
    "Trans Fat" : 0,
    "Total Carbohydrate" : 46,
    "Added Sugar" : 2,
    "Protein" : 11
}

In [5]:
def read_text(file: str) -> list:
    """
    Reads text from an image file using the easyOCR library.
    ------------------
    Parameters: 
    file (str): The path to the image file from which text needs to be extracted.
    ------------------
    Returns: 
    list: A list of strings where each string represents a detected line of text in the image.
    """
    reader = easyocr.Reader(['en'])
    text = reader.readtext(file, detail = 0, text_threshold=0.7)
    # print(text)
    return text

In [6]:
def extract_info_from_text(text: list, category: str) -> dict:
    """
    Using the text output from read_text function, extract info and convert it into a 
    pandas df. Each time we run the read_text function, the output will become one row in 
    the final df. The idea is to get ~200 entries/rows (aka food items) in our database. 
    Maybe store the data in AWS or snowflake? Let me know what you guys think.
    ------------------
    Parameters: 
    text: text in str
    category: category of food in str
    ------------------
    Returns: single-row table with columns as features in pd.DataFrame
    """
    """ 
    features: 
    Calories from added sugar/total calories
    Calories from fat/total calories
    Calories from protein/total calories
    Calories from carbs/total calories
    Calories from saturated fat/total calories
    Calories from trans fat/total calories
    more to be added
    """
    nutrition_map = {}
    nutrition_map['Category'] = category
    for i in range(len(text)):
        if text[i] == 'Calories':
            nutrition_map['Calories'] = int(text[i+1])
            continue
        if 'Total Fat' in text[i]:
            nutrition_map['Total Fat'] = int(text[i].split(' ')[-1][:(len(text[i]) - 1)][:-1])
            continue
        if 'Saturated Fat' in text[i]:
            nutrition_map['Saturated Fat'] = float(text[i].split(' ')[-1][:(len(text[i]) - 1)][:-1])
            continue
        if 'Trans Fat' in text[i]:
            tmp = text[i].split(' ')[-1][:(len(text[i]) - 1)][:-1]
            tmp = 0 if tmp == 'O' else int(tmp)
            nutrition_map['Trans Fat'] = tmp
            continue
        if 'Total Carbohydrate' in text[i]:
            nutrition_map['Total Carbohydrate'] = int(text[i].split(' ')[-1][:(len(text[i]) - 1)][:-1])
            continue
        if 'Added Sugar' in text[i]:
            nutrition_map['Added Sugar'] = int(text[i].split(' ')[-3][:(len(text[i]) - 3)][:-1])
            continue
        if 'Protein' in text[i]:
            nutrition_map['Protein'] = int(text[i+1][:-1])
            continue
    return nutrition_map



In [7]:
def convert_info_to_df(d: dict)-> pd.DataFrame:
    nutrition_map = {}
    nutrition_map['category'] = d['Category']
    # Calories from Added Sugar vs Total Calories
    nutrition_map['suga_to_total'] = round(d['Added Sugar'] * 4 / d['Calories'], 4)
    # Calories from Fat vs Total Calories
    nutrition_map['fat_to_total'] = round(d['Total Fat'] * 9 / d['Calories'], 4)
    # Calories from Protein vs Total Calories
    nutrition_map['pro_to_total'] = round(d['Protein'] * 4 / d['Calories'], 4)
    # Calories from Carbohydrates vs Total Calories
    nutrition_map['carb_to_total'] = round(d['Total Carbohydrate'] * 4 / d['Calories'], 4)
    # Calories from Saturated Fat vs Total Calories
    nutrition_map['satu_to_total'] = round(d['Saturated Fat'] * 9 / d['Calories'], 4)
    # Calories from Trans Fat vs Total Calories
    nutrition_map['tran_to_total'] = round(d['Trans Fat'] * 9 / d['Calories'], 4)
    res = pd.DataFrame(nutrition_map, index=[str(uuid.uuid4())])
    print(res)
    return res



In [8]:
text = read_text('../INPUTS/foodlabel.png')
food_info_df = convert_info_to_df(extract_info_from_text(text, 'food1'))

Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.


                                     category  suga_to_total  fat_to_total  \
e7e4696c-e7e4-438e-8e88-399977476706    food1         0.0333          0.15   

                                      pro_to_total  carb_to_total  \
e7e4696c-e7e4-438e-8e88-399977476706        0.1833         0.7667   

                                      satu_to_total  tran_to_total  
e7e4696c-e7e4-438e-8e88-399977476706         0.0563            0.0  


<a name="WebCrawler"></a>
### 2.2 Web Crawler for Data Collection
<a name="EWG"></a>
#### 2.2.1 EWG Food Nutrition and Rating Information
##### 2.2.1.1 Overview
In this section, we focus on the web crawler developed to collect food impact information and ratings from the Environmental Working Group (EWG) website. The purpose of this crawler is to gather data that serves as a valuable sample for training the subsequent machine learning models.
##### 2.2.1.2 Implementation
The crawler navigates through the EWG food scores webpage, specifically targeting food products, their details, and associated nutritional information. The following steps outline the process:
1. Page Navigation:
The crawler systematically explores the EWG food scores pages, scraping information for a predetermined number of products per page.
2. Data Extraction:
For each product, the crawler extracts details such as the product name, link to the product page, size information, and nutritional content.
3. Additional Information:
The crawler also captures specific details like the caloric content and the product's overall score, providing a comprehensive dataset for subsequent analysis.
2.2.1.3 Challenges and Solutions
1. Dynamic Content: The EWG website employs dynamic loading of content, requiring careful handling to ensure complete data retrieval.
2. Parsing Complexity: The information on the website is nested and requires careful parsing, which was successfully achieved using XPath expressions.
3. Rate Limiting: The crawler includes a mechanism to handle HTTP 429 status codes, indicative of exceeding the allowed number of requests within a specified time frame. When such a status code is encountered, the program gracefully pauses execution for a predetermined duration before resuming the crawling process.
4. Robots.txt Compliance: In adherence to web etiquette and legal considerations, both crawlers strictly adhere to the rules defined in the `robots.txt` files of the respective websites. This ensures that the crawling activity is within the bounds defined by the website administrators.
##### 2.2.1.4 Output
The gathered data is stored in a structured format, with each entry containing essential details about a food product. The resulting dataset serves as a valuable resource for training machine learning models in subsequent stages of the project.


##### Data Crawling

In [56]:
def get_response(url):
    """
    Makes a GET request to the specified URL with a user-defined header to mimic a web browser request.
    
    Parameters:
    - url (str): Target URL from which to fetch the content.
    
    Returns:
    - Response object from requests library.
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/118.0.0.0 Safari/537.36 Edg/118.0.2088.57'}
    resp = requests.get(url, headers=headers, timeout=15)
    return resp

def get_food_label(page_number):
    """
    Fetches food product labels from the EWG website for a specific page number.
    
    Parameters:
    - page_number (int): The specific page number to scrape.
    
    Returns:
    - List of anchor elements (<a> tags) containing product information.
    """
    page_url = f'https://www.ewg.org/foodscores/products/?category_group=&page={page_number}&per_page=48&type=products'
    resp = get_response(page_url)
    a_labels = etree.HTML(resp.text).xpath('//div[@class="ind_result_text fleft"]/a')
    return a_labels

In [66]:
def get_food_info(a):
    """
    Extracts detailed food product information from its detail page.
    
    Parameters:
    - a (element): The anchor element pointing to the product's detail page.
    
    Returns:
    - Dictionary containing nutritional facts and other details of the product.
    """
    try:
        nutrition_fact = {}
        detail_url = urljoin('https://www.ewg.org/foodscores/products/', a.get('href'))
        resp_text = get_response(detail_url).text
        resp = etree.HTML(resp_text)
        name = resp.xpath('//h1[@class="truncate_title_specific_product_page"]')[0].text
        clr = resp.xpath('//div[@id="nut_calories_value"]')[0].text
        nutrition_fact['name'] = name
        nutrition_fact['link'] = detail_url
        size_opt = resp.xpath('//option[@selected]')
        nutrition_fact['size'] = size_opt[0].text if size_opt else None
        nutrition_fact['Calories'] = clr
        pattern = r'score_(\d+)_(\d+)'
        matches = re.findall(pattern, resp_text)
        score = '.'.join(matches[0]) if matches else None
        for i in resp.xpath('//span[@class="normal_title" or @class="med_title"]'):
            name_n_value = i.xpath('string(./..)').strip()
            element_name = i.xpath('string(.)').strip()
            element_kv = name_n_value.split(element_name)
            element_value_detail = element_kv[-1].strip() if len(element_kv) > 1 else ''
            nutrition_fact[element_name] = element_value_detail
        return nutrition_fact
    except Exception as e:
        # print(e)
        return None


In [100]:
if web_crawling:
    # Main driver code: Iterates through EWG food score pages and aggregates data into a CSV file.
    ewg_data_list = []
    # UID counter for unique identification of entries
    ct = 20000000
    start_url = "https://www.ewg.org/foodscores/products/?category_group=&page=1&per_page=48&type=products"
    # Determine the total number of pages to crawl
    pages = int(etree.HTML(get_response(start_url).text).xpath('//a[@aria-label]')[-1].text)
    
    pages = 50 if pages > 50 else pages
    
    for page_number in range(1, pages+1):
        food_labels = get_food_label(page_number)
        for a in food_labels:
            nutrition_fact = get_food_info(a)
            if nutrition_fact:
                nutrition_fact['uid'] = ct
                ct += 1
                ewg_data_list.append(nutrition_fact)
            else:
                break
    # Convert aggregated data into a pandas DataFrame
    ewg_data_df = pd.DataFrame(ewg_data_list)


##### Data Cleaning

In [95]:
def parse_and_convert(value):
    """
    Parses nutritional values from string to float, handling different units and converting as necessary.
    
    Parameters:
    - value (str): The string containing the nutritional value and unit.
    
    Returns:
    - The converted value as float, or the original value if conversion isn't applicable.
    """
    match = re.match(r'<?(\d+(\.\d+)?)\s*(\D+)', str(value).replace(' ', ''))
    if match:
        number, _, unit = match.groups()
        unit_mapping = {
            'g': 1,
            'mg': 0.001,
            'q': 1,
            '[g, g]': 1,
            'g*': 1,
            'g,': 1,
            '%': 1,
        }
        if unit.lower() in unit_mapping:
            return float(number) * unit_mapping[unit.lower()]
    elif str(value).replace('<', '').isdigit():
        return str(value).replace('<', '')
    else:
        return "Invalid Data"
        

def rename(df):
    """
    Renames columns of the DataFrame to standardize names for further processing and ensures all expected columns exist.
    
    Parameters:
    - df (DataFrame): The DataFrame with original column names.
    
    Returns:
    - DataFrame: The DataFrame with columns renamed and standardized.
    """
    # Define a mapping of existing column names in df to their new standardized names
    rename_mapping = {
        'Total Carbs': 'Total Carbohydrate',
        'Added Sugar Ingredients:': 'Added Sugar',
        # Add any additional column renaming rules here
    }
    
    # Rename the columns based on the mapping
    # 'errors=ignore' ensures that non-existing columns in the mapping do not cause an error
    df.rename(columns=rename_mapping, inplace=True, errors='ignore')
    
    # List of all expected column names after renaming
    expected_columns = [
        'uid', 'name', 'score', 'Calories', 'Total Fat', 'Total Carbohydrate',
        'Protein', 'Saturated Fat', 'Trans Fat', 'Cholesterol', 'Sodium',
        'Added Sugar', 'Dietary Fiber', 'Sugars'
    ]
    
    # Add any missing expected columns as empty (None values)
    for column in expected_columns:
        if column not in df.columns:
            df[column] = 0  # Adds the column with None as default value for all rows
    
    # Optionally, reorder DataFrame columns to match the expected_columns order
    df = df[expected_columns]
    
    return df


def del_invalid(df):
    """
    Removes rows with invalid data from the DataFrame.
    
    Parameters:
    - df (DataFrame): The DataFrame to clean.
    
    Returns:
    - DataFrame with invalid rows removed.
    """
    all_columns = df.columns
    for column in all_columns:
        df = df[~(df[column] == 'Invalid Data')]
    return df


def clean_and_convert_types(df, strategy='fill', fill_value=0):
    """
    Cleans specified columns in a DataFrame by replacing non-numeric values with NaN,
    converts them to float, and handles missing values according to the specified strategy.
    
    Parameters:
    - df (DataFrame): The DataFrame to clean and convert.
    - strategy (str): Strategy to handle missing values ('fill' to fill with a value, 'drop' to drop rows with NaN).
    - fill_value (float or dict): The value to fill NaNs with if strategy is 'fill'. Can be a single value or a dict specifying per-column fill values.
    
    Returns:
    - DataFrame: The cleaned and type-converted DataFrame.
    """
    columns_to_convert = [
        'score', 'Calories', 'Total Fat', 'Total Carbohydrate', 'Protein', 
        'Saturated Fat', 'Trans Fat', 'Cholesterol', 'Sodium', 'Dietary Fiber', 'Sugars'
    ]
    
    # Replace known non-numeric values with NaN for specified columns
    df[columns_to_convert] = df[columns_to_convert].replace(['n/a', 'None', '--'], np.nan)
    
    # Convert columns to numeric, coercing errors to NaN
    for column in columns_to_convert:
        df[column] = pd.to_numeric(df[column], errors='coerce')
    
    # Handle missing values based on the specified strategy
    if strategy == 'fill':
        df.fillna(fill_value, inplace=True)
    elif strategy == 'drop':
        df.dropna(subset=columns_to_convert, inplace=True)
    
    return df

In [101]:
if web_crawling:
    ewg_data_df = rename(ewg_data_df)
    ewg_data_df.fillna('0', inplace=True)
    ewg_data_df[['Total Fat', 'Total Carbohydrate', 'Protein', 'Saturated Fat', 'Trans Fat', 'Cholesterol', 'Sodium',
        'Dietary Fiber', 'Sugars']] = ewg_data_df[
        ['Total Fat', 'Total Carbohydrate', 'Protein', 'Saturated Fat', 'Trans Fat', 'Cholesterol', 'Sodium',
         'Dietary Fiber', 'Sugars']].applymap(parse_and_convert)
    
    ewg_data_df = del_invalid(ewg_data_df)
    
    ewg_data_df = clean_and_convert_types(ewg_data_df)
    print(ewg_data_df)

<a name="Costco"></a>
#### 2.2.2 Costco Food Nutrition Facts Image Data
##### 2.2.2.1 Overview
This section focuses on the second web crawler, designed to collect nutrition facts information from images on the Costco website. The goal is to utilize OCR for extracting relevant details from food labels and enriching the dataset.
##### 2.2.2.2 Implementation
1. Category Selection:
The crawler systematically navigates through specified categories on the Costco website to target relevant food products.
2. Image Retrieval:
For each product, the crawler identifies the product name, link to the product page, profile ID, and image ID.
3. Confirmation of Relevant Image:
Another challenge involves confirming which image on the webpage corresponds to the nutritional content of the product. This required careful analysis of the webpage structure and content, ensuring that the correct image associated with nutritional information is selected.
4. Image Download:
The crawler utilizes the identified IDs to download images containing nutritional information, focusing on the textual content related to nutritional facts.
##### 2.2.2.3 Challenges and Solutions
1. Image Retrieval: Extracting the correct image details from the Costco website required careful analysis of the web page structure.
2. Image Download: Handling different image formats and ensuring successful downloads were key considerations addressed during implementation.
3. Rate Limiting: To prevent potential issues, a time delay of 1 second was introduced between requests to the server.
4. Robots.txt Compliance: As with the EWG crawler, the Costco crawler respects the guidelines set forth in the `robots.txt` file, ensuring ethical and legal use of web scraping techniques.
##### 2.2.2.4 Output
The collected data is stored in a CSV file containing unique identifiers, product names, and links. This dataset serves as a valuable resource for further analysis and integration into the overall project framework. Images associated with nutritional content are stored locally using the format {uid}.jpg in the same directory as the scripts, ensuring a systematic and easily accessible storage approach.


In [None]:
def get_response(url):
    """
    Fetches the content of a given URL using a specified User-Agent in the headers.

    Parameters:
    - url (str): The URL from which to fetch content.

    Returns:
    - Response object: The response from the requests.get() call.
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/118.0.0.0 Safari/537.36 Edg/118.0.2088.57'}
    resp = requests.get(url, headers=headers, timeout=15)
    return resp


def get_pages(url):
    """
    Determines the total number of pages for a website section based on the total number of products.

    Parameters:
    - url (str): The URL of the website section to analyze.

    Returns:
    - page_total (int): The total number of pages calculated from the total products.
    """
    resp = get_response(url)
    page = etree.HTML(resp.text).xpath('//span[@automation-id="totalProductsOutputText"]')[0].text
    page_total = ceil(int(page.split('of')[-1].strip()) / 24)
    return page_total


def get_detail_labels(page_url):
    """
    Extracts the detail labels (links to product detail pages) from a given page URL.

    Parameters:
    - page_url (str): The URL of the page from which to extract product details links.

    Returns:
    - detail_labels (list): A list of elements pointing to product detail pages.
    """
    resp = get_response(page_url)
    detail_labels = etree.HTML(resp.text).xpath(
        '//div[@automation-id="productList"]//a[contains(@automation-id,"productDescriptionLink")]')
    return detail_labels


def parse_detail_page(label):
    """
    Parses the product detail page for specific product information and image IDs.

    Parameters:
    - label (element): An element containing the product detail link and name.

    Returns:
    - Tuple containing product name, detail URL, profile ID, and image ID. If no match is found, returns None for all.
    """
    time.sleep(1) # Delays the request to avoid overwhelming the server
    pd_name = label.text.strip()
    detail_url = label.get('href')
    detail_page = etree.HTML(get_response(detail_url).text)
    img_src = detail_page.xpath('//meta[@property="og:image"]')[0].get('content')
    pattern = r"profileId=(\d+)&imageId=(\d+-?\d*)"
    match = re.search(pattern, img_src)
    if match:
        profile_id = match.group(1)
        image_id = match.group(2)
        return pd_name, detail_url, profile_id, image_id
    else:
        return None, None, None, None


def download_images(profile_id, image_id):
    """
    Downloads images based on the profile ID and image ID obtained from the product's detail page.

    Parameters:
    - profile_id (str): The profile ID for the image.
    - image_id (str): The specific image ID to download.

    Returns:
    - bool: True if the image was successfully downloaded, False otherwise.
    """
    img_base = f'https://richmedia.ca-richimage.com/ViewerDelivery/productXmlService?profileid={profile_id}&itemid={image_id}&viewerid=1068&callback='
    img_page = get_response(img_base).text.strip('()')
    try:
        img_json = json.loads(img_page)
        imgs = img_json['product']['views']
        for img in imgs:
            if 'nf' in img.get('@name'):
                nf_url = img['swatches'][0]['images'][0]['@path']
                img_file = get_response(nf_url).content
                with open(f'{uid}.jpg', 'wb') as f:
                    f.write(img_file)
                return True
            else:
                continue
        return False
    except:
        return False


In [102]:
if web_crawling:
    costco = 'https://www.costco.com/grocery-household.html'
    start_list = [
        '/snacks.html',
        "/coffee-sweeteners.html",
        "/candy.html",
        "/pantry.html",
        "/breakfast.html",
        "/beverages.html",
        "/emergency-kits-supplies.html",
        "/organic-groceries.html",
        "/cheese.html",
        "/deli.html",
        "/cakes-cookies.html",
    ]
    
    uid = 1000000
    
    with open('costco_food_img.csv', 'w') as f:
        f.write('uid\tname\tlink\n')
    
    for start_url in start_list:
        page_total = get_pages(urljoin(costco, start_url))
        for i in range(0, page_total):
            page_url = f'https://www.costco.com{start_url}?currentPage={i + 1}&pageSize=24'
            detail_labels = get_detail_labels(page_url)
            for label in detail_labels:
                pd_name, detail_url, profile_id, image_id = parse_detail_page(label)
                uid += 1
                if profile_id and image_id:
                    down_img = download_images(profile_id, image_id)
                    if down_img:
                        with open('costco_food_img.csv', 'a') as f:
                            f.write(f'{uid}\t{pd_name}\t{detail_url}\n')
                else:
                    continue


<a name="MLModels"></a>
### 2.3 Machine Learning Models
#### 2.3.1 Features
First, we chose the features for our models, which include calories from sugar to total calories, calories from fat to total calories, calories from protein to total calories, calories from carbs to total calories, calories from saturated fat to total calories, calories from trans fat to total calories. The reason we chose ratio instead of actual calorie counts is so that we can compare the data of solid foods and drinks, since most beverages tend to be less calorie-dense, while it is still possible for them to have a bad calorie ratio and poor nutrition value.

<a name="LinearRegression"></a>
#### 2.3.2 Linear Regression
Linear regression is a parametric algorithm, meaning it makes certain assumptions about the underlying data distribution. It assumes a linear relationship between the input features and the output variable. The model is represented by the equation of a straight line (in the case of simple linear regression) or a hyperplane (in the case of multiple linear regression). The objective of linear regression is to find the coefficients of the linear equation that minimizes the sum of squared differences between the predicted and actual values.

In [108]:
class LinearRegressionModel:
    def __init__(self, data):
        self.data = data

    def split_data(self, test_size=0.2, random_state=None):
        """
        Splits the data into training and testing sets.
        """
        X = self.data[['suga_to_total', 'fat_to_total', 'pro_to_total', 'carb_to_total', 'satu_to_total', 'tran_to_total']]
        y = self.data['Score']
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

    def train_model(self):
        """
        Trains the Linear Regression model using the training data.
        """
        self.model = LinearRegression()
        self.model.fit(self.X_train, self.y_train)

    def evaluate_model(self):
        """
        Evaluates the model's performance on the test set.
        """
        y_pred = self.model.predict(self.X_test)
        mse = mean_squared_error(self.y_test, y_pred)
        r2 = r2_score(self.y_test, y_pred)
        return mse, r2

    def get_coefficients(self):
        """
        Retrieves the model's coefficients and intercept.
        """
        coefficients = dict(zip(['suga_to_total', 'fat_to_total', 'pro_to_total', 'carb_to_total', 'satu_to_total', 'tran_to_total'], self.model.coef_))
        coefficients['Intercept'] = self.model.intercept_
        return coefficients

    def predict_new_data(self, new_data):
        """
        Predicts the score for new data points based on the trained model.
        """
        features = [round(new_data[feature] * conversion_factor / new_data['Calories'], 4) 
                    for feature, conversion_factor in zip(['Added Sugar', 'Total Fat', 'Protein', 'Total Carbohydrate', 'Saturated Fat', 'Trans Fat'], 
                                                          [4, 9, 4, 4, 9, 9])]
        predicted_score = self.model.predict([features])[0]
        return predicted_score


In [109]:
def convert_info_to_df_databse_lr(df: pd.DataFrame)-> pd.DataFrame:
    """
    Prepares a DataFrame for Linear Regression Model training by calculating nutrient ratios.
    """
    columns = ['category', 'suga_to_total', 'fat_to_total', 'pro_to_total', 'carb_to_total', 'satu_to_total', 'tran_to_total', 'Score', 'category']  # Replace these with your column names
    nutrition = pd.DataFrame(columns=columns)
    nutrition['Name'] = df['Name']
    nutrition['Score'] = df['Score']
    nutrition['category'] = df['Category']
    # Calories from Added Sugar vs Total Calories
    nutrition['suga_to_total'] = round(df['Added Sugar'] * 4 / df['Calories'], 4)
    # Calories from Fat vs Total Calories
    nutrition['fat_to_total'] = round(df['Total Fat'] * 9 / df['Calories'], 4)
    # Calories from Protein vs Total Calories
    nutrition['pro_to_total'] = round(df['Protein'] * 4 / df['Calories'], 4)
    # Calories from Carbohydrates vs Total Calories
    nutrition['carb_to_total'] = round(df['Total Carbohydrate'] * 4 / df['Calories'], 4)
    # Calories from Saturated Fat vs Total Calories
    nutrition['satu_to_total'] = round(df['Saturated Fat'] * 9 / df['Calories'], 4)
    # Calories from Trans Fat vs Total Calories
    nutrition['tran_to_total'] = round(df['Trans Fat'] * 9 / df['Calories'], 4)
    return nutrition

In [110]:
df = pd.read_csv('../INPUTS/data_for_model_training.csv')

# Preprocess data
df_prepared = convert_info_to_df_databse_lr(df)
df_prepared = df_prepared.dropna()  # Ensure there are no NaN values

# Initialize and train model
model = LinearRegressionModel(df_prepared)
model.split_data(test_size=0.2, random_state=42)
model.train_model()

# Evaluate model
mse, r2 = model.evaluate_model()
coefficients = model.get_coefficients()

# Display results
print(f"Mean Squared Error: {mse}")
print(f"R-squared (R2) Score: {r2}")
print("Model Coefficients:")
for feature, coef in coefficients.items():
    print(f"{feature}: {coef}")

# Predictions can be made with model.predict_new_data(new_data) where new_data is a dictionary of features


Mean Squared Error: 3.3831290276348995
R-squared (R2) Score: 0.10617462942274791
Model Coefficients:
suga_to_total: 2.4908297693439345
fat_to_total: 1.1938017876213083
pro_to_total: -3.230754564358422
carb_to_total: -0.9041395636852857
satu_to_total: 2.3359050299067583
tran_to_total: -6.145781487898252
Intercept: 5.164429146560749


<a name="decisionTreeRegressor"></a>
#### 2.3.3 Decision Tree Regressor
Decision trees are non-parametric algorithms that do not make strong assumptions about the underlying data distribution. A decision tree is a hierarchical tree-like structure where each internal node represents a decision based on a feature, each branch represents the outcome of the decision, and each leaf node represents the predicted value. The goal of a decision tree regressor is to recursively split the data into subsets based on the most informative features and assign a constant value to each leaf node, minimizing the variance within each leaf. The decision tree is built by recursively splitting the data based on features, with each split chosen to maximize the reduction in variance or mean squared error.

In [112]:
class DecisionTreeRegressorModel:
    def __init__(self, data):
        """
        Initializes the DecisionTreeRegressorModel with the provided data.
        
        Parameters:
        - data (DataFrame): The dataset containing features and target variable for model training.
        """
        self.data = data

    def split_data(self, test_size=0.2, random_state=None):
        """
        Splits the dataset into training and testing sets.
        
        Parameters:
        - test_size (float): The proportion of the dataset to include in the test split.
        - random_state (int, optional): Controls the shuffling applied to the data before applying the split.
        """
        X = self.data[['suga_to_total', 'fat_to_total', 'pro_to_total', 'carb_to_total', 'satu_to_total', 'tran_to_total']]
        y = self.data['Score']
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

    def train_model(self):
        """
        Trains the Decision Tree Regressor model using the training dataset.
        """
        self.model = DecisionTreeRegressor(random_state=42)
        self.model.fit(self.X_train, self.y_train)

    def evaluate_model(self):
        """
        Evaluates the model's performance on the test dataset.
        
        Returns:
        - mse (float): Mean Squared Error of the model predictions.
        - r2 (float): R-squared score of the model.
        """
        y_pred = self.model.predict(self.X_test)
        mse = mean_squared_error(self.y_test, y_pred)
        r2 = r2_score(self.y_test, y_pred)
        return mse, r2

    def get_feature_importance(self):
        """
        Retrieves the feature importance determined by the model.
        
        Returns:
        - Array of feature importance scores.
        """
        return self.model.feature_importances_

In [113]:
def convert_info_to_df_database_dt(df: pd.DataFrame)-> pd.DataFrame:
    """
    Prepares the DataFrame for the Decision Tree Regressor Model by calculating nutrient ratios.
    
    Parameters:
    - df (DataFrame): The original DataFrame with nutritional information.
    
    Returns:
    - DataFrame: Prepared DataFrame with calculated features for model training.
    """
    columns = ['Name', 'suga_to_total', 'fat_to_total', 'pro_to_total', 'carb_to_total', 'satu_to_total', 'tran_to_total', 'Score', 'category']  # Replace these with your column names
    nutrition = pd.DataFrame(columns=columns)
    nutrition['Name'] = df['Name']
    nutrition['Score'] = df['Score']
    nutrition['category'] = df['Category']
    # Calories from Added Sugar vs Total Calories
    nutrition['suga_to_total'] = round(df['Added Sugar'] * 4 / df['Calories'], 4)
    # Calories from Fat vs Total Calories
    nutrition['fat_to_total'] = round(df['Total Fat'] * 9 / df['Calories'], 4)
    # Calories from Protein vs Total Calories
    nutrition['pro_to_total'] = round(df['Protein'] * 4 / df['Calories'], 4)
    # Calories from Carbohydrates vs Total Calories
    nutrition['carb_to_total'] = round(df['Total Carbohydrate'] * 4 / df['Calories'], 4)
    # Calories from Saturated Fat vs Total Calories
    nutrition['satu_to_total'] = round(df['Saturated Fat'] * 9 / df['Calories'], 4)
    # Calories from Trans Fat vs Total Calories
    nutrition['tran_to_total'] = round(df['Trans Fat'] * 9 / df['Calories'], 4)
    return nutrition

In [115]:
# Load the dataset
df = pd.read_csv('../INPUTS/data_for_model_training.csv')
# Prepare the dataset
df_prepared = convert_info_to_df_database_dt(df)
df_prepared = df_prepared.dropna()  # Ensure there are no NaN values

# Initialize, train, and evaluate the model
model = DecisionTreeRegressorModel(df_prepared)
model.split_data(test_size=0.2, random_state=42)
model.train_model()
mse, r2 = model.evaluate_model()
feature_importance = model.get_feature_importance()

# Display evaluation results and feature importance
print(f"Mean Squared Error: {mse}")
print(f"R-squared (R2) Score: {r2}")
print("Feature Importance:")
for feature, importance in zip(df_prepared.columns[1:7], feature_importance):  # Adjust column slicing as needed
    print(f"{feature}: {importance}")


Mean Squared Error: 4.332
R-squared (R2) Score: -0.14451783355350067
Feature Importance:
suga_to_total: 0.2880020266050098
fat_to_total: 0.1492217527815166
pro_to_total: 0.2706288488089343
carb_to_total: 0.13437789144090984
satu_to_total: 0.15776948036362962
tran_to_total: 0.0


#### 2.3.4 Implementation
We compared the Mean Squared Error and R squared value and concluded that the linear regression model is the more suitable model, so we used it to estimate the score for the nutrition label.


<a name="LanguageModal"></a>
### 2.4 Language Model
For the language model, we simply incorporated the openAI API, provided the ingredient list to the model, and asked it to give it a score of 0 to 10 where 0 means the healthiest and 10 means the least health. 
After that, we take the average score of the language model’s score and the ML model’s score to serve as the final score.



In [191]:
class LanguageModel:
    def __init__(self):
        # Prompt the user for the OpenAI API key or you can change the input by your OpenAI API key
        self.OPEN_AI_ACCESS_KEY = input("Please enter your OpenAI API key: ")

    def language_model_ingredients(self, file_name: str) -> str:
        """
        Evaluates the healthiness of food based on its ingredient list using OpenAI's language model.
        
        Parameters:
        - file_name (str): The file name containing the ingredient list.
        
        Returns:
        - str: The language model's healthiness rating of the food.
        """
        api_key = self.OPEN_AI_ACCESS_KEY
        ingredients = read_text(file_name)  # Ensure the read_text function is defined or import it if it's from another module
        prompt = f"Rate the healthiness of the food on a scale of 1 to 10 based on its nutritional content provided (1 is most healthy and 10 is most unhealthy), give me only the numeric answer: {ingredients}."
        print(prompt)
        
        
        client = OpenAI(
            api_key=self.OPEN_AI_ACCESS_KEY,
        )
        
        chat_completion = client.chat.completions.create(
            messages=[
                {
                    "role": "assistant",
                    "content": prompt,
                }
            ],
            # Adjust to the model you prefer
            model="gpt-3.5-turbo",
            max_tokens=10,
        )

        answer = chat_completion.choices[0].message.content
        return answer


In [192]:
model = LanguageModel()
file_name_ingredients = "../INPUTS/foodlabel.png"
model.language_model_ingredients(file_name_ingredients)

Please enter your OpenAI API key:  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


Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.


Rate the healthiness of the food on a scale of 1 to 10 based on its nutritional content provided (1 is most healthy and 10 is most unhealthy), give me only the numeric answer: ['Nutrition Facts]', 'servings per container', 'Serving size', '1/2 cup (208g)', 'Amount per serving', 'Calories', '240', '% Daily Value*', 'Total Fat 4g', '5%', 'Saturated Fat 1.5g', '8%', 'Trans Fat Og', 'Cholesterol Smg', '2%', 'Sodium 430mg', '19%', 'Total Carbohydrate 46g', '17%', 'Dietary Fiber 7g', '25%', 'Total Sugars 4g', 'Includes 2g Added Sugars', '4%', 'Protein', '11g', 'Vitamin D 2mcg', '10%', 'Calcium 260mg', '20%', 'Iron 6mg', '35%', 'Potassium 240mg', 'The % Daily Value (DV) tells you how much a nutrient in', 'serving of food contributes to', 'daily diet: 2,000 calories', 'day is used for general nutrition advice.'].


'8'

<a name="RecommendationSystemAlgorithm" ></a>
### 2.5 Recommendation System Algorithm
To build the recommendation system, we looked at the final overall score of a food. If it is below a lower boundary, we suggest that the food is great. If it is above a higher boundary, we suggest the wood is unhealthy and shouldn’t be consumed and recommend a few new options. If it is downright in the middle, we suggest the food is okay as it is, but a few new options could be considered. 
It is important to note that in order to avoid recommending a beverage when the user wants a whole meal (aka recommending something completely unrelated to the user’s needs), we separate the foods into different categories, so that when the user checks the nutritional value of a brownie, we can confidently recommend a protein cupcake, which has a much better rating and  is similar to what the user wants.



<a name="UserCenteredResearch"></a>
### 2.6 User-Centered Research
We conducted a series of evaluation surveys where we shared the survey to students at Georgia Tech and Northeastern University. http://peersurvey.cc.gatech.edu/s/c93c4b2bf03a491c967ae81ae6b62ffb
 By walking users through the entire process of using our interface, this survey studies the background, satisfaction, and needs of our users. It provides us with insights on what some of the things we did well on are and how to improve in the future. We then performed data analysis on this data.

#### Gender
<img src="../images/gender.png" alt="Gender" height=300 />

#### Ages
<img src="../images/ages.png" alt="Ages" height=300/>

#### Fitness Goal
<img src="../images/image1.png" alt="Fitness Goal" height=300 />

#### Fitness Goal Among Gender
<img src="../images/image2.png" alt="Fitness Goal Among Gender" height=300 />

#### Fitness Goal Among Ages
<img src="../images/image3.png" alt="Fitness Goal Among Ages" height=300 />

#### Fitness Goal Among Fitness Levels
<img src="../images/image4.png" alt="Fitness Goal Among Fitness Levels" height=300 />

#### Improvement in Confidence of Making wise food choice in % among Fitness Levels
<img src="../images/image5.png" alt="Confidence Level" height=300 />




<a name="Streamlit"></a>
### 2.7 Streamlit Dashboard
We used streamlit as a visualization tool to create a dashboard that walks the user down the process. Please run the code to see for yourself.


---
####  Running the Streamlit App

To interact with the Streamlit application associated with this notebook, please follow these steps:

1. **Open a Terminal**: Open a new terminal window on your computer. This can usually be done through your operating system's main menu or search function.

2. **Activate Your Environment** (if applicable): If you're using a virtual environment for your project (recommended), activate it by running the appropriate command for your virtual environment. For example, if you're using `conda`, you might use:
   ```bash
   conda activate your_env_name
   ```
   Or, if you're using `venv`, you might use:
   ```bash
   source your_env_name/bin/activate
   ```
   Replace `your_env_name` with the name of your virtual environment.

3. **Navigate to Your Project Directory**: Use the `cd` command to navigate to the directory containing your Streamlit script. For example:
   ```bash
   cd path/to/your/project/streamlit
   ```
   Replace `path/to/your/project/streamlit` with the actual path to your project directory.

4. **Run the Streamlit App**: Execute the Streamlit app by running:
   ```bash
   streamlit run streamlit.py
   ```
   Replace `streamlit.py` with the name of your Streamlit script file.

5. **View the App**: After running the above command, Streamlit will start the server and open your default web browser to display the app. If the browser does not open automatically, Streamlit will provide a URL in the terminal that you can copy and paste into your browser to view the app.

### Additional Notes

- **Dependencies**: Ensure all dependencies required by your Streamlit app are installed in your environment. This may include libraries like `pandas`, `numpy`, `PIL`, and any others your app uses.
- **Troubleshooting**: If you encounter any issues running the app, check that you're in the correct directory, your virtual environment is activated, and all necessary Python packages are installed.
- **Feedback and Iteration**: Encourage users to provide feedback on the app. If you're using this in an educational or collaborative setting, this can be a great way to iterate on and improve the app.

---


<a name="Conclusion"></a>
## 3. Conclusion
Next steps:
We plan to improve on OCR so that more accurate text extraction can be done in real-life image captioning.
We plan to work on details of our model, including a bigger dataset, more features, the consideration of one’s physical and dietary needs, and other factors to both lower MSE of our model and improve user experiences.
