# COGS 108 - Project Proposal

## Authors

- Jared Wang: TBD at Checkpoint
- Dylan Dsouza: TBD at Checkpoint
- Christian Kumagai: TBD at Checkpoint
- Kyle Zhao: TBD at Checkpoint

## Research Question

Do dessert recipes that use honey as the primary sweetener differ in caloric content, sugar content, and total fat (normalized to FDA daily values) compared to desserts that use sugar as the primary sweetener?

As a control, we will filter recipes using the “desserts” tag and classified as honey-based or sugar-based using ingredient lists, restricting to recipes that contain only one primary sweetener. Then, we will define nutritional outcomes, specifically by total calories, sugar (% daily value), and total fat (% daily value) extracted from the nutrition column. As a method of comparison, we intend to use independent samples t-tests to compare mean nutritional values between honey and sugar desserts. We also intend to conduct downstream secondary analyses using ANOVAs, although we are yet to concretize the details for this, which we will likely do once our EDA is complete.

## Background and Prior Work


Instructions: REPLACE the contents of this cell with your work

- Include a general introduction to your topic
- Include explanation of what work has been done previously
- Include citations or links to previous work

This section will present the background and context of your topic and question in a few paragraphs. Include a general introduction to your topic and then describe what information you currently know about the topic after doing your initial research. Include references to other projects who have asked similar questions or approached similar problems. Explain what others have learned in their projects.

Find some relevant prior work, and reference those sources, summarizing what each did and what they learned. Even if you think you have a totally novel question, find the most similar prior work that you can and discuss how it relates to your project.

References can be primary research publications, but they need not be. Blogs, GitHub repositories, company websites, reputable news or magazine articles etc., are all viable references if they are relevant to your project. It may be very helpful to look for review papers in research publications; these surveys of a field can provide you with important domain expertise and make excellent citations.

Do not just give us a couple of random citations.  These should be directly relevant. Depending on the needs this could
- fundamental things that are very important in the broad field (think textbook chapters or famous case studies)
- very similar projects to what you want to do
- primary research or review articles about techniques you will use or an obstacle you will face 

Generally if two possible citations exist, choose the one which seems more central to the field (has more citations or appears in a reputable source).

Be aware that AI will make up citations and generally not necessarily pick out the most important ones.

You are expected to have three to six paragraphs of background information here and a *minimum* of three relevant citations.  Use those citations in a way that it is clear which information comes from which reference. Don't just claim a bunch of stuff in the text without citation and then dump a bibliogrpahy at the end

 **Use inline citation through HTML footnotes to specify which references support which statements** 

For example: After government genocide in the 20th century, real birds were replaced with surveillance drones designed to look just like birds.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Use a minimum of 3 citations, but we prefer more.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) You need enough to fully explain and back up important facts and methods. 

Note that if you click a footnote number in the paragraph above it will transport you to the proper entry in the footnotes list below.  And if you click the ^ in the footnote entry, it will return you to the place in the main text where the footnote is made.

To understand the HTML here, `<a name="#..."> </a>` is a tag that allows you produce a named reference for a given location.  Markdown has the construciton `[text with hyperlink](#named reference)` that will produce a clickable link that transports you the named reference.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.


## Hypothesis


We hypothesize that desserts using honey as the primary sweetener will have higher overall caloric content and slightly higher total fat compared to sugar-sweetened desserts when normalized to FDA daily values, while sugar (% daily value) may be comparable. Although honey is often perceived as a healthier alternative, it is more calorically dense than sugar (64 cal/g for 1 tablespoon of honey, as compared to 48 cal/g for 1 tablespoon of sugar), and recipes claiming to be 'healthy' may not reduce quantities enough to offset this difference. Additionally, honey-based desserts often include complementary fat sources (e.g., butter, oils, nuts), which may contribute to comparable or higher fat content.

**Calories**

* Null Hypothesis (H₀): μ_honey ≤ μ_sugar

* Alternate Hypothesis (H₁): μ_honey > μ_sugar

**Total fat (% DV)**

* Null Hypothesis (H₀): μ_honey ≤ μ_sugar

* Alternate Hypothesis (H₁): μ_honey > μ_sugar

**Sugar (% DV)**

* Null Hypothesis (H₀): μ_honey = μ_sugar

* Alternate Hypothesis (H₁): μ_honey ≠ μ_sugar

## Data

For our data, we aim to find a dataset which includes different foods; the data will hopefully contain columns such as ingredients, which we can then query down to find specific dishes/recipes which contain solely the ingredients honey, or sugar respectively. We aim to include data with a good sample size, likely of 100000 or more rows, which we can query down to at least 10000 or more results. We would prefer the data to be stored in CSV files, which we can then operate on using either pandas. 


The data which we have found is a collection of recipes, stored on kaggle, which stemmed from recipes uploaded to food.com. Since the data is open source, we do not have to pursue further action in regards to obtaining the dataset. Notable variables which we plan to use are the following: `tags`, `nutrition`, `ingredients`

The data we plan to use can be accessed through the following URL. We are specifically using the Raw Recipes dataset:
https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions/data?select=RAW_recipes.csv


## Ethics 

### A. Data Collection
 - [ ] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

    > The dataset reflects recipes uploaded by users on Food.com (formerly GeniusKitchen) and cleaned by researchers at UC San Diego. Although there is a vast range of dietary preferences stored as tags, we acknowledge that the data may skew towards certain cultures and demographics, inhibiting its ability to generalize to other use cases.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

    > We will only use recipe-level data (ingredients, tags, nutrition). We will not analyze usernames, reviews, or any identifiable user metadata. The analysis focuses on aggregate nutritional properties rather than individuals. Although the data does include user IDs, we will be getting rid of these during the wrangling process.

 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

    > The dataset is publicly available on Kaggle and does not contain sensitive personal data. It will be stored locally and we intend to maintain any change history on GitHub, specifically for coursework.

 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

    > The dataset is publicly accessible and local copies will likely be retained only for the duration of the course project. We might upload aggregated data but this will be clearly marked and identifiable as different from the raw data.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

    > We intend to utilize a user's interpretation of features like 'healthy' and 'desserts' based on tags they assign to the uploaded recipe, which we acknowledged might be a biased perspective.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

    > We will examine class balance (honey vs sugar desserts), ingredient labeling consistency (checking for membership instead of absolute comparison), and missing nutrition values. Filtering decisions (e.g., defining a “primary sweetener”) may introduce bias, which will be well-documented and reproducible in downstream analysis.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

    > All summary statistics and visualizations will be reported with clear note of assumptions, uncertainty, and limitations. The intention is to mimic our data in the way best intuitively understandable.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

    > We will not analyze or display any user information which may be used to personally identify such individuals. We aim to keep all results at the recipe level.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

    > As we are completing all relevant work through GitHub and the corresponding Jupyter Notebooks, we believe that results and discoveries should be easily accessible to those seeking to review what we have done later in the future. We value transparency, which we hope, will be visible through GitHub and their commit history. 

### D. Modeling
 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

    > Although we selected calories, sugar and fat as our operational definition for healthiness, we understand that there are many other factors that can be taken into account in regards to physical health and diet. 

 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?


    > When communicating our results, we aim to be clear that these statistics are based solely on the specific recipe which we have analyzed. Through such, we aim to avoid the perception that we are giving out advice regarding personal diets.

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

    > We would like to avoid instances in which our conclusions are interpreted as health recommendations. As such, we aim to portray our results from a statistical standpoint, rather than as a conclusion which should be served as a guideline for others. 



## Team Expectations 

* We will communicate virtually over discord; We expect replies within six hours, providing ample time for us to update our project based on the comments/ suggestions made by others. We expect to meet up prior to submissions, with standard check-ins being done through messages.  
* We decided that we would seek effective communication, not being afraid to be blunt when needed, as long as it is constructive and helpful towards our goals. We will still be mindful of being polite and respectful to the other members while doing so.
* Our teams decision making will be made largely based on majority vote, with considerations made based on the scale of the decision being made. We do not believe it to be necessary to notify others prior to making commits, as we have git history to rely back on, however, if conflicts between ideas arise based on these commits, they will be discussed in the group chat, decided by majority. In the event that a team mate is unresponsive, we will continue to move forward without them; we will stop to make adjustments upon their return as needed.
* We do not plan to assign leadership between members in regards to work, however, we will divide tasks, splitting up into smaller groups to complete work, or individually as we see fit. Tasks will be decided on preference, however, workload will be accounted for in the decision process. 
* See attatched below in the Project Timeline Proposal
* In the event that someone is struggling to keep up, we expect communication at prior to the day before a deadline. Upon this, remaining members will discuss how to redistribute the workload. In the event that this continues to happen, a meeting may be set up to discuss how to fix the issue.
* We can include this in the discord server as a pinned message, as well as on this GitHub repo.

## Project Timeline Proposal

| Week  | Deliverable | Work Expectations |
|---|---|---|
| Week 5  | Project Proposal  | Select dataset to work with - Formulate guiding question and hypothesis - Complete checklist for ethics |
| Week 6  | - | Finalize dataset of interest - Look for adjacent datasets - Include qualitative metadata descriptions |
| Week 7  | Project Checkpoint 1  | Clean dataset - Wrangle data - Subset into aggregations of interest |
| Week 8  | - | Conduct rudimentary EDA - Ideate insightful visualizations - Narrow scope of investigation |
| Week 9  | Project Checkpoint 2 | Finalize graphs and charts - Include statistical tests - Finalize EDA and inference |
| Week 10  | - | Record missing project components - Improve for cohesion - Consider research extensions |
| Finals Week  | Final Project | Add final touches - Incorporate feedback and suggestions - Turn in submission |