# COGS 108 - Project Proposal

# Names

- Raymond Williams
- Jefferson Chen
- Olivia Cai
- Patrick Maloney
- Stanley Pan

# Research Question

How do different regional diets, as measured by vitamin, calorie, and macronutrients intake relate to obesity rates and longevity as overall health indicators in the 21st century? (cross-sectional)

## Background and Prior Work

Based on different geographical locations, people have access to different resources, climates, and agricultural commodities. Due to these factors, there are various levels of distinctions in regional diets and their consequent health impacts on their related populations. In past research according to the World Population Review,<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) the world’s obesity rate has grown dramatically from 13% to 39% worldwide in the past 50 years. The United States has a high rate of obesity compared to other high-income nations, staying in the 12th highest obesity rate worldwide. However, it is difficult to pinpoint which part of the American diet and lifestyle contributes to higher rates of obesity factors. Research on the relationship between obesity and culture, ethnicity, and race has been done by two MDs, Sonia Caprio, and Stephen R. Daniels.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) Their research has shown that the obesity rate among children varies between different races. Specifically, Hispanic and Black children have a much higher obesity rate than White children. There’s no obvious biological difference between races that will cause a significantly different obesity rate, which raised their concerns about the food habits in the USA may cause a high obesity rate. Yet, no data is supporting their hypothesis. In a global context, different regional diets are associated with diverse health benefits and drawbacks. Therefore, this proposed study seeks to examine the relationship between regional diets and health metrics such as weight and longevity.  

Previous Work:
Recent research has shown there are “blue zones” where people are known to live longer. Studies have been done in the past regarding BMI and life expectancy comparisons, diet and life expectancy, supplements and bodily function (for example with athletes), etc.
As of lately, evidence leans towards the impact plant-based diets have on our health, as opposed to consuming animal products. Food and living conditions are something that we do not fully know and understand, but progress is being made every day to better our knowledge.


Citations:

1. <a name="cite_note-1"></a> [^](#cite_ref-1) "Obesity Rates by Country 2024." World Population Review, 2024, *World Population Review* https://worldpopulationreview.com/country-rankings/obesity-rates-by-country.birds-arent-real-gen-z-misinformation.html
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Hossain, Parvez, Bisher Kawar, and Meguid El Nahas. "Obesity and Diabetes in the Developing World — A Growing Challenge." New England Journal of Medicine, vol. 356, no. 3, 2007, pp. 213-215. PMC, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2571048/.
For the PLOS Medicine article:

3. Henriksen, Lena, et al. "Estimating the Global Prevalence of Type 2 Diabetes Using a Bayesian Meta-Regression." PLOS Medicine, vol. 18, no. 6, June 2021, https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1003889.
For the Healthline article on Blue Zones:

4. Link, Rachael. "What Are Blue Zones?." Healthline, 8 June 2020, https://www.healthline.com/nutrition/blue-zones#TOC_TITLE_HDR_3.
For the first NCBI article:

5. Ruiz, Jonathan, et al. "Public Health Strategies for Preventing and Controlling Overweight and Obesity in School and Worksite Settings." Morbidity and Mortality Weekly Report, vol. 54, no. RR10, Oct. 2005, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7727853/.
For the second NCBI article:

6. O’Hearn, Michael, et al. "Trends in Diagnoses of HIV Infection in the United States, 2002-2011." Journal of the American Medical Association, vol. 312, no. 4, July 2014, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8879056/.


# Hypothesis


We believe that diets in regions that have more nutritional variety and lower caloric intake yield a healthier lifestyle, in terms of longevity through life expectancy and mortality rates, and obesity rates measured with BMI. This will be due to having less processed foods more typical in larger countries requiring a larger populace to feed (in turn lowering food quality). We must consider confounding variables that may have an effect to our metrics, such as socioeconomic status, access to healthcare, environmental factors, etc.

# Data

1. The ideal data should should include a large variety of countries that either fairly represent the region, or just a generalized region that can represent the countries (e.g. Southeast Asia, Northwest Europe, Central America, etc.). From these regions/countries, we can have specifics in terms of macronutrients and micronutrients, and the main source(s) of those nutrients, providing an answer to the typical diet in that region. Additionally, there will be population-to-nutritional information such as average caloric intake, obesity rate, weight/BMI, life expectancy. As a bonus, there could be additional information such as socioeconomic status and perhaps the percentage of processed foods present in the typical diet in that region.

2.  https://globalnutritionreport.org/resources/nutrition-profiles/asia/
    https://data.worldbank.org/indicator/SP.DYN.LE00.IN
    https://www.prosperity.com/rankings?pinned=&rankOrScore=1&filter=
    https://databank.worldbank.org/indicator/SH.STA.WASH.P5?Id=7f18f0c5&Report_Name=Health&populartype=series
    https://ncdrisc.org/data-downloads-adiposity.html


In [1]:
# Imports 
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
import seaborn as sns
sns.set()
sns.set_context('talk')

import warnings
warnings.filterwarnings('ignore')
pd.set_option("display.max_columns", 104)
import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest
# Note: the statsmodels import may print out a 'FutureWarning'. Thats fine.

In [2]:
bmi = pd.read_csv("./bmi_country.csv")

In [3]:
bmi.tail() # basically just checking if it is properly uploaded and works

Unnamed: 0,Country/Region/World,ISO,Sex,Year,Mean BMI,Mean BMI lower 95% uncertainty interval,Mean BMI upper 95% uncertainty interval,Prevalence of BMI>=30 kg/m² (obesity),Prevalence of BMI>=30 kg/m² lower 95% uncertainty interval,Prevalence of BMI>=30 kg/m² upper 95% uncertainty interval,Prevalence of BMI>=35 kg/m² (severe obesity),Prevalence of BMI>=35 kg/m² lower 95% uncertainty interval,Prevalence of BMI>=35 kg/m² upper 95% uncertainty interval,Prevalence of BMI<18.5 kg/m² (underweight),Prevalence of BMI<18.5 kg/m² lower 95% uncertainty interval,Prevalence of BMI<18.5 kg/m² upper 95% uncertainty interval,Prevalence of BMI 18.5 kg/m² to <20 kg/m²,Prevalence of BMI 18.5 kg/m² to <20 kg/m² lower 95% uncertainty interval,Prevalence of BMI 18.5 kg/m² to <20 kg/m² upper 95% uncertainty interval,Prevalence of BMI 20 kg/m² to <25 kg/m²,Prevalence of BMI 20 kg/m² to <25 kg/m² lower 95% uncertainty interval,Prevalence of BMI 20 kg/m² to <25 kg/m² upper 95% uncertainty interval,Prevalence of BMI 25 kg/m² to <30 kg/m²,Prevalence of BMI 25 kg/m² to <30 kg/m² lower 95% uncertainty interval,Prevalence of BMI 25 kg/m² to <30 kg/m² upper 95% uncertainty interval,Prevalence of BMI 30 kg/m² to <35 kg/m²,Prevalence of BMI 30 kg/m² to <35 kg/m² lower 95% uncertainty interval,Prevalence of BMI 30 kg/m² to <35 kg/m² upper 95% uncertainty interval,Prevalence of BMI 35 kg/m² to <40 kg/m²,Prevalence of BMI 35 kg/m² to <40 kg/m² lower 95% uncertainty interval,Prevalence of BMI 35 kg/m² to <40 kg/m² upper 95% uncertainty interval,Prevalence of BMI >=40 kg/m²(morbid obesity),Prevalence of BMI >=40 kg/m² lower 95% uncertainty interval,Prevalence of BMI >=40 kg/m² upper 95% uncertainty interval
16795,Zimbabwe,ZWE,Women,2012,25.28484,24.743206,25.821593,0.244698,0.190822,0.305243,0.084217,0.051268,0.123961,0.049484,0.025195,0.085525,0.072888,0.046924,0.105304,0.355026,0.298633,0.409318,0.277904,0.227081,0.331028,0.160482,0.111942,0.217798,0.055707,0.028674,0.090946,0.028509,0.012218,0.053371
16796,Zimbabwe,ZWE,Women,2013,25.319948,24.741024,25.898562,0.249242,0.193386,0.311415,0.086849,0.052546,0.128145,0.0487,0.024258,0.085014,0.071592,0.045271,0.104518,0.351572,0.292811,0.407308,0.278894,0.226437,0.33369,0.162394,0.112366,0.222045,0.057094,0.028954,0.093907,0.029754,0.012482,0.056003
16797,Zimbabwe,ZWE,Women,2014,25.356222,24.722863,25.987488,0.253836,0.195761,0.318625,0.089553,0.053311,0.133574,0.047933,0.023198,0.084839,0.070315,0.043753,0.103858,0.348091,0.287157,0.406232,0.279826,0.225365,0.337415,0.164283,0.112489,0.22576,0.0585,0.02939,0.097146,0.031054,0.012538,0.059172
16798,Zimbabwe,ZWE,Women,2015,25.393409,24.719608,26.084431,0.258485,0.197569,0.325454,0.092343,0.054001,0.138658,0.047186,0.022235,0.085674,0.069068,0.041999,0.103307,0.344566,0.280957,0.405227,0.280695,0.223895,0.341319,0.166141,0.112342,0.230022,0.059935,0.029393,0.100432,0.032408,0.012767,0.063016
16799,Zimbabwe,ZWE,Women,2016,25.431487,24.701907,26.192666,0.26324,0.199351,0.333036,0.095246,0.054872,0.144509,0.04645,0.021426,0.086171,0.067832,0.040147,0.103454,0.34097,0.275426,0.404775,0.281509,0.221546,0.344713,0.167994,0.111852,0.234472,0.061414,0.029525,0.103569,0.033832,0.012837,0.066981


In [4]:
bmi.shape

(16800, 34)

In [5]:
# CLEANING UP TABLE
bmi = bmi[bmi['Year'] >= 2000] # updating it so that it is only in 21st century
if 'ISO' in bmi.columns: # deletes NaN column that wasn't used
    del bmi['ISO']
bmi.head()

Unnamed: 0,Country/Region/World,Sex,Year,Mean BMI,Mean BMI lower 95% uncertainty interval,Mean BMI upper 95% uncertainty interval,Prevalence of BMI>=30 kg/m² (obesity),Prevalence of BMI>=30 kg/m² lower 95% uncertainty interval,Prevalence of BMI>=30 kg/m² upper 95% uncertainty interval,Prevalence of BMI>=35 kg/m² (severe obesity),Prevalence of BMI>=35 kg/m² lower 95% uncertainty interval,Prevalence of BMI>=35 kg/m² upper 95% uncertainty interval,Prevalence of BMI<18.5 kg/m² (underweight),Prevalence of BMI<18.5 kg/m² lower 95% uncertainty interval,Prevalence of BMI<18.5 kg/m² upper 95% uncertainty interval,Prevalence of BMI 18.5 kg/m² to <20 kg/m²,Prevalence of BMI 18.5 kg/m² to <20 kg/m² lower 95% uncertainty interval,Prevalence of BMI 18.5 kg/m² to <20 kg/m² upper 95% uncertainty interval,Prevalence of BMI 20 kg/m² to <25 kg/m²,Prevalence of BMI 20 kg/m² to <25 kg/m² lower 95% uncertainty interval,Prevalence of BMI 20 kg/m² to <25 kg/m² upper 95% uncertainty interval,Prevalence of BMI 25 kg/m² to <30 kg/m²,Prevalence of BMI 25 kg/m² to <30 kg/m² lower 95% uncertainty interval,Prevalence of BMI 25 kg/m² to <30 kg/m² upper 95% uncertainty interval,Prevalence of BMI 30 kg/m² to <35 kg/m²,Prevalence of BMI 30 kg/m² to <35 kg/m² lower 95% uncertainty interval,Prevalence of BMI 30 kg/m² to <35 kg/m² upper 95% uncertainty interval,Prevalence of BMI 35 kg/m² to <40 kg/m²,Prevalence of BMI 35 kg/m² to <40 kg/m² lower 95% uncertainty interval,Prevalence of BMI 35 kg/m² to <40 kg/m² upper 95% uncertainty interval,Prevalence of BMI >=40 kg/m²(morbid obesity),Prevalence of BMI >=40 kg/m² lower 95% uncertainty interval,Prevalence of BMI >=40 kg/m² upper 95% uncertainty interval
25,Afghanistan,Men,2000,21.369506,19.153423,23.559675,0.012297,0.004712,0.026176,0.000582,0.000112,0.001755,0.218857,0.116488,0.329003,0.181228,0.115501,0.257605,0.474904,0.392218,0.562829,0.112715,0.074823,0.161792,0.011715,0.004119,0.025631,0.00044,5.3e-05,0.001507,0.000142,8e-06,0.000623
26,Afghanistan,Men,2001,21.456846,19.226323,23.641168,0.013129,0.005102,0.02775,0.000649,0.000128,0.001936,0.215744,0.114774,0.325905,0.178977,0.113907,0.254904,0.47619,0.394457,0.562832,0.115959,0.077215,0.165228,0.012479,0.00448,0.027253,0.000489,6e-05,0.001679,0.00016,1e-05,0.000694
27,Afghanistan,Men,2002,21.543463,19.301239,23.727827,0.014011,0.005522,0.029456,0.000724,0.000145,0.002104,0.212595,0.112654,0.322228,0.176702,0.112743,0.251709,0.477428,0.396318,0.563393,0.119263,0.080114,0.168613,0.013287,0.004841,0.028726,0.000544,7e-05,0.001836,0.000181,1.1e-05,0.000769
28,Afghanistan,Men,2003,21.628227,19.394877,23.830242,0.014947,0.006002,0.031181,0.000807,0.000167,0.002341,0.209419,0.110636,0.318674,0.174411,0.110997,0.248713,0.478601,0.397541,0.564361,0.122622,0.082665,0.173029,0.01414,0.005196,0.030535,0.000604,8e-05,0.002018,0.000204,1.3e-05,0.00086
29,Afghanistan,Men,2004,21.710447,19.476779,23.912341,0.015937,0.006489,0.032795,0.0009,0.000187,0.002584,0.206212,0.108702,0.31584,0.172098,0.109884,0.246124,0.479721,0.399063,0.564695,0.126031,0.085312,0.177013,0.015038,0.005581,0.032024,0.00067,9.2e-05,0.002221,0.00023,1.5e-05,0.000959


In [6]:
bmi.shape

(6800, 33)

Next up, standardddizing the table, with sex and relevant BMI information

In [7]:
def standardize_gender(str_in): 
    try:
        if pd.notna(str_in):
            str_in = str_in.lower().strip()

            
            if 'women' in str_in:
                return str_in.replace('women', 'F')
            elif 'men' in str_in:
                return str_in.replace('men', 'M')
            else:
                return np.nan
        else:
            return np.nan
    except Exception as e:
        print(f"An error occurred: {e}")
        return np.nan

In [8]:
bmi['Sex'] = bmi['Sex'].apply(standardize_gender)

In [9]:
# cleans up uncertainty interval, we can add back if needed for EDA but not sure
for col in bmi.columns:
    if 'uncertainty interval' in col:
        del bmi[col]
bmi

Unnamed: 0,Country/Region/World,Sex,Year,Mean BMI,Prevalence of BMI>=30 kg/m² (obesity),Prevalence of BMI>=35 kg/m² (severe obesity),Prevalence of BMI<18.5 kg/m² (underweight),Prevalence of BMI 18.5 kg/m² to <20 kg/m²,Prevalence of BMI 20 kg/m² to <25 kg/m²,Prevalence of BMI 25 kg/m² to <30 kg/m²,Prevalence of BMI 30 kg/m² to <35 kg/m²,Prevalence of BMI 35 kg/m² to <40 kg/m²,Prevalence of BMI >=40 kg/m²(morbid obesity)
25,Afghanistan,M,2000,21.369506,0.012297,0.000582,0.218857,0.181228,0.474904,0.112715,0.011715,0.000440,0.000142
26,Afghanistan,M,2001,21.456846,0.013129,0.000649,0.215744,0.178977,0.476190,0.115959,0.012479,0.000489,0.000160
27,Afghanistan,M,2002,21.543463,0.014011,0.000724,0.212595,0.176702,0.477428,0.119263,0.013287,0.000544,0.000181
28,Afghanistan,M,2003,21.628227,0.014947,0.000807,0.209419,0.174411,0.478601,0.122622,0.014140,0.000604,0.000204
29,Afghanistan,M,2004,21.710447,0.015937,0.000900,0.206212,0.172098,0.479721,0.126031,0.015038,0.000670,0.000230
...,...,...,...,...,...,...,...,...,...,...,...,...,...
16795,Zimbabwe,F,2012,25.284840,0.244698,0.084217,0.049484,0.072888,0.355026,0.277904,0.160482,0.055707,0.028509
16796,Zimbabwe,F,2013,25.319948,0.249242,0.086849,0.048700,0.071592,0.351572,0.278894,0.162394,0.057094,0.029754
16797,Zimbabwe,F,2014,25.356222,0.253836,0.089553,0.047933,0.070315,0.348091,0.279826,0.164283,0.058500,0.031054
16798,Zimbabwe,F,2015,25.393409,0.258485,0.092343,0.047186,0.069068,0.344566,0.280695,0.166141,0.059935,0.032408


In [None]:
bmi.to_csv('./datasets/bim_country_clean.csv', index=False)



# Ethics & Privacy

The data collected in this is already publicly available through resources such as Kaggle. There may be issues in terms of representation, if we do not find all countries in a dataset that can give us a proper true answer for what the healthiest diet is, but can be generalized by regions. There should not be biases in the research collected in these databases, but the inference of socioeconomic status in less wealthy countries may produce a prenotion that their diets are inferior to those that are wealthier. To combat these issues, we will ensure to collect a diverse set of data that is most representative of countries/regions while focusing on the nutritional information of a region’s diet.

# Team Expectations 


1. Expect to meet at least once a week. If a member cannot make it or it is a busy week, try to have 2 meetings so     that everyone can have a chance to work, or split in random groups

2. Make sure to communicate if people are gonna miss an event

3. Work to each other’s strengths and weaknesses (coding, reports)

4. Resolve any conflicts together as a group, maintain some form of conformity throughout the project


# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/7  |  3 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 2/11  |  1 PM |  Find the proper datasets, do more background research and finish the general layout of the proposal.| Determine which dataset(s) to use and the ethics behind it. Finalize the project proposal. | 
| 2/14  | 6 PM  | Describe the dataset(s) and briefly understand its information. | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part  |
| 2/21  | 3 PM  |Completely summarized each dataset, and started to clean the datasets. | Further turns the datasets into a usable format. Summarize the reasons and strategies used to clean the datasets. |
| 2/25  | 1 PM  | Each member finished their cleaning of the datasets. | Complete the summarization of datasets and final check any negligence in cleaning datasets. |
| 2/28  | 3 PM  | Import & Wrangle Data. Brainstorming any potential way to analyze the datasets. | Review/Edit wrangling/EDA; Discuss Analysis Plan |
| 3/6  | 3 PM  | Each member should finish analyzing most portions of their assigned parts. A brief summary of what they did. | Gather the analyzed results from each person, and discuss if any additional analysis should be done. Make sure the procedures are correct, and start to write summaries. |
| 3/10  | 1 PM  | Complete analysis of the datasets. | Double-check the analysis. Finalizing the summaries. |
| 3/13  | 3 PM  | Brainstorm on how to put everything together | Discuss how to put everything together. Start to put each part together and write up summaries. |
| 3/20  | 3 PM  | Finalizing Everything | Finalizing everything. |