# AHEI
While looking for a target we found a great candidate named Alternative Halthy Eating Index (AHEI), which scores how healthy is a person diet given its food components (carbo, protein, ...)

## Process
Our team tried to find an already calculated index for measuring diets, but most of them were private, or considered other factors that could introduce a bias on our model, such as underweight or hypertension. In order to fix this, we used the `Global Dietary Database` (GDD) to get each country diet composition.

### Formula Problem
Even tough, the GDD gives us an understanding of the food, we still ran into the problem of calculating it because there were some categories non easily related to our job

In [11]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse, r2_score as r2
import numpy as np

In [2]:
labels = pd.read_csv("data/raw/GDDLabels.csv", sep=";")
labels = labels.set_index("File").to_dict()["Group"]
labels

{'v01': 'Fruits',
 'v02': 'Non-starchy vegetables',
 'v03': 'Potatoes',
 'v04': 'Other starchy vegetables',
 'v05': 'Beans and legumes',
 'v06': 'Nuts and seeds',
 'v07': 'Refined grains',
 'v08': 'Whole grains',
 'v09': 'Total processed meats',
 'v10': 'Unprocessed red meats',
 'v11': 'Total seafoods',
 'v12': 'Eggs',
 'v13': 'Cheese',
 'v14': 'Yoghurt (including fermented milk)',
 'v15': 'Sugar-sweetened beverages',
 'v16': 'Fruit juices',
 'v17': 'Coffee',
 'v18': 'Tea',
 'v22': 'Total carbohydrates',
 'v23': 'Total protein',
 'v27': 'Saturated fat',
 'v28': 'Monounsaturated fatty acids',
 'v29': 'Total omega-6 fat',
 'v30': 'Seafood omega-3 fat',
 'v31': 'Plant omega-3 fat',
 'v33': 'Dietary cholesterol',
 'v34': 'Dietary fiber',
 'v35': 'Added sugars',
 'v36': 'Calcium',
 'v37': 'Dietary sodium',
 'v38': 'Iodine',
 'v39': 'Iron',
 'v40': 'Magnesium',
 'v41': 'Potassium',
 'v42': 'Selenium',
 'v43': 'Vitamin A w/ supplements',
 'v45': 'Vitamin B1',
 'v46': 'Vitamin B2',
 'v47': 'Vi

In [3]:
ids = ["iso3", "age", "female", "urban", "edu", "year"]

In [4]:
df = pd.DataFrame()
for file, label in labels.items():
    p = pd.read_csv(f"data/raw/NutritionPerCountry/{file}_cnty.csv")
    p = p.query("year == 2018")
    if len(df) == 0:
         df = p[ids]
    df[label] = p["median"]
    print(file)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[label] = p["median"]


v01
v02
v03
v04
v05
v06
v07
v08
v09
v10
v11
v12
v13
v14
v15
v16
v17
v18
v22
v23
v27
v28
v29
v30
v31
v33
v34
v35
v36
v37
v38
v39
v40
v41
v42
v43
v45
v46
v47
v48
v49
v50
v51
v52
v53
v54
v57


Unnamed: 0,iso3,age,female,urban,edu,year,Fruits,Non-starchy vegetables,Potatoes,Other starchy vegetables,...,Vitamin B2,Vitamin B3,Vitamin B6,Vitamin B9 (Folate),Vitamin B12,Vitamin C,Vitamin D,Vitamin E,Zinc,Total Milk
4968,AFG,0.5,0,0,1,2018,8.467415,3.419693,3.859324,2.081005,...,0.538343,3.395698,0.344610,76.739720,,13.618115,0.418857,3.420298,3.194506,29.735829
4969,AFG,0.5,0,0,2,2018,12.226313,3.537751,3.893707,2.404693,...,0.566551,3.416773,0.346497,79.361680,,14.684184,0.458032,3.841461,3.223998,45.792801
4970,AFG,0.5,0,0,3,2018,16.246001,3.996102,3.433671,3.003255,...,0.594985,3.526458,0.354356,81.129293,,15.635696,0.454475,3.575676,3.190349,59.877288
4971,AFG,0.5,0,0,999,2018,9.376535,3.522067,4.030040,2.399389,...,0.549927,3.445685,0.346067,77.972977,,13.980427,0.437349,3.520146,3.239943,33.499286
4972,AFG,0.5,0,1,1,2018,11.554088,3.527541,3.855074,2.500543,...,0.445774,3.523774,0.292269,58.046698,,13.361231,0.476264,3.563376,2.640269,37.124894
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1072255,ZWE,999.0,999,1,999,2018,96.531649,138.124881,57.401845,35.223207,...,1.089616,16.305069,1.499906,142.819868,,84.212003,1.600584,9.357011,6.002634,86.422827
1072256,ZWE,999.0,999,999,1,2018,64.691141,129.133868,43.457593,27.481140,...,1.098965,15.493286,1.507762,151.845288,,96.411845,2.549639,10.064455,5.561608,41.783699
1072257,ZWE,999.0,999,999,2,2018,84.075382,139.050109,45.981225,32.567803,...,0.923139,15.928116,1.494387,154.753091,,94.571968,2.522924,9.386910,5.628539,64.816385
1072258,ZWE,999.0,999,999,3,2018,114.820538,165.455230,45.810081,43.623002,...,0.962904,16.193286,1.520831,159.100704,,99.694917,2.367599,9.070019,5.744827,107.152291


In [5]:
ahei = pd.read_csv("data/raw/AHEI_2018.csv")
ahei = ahei.rename(columns={
    "educ": "edu",
    "median_x": "ahei",
}).drop(columns=["lowerci_95", "upperci_95"])
ahei = ahei.dropna()
ahei

Unnamed: 0,iso3,age,female,urban,edu,year,ahei
0,AFG,999.0,0,999,999,2018,42.180344
1,AFG,999.0,1,999,999,2018,43.432438
2,AGO,999.0,0,999,999,2018,50.085574
3,AGO,999.0,1,999,999,2018,51.294205
4,ALB,999.0,0,999,999,2018,27.830695
...,...,...,...,...,...,...,...
104335,WSM,999.0,999,999,999,2018,53.812846
104336,YEM,999.0,999,999,999,2018,26.489124
104337,ZAF,999.0,999,999,999,2018,34.307336
104338,ZMB,999.0,999,999,999,2018,44.357181


In [6]:
# ahei table is a score of the unique rows and it is difficult to have the same
# person with the same features duplicated in a small "table"
target = ahei.merge(df, on=["iso3", "age", "female", "urban", "edu", "year"])
target

Unnamed: 0,iso3,age,female,urban,edu,year,ahei,Fruits,Non-starchy vegetables,Potatoes,...,Vitamin B2,Vitamin B3,Vitamin B6,Vitamin B9 (Folate),Vitamin B12,Vitamin C,Vitamin D,Vitamin E,Zinc,Total Milk
0,AFG,999.0,0,999,999,2018,42.180344,64.310693,96.959429,21.388945,...,0.969057,13.600030,1.132735,189.094325,,39.819256,1.608323,6.490088,7.843413,67.882308
1,AFG,999.0,1,999,999,2018,43.432438,67.045325,98.105945,20.509140,...,0.942085,13.313891,1.072211,190.353375,,44.583268,1.569744,6.301344,7.395684,71.035382
2,AGO,999.0,0,999,999,2018,50.085574,118.034637,303.751559,356.682550,...,1.300084,13.607065,1.670666,263.978902,,171.654620,1.719658,8.018062,7.390759,31.886074
3,AGO,999.0,1,999,999,2018,51.294205,120.993093,313.006360,336.141230,...,1.287358,13.273713,1.673857,267.181379,,187.499629,1.709039,7.896378,7.140587,33.545413
4,ALB,999.0,0,999,999,2018,27.830695,125.235001,120.039640,195.516010,...,1.687270,30.579335,2.355194,277.807283,,124.303422,5.350403,15.343191,8.327896,111.278294
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
104047,WSM,999.0,999,999,999,2018,53.812846,140.298982,336.380953,13.158776,...,1.343826,14.991218,1.211559,224.222861,,69.688241,4.504330,20.973773,8.271951,135.140556
104048,YEM,999.0,999,999,999,2018,26.489124,39.336626,58.530138,80.552301,...,1.145410,21.005360,1.756198,171.501288,,82.130158,1.221586,11.922963,7.296379,101.881131
104049,ZAF,999.0,999,999,999,2018,34.307336,32.667852,131.801080,54.860450,...,1.542240,17.290615,1.246154,194.294094,,37.887315,2.943259,8.515702,9.207601,94.349931
104050,ZMB,999.0,999,999,999,2018,44.357181,51.185607,195.758520,23.045660,...,1.313372,13.387016,1.419700,233.872893,,88.267127,1.805338,11.208324,8.040751,31.137654


In [12]:
# if nan, then the person may not consume it? red meat for example
ref_target = target.copy().fillna(0)
for col in target.columns:
    if col not in ["iso3", "year"]:
        target[col] = target[col].fillna(ref_target[col].median())
target = target.fillna(0)
target

Unnamed: 0,iso3,age,female,urban,edu,year,ahei,Fruits,Non-starchy vegetables,Potatoes,...,Vitamin B2,Vitamin B3,Vitamin B6,Vitamin B9 (Folate),Vitamin B12,Vitamin C,Vitamin D,Vitamin E,Zinc,Total Milk
0,AFG,999.0,0,999,999,2018,42.180344,64.310693,96.959429,21.388945,...,0.969057,13.600030,1.132735,189.094325,0.0,39.819256,1.608323,6.490088,7.843413,67.882308
1,AFG,999.0,1,999,999,2018,43.432438,67.045325,98.105945,20.509140,...,0.942085,13.313891,1.072211,190.353375,0.0,44.583268,1.569744,6.301344,7.395684,71.035382
2,AGO,999.0,0,999,999,2018,50.085574,118.034637,303.751559,356.682550,...,1.300084,13.607065,1.670666,263.978902,0.0,171.654620,1.719658,8.018062,7.390759,31.886074
3,AGO,999.0,1,999,999,2018,51.294205,120.993093,313.006360,336.141230,...,1.287358,13.273713,1.673857,267.181379,0.0,187.499629,1.709039,7.896378,7.140587,33.545413
4,ALB,999.0,0,999,999,2018,27.830695,125.235001,120.039640,195.516010,...,1.687270,30.579335,2.355194,277.807283,0.0,124.303422,5.350403,15.343191,8.327896,111.278294
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
104047,WSM,999.0,999,999,999,2018,53.812846,140.298982,336.380953,13.158776,...,1.343826,14.991218,1.211559,224.222861,0.0,69.688241,4.504330,20.973773,8.271951,135.140556
104048,YEM,999.0,999,999,999,2018,26.489124,39.336626,58.530138,80.552301,...,1.145410,21.005360,1.756198,171.501288,0.0,82.130158,1.221586,11.922963,7.296379,101.881131
104049,ZAF,999.0,999,999,999,2018,34.307336,32.667852,131.801080,54.860450,...,1.542240,17.290615,1.246154,194.294094,0.0,37.887315,2.943259,8.515702,9.207601,94.349931
104050,ZMB,999.0,999,999,999,2018,44.357181,51.185607,195.758520,23.045660,...,1.313372,13.387016,1.419700,233.872893,0.0,88.267127,1.805338,11.208324,8.040751,31.137654


In [13]:
m = LinearRegression()

In [14]:
x = target[[col for col in target.columns if col not in ["ahei"] + ids]].to_numpy()
y = target["ahei"]
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)

In [15]:
m.fit(xtrain, ytrain)

LinearRegression()

In [16]:
print(r2(ytrain, m.predict(xtrain)))
print(r2(ytest, m.predict(xtest)))

0.7820664959668818
0.7807859116466866


In [17]:
from sklearn.tree import DecisionTreeRegressor

In [18]:
t = DecisionTreeRegressor()
t.fit(xtrain, ytrain)

DecisionTreeRegressor()

In [19]:
# decision tree nailed it
print(r2(ytrain, t.predict(xtrain)))
print(r2(ytest, t.predict(xtest)))

1.0
0.9736318529417899


In [20]:
import pickle

In [21]:
with open("models/ahei_tree.pkl", "wb") as f:
    pickle.dump(t, f)