# Day 3 - Banana Index

Today I'm using this dataset [https://github.com/TheEconomist/banana-index-data/tree/master] from the economist, which ranks food sources based on their CO2 equivalent compared to Bananas (banana index).
Today I'm focusing on  plotting and data wrangling to improve those skills.

In [16]:
# Libraries
import pandas as pd
import altair as alt

In [9]:
df_banana = pd.read_csv("bananaindex.csv")

In [10]:
df_banana

Unnamed: 0,entity,year,emissions_kg,emissions_1000kcal,emissions_100g_protein,emissions_100g_fat,land_use_kg,land_use_1000kcal,Land use per 100 grams of protein,Land use per 100 grams of fat,Bananas index (kg),Bananas index (1000 kcalories),Bananas index (100g protein),Chart?,type,Banana values,Unnamed: 16
0,Ale,2022,0.488690,0.317338,0.878525,2.424209,0.811485,0.601152,1.577687,3.065766,0.559558,0.362340,0.113771,True,1,Per KG,0.873350
1,Almond butter,2022,0.387011,0.067265,0.207599,0.079103,7.683045,1.296870,3.608433,1.495297,0.443134,0.076804,0.026885,True,1,Per 1000 kcalories,0.875803
2,Almond milk,2022,0.655888,2.222230,13.595512,4.057470,1.370106,2.675063,12.687839,4.600530,0.751002,2.537364,1.760651,True,1,Per 100g protein,7.721869
3,Almonds,2022,0.602368,0.105029,0.328335,0.119361,8.230927,1.423376,4.261040,1.610136,0.689721,0.119923,0.042520,True,1,,
4,Apple juice,2022,0.458378,0.955184,29.152212,19.754980,0.660629,1.382839,43.232158,26.246743,0.524851,1.090638,3.775280,True,1,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155,Tortilla wraps,2022,0.948584,0.393648,1.260451,1.658348,2.256113,0.979229,2.979443,4.106024,1.086144,0.449471,0.163231,True,1,,
156,Tuna,2022,13.075355,9.969608,4.972586,105.113632,5.521840,4.194251,2.167108,44.058029,14.971502,11.383397,0.643961,True,2,,
157,Walnuts,2022,2.416308,0.409580,1.725508,0.492456,11.875852,1.924057,7.828816,2.092320,2.766713,0.467663,0.223457,True,1,,
158,Watermelon,2022,0.969403,2.464087,16.335799,22.110017,1.009878,2.616771,17.232334,22.874311,1.109983,2.813519,2.115524,True,1,,


Let's hone in on the protein related data in this dataset

In [13]:
df_protein = df_banana[["entity", "year", "emissions_100g_protein", "Land use per 100 grams of protein", "Bananas index (100g protein)"]]

In [14]:
df_protein

Unnamed: 0,entity,year,emissions_100g_protein,Land use per 100 grams of protein,Bananas index (100g protein)
0,Ale,2022,0.878525,1.577687,0.113771
1,Almond butter,2022,0.207599,3.608433,0.026885
2,Almond milk,2022,13.595512,12.687839,1.760651
3,Almonds,2022,0.328335,4.261040,0.042520
4,Apple juice,2022,29.152212,43.232158,3.775280
...,...,...,...,...,...
155,Tortilla wraps,2022,1.260451,2.979443,0.163231
156,Tuna,2022,4.972586,2.167108,0.643961
157,Walnuts,2022,1.725508,7.828816,0.223457
158,Watermelon,2022,16.335799,17.232334,2.115524


Lets do some descriptive data analysis by doing some plotting.

That not very pretty, let's turn the chart on it's side for readability and sort the values according to emissions

In [28]:
alt.Chart(df_protein).mark_bar().encode(
    x="entity", y="emissions_100g_protein")

In [43]:
# plotting entity and their respective emissions per 100g protein
alt.Chart(df_protein).mark_bar().encode(
    y=alt.Y("entity").sort("-x").title("Type of food"),
    x=alt.X("emissions_100g_protein").title("Emissions per 100g of protein")
)

It seems that foods that are high in sugar, fat, or have a high water content (grapes), are emissions heavy relative to their protein content. I would say that a lot of these foods aren't necessarily eaten for their protein content, thus it would be better to include foods with a similar protein percentage of let's say 5%. We are however lacking that data in this analysis.

In [52]:
# How about land use in relation to emissions per gram of protein?
alt.Chart(df_protein).mark_point().encode(
    y=alt.Y("Land use per 100 grams of protein").sort("-x").title("Land use per 100 grams of protein"),
    x=alt.X("emissions_100g_protein").title("Emissions per 100g of protein"),
    color="entity"
)

Here we can see that the foods are mostly bunched together, although there are some outliers with heavy land use or emissions per protein

Let's filter out the blob and look at the outliers.

In [48]:
df_outliers = df_protein[df_protein["emissions_100g_protein"] > 20]
df_outliers = df_outliers[df_outliers["Land use per 100 grams of protein"] > 20]

In [51]:
# How about land use in relation to emissions
alt.Chart(df_outliers).mark_point().encode(
    y=alt.Y("Land use per 100 grams of protein").sort("-x").title("Land use per 100 grams of protein"),
    x=alt.X("emissions_100g_protein").title("Emissions per 100g of protein"),
    color="entity"
)