# Intro

The purpose of my EDA is to find out whether a diet based on animal produts is efficient from a societal perspective.

This means, I'll do a cost-benefit analysis of the food animal sources and compare it to veggie alternatives.

For that purpose, I'll use the following data:
- Resources for food production (cost)
- Nutritional values of the food & recommended daily intake (benefit)

By measuring the relation between resources and nutritional values, we can determine which foods are more efficient.

In [1]:
# Let's import the necessary libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import re
from varname import nameof

import requests
from bs4 import BeautifulSoup
import html
import lxml

import sys, os

# Helpers
abspath = os.path.abspath
dirname = os.path.dirname
sep = os.sep
file_ = os.getcwd()

ml_folder = dirname(file_)
sys.path.append(ml_folder)

from src.utils import mining_data_tb as md
from src.utils import visualization_tb as vi
from src.utils import folder_tb as fo
from src.utils import models as mo

import warnings
warnings.filterwarnings('ignore')

# Data exploration and cleaning

As mentioned before, we'll use three datasets:
- Resources: This is actually composed by several csv files that we will merge
- Nutritional values: It shows +60 nutrients (columns) for +7000 different foods (rows)
- Daily intakes: It is a CSV with the urls of the recommended daily intake according to gender and age. We will do some scrapping here to pull this data

## Resources

In [5]:
#### Data paths
environment_path = fo.path_to_folder(2, "data" + sep + "environment")
resources_path = fo.path_to_folder(2, "data" + sep + "environment" + sep + "resources_use")

#### Load data
general = pd.read_csv(environment_path + "food_production.csv")
general.head()

Unnamed: 0,Food product,Land use change,Animal Feed,Farm,Processing,Transport,Packging,Retail,Total_emissions,Eutrophying emissions per 1000kcal (gPO₄eq per 1000kcal),...,Freshwater withdrawals per 100g protein (liters per 100g protein),Freshwater withdrawals per kilogram (liters per kilogram),Greenhouse gas emissions per 1000kcal (kgCO₂eq per 1000kcal),Greenhouse gas emissions per 100g protein (kgCO₂eq per 100g protein),Land use per 1000kcal (m² per 1000kcal),Land use per kilogram (m² per kilogram),Land use per 100g protein (m² per 100g protein),Scarcity-weighted water use per kilogram (liters per kilogram),Scarcity-weighted water use per 100g protein (liters per 100g protein),Scarcity-weighted water use per 1000kcal (liters per 1000 kilocalories)
0,Wheat & Rye (Bread),0.1,0.0,0.8,0.2,0.1,0.1,0.1,1.4,,...,,,,,,,,,,
1,Maize (Meal),0.3,0.0,0.5,0.1,0.1,0.1,0.0,1.1,,...,,,,,,,,,,
2,Barley (Beer),0.0,0.0,0.2,0.1,0.0,0.5,0.3,1.1,,...,,,,,,,,,,
3,Oatmeal,0.0,0.0,1.4,0.0,0.1,0.1,0.0,1.6,4.281357,...,371.076923,482.4,0.945482,1.907692,2.897446,7.6,5.846154,18786.2,14450.92308,7162.104461
4,Rice,0.0,0.0,3.6,0.1,0.1,0.1,0.1,4.0,9.514379,...,3166.760563,2248.4,1.207271,6.267606,0.759631,2.8,3.943662,49576.3,69825.77465,13449.89148


In [6]:
general.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43 entries, 0 to 42
Data columns (total 23 columns):
 #   Column                                                                   Non-Null Count  Dtype  
---  ------                                                                   --------------  -----  
 0   Food product                                                             43 non-null     object 
 1   Land use change                                                          43 non-null     float64
 2   Animal Feed                                                              43 non-null     float64
 3   Farm                                                                     43 non-null     float64
 4   Processing                                                               43 non-null     float64
 5   Transport                                                                43 non-null     float64
 6   Packging                                                                 43 

Right now, we can already see several things:
- Quite a few NaN values
- Many columns refering to emissions: from "Land use change" to "Retail", they all refer to emissions. As the columns "Total_emissions" is the sum of the previous ones, I'll keep just this one and drop the rest. 
- Additionally, I'm interested in the data of water and land use, as I've found other datasets that can fill some of the missing values.

This is what we are going to do next:
1. Try to fill the missing values using data from other dataframes
2. Filter by the columns we will actually use

In [3]:
print(general_df.shape)
print(land_use.shape)
print(water_use.shape)

(43, 23)
(44, 4)
(43, 4)


In [7]:
# Let's start with the land use data
# The units are squared meters (m2): https://ourworldindata.org/environmental-impacts-of-food?country=#water-use


# Load data and drop the unnecessary columns
land_use_kcal = pd.read_csv(resources_path + "land-use-kcal-poore.csv").drop(["Code", "Year"], axis = 1)
land_use_kg = pd.read_csv(resources_path + "land-use-per-kg-poore.csv").drop(["Code", "Year"], axis = 1)
land_use_protein = pd.read_csv(resources_path + "land-use-protein-poore.csv").drop(["Code", "Year"], axis = 1)

# Merge all the data in one dataframe
land_use = pd.merge(land_use_kcal, land_use_kg, how = "outer", on = "Entity")
land_use = pd.merge(land_use, land_use_protein, how = "outer", on = "Entity")
land_use.columns = ["Entity", "Land use per 1000kcal", "Land use per kg", "Land use per 100g protein"]

land_use.head()

Unnamed: 0,Entity,Land use per 1000kcal,Land use per kg,Land use per 100g protein
0,Apples,1.3125,0.63,21.0
1,Bananas,3.216667,1.93,21.444444
2,Barley,0.222,1.11,
3,Beef (beef herd),119.490842,326.21,163.595787
4,Beef (dairy herd),15.838828,43.24,21.904762


In [8]:
# Let's continue with the water use data
# The units are liters (l): https://ourworldindata.org/environmental-impacts-of-food?country=#water-use


# Load data and drop the unnecessary columns
water_use_kcal = pd.read_csv(resources_path + "freshwater-withdrawals-per-kcal.csv").drop(["Code", "Year"], axis = 1)
water_use_kg = pd.read_csv(resources_path + "freshwater-withdrawals-per-kg.csv").drop(["Code", "Year"], axis = 1)
water_use_protein = pd.read_csv(resources_path + "freshwater-withdrawals-per-protein.csv").drop(["Code", "Year"], axis = 1)

# Merge all the data in one dataframe
water_use = pd.merge(water_use_kcal, water_use_kg, how = "outer", on = "Entity")
water_use = pd.merge(water_use, water_use_protein, how = "outer", on = "Entity")
water_use.columns = ["Entity", "Freswater withdrawls per 1000kcal", "Freswater withdrawls per kg", "Freswater withdrawls per 100g protein"]
water_use.head()

Unnamed: 0,Entity,Freswater withdrawls per 1000kcal,Freswater withdrawls per kg,Freswater withdrawls per 100g protein
0,Apples,375.208333,180.1,6003.333333
1,Bananas,190.833333,114.5,1272.222222
2,Barley,3.42,17.1,
3,Beef (beef herd),531.575092,1451.2,727.78335
4,Beef (dairy herd),994.249084,2714.3,1375.025329
