# What's in an Avocado Toast: A Supply Chain Analysis

![](avocado_wallpaper.jpeg)

You find yourself in London, crafting a delectable avocado toast, a dish that has risen dramatically in popularity on breakfast menus since the 2010s. This straightforward recipe requires just five ingredients: a ripe avocado, half a lemon, a generous pinch of salt flakes, two slices of sourdough bread, and a good drizzle of extra virgin olive oil. Most of these ingredients are now staples in grocery stores, and as you will find with this project, that is no small feat!

In this project, you'll conduct a supply chain analysis of three ingredients used in avocado toast using the Open Food Facts database. This database contains extensive, openly-sourced information on various foods, including their origins. Through this analysis, you will gain an in-depth understanding of the complex supply chain involved in producing a single dish.

Three pairs of files are provided in the data folder:
- A CSV file for each ingredient, such as `avocado.csv`, with data about each food item and countries of origin.
- A TXT file for each ingredient, such as `relevant_avocado_categories`, containing only the category tags of interest for that food.

Here are some other key points about these files:
- Some of the rows of data in each of the three CSV files do not contain relevant data for your investigation. In each dataset, you will need to filter out rows with irrelevant data, based on values in the `categories_tags` column. Examples of categories are fruits, vegetables, and fruit-based oils. Filter the DataFrame to include only rows where `categories_tags` contains one of the tags in the relevant categories for that ingredient.
- Each row of data usually has multiple category tags in the `categories_tags` column.
There is a column in each CSV file called `origins_tags`, which contains strings for the country of origin of each item.

After completing this project, you'll be armed with a list of ingredients and their countries of origin and be well-positioned to launch into other analyses that explore how long, on average, these ingredients spend at sea.

[Open Food Facts database](https://world.openfoodfacts.org/)

In [1]:
import pandas as pd
from pathlib import Path


In [2]:
#dataframe columns to work with
#'code', 'lc', 'product_name_en', 'quantity', 'serving_size', 'packaging_tags', 'brands', 'brands_tags', 'categories_tags', 'labels_tags', 'countries', 'countries_tags', 'origins', 'origins_tags'

In [3]:
avocado_df_complete = pd.read_csv("data/avocado.csv",sep = '\t')
avocado_df_complete
avocado_df_complete.shape
#avocado_df.info()

(1785, 184)

In [4]:
avocado_df = avocado_df_complete[['code', 'lc', 'product_name_en', 'quantity', 'serving_size', 'packaging_tags', 'brands', 'brands_tags', 'categories_tags', 'labels_tags', 'countries', 'countries_tags', 'origins', 'origins_tags']]
#avocado_df

???
C:\Users\terez\AppData\Local\Temp\ipykernel_27876\3181878656.py:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  olive_oil_df = pd.read_csv("data/olive_oil.csv", sep = '\t',
low_memory : bool, default True
    Internally process the file in chunks, resulting in lower memory use
    while parsing, but possibly mixed type inference.  To ensure no mixed
    types either set ``False``, or specify the type with the ``dtype`` parameter.
    Note that the entire file is read into a single :class:`~pandas.DataFrame`
    regardless, use the ``chunksize`` or ``iterator`` parameter to return the data in
    chunks. (Only valid with C parser).
    ???

In [5]:
#relevant categories tags
rel_avocado_cat = ["en:avocadoes","en:avocados","en:fresh-foods", "en:fresh-vegetables", "en:fruchte","en:fruits", "en:raw-green-avocados","en:tropical-fruits",
                   "en:tropische-fruchte", "en:vegetables-based-foods", "fr:hass-avocados"]
#rel_avocado_cat

## dropping NaN values and splitting categories_tags content into list

In [6]:
avocado_df = avocado_df.dropna(subset = 'categories_tags')
#avocado_df

In [7]:
avocado_df['categories_tags'] = avocado_df['categories_tags'].str.split(',')
#avocado_df

## relevant rows selection

In [8]:
avocado_sel = avocado_df['categories_tags'].apply(lambda x: any([i for i in x if i in rel_avocado_cat]))
#avocado_sel

In [9]:
avocado_df_selected = avocado_df[avocado_sel]
#avocado_df_selected.shape
#avocado_df_selected['countries'].unique()

In [10]:
uk_avocados = avocado_df_selected[avocado_df_selected['countries'] == 'United Kingdom']
#uk_avocados.shape
#the project works with countries value strictly "United Kingdom", not includes values where UK is one of the countries

In [11]:
avocado_origins = uk_avocados.value_counts(subset = "origins_tags")
#avocado_origins

In [12]:
#avocado_origin = (uk_avocados['origins_tags'].value_counts().index[0])
avocado_origin = uk_avocados.value_counts(subset = "origins_tags").index[0]
#avocado_origin

In [13]:
#top_avocado_origin = avocado_origin.strip('en:')
# strip is not ideal for using with other countries
top_avocado_origin = avocado_origin[3:]
top_avocado_origin

'peru'

# steps that the function does 
- relevant categories list create
  (as relevant categories for other ingredients contain a lot of values, the more efficient way is to import the text file as a list)
- dataframe
      : read file, process (dropna, column of lists)
      : relevant rows selection (according to categories and then according to UK)
      : the most frequent origin country and the name cleaning (including hyphen)


In [14]:
def data_prep(filename,sep,categories_file, encoding):
    relevant_list = Path(categories_file).read_text(encoding = encoding)
    ingredient_df = pd.read_csv(filename, sep = sep, 
                           usecols = ['code', 'lc', 'product_name_en', 'quantity', 'serving_size', 'packaging_tags', 'brands', 'brands_tags', 'categories_tags', 'labels_tags', 
                         'countries', 'countries_tags', 'origins', 'origins_tags'])
    ingredient_df = ingredient_df.dropna(subset = 'categories_tags')
    ingredient_df['categories_tags'] = ingredient_df['categories_tags'].str.split(',')
    ingredient_sel = ingredient_df['categories_tags'].apply(lambda x: any([i for i in x if i in relevant_list]))
    ingredient_df_selected = ingredient_df[ingredient_sel]
    uk_ingredient = ingredient_df_selected[ingredient_df_selected['countries'] == 'United Kingdom']
    ingredient_origin = uk_ingredient.value_counts(subset = "origins_tags").index[0]
    #top_ingredient_origin = ingredient_origin.strip('en:')
    top_ingredient_origin = ingredient_origin[3:].replace("-"," ")
    return top_ingredient_origin

In [15]:
top_olive_oil_origin = data_prep('data/olive_oil.csv','\t', 'data/relevant_olive_oil_categories.txt', 'utf-8')
top_olive_oil_origin

  ingredient_df = pd.read_csv(filename, sep = sep,


'greece'

In [16]:
top_sourdough_origin = data_prep('data/sourdough.csv','\t', 'data/relevant_sourdough_categories.txt', 'utf-8')
top_sourdough_origin

'united kingdom'