# What's in an Avocado Toast: A Supply Chain Analysis

You're in London, making an avocado toast, a quick-to-make dish that has soared in popularity on breakfast menus since the 2010s. A simple smashed avocado toast can be made with five ingredients: one ripe avocado, half a lemon, a big pinch of salt flakes, two slices of sourdough bread and a good drizzle of extra virgin olive oil. It's no small feat that most of these ingredients are readily available in grocery stores. 

In this project, you'll conduct a supply chain analysis of three of these ingredients used in an avocado toast, utilizing the Open Food Facts database. This database contains extensive, openly-sourced information on various foods, including their origins. Through this analysis, you will gain an in-depth understanding of the complex supply chain involved in producing a single dish.

Three pairs of files are provided in the data folder:
- A CSV file for each ingredient, such as `avocado.csv`, with data about each food item and countries of origin
- A TXT file for each ingredient, such as `relevant_avocado_categories`, containing only the category tags of interest for that food.

Here are some other key points about these files:
- Some of the rows of data in each of the three CSV files do not contain relevant data for your investigation. In each dataset, you will need to filter out rows with irrelevant data, based on values in the `categories_tags` column. Examples of categories are, fruits, vegetables, and fruit-based oils. Filter the DataFrame to include only rows where `categories_tags` contains one of the tags in the relevant categories for that ingredient.
- Each row of data usually has multiple categories tags in the `categories_tags` column.
- There is a column in each CSV file called `origins_tags` with strings for country of origin of that item.

After completing this project, you'll be armed with a list of ingredients and their countries of origin, and be well-positioned to launch into other analyses that explore how long, on average, these ingredients spend at sea.

![](avocado_wallpaper.jpeg)

## Avocado Data

In [1]:
import pandas as pd

In [2]:
# Read in the CSV data for avocado (tab-delimited)
keep_columns = ['code', 'lc', 'product_name_en', 'quantity', 'serving_size', 'packaging_tags', 'brands', 'brands_tags', 'categories_tags', 'labels_tags', 'countries', 'countries_tags', 'origins','origins_tags']

avc = pd.read_csv('data/avocado.csv', delimiter='\t')
avc = avc[keep_columns]

In [3]:
# Read TXT file for relevant category tags of avocado
with open('data/relevant_avocado_categories.txt', "r") as file:    # "r" is read mode
    relevant_avocado_categories = file.read().splitlines()
# Check that the file has been automatically closed
file.close()

In [4]:
# Turning categories_tags (comma separated tags) into a column of lists
avc['categories_tags'] = avc['categories_tags'].str.split(',')

In [5]:
# Dropping rows with null in categories_tags
avc = avc.dropna(subset=['categories_tags'])

In [6]:
# Filtering avc based on categories_tags
avc = avc[avc['categories_tags'].apply(lambda x: any([i in x for i in    \
                                                     relevant_avocado_categories]))]

```python
>>> [i for i in xif i in relevant_avocado_categories]) is equivalent to the code:
filtered_list = []
for i in x:
    if i in relevant_avocado_categories:
        filtered_list.append(i)
        
>>> any([i for i in x if i in relevant_avocado_categories]) is equivalent to the code:

found_any = False
for i in x:
    if i in relevant_avocado_categories:
        found_any = True
        break
result = any([found_any])


## Where do most UK avocados come from?

In [7]:
# Filtering avc by a country United Kingdom
avocado_origin_uk = avc[avc['countries'] == 'United Kingdom']

In [8]:
# Counting and ordering by the unique values in the country of origin column
avocado_origin_uk_count = avocado_origin_uk['origins_tags'].value_counts()

# Get the country with the highest count
top_avocado_origin = avocado_origin_uk_count.index[0]

In [9]:
# Strip out characters before country name
# Replace hyphen in country name with a space
top_avocado_origin = top_avocado_origin.lstrip("en:").replace('-', ' ')

In [10]:
# Print the result
print(top_avocado_origin)

peru


## Create a user-defined function to call for each ingredient

Applying the Don't Repeat Yourself (DRY) principle, the analysis code for avocado data has been refactored into a universal function. This function is now adaptable for files of various ingredients, and extra steps have been incorporated to manage ties, a consideration that wasn't needed for avocado data analysis.

In [11]:
# Create a function called read_and_filter_data()

def read_and_filter_data(filename, string_list):
    """
    Reads data from a CSV file specified by the filename and filters it based on specified string_list (relevant categories).

    Parameters:
    - filename (str): A string specifying the path to the CSV file containing ingredient data.
    - string_list (list of str): A list of strings indicating the criteria to filter the data.

    Returns:
    - str: A string representing the outcome or result based on the specified criteria.
    """
    
    # Read the file
    df = pd.read_csv(filename, delimiter='\t', low_memory=False)
    
    # Subset to just the relevant columns
    keep_columns = ['code', 'lc', 'product_name_en', 'quantity', 'serving_size', 'packaging_tags', 'brands', 'brands_tags', 'categories_tags', 'labels_tags', 'countries', 'countries_tags', 'origins','origins_tags']
    df = df[keep_columns]
    
    # Turning categories_tags (comma separated tags) into a column of lists
    df['categories_tags'] = df['categories_tags'].str.split(',')
    
    # Dropping rows with null in categories_tags
    df = df.dropna(subset=['categories_tags'])
    
    # Filtering df based on categories_tags
    df = df[df['categories_tags'].apply(lambda x: any([i in x for i in string_list]))]
    
    # Filtering df by country United Kingdom
    df_origin_uk = df[df['countries'] == 'United Kingdom']
    
    # Counting and ordering by the unique values in the country of origin column
    df_origin_uk_count = df_origin_uk['origins_tags'].value_counts()
    
    # Get the country with the highest count
    df_top_origin_uk = df_origin_uk_count.index[0]
    
    # Clean up the country string data
    df_top_origin_uk = df_top_origin_uk.lstrip("en:").replace('-', ' ')
    
    return df_top_origin_uk

## Read relevant categories data file and call function for each ingredient

Using the relevant category data and analyzing country origin data. The last two origin (olive oil and sourdough) variables were determined by calling the function created.

## Olive Oil Data

In [12]:
# Split categories tags into lists
with open('data/relevant_olive_oil_categories.txt', "r", encoding="utf-8") as file:
    relevant_olive_oil_categories = file.read().splitlines()
file.close()

# Call the function and assign it as top_olive_oil_origin
top_olive_oil_origin = read_and_filter_data('data/olive_oil.csv',  relevant_olive_oil_categories)

# Print the result
print(top_olive_oil_origin)

greece


## Sourdough Data

In [13]:
# Split categories tags into lists
with open('data/relevant_sourdough_categories.txt', "r", encoding="utf-8") as file:
    relevant_olive_oil_categories = file.read().splitlines()
file.close()

# Call the function and assign it as top_olive_oil_origin
top_olive_oil_origin = read_and_filter_data('data/sourdough.csv',  relevant_olive_oil_categories)

# Print the result
print(top_olive_oil_origin)

united kingdom
