# SBS Maltese Food Recipes
---

In [4]:
# Libraries needed to get the html from a site and to parse the html
from bs4 import BeautifulSoup as bs
import requests
import csv
from parse_ingredients import parse_ingredient
import pyfood as pyf
import pandas as pd

To start we need the main url. This is the site we will be scraping for recipes. Using the requests package we can load the website. This prepares it for scraping 

In [5]:
# Saving the url to a variable, getting the url and parsing the html
mainURL = "https://www.sbs.com.au/food/cuisine/maltese"
req = requests.get(mainURL)
soup = bs(req.text, "html.parser")

The base link is the common part of all recipe links. The recipes will be added onto this

In [6]:
# Base link that hrefs will be added onto
baseLink = 'https://www.sbs.com.au'

We are using the Beautiful soup package to crawl the htlm of the website for any instance of the class 'link-underlay'. This function will return the object holding that class. We determined that all the a tags on the site have this class.

In [7]:
# The tag that stores the href to the recipes
hrefs = soup.findAll(class_='link-underlay')

In [8]:
# Simply printing out the contents to see what the hrefs of the tags are

def printHrefs(hrefsIn):
    for i in hrefsIn:
        print(i.get('href'), "\n")

We now need to prepare these links so that they can be accessed. To do this we append the base link to the start of the scraped link. This gives us an addressable link

In [9]:
# This for loop will generate the links to all the recipes that are shown on the mainURL

def funcRecipeLinks(hrefsIn, baseLinkIn, recipeLinks):
    for i in hrefsIn:
        fullLink = baseLinkIn + i.get('href')
        if i.get('href')[6] == 'r':
            recipeLinks.append(fullLink)
    return recipeLinks

recipeLinks = []
recipeLinks = funcRecipeLinks(hrefs, baseLink, recipeLinks)

This specific site has 2 pages of recipes. Therefore, below we are going trough the same process as above for the other page. In the end we have the variable 'recipeLinks' that stores all the links (that are related to recipes) on the sbs website.

In [10]:
mainURL = "https://www.sbs.com.au/food/cuisine/maltese?sort_by=created&page=1"
req = requests.get(mainURL)
soup = bs(req.text, "html.parser")

baseLink = 'https://www.sbs.com.au'
hrefs = soup.findAll(class_='link-underlay')

recipeLinks = funcRecipeLinks(hrefs, baseLink, recipeLinks)


## Testing different ingredient extraction methods
---
### Method 1: parse_ingredients library

The parse_ingredients library provides a way to extract ingredients, quantity, units and comments from recipe ingredients. In testing it was found that the library can be inconsistent with removing quantifiers such as size, colour and plurals. 

In [None]:
def parseIngredientsFunc(i, title):
    data = []
    for j in title:
                data.append(str(j.get_text(strip=False))) # Dish Name
    
    parseResult = parse_ingredient(str(i.get_text(strip=False)))
            
    data.append(parseResult.original_string) #Original String
    data.append(parseResult.name) #Ingredient
    data.append(parseResult.quantity) #Quantity
    data.append(parseResult.unit) #Unit
    data.append(parseResult.comment) #Comment
    
    return data

### Method 2: pyfood library

Contrary to the parse_ingredient library, the pyfood library only extracts the ingredient from the input string. Moreover, It is prone to some silly errors where it will convert ingredients like 'frozen peas' to 'green peppers'. These mistakes appear to be few and far between. However, examination shows a better handling of quantifiers; it ignores indicators to size and colour, as well as removes plurals. 

In [None]:
def pyfoodFunc(title, i):
    data = []

    for j in title:
        data.append(str(j.get_text(strip=False))) # Dish Name
        
    print(str(i.get_text(strip=False)))
    results = shelf.process_ingredients([i.get_text(strip=False)])
    try:
        temp = results['ingredients'][0]['foodname'] # vegetarian, vegan, nutrition, seasonality
    except:
        temp = results['HS'][0]
    data.append(i)
    data.append(temp)

    return data

### Method 3: parse_ingredients and pyfood libraries

After taking into consideration the above scenario, we decided to use both of these libraries for the best result. First, the parse_ingredients library is used to extract: quantity, and the ingredient name. The ingredient name is then passed trough the pyfoods function set to obtain a striped down version of the ingredient. Using the parse_ingredients library first also appears to reduce the error in the pyfoods library. 

In [11]:
def parseIngredientsAndPyfoodFunc(i, title, shelf):
    data = []
    for j in title:
                data.append(str(j.get_text(strip=False))[0:-34]) # Dish Name
    
    parseResult = parse_ingredient(str(i.get_text(strip=False)))

    results = shelf.process_ingredients([parseResult.name])
    try:
        temp = results['ingredients'][0]['foodname']
    except:
        temp = results['HS'][0]

    temp = temp.replace(" ", "_")

    data.append(temp)

    return data

In [12]:
ingredientList = []
shelf = pyf.Shelf(region='Italy', month_id=0)

for recipe in recipeLinks:
    req = requests.get(recipe) # Accesses the next recipe
    soup = bs(req.text, "html.parser")
    title = soup.findAll('h1') # finds the name of the recipe
    
    ingredientsDiv = soup.findAll('div', class_='field-name-field-ingredients') # finds all the divs containing the ingredients
    
    # For each div go trough and extract the ingredients
    for ul in ingredientsDiv:
        ingredients = ul.findAll("li")
        
        for i in ingredients:
            # data = parseIngredientsFunc(i, title)
            # data = pyfoodFunc(title, i)
            data = parseIngredientsAndPyfoodFunc(i, title, shelf)
            
            ingredientList.append(data)

https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


## Finding each ingredients type and index

In [None]:
ing_info = pd.read_csv("../CSV/Compound CSVs/ingr_info.tsv", sep="\t")

category = [None] * len(ingredientList)
index = [None] * len(ingredientList)

for ing in range(0, len(ingredientList)):
    for ing2 in range(0, len(ing_info["ingredient name"])):
        if(ingredientList[ing][1] == ing_info["ingredient name"][ing2]):
            category[ing] = ing_info["category"][ing2]
            index[ing] = ing_info["# id"][ing2]
            continue

In [None]:
df = pd.DataFrame(columns = ["Recipe", "Ingredient"], data = ingredientList)

df.insert(2, "Ingredint Index", index, True)
df.insert(2, "Ingredint Category", category, True)

## Saving the data to a CSV

It was decided to save the scraped data to a series of CSVs. This was done to make the data easier and faster to access as you do not need to wait for the long scraping process. 

In [None]:
def saveToCSV(header, rows, filePath):
    with open(filePath, 'w', encoding='UTF8', newline='') as f:
        writer = csv.writer(f)

        # write the header
        writer.writerow(header)

        # write multiple rows
        writer.writerows(rows)

The following csv stores all the ingredients for each dish found on the sbs website

In [None]:
df.to_csv("../CSV/recipeList.csv", index = False)

## Scraping a list of recipes and there links

The following for loop looks over all the links previously scraped, gets the title from each page (this will indicate the recipe name), and saves the recipe name and link to csv. This provides us with a list of recipes and there websites. 

In [None]:
RecipeList = []

for recipe in recipeLinks:
    req = requests.get(recipe) # Accesses the next recipe
    soup = bs(req.text, "html.parser")
    title = soup.findAll('h1') # finds the name of the recipe

    row = []

    for j in title:
        row.append(str(j.get_text(strip=False)))

    row.append(recipe)

    RecipeList.append(row)

In [None]:
header = ['Recipe Name', 'Links']

saveToCSV(header, RecipeList, '../CSV/recipeLinks.csv')