# Let's create a dataset of Finnish Recipes

### Objective:

In this notebook, we aim to scrape Finnish recipes from the website finland.fi. For each recipe, we will extract the following information:
- Title: The name of the recipe.
- Category: The type of dish (e.g., salads, breads, cheeses).
- Ingredients: A list of ingredients required for the recipe.
- Cooking Instructions: Step-by-step instructions for preparing the dish.

By the end of this notebook, we will have a consolidated dataset of Finnish recipes structured in a table format.

### Steps

1. Analyze the Website Structure: We will inspect the HTML elements to identify where the recipe details (title, category, ingredients, and instructions) are located.

2. Parse the HTML with BeautifulSoup: Use BeautifulSoup to process and navigate the webpage’s HTML.

3. Extract Relevant Data: Extract titles, categories, ingredients, and instructions by locating and parsing the appropriate HTML tags.

4. Organize the Data: Store the extracted information in a pandas DataFrame for better visualization and analysis.

5. Save the Dataset: Export the DataFrame to a CSV file for future use.

In [100]:
# This is a comment
# First, we need to install BeautifulSoup (bs4 is the package name)
# Add the "!" at the beginning to run it as a command in the terminal from within the notebook.
!pip install bs4 pd

Defaulting to user installation because normal site-packages is not writeable
Collecting pd
  Downloading pd-0.0.4-py3-none-any.whl (6.5 kB)
Installing collected packages: pd
Successfully installed pd-0.0.4


In [101]:
# Import the necessary libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [102]:
# Define the URL of the website we want to scrape
url = "https://finland.fi/life-society/finnish-recipes/"

# Use the requests library to fetch the content of the webpage
# This sends an HTTP GET request to the specified URL and stores the response
response = requests.get(url)

# Parse the content of the webpage using BeautifulSoup
# The response.content contains the raw HTML of the page
# We pass it to BeautifulSoup, specifying the HTML parser to process it
soup = BeautifulSoup(response.content, 'html.parser')

In [103]:
# Print the entire parsed HTML content of the webpage
print(soup)

# The 'soup' object contains the parsed HTML of the webpage.
# When you use print(soup), it displays the HTML as a nicely formatted string.
# This is similar to what you would see if you right-click on the webpage in a browser and select "View Page Source."
# From this output, you can identify the specific elements you want to extract (e.g., titles, links).
# For larger webpages, the output may be too long. In such cases, you can use methods like soup.prettify() or search for specific elements (e.g., soup.find('title')).

<!DOCTYPE html>

<!--[if (lte IE 8)&!(IEMobile)]><html lang="en-US" class="no-js lt-ie9"><![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en-US"><!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="http://gmpg.org/xfn/11" rel="profile"/>
<script>
    if(navigator.platform.indexOf('Win') > -1){document.getElementsByTagName('html')[0].className+=' win'}if(navigator.userAgent.indexOf("MSIE") != -1){document.getElementsByTagName('html')[0].className+=' ie'}
    if(document.documentElement.lang=='ru-RU' || document.documentElement.lang=='zh-CN'){document.getElementsByTagName('html')[0].className+=' win'}
  </script>
<link href="https://finland.fi/wp-content/themes/thisisfinland/favicon.ico" rel="shortcut icon">
<!-- cookiebot script start -->
<script data-blockingmode="auto" data-cbid="c101ed35-c9d9-4ec7-bb95-4887f06fa66b" data-culture="en" id="Cookiebot" src="https://consent.cookiebot.com/uc.js" type="

In [111]:
# From a manual inspection of the webpage, we find recipes inside <h3> tags
soup = soup.find('h2').find_parent() # Keeps everything from the first <h2> onward
recipe_titles = soup.find_all('h3')  
recipe_titles

[<h3><b>‘Rosolli’ salad </b>(<span lang="fi">Rosolli</span>)</h3>,
 <h3><b>Finnish mushroom salad </b>(<span lang="fi">Suomalainen sienisalaatti</span>)</h3>,
 <h3><b>Karelian pasties </b>(<span lang="fi">Karjalanpiirakat</span>)</h3>,
 <h3><b>Egg cheese </b>(<span lang="fi">Munajuusto</span>)</h3>,
 <h3><b>Karelian Hot Pot </b>(<span lang="fi">Karjalanpaisti</span>)</h3>,
 <h3><b>Fish soup </b>(<span lang="fi">Kalakeitto</span>)</h3>,
 <h3><b>Fish soup à la Kainuu </b>(<span lang="fi">Kainuulainen kalakeitto</span>)</h3>,
 <h3><b>Cabbage Rolls </b>(<span lang="fi">Kaalikääryleet</span>)</h3>,
 <h3><b>Meatballs </b>(<span lang="fi">Lihapullat</span>)</h3>,
 <h3><b>Mushroom-omelette roll </b>(<span lang="fi">Sienimunakas-kääryle</span>)</h3>,
 <h3><b>Cheese-and-herbs stuffed salmon </b>(<span lang="fi">Yrttijuustolla täytetty lohi</span>)</h3>,
 <h3><b>Runeberg cakes </b>(<span lang="fi">Runebergin tortut</span>)</h3>,
 <h3><b>Poor knights </b>(<span lang="fi">Köyhät ritarit</span>)</h3

In [112]:
# we can extract the text of the html elements
for title in recipe_titles:
    print(title.text.strip())

‘Rosolli’ salad (Rosolli)
Finnish mushroom salad (Suomalainen sienisalaatti)
Karelian pasties (Karjalanpiirakat)
Egg cheese (Munajuusto)
Karelian Hot Pot (Karjalanpaisti)
Fish soup (Kalakeitto)
Fish soup à la Kainuu (Kainuulainen kalakeitto)
Cabbage Rolls (Kaalikääryleet)
Meatballs (Lihapullat)
Mushroom-omelette roll (Sienimunakas-kääryle)
Cheese-and-herbs stuffed salmon (Yrttijuustolla täytetty lohi)
Runeberg cakes (Runebergin tortut)
Poor knights (Köyhät ritarit)
Lingonberry Delight (Marjakiisseli)
Oven porridge (Uunipuuro)
Pancakes (Ohukaiset)
Sweet Buns (Pikkupullat)
Tiger cake (Tiikerikakku)
Aunt Hanna’s biscuits (Hanna-tädin piparkakut)
May Day Cookies (Tippaleivät)
Mead (Sima)


In [121]:
recipes = []
for recipe in soup.find_all('h3'):
    # Extract the title
    recipe_title = recipe.text.strip()

    # Extract Finnish name
    recipe_en, recipe_fi = recipe_title.split('(')
    recipe_fi = recipe_fi.replace(')','')

    # Extract the category (e.g., salads, breads, cheeses)
    # Assuming categories are part of the text or another element
    category_tag = recipe.find_previous('h2')  # This assumes categories are grouped
    category = category_tag.text.strip() if category_tag else "Uncategorized"

    # Extract the ingredients
    ingredients = recipe.find_next('ul')  # Ingredients are usually in an unordered list
    ingredients_list = ', '.join([li.text.strip() for li in ingredients if li.text.strip()] if ingredients else [])

    # Extract the cooking instructions
    instructions = []
    for tag in recipe.find_all_next():
        if tag.name == 'h3':
            break
        if tag.name == 'p':
            instructions.append(tag)
    instructions_text = ' '.join([ins.text.strip() for ins in instructions] if ingredients else [])

    # Append the extracted data to the recipes list
    recipes.append({
        'title': recipe_en,
        'title_fi': recipe_fi,
        'category': category,
        'ingredients': ingredients_list,
        'instructions': instructions_text
    })

recipes

[{'title': '‘Rosolli’ salad ',
  'title_fi': 'Rosolli',
  'category': 'Salads',
  'ingredients': '4 boiled potatoes, 4 boiled carrots, 4 boiled beetroot or pickled beetroot, 1 gherkin, 1 small onion, salt, white pepper',
  'instructions': 'Dressing: (water the beetroot was cooked in) Cook the vegetables in their skin well beforehand until just tender. Peel the vegetables and onion, and cut them into small, equal-sized cubes. Mix them together and season with a little salt and white pepper. Whip the cream lightly, season with sugar and vinegar and add a few drops of beetroot liquid for colour. Serve the dressing separately. Garnish the salad with hard-boiled eggs, the yolks and whites chopped separately and laid in stripes on the top.'},
 {'title': 'Finnish mushroom salad ',
  'title_fi': 'Suomalainen sienisalaatti',
  'category': 'Salads',
  'ingredients': '3-4 dl of salted mushrooms, 1 onion',
  'instructions': 'Dressing: Soak the salted mushrooms until the salt level is right. Press 

In [115]:
# Create a DataFrame for better visualization
recipes_df = pd.DataFrame(recipes)

# Remove new lines from the 'instructions' column
recipes_df['instructions'] = recipes_df['instructions'].str.replace('\n', ' ').str.strip()

# Display the DataFrame
print("\nExtracted Recipes:")
recipes_df.head()


Extracted Recipes:


Unnamed: 0,title,title_fi,category,ingredients,instructions
0,‘Rosolli’ salad,Rosolli,Salads,"4 boiled potatoes, 4 boiled carrots, 4 boiled ...",Dressing:. (water the beetroot was cooked in)....
1,Finnish mushroom salad,Suomalainen sienisalaatti,Salads,"3-4 dl of salted mushrooms, 1 onion",Dressing:. Soak the salted mushrooms until the...
2,Karelian pasties,Karjalanpiirakat,Breads and cheeses,"1 decilitre water, ½ – 1 tsp salt, 2½ decilitr...",Ingredients:. Rice filling:.
3,Egg cheese,Munajuusto,Breads and cheeses,"3 l milk, 1 l sour milk, 4 eggs, 1 tsp salt",Bring the milk almost to boiling point. Combin...
4,Karelian Hot Pot,Karjalanpaisti,Main courses,"300 g chuck steak, 300 g pork shoulder, 300 g ...",(for four – five persons). Cut the meat into c...


In [120]:
# Save to a CSV file for further use
recipes_df.to_csv("finnish_recipes.csv", index=True, sep='\t')