### Using beautiful soup to scrape 
- Recipe name
- Ingredients list
- Cooking time
- Instructions
- Nutrient info 
- Tags (e.g., vegan, keto, veg/non veg)

- The website we're currently using is https://www.bbcgoodfood.com/recipes/collection/healthy-indian-recipes
It has nutient breakdown also available 


#### Import dependencies

In [None]:
# pip install pandas
# pip install requests
# pip install beautifulsoup4
# pip install lxml

In [1]:
import pandas as pd
import requests 
from bs4 import BeautifulSoup

#### Fetching the page

In [36]:
webpage = requests.get("https://www.bbcgoodfood.com/recipes/collection/healthy-indian-recipes").text

In [37]:
# parsing 
soup = BeautifulSoup(webpage, 'lxml')  #lxml for html parsing

In [None]:
soup.find_all('h2')[0].text  # hence, here we have the name

'Kitchari'

In [40]:
names = []
for i in soup.find_all('h2'):
    names.append(i.text)

In [None]:
len(names) # total recipies available at the fist page

24

- to get the links of all the recipies
- they are stored in a tags of class "link d block"

In [51]:
soup.find_all('a',class_= "link d-block")

<a class="link d-block" data-component="Link" href="https://www.bbcgoodfood.com/recipes/kitchari"><h2 class="heading-4" style="color:inherit">Kitchari</h2></a>

In [57]:
links = []
for i in soup.find_all('a',class_= "link d-block"):
    links.append(i.get("href"))

In [None]:
# now, from every link, we will be getting our information, let's start form the first link
# store- name, serving size, cooking time, tags, ingredients, nutrition, instructions

### For the first webpage

In [62]:
webpage = requests.get("https://www.bbcgoodfood.com/recipes/kitchari").text

In [63]:
soup = BeautifulSoup(webpage, 'lxml')

In [67]:
soup.find_all('h1')[0].text

'Kitchari'

In [69]:
# serving size
soup.find_all('div', class_="recipe-cook-and-prep-details__item")[0].text

'Serves 4'

In [226]:
# cooking time 
soup.find_all('div', class_="recipe-cook-and-prep-details__item")

[<div class="recipe-cook-and-prep-details__item"><strong>Serves 4</strong></div>,
 <div class="recipe-cook-and-prep-details__item"><strong>Easy</strong></div>,
 <div class="recipe-cook-and-prep-details__item">Prep:<!-- --> <strong><span><time datetime="PT0H10M">10 mins</time></span></strong></div>,
 <div class="recipe-cook-and-prep-details__item">Cook:<!-- --> <strong><span><time datetime="PT1H0M">1 hr</time></span></strong></div>]

In [233]:
# Find all detail items
cook_prep_items = soup.find_all('div', class_="recipe-cook-and-prep-details__item")

# Initialize variables
serving_size = None
cook_time = None

# Loop through items and extract based on content
for item in cook_prep_items:
    text = item.get_text(strip=True).lower()
    print(text)
    if 'serves' in text:
        serving_size = text
    elif 'cook' in text and item.find('time'):
        cook_time = item.find('time').text.strip()

serves 4
easy
prep:10 mins
cook:1 hr


In [232]:
print(cook_time)

1 hr


In [79]:
# tags 
soup_tags = soup.find_all('div', class_ = "post-header--masthead__tags-item")

In [80]:
tags = [tag.text for tag in soup_tags]

In [147]:
tags

['Gluten-free', 'Healthy', 'Low calorie', 'Low fat', 'Vegetarian']

In [85]:
# ingredients
soup.find_all("li", class_ = "ingredients-list__item list-item")[0].text

'1 tbsp ghee'

In [97]:
ul = soup.find('ul',class_ = 'ingredients-list')
print(ul.prettify())

<ul class="ingredients-list list">
 <li class="ingredients-list__item list-item">
  1 tbsp
  <a class="link link--styled" data-component="Link" href="/glossary/ghee-glossary">
   ghee
  </a>
 </li>
 <li class="ingredients-list__item list-item list-item--separator-top">
  1
  <a class="link link--styled" data-component="Link" href="/glossary/cauliflower-glossary">
   small cauliflower
  </a>
  <div class="ingredients-list__item-note">
   stalks and florets finely chopped
  </div>
 </li>
 <li class="ingredients-list__item list-item list-item--separator-top">
  2
  <a class="link link--styled" data-component="Link" href="/glossary/carrots-glossary">
   carrots
  </a>
  <div class="ingredients-list__item-note">
   finely chopped
  </div>
 </li>
 <li class="ingredients-list__item list-item list-item--separator-top">
  15g
  <a class="link link--styled" data-component="Link" href="/glossary/ginger-glossary">
   piece of ginger
  </a>
  <div class="ingredients-list__item-note">
   peeled and 

In [94]:
ul.find_all('li')

[<li class="ingredients-list__item list-item">1 tbsp <a class="link link--styled" data-component="Link" href="/glossary/ghee-glossary">ghee</a></li>,
 <li class="ingredients-list__item list-item list-item--separator-top">1 <a class="link link--styled" data-component="Link" href="/glossary/cauliflower-glossary">small cauliflower</a><div class="ingredients-list__item-note"> stalks and florets finely chopped</div></li>,
 <li class="ingredients-list__item list-item list-item--separator-top">2 <a class="link link--styled" data-component="Link" href="/glossary/carrots-glossary">carrots</a><div class="ingredients-list__item-note"> finely chopped</div></li>,
 <li class="ingredients-list__item list-item list-item--separator-top">15g <a class="link link--styled" data-component="Link" href="/glossary/ginger-glossary">piece of ginger</a><div class="ingredients-list__item-note"> peeled and grated</div></li>,
 <li class="ingredients-list__item list-item list-item--separator-top">1 tsp <a class="link

In [98]:
ingredients = [li.text for li in ul.find_all('li')]

In [99]:
ingredients

['1 tbsp ghee',
 '1 small cauliflower stalks and florets finely chopped',
 '2 carrots finely chopped',
 '15g piece of ginger peeled and grated',
 '1 tsp ground cumin',
 '½ tsp black mustard seeds',
 '½ tsp fennel seeds',
 '½ tsp ground coriander',
 '½ tsp ground turmeric',
 '150g moong dal rinsed and drained (available in specialist shops and large supermarkets)',
 '100g basmati rice rinsed and drained',
 'small handful of coriander finely chopped',
 '1 lime cut into wedges']

In [100]:
ul = soup.find('ul', class_ = "nutrition-list")

In [104]:
ul.find_all('li')

[<li class="nutrition-list__item"><span class="fw-600 mr-1">kcal</span>271<div class="nutrition-list__additional-text">low<div class="nutrition-list__additional-text-icon"><i class="icon" style="width:12px;min-width:12px;height:12px;min-height:12px;animation-duration:1000ms;transform:rotate(180deg)"><svg aria-hidden="true" class="icon__svg" focusable="false" style="color:rgba(255, 255, 255, 1);fill:rgba(255, 255, 255, 1)"><use xlink:href="/static/icons/base/sprite-maps/arrows-71cd4ec91a6536f2abcc71183b8f0de8.svg#arrow-light"></use></svg></i></div></div></li>,
 <li class="nutrition-list__item"><span class="fw-600 mr-1">fat</span>6<!-- -->g<div class="nutrition-list__additional-text">low<div class="nutrition-list__additional-text-icon"><i class="icon" style="width:12px;min-width:12px;height:12px;min-height:12px;animation-duration:1000ms;transform:rotate(180deg)"><svg aria-hidden="true" class="icon__svg" focusable="false" style="color:rgba(255, 255, 255, 1);fill:rgba(255, 255, 255, 1)"><u

In [116]:
nutrition = {}

for li in ul.find_all('li'):
    label = li.find('span').text
    quantity =  li.contents[1]
    nutrition[label]=f"{quantity} g"

In [117]:
nutrition # add grams

{'kcal': '271 g',
 'fat': '6 g',
 'saturates': '3 g',
 'carbs': '40 g',
 'sugars': '4 g',
 'fibre': '4 g',
 'protein': '13 g',
 'salt': '0.1 g'}

In [123]:
# instructions
uls = soup.find('ul', class_ = "method-steps__list")

In [139]:
instructions=[]
for li in uls.find_all('li', class_="method-steps__list-item"):
    step_no = li.find("h3", class_="method-steps__item-heading heading-5").text # step o is stored in the h3 tag
    step_content = li.find('p').text
    instructions.append(step_no)
    instructions.append(step_content)

In [140]:
instructions

['step 1',
 'Melt the ghee in a large flameproof casserole or saucepan over a medium heat. Stir in all the cauliflower and carrots, and season lightly. Fry gently for 10 mins until the vegetables have softened and taken on a bit of colour.',
 'step 2',
 'Tip in all the spices and fry for a further 2 mins until fragrant. Pour in the moong dal and rice, and stir to coat in the spices. Season with salt and pour in 1.25 litres water. Bring to a simmer and cook for 45 mins, stirring occasionally until the beans and rice are fully tender and have broken down. The texture should be porridge-like. Season to taste and sprinkle over the coriander. Serve with the lime wedges on the side for squeezing over.']

### now, let's create functions to get data similarly from all the pages

In [237]:
def get_data(link):
    # requesting webpage
    webpage = requests.get(link).text
    soup = BeautifulSoup(webpage, 'lxml')
    name = soup.find('h1').text

    # serving size and cooking time
    cook_prep_items = soup.find_all('div', class_="recipe-cook-and-prep-details__item")

    # Initialize variables
    serving_size = None
    cook_time = None

    # Loop through items and extract based on content
    for item in cook_prep_items:
        text = item.get_text(strip=True).lower()
        if 'serves' in text:
            serving_size = text
        elif 'cook' in text and item.find('time'):
            cook_time = item.find('time').text.strip()

    #tags
    soup_tags = soup.find_all('div', class_ = "post-header--masthead__tags-item")
    tags = [tag.text for tag in soup_tags]

    #ingredients
    ul_ingredients = soup.find('ul',class_ = 'ingredients-list')
    ingredients = [li.text for li in ul_ingredients.find_all('li')]

    #nutrition
    ul_nutr = soup.find('ul', class_ = "nutrition-list")
    nutrition = {}
    for li in ul_nutr.find_all('li'):
        label = li.find('span').text
        quantity =  li.contents[1]
        nutrition[label]=f"{quantity} g"

    # instructions
    uls = soup.find('ul', class_ = "method-steps__list")
    instructions=[]
    for li in uls.find_all('li', class_="method-steps__list-item"):
        step_no = li.find("h3", class_="method-steps__item-heading heading-5").text # step o is stored in the h3 tag
        step_content = li.find('p').text
        instructions.append(step_no)
        instructions.append(step_content)

    return {"name":name, "tags":tags, "ingredients":ingredients,"serving_size":serving_size,"cook_time":cook_time, "nutrition":nutrition, "instructions":instructions}
    

In [254]:
recipes_pg1 = [get_data(href) for href in links[:-3]]

In [255]:
# function for getting all the available recipes
def get_recipe_links(main_page_addr):
    webpage = requests.get(main_page_addr).text
    soup = BeautifulSoup(webpage, 'lxml')
    links = [i.get("href") for i in soup.find_all('a',class_= "link d-block")]
    return links

### Next page

In [246]:
links = get_recipe_links("https://www.bbcgoodfood.com/recipes/collection/healthy-indian-recipes?page=2")

In [256]:
recipes_pg2 = [get_data(href) for href in links[:-3]]

### Final Page

In [258]:
links = get_recipe_links("https://www.bbcgoodfood.com/recipes/collection/healthy-indian-recipes?page=3")

In [260]:
recipes_pg3 = [get_data(href) for href in links[:-3]]

In [261]:
recipes = recipes_pg1 + recipes_pg2 + recipes_pg3

In [262]:
len(recipes)

52

In [263]:
df = pd.DataFrame(recipes)

In [265]:
df.head()

Unnamed: 0,name,tags,ingredients,serving_size,cook_time,nutrition,instructions
0,Chicken madras,"[Dairy-free, Egg-free, Gluten-free, Healthy, L...","[1 onion peeled and quartered, 2 garlic cloves...",serves 3 - 4,35 mins,"{'kcal': '373 g', 'fat': '17 g', 'saturates': ...","[step 1, Blitz 1 quartered onion, 2 garlic clo..."
1,Pani puris,"[Healthy, Vegan, Vegetarian]","[150g chakki atta (chapatti flour), 30g fine s...",serves 4 - 6,40 mins,"{'kcal': '385 g', 'fat': '20 g', 'saturates': ...","[step 1, Make the pani water. Place the corian..."
2,Easy veggie biryani,"[Healthy, Vegetarian]","[250g basmati rice, 400g special mixed frozen ...",serves 4,,"{'kcal': '305 g', 'fat': '6 g', 'saturates': '...","[step 1, Boil the kettle. Get out a large micr..."
3,"Onion & butternut bhajis with rotis, mango rai...","[Freezable (Freeze cooked bhajis only), Healthy]","[10 rotis sprinkled with water, wrapped in bak...",,25 mins,"{'kcal': '328 g', 'fat': '11 g', 'saturates': ...","[step 1, For the bhajis, mix the korma paste w..."
4,Prawn jalfrezi,"[Freezable, Gluten-free, Healthy]","[2 tsp rapeseed oil, 2 medium onions chopped, ...",serves 2,22 mins,"{'kcal': '335 g', 'fat': '7 g', 'saturates': '...","[step 1, Heat the oil in a non-stick pan and f..."


In [266]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   name          52 non-null     object
 1   tags          52 non-null     object
 2   ingredients   52 non-null     object
 3   serving_size  50 non-null     object
 4   cook_time     45 non-null     object
 5   nutrition     52 non-null     object
 6   instructions  52 non-null     object
dtypes: object(7)
memory usage: 3.0+ KB


In [267]:
df.to_csv("recipies 1.csv")