### Scraping more healthy indian recipes

- Recipe name
- Ingredients list
- Make time
- Instructions
- Nutrient info 
- Tags (e.g., vegan, keto, veg/non veg)

website: https://www.tarladalal.com/
-nutrient breakdown avaliable too

#### Import dependencies

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [3]:
webpage = requests.get("https://www.tarladalal.com/category/Healthy-Indian-Recipes/").text

In [4]:
soup = BeautifulSoup(webpage, 'lxml')

In [5]:
# Getting all recipe links from the main page 
# thet are stored in div tags of class text-center
soup.find_all('div',class_="text-center")[:5]

[<div class="text-center">
 <a class="btn btn-main primary-bg" href="/recipes/category/Vitamin-B12-Cobalamin-Rich-Foods/">View All</a>
 </div>,
 <div class="text-center">
 <a class="btn btn-main primary-bg" href="/recipes/category/Healthy-Low-Calorie-Weight-Loss/">View All</a>
 </div>,
 <div class="text-center">
 <a class="btn btn-main primary-bg" href="/recipes/category/Insoluble-Fiber-Diet/">View All</a>
 </div>,
 <div class="text-center">
 <a class="btn btn-main primary-bg" href="/recipes/category/Low-Cholesterol-/">View All</a>
 </div>,
 <div class="text-center">
 <a class="btn btn-main primary-bg" href="/recipes/category/Soluble-Fibre-Diet/">View All</a>
 </div>]

In [6]:
# getting the tag links from the href attrbute of anchor tag
tags_links=[]
for div in soup.find_all('div',class_="text-center")[:-1]:
    tags_links.append(div.find('a').get("href"))

In [7]:
tags_links[:5]

['/recipes/category/Vitamin-B12-Cobalamin-Rich-Foods/',
 '/recipes/category/Healthy-Low-Calorie-Weight-Loss/',
 '/recipes/category/Insoluble-Fiber-Diet/',
 '/recipes/category/Low-Cholesterol-/',
 '/recipes/category/Soluble-Fibre-Diet/']

In [8]:
tags_links[-5:]

['/recipes/category/Lactose-Free-Dairy-Free-Cake-/',
 '/recipes/category/Chronic-Kidney-Disease/',
 '/recipes/category/indian-recipes-for-relief-from-pregnancy-constipation/',
 '/recipes/category/selenium1/',
 '/recipes/category/healthy-indian-soups-under-100-calories/']

- They are in the form of relative address to the main page, so let's prefix them with the address "https://www.tarladalal.com/"

In [9]:
cleaned_tags_links = ["https://www.tarladalal.com"+link for link in tags_links ]

In [10]:
cleaned_tags_links[:5]

['https://www.tarladalal.com/recipes/category/Vitamin-B12-Cobalamin-Rich-Foods/',
 'https://www.tarladalal.com/recipes/category/Healthy-Low-Calorie-Weight-Loss/',
 'https://www.tarladalal.com/recipes/category/Insoluble-Fiber-Diet/',
 'https://www.tarladalal.com/recipes/category/Low-Cholesterol-/',
 'https://www.tarladalal.com/recipes/category/Soluble-Fibre-Diet/']

In [11]:
cleaned_tags_links[-5:]

['https://www.tarladalal.com/recipes/category/Lactose-Free-Dairy-Free-Cake-/',
 'https://www.tarladalal.com/recipes/category/Chronic-Kidney-Disease/',
 'https://www.tarladalal.com/recipes/category/indian-recipes-for-relief-from-pregnancy-constipation/',
 'https://www.tarladalal.com/recipes/category/selenium1/',
 'https://www.tarladalal.com/recipes/category/healthy-indian-soups-under-100-calories/']

In [12]:
# total filters or tags
len(cleaned_tags_links)

437

- therefore, now we have 437 different filters or tags 
- let's go through each link and extract recipes under that category

In [13]:
# get recipie links from categories
def get_recipie_links(category_link):
    website = requests.get(category_link).text
    soup = BeautifulSoup(website, "lxml")
    tags_links=[]

    # getting links
    for div in soup.find_all('div',class_="img-block"):
        tags_links.append(div.find('a').get("href"))
    
    #filtering links
    category = category_link.rstrip('/').split('/')[-1] 
    recipies = [("https://www.tarladalal.com"+link, category) for link in tags_links ]

    return recipies 

- now we can use this function to extract recipe links from under each category.
- Next, we scrape through each recipe page to get our required data

In [14]:
recipies_cat = []
for category in cleaned_tags_links:
    recipies_cat.extend(get_recipie_links(category))

In [15]:
recipies=[]
for rec, cat in recipies_cat:
    recipies.append(rec)

In [16]:
from collections import defaultdict

recipe_tags = defaultdict(list)

for link, tag in recipies_cat:
    recipe_tags[link].append(tag)


In [17]:
len(set(recipies))

2129

- therefore, here we have a list of 2129 unique recipes

#### Scraping through the first recipe page


In [18]:
recipies = list(set(recipies))

In [19]:
recipies[0]

'https://www.tarladalal.com/oats-methi-muthia-39094r'

- Features we need:
- name, serving size, Time to make, tags, ingredients, nutrition, instructions

In [20]:
website = requests.get(recipies[0]).text
soup = BeautifulSoup(website,'lxml')

##### Getting the name


In [21]:
soup.find("h4", class_ = "rec-heading").text

'oats methi muthia recipe | steamed Gujarati savoury snack | healthy oats fenugreek dumplings |'

In [22]:
soup.find("h4", class_ = "rec-heading").text.split("|")

['oats methi muthia recipe ',
 ' steamed Gujarati savoury snack ',
 ' healthy oats fenugreek dumplings ',
 '']

In [23]:
# just taking the first name:
soup.find("h4", class_ = "rec-heading").text.split("|")[0].strip()

'oats methi muthia recipe'

In [24]:
def find_name(rec_soup):
    # it will most probably have only one h4 tag so use .find()
    return rec_soup.find("h4", class_ = "rec-heading").text.split("|")[0].strip()

In [25]:
find_name(soup)

'oats methi muthia recipe'

##### Getting the Make time

In [26]:
soup.find_all('p', class_="mb-0 font-size-13")

[<p class="mb-0 font-size-13"><strong>5 Mins</strong></p>,
 <p class="mb-0 font-size-13"><strong>13 Mins</strong></p>,
 <p class="mb-0 font-size-13"><strong>18 Mins</strong></p>]

In [27]:
soup.find_all('p', class_="mb-0 font-size-13")[2].text

'18 Mins'

In [28]:
def get_make_time(res_soup):
    return res_soup.find_all('p', class_="mb-0 font-size-13")[2].text

In [29]:
get_make_time(soup)

'18 Mins'

##### Getting serving size

In [30]:
soup.find_all('p',class_="mb-0 font-size-13 font-size-13")

[<p class="mb-0 font-size-13 font-size-13"><strong>3 servings</strong></p>]

In [31]:
#using indexing, resolve to .find() if creates a problem later
soup.find_all('p',class_="mb-0 font-size-13 font-size-13")[0].text

'3 servings'

In [32]:
def get_serving_size(res_soup):
    return soup.find_all('p',class_="mb-0 font-size-13 font-size-13")[0].text

In [33]:
get_serving_size(soup)

'3 servings'

##### Getting the tags

In [34]:
soup.find('ul', class_ = 'tags-list')

<ul class="tags-list">
<li><a href="/recipes-for-equipment-indian-steamer-recipes-315">Indian Steamer Recipes</a></li>
<li><a href="/recipes-for-breakfast--Indian-veg-breakfast-recipes-151">Indian Breakfast Recipes</a></li>
<li><a href="/recipes-for-Jain-Breakfast-159">Jain Breakfast</a></li>
<li><a href="/recipes-for-Easy-Indian-Veg-Recipes-180-Simple-Vegetarian-Indian-Recipes-964">Easy Indian Veg</a></li>
<li><a href="/recipes-for-Indian-Diabetic-Recipes-370">Indian Diabetic recipes</a></li>
<li><a href="/recipes-for-Diabetic-Breakfast-Recipes-454">Diabetic Indian Breakfast</a></li>
<li><a href="/recipes-for-Diabetic-Starters-Snacks-Recipes-458">Diabetic Indian Snacks, Starters</a></li>
</ul>

In [35]:
li_soup = soup.find('ul', class_ = 'tags-list').find_all('li')
li_soup

[<li><a href="/recipes-for-equipment-indian-steamer-recipes-315">Indian Steamer Recipes</a></li>,
 <li><a href="/recipes-for-breakfast--Indian-veg-breakfast-recipes-151">Indian Breakfast Recipes</a></li>,
 <li><a href="/recipes-for-Jain-Breakfast-159">Jain Breakfast</a></li>,
 <li><a href="/recipes-for-Easy-Indian-Veg-Recipes-180-Simple-Vegetarian-Indian-Recipes-964">Easy Indian Veg</a></li>,
 <li><a href="/recipes-for-Indian-Diabetic-Recipes-370">Indian Diabetic recipes</a></li>,
 <li><a href="/recipes-for-Diabetic-Breakfast-Recipes-454">Diabetic Indian Breakfast</a></li>,
 <li><a href="/recipes-for-Diabetic-Starters-Snacks-Recipes-458">Diabetic Indian Snacks, Starters</a></li>]

In [36]:
tags = [li.text for li in li_soup]
tags


['Indian Steamer Recipes',
 'Indian Breakfast Recipes',
 'Jain Breakfast',
 'Easy Indian Veg',
 'Indian Diabetic recipes',
 'Diabetic Indian Breakfast',
 'Diabetic Indian Snacks, Starters']

In [37]:
# also need to add the category as tag

tags.extend(recipe_tags[recipies[0]])
tags

['Indian Steamer Recipes',
 'Indian Breakfast Recipes',
 'Jain Breakfast',
 'Easy Indian Veg',
 'Indian Diabetic recipes',
 'Diabetic Indian Breakfast',
 'Diabetic Indian Snacks, Starters',
 'Diabetic-Breakfast-Recipes',
 'Diabetic-Starters-Snacks-Recipes']

In [38]:
def get_tags(res_soup):
    li_soup = res_soup.find('ul', class_ = 'tags-list').find_all('li')
    tags = [li.text for li in li_soup]
    tags.extend(recipe_tags[recipy])  # recipy will be the iterator in main function
    return tags

#### Getting the ingredients

In [39]:
ingredient_div = soup.find('div', class_="ingredients")

In [40]:

for p in ingredient_div.find_all('p'):
    text = p.get_text(strip=True,separator=" ")
    print(text)

3/4 cup coarsely powdered quick cooking rolled oats
2 cups finely chopped fenugreek leaves (methi)
2 tbsp semolina (rava / sooji)
3 tbsp low fat curds (dahi)
1 1/2 tsp chilli powder
2 tsp coriander-cumin seeds (dhania-jeera) powder
1/4 tsp turmeric powder (haldi)
1 tsp green chilli paste
a pinch asafoetida (hing)
salt to taste
1 tsp oil
1/2 tsp mustard seeds ( rai / sarson)
1/2 tsp sesame seeds (til)
1 1/2 tbsp biryani masala


In [41]:
def get_ingredients(res_soup):
    ingredient_div = soup.find('div', class_="ingredients")
    ingredients=[]
    for p in ingredient_div.find_all('p'):
        text = p.get_text(strip=True,separator=" ")
        ingredients.append(text)
    return ingredients

In [42]:
get_ingredients(soup)

['3/4 cup coarsely powdered quick cooking rolled oats',
 '2 cups finely chopped fenugreek leaves (methi)',
 '2 tbsp semolina (rava / sooji)',
 '3 tbsp low fat curds (dahi)',
 '1 1/2 tsp chilli powder',
 '2 tsp coriander-cumin seeds (dhania-jeera) powder',
 '1/4 tsp turmeric powder (haldi)',
 '1 tsp green chilli paste',
 'a pinch asafoetida (hing)',
 'salt to taste',
 '1 tsp oil',
 '1/2 tsp mustard seeds ( rai / sarson)',
 '1/2 tsp sesame seeds (til)',
 '1 1/2 tbsp biryani masala']

#### Getting the nutrients

In [43]:
soup.find("table", id="rcpnutrients")

<table id="rcpnutrients"><tr><td style="padding:0px 2px;">Energy</td><td style="padding:0px 4px;"><span itemprop="calories">133 cal</span></td></tr><tr><td style="padding:0px 2px;">Protein</td><td style="padding:0px 4px;"><span itemprop="proteinContent">5.4 g</span></td></tr><tr><td style="padding:0px 2px;">Carbohydrates</td><td style="padding:0px 4px;"><span itemprop="carbohydrateContent">20.6 g</span></td></tr><tr><td style="padding:0px 2px;">Fiber</td><td style="padding:0px 4px;"><span itemprop="fiberContent">3.7 g</span></td></tr><tr><td style="padding:0px 2px;">Fat</td><td style="padding:0px 4px;"><span itemprop="fatContent">3.3 g</span></td></tr><tr><td style="padding:0px 2px;">Cholesterol</td><td style="padding:0px 4px;"><span itemprop="cholesterolContent">0 mg</span></td></tr><tr><td style="padding:0px 2px;">Sodium</td><td style="padding:0px 4px;"><span itemprop="sodiumContent">23.5 mg</span></td></tr></table>

In [44]:
soup.find("table", id="rcpnutrients").find_all('tr')

[<tr><td style="padding:0px 2px;">Energy</td><td style="padding:0px 4px;"><span itemprop="calories">133 cal</span></td></tr>,
 <tr><td style="padding:0px 2px;">Protein</td><td style="padding:0px 4px;"><span itemprop="proteinContent">5.4 g</span></td></tr>,
 <tr><td style="padding:0px 2px;">Carbohydrates</td><td style="padding:0px 4px;"><span itemprop="carbohydrateContent">20.6 g</span></td></tr>,
 <tr><td style="padding:0px 2px;">Fiber</td><td style="padding:0px 4px;"><span itemprop="fiberContent">3.7 g</span></td></tr>,
 <tr><td style="padding:0px 2px;">Fat</td><td style="padding:0px 4px;"><span itemprop="fatContent">3.3 g</span></td></tr>,
 <tr><td style="padding:0px 2px;">Cholesterol</td><td style="padding:0px 4px;"><span itemprop="cholesterolContent">0 mg</span></td></tr>,
 <tr><td style="padding:0px 2px;">Sodium</td><td style="padding:0px 4px;"><span itemprop="sodiumContent">23.5 mg</span></td></tr>]

In [52]:
for row in soup.find("table", id="rcpnutrients").find_all('tr'):
    print(row.find_all("td")[0].text, row.find_all("td")[1].text)

Energy 133 cal
Protein 5.4 g
Carbohydrates 20.6 g
Fiber 3.7 g
Fat 3.3 g
Cholesterol 0 mg
Sodium 23.5 mg


In [53]:
nutrients = {}
for row in soup.find("table", id="rcpnutrients").find_all('tr'):
    key, value = row.find_all("td")[0].text, row.find_all("td")[1].text
    nutrients[key] = value
nutrients

{'Energy': '133 cal',
 'Protein': '5.4 g',
 'Carbohydrates': '20.6 g',
 'Fiber': '3.7 g',
 'Fat': '3.3 g',
 'Cholesterol': '0 mg',
 'Sodium': '23.5 mg'}

In [54]:
def get_nutrients(res_soup):
    nutrients = {}
    for row in res_soup.find("table", id="rcpnutrients").find_all('tr'):
        key, value = row.find_all("td")[0].text, row.find_all("td")[1].text
        nutrients[key] = value
    return nutrients

In [55]:
get_nutrients(soup)

{'Energy': '133 cal',
 'Protein': '5.4 g',
 'Carbohydrates': '20.6 g',
 'Fiber': '3.7 g',
 'Fat': '3.3 g',
 'Cholesterol': '0 mg',
 'Sodium': '23.5 mg'}

#### Get the instructions

In [63]:
soup.find("div", class_="rsepc").text.strip()   ## there are no steps here

'For oats methi muthiaTo make oats methi muthia, combine the oats, fenugreek leaves, semolina, curds, chilli powder, coriander-cumin seeds powder, turmeric powder, green chilli paste, asafoetida and salt in a bowl, mix well and knead into a soft dough using little water.Divide the dough into 2 equal portions and shape each portion into a cylindrical roll of approximately 150 mm. (6") in length and 25 mm. (1") in diameter.Arrange the rolls on a sieve and steam in a steamer on a high flame for 10 minutes. Remove and keep aside to cool slightly for 10 minutes.Cut into 12 mm. (½”) slices and keep aside.For the tempering, heat the oil in a small non-stick pan and add the mustard seeds.When the seeds crackle, add the sesame seeds and cook on a medium flame for 30 seconds.Pour the tempering over the muthia pieces and toss it lightly.Serve the oats methi muthia hot with green chutney.'

In [61]:
def get_instructions(res_soup):
    return res_soup.find("div", class_="rsepc").text.strip()

In [62]:
get_instructions(soup)

'For oats methi muthiaTo make oats methi muthia, combine the oats, fenugreek leaves, semolina, curds, chilli powder, coriander-cumin seeds powder, turmeric powder, green chilli paste, asafoetida and salt in a bowl, mix well and knead into a soft dough using little water.Divide the dough into 2 equal portions and shape each portion into a cylindrical roll of approximately 150 mm. (6") in length and 25 mm. (1") in diameter.Arrange the rolls on a sieve and steam in a steamer on a high flame for 10 minutes. Remove and keep aside to cool slightly for 10 minutes.Cut into 12 mm. (½”) slices and keep aside.For the tempering, heat the oil in a small non-stick pan and add the mustard seeds.When the seeds crackle, add the sesame seeds and cook on a medium flame for 30 seconds.Pour the tempering over the muthia pieces and toss it lightly.Serve the oats methi muthia hot with green chutney.'