### Scraping more healthy indian recipes

- Recipe name
- Ingredients list
- Cooking time
- Instructions
- Nutrient info 
- Tags (e.g., vegan, keto, veg/non veg)

website: https://www.tarladalal.com/
-nutrient breakdown avaliable too

#### Import dependencies

In [3]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [4]:
webpage = requests.get("https://www.tarladalal.com/category/Healthy-Indian-Recipes/").text

In [5]:
soup = BeautifulSoup(webpage, 'lxml')

In [6]:
# Getting all recipe links from the main page 
# thet are stored in div tags of class text-center
soup.find_all('div',class_="text-center")[:5]

[<div class="text-center">
 <a class="btn btn-main primary-bg" href="/recipes/category/Vitamin-B12-Cobalamin-Rich-Foods/">View All</a>
 </div>,
 <div class="text-center">
 <a class="btn btn-main primary-bg" href="/recipes/category/Healthy-Low-Calorie-Weight-Loss/">View All</a>
 </div>,
 <div class="text-center">
 <a class="btn btn-main primary-bg" href="/recipes/category/Insoluble-Fiber-Diet/">View All</a>
 </div>,
 <div class="text-center">
 <a class="btn btn-main primary-bg" href="/recipes/category/Low-Cholesterol-/">View All</a>
 </div>,
 <div class="text-center">
 <a class="btn btn-main primary-bg" href="/recipes/category/Soluble-Fibre-Diet/">View All</a>
 </div>]

In [7]:
# getting the tag links from the href attrbute of anchor tag
tags_links=[]
for div in soup.find_all('div',class_="text-center")[:-1]:
    tags_links.append(div.find('a').get("href"))

In [8]:
tags_links[:5]

['/recipes/category/Vitamin-B12-Cobalamin-Rich-Foods/',
 '/recipes/category/Healthy-Low-Calorie-Weight-Loss/',
 '/recipes/category/Insoluble-Fiber-Diet/',
 '/recipes/category/Low-Cholesterol-/',
 '/recipes/category/Soluble-Fibre-Diet/']

In [9]:
tags_links[-5:]

['/recipes/category/Lactose-Free-Dairy-Free-Cake-/',
 '/recipes/category/Chronic-Kidney-Disease/',
 '/recipes/category/indian-recipes-for-relief-from-pregnancy-constipation/',
 '/recipes/category/selenium1/',
 '/recipes/category/healthy-indian-soups-under-100-calories/']

- They are in the form of relative address to the main page, so let's prefix them with the address "https://www.tarladalal.com/"

In [10]:
cleaned_tags_links = ["https://www.tarladalal.com"+link for link in tags_links ]

In [11]:
cleaned_tags_links[:5]

['https://www.tarladalal.com/recipes/category/Vitamin-B12-Cobalamin-Rich-Foods/',
 'https://www.tarladalal.com/recipes/category/Healthy-Low-Calorie-Weight-Loss/',
 'https://www.tarladalal.com/recipes/category/Insoluble-Fiber-Diet/',
 'https://www.tarladalal.com/recipes/category/Low-Cholesterol-/',
 'https://www.tarladalal.com/recipes/category/Soluble-Fibre-Diet/']

In [12]:
cleaned_tags_links[-5:]

['https://www.tarladalal.com/recipes/category/Lactose-Free-Dairy-Free-Cake-/',
 'https://www.tarladalal.com/recipes/category/Chronic-Kidney-Disease/',
 'https://www.tarladalal.com/recipes/category/indian-recipes-for-relief-from-pregnancy-constipation/',
 'https://www.tarladalal.com/recipes/category/selenium1/',
 'https://www.tarladalal.com/recipes/category/healthy-indian-soups-under-100-calories/']

In [13]:
# total filters or tags
len(cleaned_tags_links)

437

- therefore, now we have 437 different filters or tags 
- let's go through each link and extract recipes under that category

In [14]:
# get recipie links from categories
def get_recipie_links(category_link):
    website = requests.get(category_link).text
    soup = BeautifulSoup(website, "lxml")
    tags_links=[]

    # getting links
    for div in soup.find_all('div',class_="img-block"):
        tags_links.append(div.find('a').get("href"))
    
    #filtering links
    category = category_link.rstrip('/').split('/')[-1] 
    recipies = [("https://www.tarladalal.com"+link, category) for link in tags_links ]

    return recipies 

- now we can use this function to extract recipe links from under each category.
- Next, we scrape through each recipe page to get our required data

In [28]:
recipies_cat = []
for category in cleaned_tags_links:
    recipies_cat.extend(get_recipie_links(category))

In [None]:
recipies=[]
for rec, cat in recipies_cat:
    recipies.append(rec)

In [46]:
from collections import defaultdict

recipe_tags = defaultdict(list)

for link, tag in recipies_cat:
    recipe_tags[link].append(tag)


In [43]:
len(set(recipies))

2129

- therefore, here we have a list of 2129 unique recipes

#### Scraping through the first recipe page


In [52]:
recipies = list(set(recipies))

In [53]:
recipies[0]

'https://www.tarladalal.com/sweet-lime-and-pepper-salad-3601r'

- Features we need:
- name, serving size, Time to make, tags, ingredients, nutrition, instructions

In [54]:
website = requests.get(recipies[0]).text
soup = BeautifulSoup(website,'lxml')

##### Getting the name


In [55]:
soup.find("h4", class_ = "rec-heading").text

'Sweet Lime and Pepper Salad'

In [56]:
soup.find("h4", class_ = "rec-heading").text.split("|")

['Sweet Lime and Pepper Salad']

In [57]:
# just taking the first name:
soup.find("h4", class_ = "rec-heading").text.split("|")[0].strip()

'Sweet Lime and Pepper Salad'

In [58]:
def find_name(rec_soup):
    # it will most probably have only one h4 tag so use .find()
    return rec_soup.find("h4", class_ = "rec-heading").text.split("|")[0].strip()

In [59]:
find_name(soup)

'Sweet Lime and Pepper Salad'

##### Getting the Make time

In [60]:
soup.find_all('p', class_="mb-0 font-size-13")

[<p class="mb-0 font-size-13"><strong>10 Mins</strong></p>,
 <p class="mb-0 font-size-13"><strong>None Mins</strong></p>,
 <p class="mb-0 font-size-13"><strong>10 Mins</strong></p>]

In [61]:
soup.find_all('p', class_="mb-0 font-size-13")[2].text

'10 Mins'

In [62]:
def get_make_time(res_soup):
    return res_soup.find_all('p', class_="mb-0 font-size-13")[2].text

In [63]:
get_make_time(soup)

'10 Mins'

##### Getting serving size

In [64]:
soup.find_all('p',class_="mb-0 font-size-13 font-size-13")

[<p class="mb-0 font-size-13 font-size-13"><strong>4 servings</strong></p>]

In [65]:
#using indexing, resolve to .find() if creates a problem later
soup.find_all('p',class_="mb-0 font-size-13 font-size-13")[0].text

'4 servings'

In [66]:
def get_serving_size(res_soup):
    return soup.find_all('p',class_="mb-0 font-size-13 font-size-13")[0].text

In [67]:
get_serving_size(soup)

'4 servings'

##### Getting the tags

In [68]:
soup.find('ul', class_ = 'tags-list')

<ul class="tags-list">
<li><a href="/recipes-for-cooking-basics-no-cooking-veg-indian-282">No Cooking Veg Indian</a></li>
<li><a href="/recipes-for-salads-Indian-salad-recipes-167">Indian Salads</a></li>
<li><a href="/recipes-for-Light-Salads-168">Light Salads</a></li>
<li><a href="/recipes-for-Low-Calorie-Salads-Indian-Veg-Low-Cal-Salads-174">Low Calorie Indian Salad</a></li>
<li><a href="/recipes-for-Forever-Young-Diet--Anti-Aging-Indian-Diet-376">Forever Young Diet, Anti Aging Indian Diet</a></li>
<li><a href="/recipes-for-Beautiful-Skin-Good-Skin-512">Recipes for glowing skin</a></li>
<li><a href="/recipes-for-immunity-boosting-Indian-513">Recipes for Increasing immunity</a></li>
</ul>

In [69]:
li_soup = soup.find('ul', class_ = 'tags-list').find_all('li')
li_soup

[<li><a href="/recipes-for-cooking-basics-no-cooking-veg-indian-282">No Cooking Veg Indian</a></li>,
 <li><a href="/recipes-for-salads-Indian-salad-recipes-167">Indian Salads</a></li>,
 <li><a href="/recipes-for-Light-Salads-168">Light Salads</a></li>,
 <li><a href="/recipes-for-Low-Calorie-Salads-Indian-Veg-Low-Cal-Salads-174">Low Calorie Indian Salad</a></li>,
 <li><a href="/recipes-for-Forever-Young-Diet--Anti-Aging-Indian-Diet-376">Forever Young Diet, Anti Aging Indian Diet</a></li>,
 <li><a href="/recipes-for-Beautiful-Skin-Good-Skin-512">Recipes for glowing skin</a></li>,
 <li><a href="/recipes-for-immunity-boosting-Indian-513">Recipes for Increasing immunity</a></li>]

In [70]:
tags = [li.text for li in li_soup]
tags


['No Cooking Veg Indian',
 'Indian Salads',
 'Light Salads',
 'Low Calorie Indian Salad',
 'Forever Young Diet, Anti Aging Indian Diet',
 'Recipes for glowing skin',
 'Recipes for Increasing immunity']

In [71]:
# also need to add the category as tag

tags.extend(recipe_tags[recipies[0]])
tags

['No Cooking Veg Indian',
 'Indian Salads',
 'Light Salads',
 'Low Calorie Indian Salad',
 'Forever Young Diet, Anti Aging Indian Diet',
 'Recipes for glowing skin',
 'Recipes for Increasing immunity',
 'Salads-to-control-Acidity',
 'Healthy-Fruit-Based-Salads']

In [72]:
def get_tags(res_soup):
    li_soup = res_soup.find('ul', class_ = 'tags-list').find_all('li')
    tags = [li.text for li in li_soup]
    tags.extend(recipe_tags[recipy])  # recipy will be the iterator in main function
    return tags

#### Getting the ingredients

In [92]:
ingredient_div = soup.find('div', class_="ingredients")

In [None]:

for p in ingredient_div.find_all('p'):
    text = p.get_text(strip=True,separator=" ")
    print(text)

1 cup sweet lime (mosambi) segments
1 cup yellow and green capsicum cubes
2 cups lettuce , torn into pieces
1 cup cucumber cubes
1/2 tsp mustard (rai / sarson) powder
1/2 tsp freshly ground black pepper (kalimirch)
1 tbsp lemon juice
salt to taste


In [95]:
def get_ingredients(res_soup):
    ingredient_div = soup.find('div', class_="ingredients")
    ingredients=[]
    for p in ingredient_div.find_all('p'):
        text = p.get_text(strip=True,separator=" ")
        ingredients.append(text)
    return ingredients

In [96]:
get_ingredients(soup)

['1 cup sweet lime (mosambi) segments',
 '1 cup yellow and green capsicum cubes',
 '2 cups lettuce , torn into pieces',
 '1 cup cucumber cubes',
 '1/2 tsp mustard (rai / sarson) powder',
 '1/2 tsp freshly ground black pepper (kalimirch)',
 '1 tbsp lemon juice',
 'salt to taste']

#### Getting the instructions