### Scraping more healthy indian recipes

- Recipe name
- Ingredients list
- Cooking time
- Instructions
- Nutrient info 
- Tags (e.g., vegan, keto, veg/non veg)

website: https://www.tarladalal.com/
-nutrient breakdown avaliable too

#### Import dependencies

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [13]:
webpage = requests.get("https://www.tarladalal.com/category/Healthy-Indian-Recipes/").text

In [14]:
soup = BeautifulSoup(webpage, 'lxml')

In [102]:
# Getting all recipe links from the main page 
# thet are stored in div tags of class text-center
soup.find_all('div',class_="text-center")[:5]

[<div class="text-center">
 <a class="btn btn-main primary-bg" href="/recipes/category/Vitamin-B12-Cobalamin-Rich-Foods/">View All</a>
 </div>,
 <div class="text-center">
 <a class="btn btn-main primary-bg" href="/recipes/category/Healthy-Low-Calorie-Weight-Loss/">View All</a>
 </div>,
 <div class="text-center">
 <a class="btn btn-main primary-bg" href="/recipes/category/Insoluble-Fiber-Diet/">View All</a>
 </div>,
 <div class="text-center">
 <a class="btn btn-main primary-bg" href="/recipes/category/Low-Cholesterol-/">View All</a>
 </div>,
 <div class="text-center">
 <a class="btn btn-main primary-bg" href="/recipes/category/Soluble-Fibre-Diet/">View All</a>
 </div>]

In [99]:
# getting the tag links from the href attrbute of anchor tag
tags_links=[]
for div in soup.find_all('div',class_="text-center")[:-1]:
    tags_links.append(div.find('a').get("href"))

In [103]:
tags_links[:5]

['/recipes/category/Vitamin-B12-Cobalamin-Rich-Foods/',
 '/recipes/category/Healthy-Low-Calorie-Weight-Loss/',
 '/recipes/category/Insoluble-Fiber-Diet/',
 '/recipes/category/Low-Cholesterol-/',
 '/recipes/category/Soluble-Fibre-Diet/']

In [104]:
tags_links[-5:]

['/recipes/category/Lactose-Free-Dairy-Free-Cake-/',
 '/recipes/category/Chronic-Kidney-Disease/',
 '/recipes/category/indian-recipes-for-relief-from-pregnancy-constipation/',
 '/recipes/category/healthy-indian-soups-under-100-calories/',
 '/recipes/category/selenium1/']

- They are in the form of relative address to the main page, so let's prefix them with the address "https://www.tarladalal.com/"

In [105]:
cleaned_tags_links = ["https://www.tarladalal.com"+link for link in tags_links ]

In [106]:
cleaned_tags_links[:5]

['https://www.tarladalal.com/recipes/category/Vitamin-B12-Cobalamin-Rich-Foods/',
 'https://www.tarladalal.com/recipes/category/Healthy-Low-Calorie-Weight-Loss/',
 'https://www.tarladalal.com/recipes/category/Insoluble-Fiber-Diet/',
 'https://www.tarladalal.com/recipes/category/Low-Cholesterol-/',
 'https://www.tarladalal.com/recipes/category/Soluble-Fibre-Diet/']

In [107]:
cleaned_tags_links[-5:]

['https://www.tarladalal.com/recipes/category/Lactose-Free-Dairy-Free-Cake-/',
 'https://www.tarladalal.com/recipes/category/Chronic-Kidney-Disease/',
 'https://www.tarladalal.com/recipes/category/indian-recipes-for-relief-from-pregnancy-constipation/',
 'https://www.tarladalal.com/recipes/category/healthy-indian-soups-under-100-calories/',
 'https://www.tarladalal.com/recipes/category/selenium1/']

In [108]:
# total filters or tags
len(cleaned_tags_links)

437

- therefore, now we have 437 different filters or tags 
- let's go through each link and extract recipes under that category

In [None]:
# get recipie links from categories
def get_recipie_links(category_link):
    website = requests.get(category_link).text
    soup = BeautifulSoup(website, "lxml")
    tags_links=[]

    # getting links
    for div in soup.find_all('div',class_="img-block"):
        tags_links.append(div.find('a').get("href"))
    
    #filtering links
    recipies = ["https://www.tarladalal.com"+link for link in tags_links ]

    return recipies 

- now we can use this function to extract recipe links from under each category.
- Next, we scrape through each recipe page to get our required data

In [119]:
recipies = []
for category in cleaned_tags_links:
    recipies.extend(get_recipie_links(category))

In [121]:
len(recipies)

13884

In [122]:
recipies[-5:]

['https://www.tarladalal.com/rabri-or-how-to-make-rabdi-1099r',
 'https://www.tarladalal.com/paneer-tikka-2754r',
 'https://www.tarladalal.com/green-chutney-how-to-make-green-chutney-recipe-22266r',
 'https://www.tarladalal.com/rice-noodles-with-vegetables-in-thai-red-curry-sauce-465r',
 'https://www.tarladalal.com/dal-dhokli--gujarat-recipe-578r']

- Hence, we have 13884 recipies now

#### Scraping through the first recipe page


In [123]:
recipies[0]

'https://www.tarladalal.com/paneer-masala-2404r'

- Features we need:
- name, serving size, Time to make, tags, ingredients, nutrition, instructions

In [125]:
website = requests.get(recipies[0]).text
soup = BeautifulSoup(website,'lxml')

##### Getting the name


In [131]:
soup.find("h4", class_ = "rec-heading").text

'paneer masala recipe | Punjabi paneer masala | dhaba style paneer masala'

In [132]:
soup.find("h4", class_ = "rec-heading").text.split("|")

['paneer masala recipe ',
 ' Punjabi paneer masala ',
 ' dhaba style paneer masala']

In [135]:
# just taking the first name:
soup.find("h4", class_ = "rec-heading").text.split("|")[0].strip()

'paneer masala recipe'

In [149]:
def find_name(rec_soup):
    # it will most probably have only one h4 tag so use .find()
    return rec_soup.find("h4", class_ = "rec-heading").text.split("|")[0].strip()

In [151]:
find_name(soup)

'paneer masala recipe'

##### Getting the Make time

In [156]:
soup.find_all('p', class_="mb-0 font-size-13")

[<p class="mb-0 font-size-13"><strong>10 Mins</strong></p>,
 <p class="mb-0 font-size-13"><strong>12 Mins</strong></p>,
 <p class="mb-0 font-size-13"><strong>22 Mins</strong></p>]

In [157]:
soup.find_all('p', class_="mb-0 font-size-13")[2].text

'22 Mins'

In [160]:
def get_make_time(res_soup):
    return res_soup.find_all('p', class_="mb-0 font-size-13")[2].text

In [161]:
get_make_time(soup)

'22 Mins'

##### Getting serving size

In [168]:
soup.find_all('p',class_="mb-0 font-size-13 font-size-13")

[<p class="mb-0 font-size-13 font-size-13"><strong>4 servings</strong></p>]

In [170]:
#using indexing, resolve to .find() if creates a problem later
soup.find_all('p',class_="mb-0 font-size-13 font-size-13")[0].text

'4 servings'

In [171]:
def get_serving_size(res_soup):
    return soup.find_all('p',class_="mb-0 font-size-13 font-size-13")[0].text

In [172]:
get_serving_size(soup)

'4 servings'

##### Getting the tags

In [175]:
soup.find('ul', class_ = 'tags-list')

<ul class="tags-list">
<li><a href="/recipes-for-Indian-dinner-939">Indian Dinner</a></li>
<li><a href="/recipes-for-North-Indian-Dinner-948">North Indian Dinner</a></li>
<li><a href="/recipes-for-Indian-Lunch-926">Indian Lunch</a></li>
<li><a href="/recipes-for-Must-have-Sabzis-for-Lunch-Indian-Veg-929">Lunch Sabzi</a></li>
<li><a href="/recipes-for-Subzis-with-Gravies-211">Sabzis with Gravies</a></li>
<li><a href="/recipes-for-Sabzi-Curries-Collection--207">Sabzis, Curries</a></li>
<li><a href="/recipes-for-main-sabzis-curries-traditional-indian-sabzis-214">Traditional Indian Sabzis</a></li>
</ul>

In [178]:
li_soup = soup.find('ul', class_ = 'tags-list').find_all('li')
li_soup

[<li><a href="/recipes-for-Indian-dinner-939">Indian Dinner</a></li>,
 <li><a href="/recipes-for-North-Indian-Dinner-948">North Indian Dinner</a></li>,
 <li><a href="/recipes-for-Indian-Lunch-926">Indian Lunch</a></li>,
 <li><a href="/recipes-for-Must-have-Sabzis-for-Lunch-Indian-Veg-929">Lunch Sabzi</a></li>,
 <li><a href="/recipes-for-Subzis-with-Gravies-211">Sabzis with Gravies</a></li>,
 <li><a href="/recipes-for-Sabzi-Curries-Collection--207">Sabzis, Curries</a></li>,
 <li><a href="/recipes-for-main-sabzis-curries-traditional-indian-sabzis-214">Traditional Indian Sabzis</a></li>]

In [179]:
tags = [li.text for li in li_soup]
tags


['Indian Dinner',
 'North Indian Dinner',
 'Indian Lunch',
 'Lunch Sabzi',
 'Sabzis with Gravies',
 'Sabzis, Curries',
 'Traditional Indian Sabzis']

In [180]:
# also need to add the category as tag

cleaned_tags_links

['https://www.tarladalal.com/recipes/category/Vitamin-B12-Cobalamin-Rich-Foods/',
 'https://www.tarladalal.com/recipes/category/Healthy-Low-Calorie-Weight-Loss/',
 'https://www.tarladalal.com/recipes/category/Insoluble-Fiber-Diet/',
 'https://www.tarladalal.com/recipes/category/Low-Cholesterol-/',
 'https://www.tarladalal.com/recipes/category/Soluble-Fibre-Diet/',
 'https://www.tarladalal.com/recipes/category/Healthy-Breakfast/',
 'https://www.tarladalal.com/recipes/category/Indian-Diabetic-Recipes/',
 'https://www.tarladalal.com/recipes/category/Healthy-Pregnancy-/',
 'https://www.tarladalal.com/recipes/category/Healthy-Zero-Oil/',
 'https://www.tarladalal.com/recipes/category/Iron-Rich-/',
 'https://www.tarladalal.com/recipes/category/Acidity-Heartburn-Acid-Reflux-and-Gerd/',
 'https://www.tarladalal.com/recipes/category/Healthy-Sabzis/',
 'https://www.tarladalal.com/recipes/category/Healthy-Indian-Snacks/',
 'https://www.tarladalal.com/recipes/category/Healthy-Heart/',
 'https://www