## README
This notebook represents the skeleton of the developed scraper and all cells have to be ran manually. However, if you'd like to see our ready-to-be-cloud-deployed scraper, check the scraper.py file.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException
from selenium.webdriver.common.action_chains import ActionChains
import json
import time
import sys
sys.path.append("../../src")
import scraping as f # Custom functions file containing all of the functions needed for this notebook

## Initialization Parameters
In this section there are some parameters that we used for scraping. The link represents the category we've decided to scrape.

In [None]:
driver = webdriver.Chrome()
driver.set_page_load_timeout(500)
link = "https://tasty.co/tag/dinner" 

## Getting URLs
In this step, we scrape all the links in the recipe category called Dinner as seen in the variable link above.

In [None]:
f.get_urls(driver,link)

## Creating Link Queue
In this step, we create a link queue, so that in case the scraper breaks, we can continue from the latest scraped link

In [None]:
f.create_link_queue('../../data/interim/scraping/links.txt')

## Getting Images
Here we are scraping the recipe images.

In [None]:
f.get_images(driver)

## Scraping
In this step, we are going into all the recipe links scraped and we are scraping the following:

- Ingredients
- Nutritional info
- Preparation steps
- Last edited date
- Rating
- Tags
- Number of comments
- Comments

NB: This scraping code does not work anymore --- They changed the layout of their website somewhat, so the comments don't scrape

In [None]:
with open('../../data/interim/scraping/has_scraped.json') as json_file:
    links = json.load(json_file)

d = dict()
for link, has_scaped in links.items():
    if has_scaped == True:
        continue
    print(f'Processing recipe: {link}')
    driver.get(link)
    if f.page_exists(driver):
        print("page doesn't exist")
        time.sleep(5)
        continue

    try:
        cookie = driver.find_element(By.ID,'onetrust-accept-btn-handler')
        cookie.click()
    except NoSuchElementException:
        pass

    print('\tGetting ingredients...')
    ingredients = [i.text for i in driver.find_elements(By.CLASS_NAME,'ingredient')]
    print('\tGetting nutrition...')
    nutrition = f.grab_nutrition(driver)
    print('\tGetting preparation...')
    preparation = f.grab_preparation(driver)
    print('\tGetting date...')
    date = f.grab_date(driver)
    print('\tGetting rating...')
    rating = f.grab_recipe_rating(driver)
    print('\tGetting tags...')
    try:
        tags = f.get_tags(driver)
    except NoSuchElementException:
        pass
    print('\tGetting number of comments...')
    try: 
        number_of_comments = driver.find_element(By.CLASS_NAME,'tips-count-heading').text
    except NoSuchElementException:
        number_of_comments = "0 TIPS"
    print('\tGetting comments...')
    if number_of_comments != "0 TIPS":
        comments = f.grab_comments(driver)
    else:
        comments = []
    
    d = {'ingredients': ingredients,
                            'nutrition': nutrition,
                            'preparation': preparation,
                            'date': date,
                            'rating': rating,
                            'tags': tags,
                            'number_of_comments': number_of_comments,
                            'comments': comments}
    
    f.save_and_remove_from_queue(link, d)
    time.sleep(5)


## Ending the Scraping

In [None]:
driver.close()