## README
This notebook represents the skeleton of the developed scraper and all cells have to be ran manually. However, if you'd like to see our ready-to-be-cloud-deployed scraper, check the scraper.py file.

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException
from selenium.webdriver.common.action_chains import ActionChains
import json
import time
import sys
sys.path.append("../../src")
import scraping as f # Custom functions file containing all of the functions needed for this notebook

## Initialization Parameters
In this section there are some parameters that we used for scraping. The link represents the category we've decided to scrape.

In [2]:
driver = webdriver.Chrome()
driver.set_page_load_timeout(500)
link = "https://tasty.co/tag/dinner" 

## Getting URLs
In this step, we scrape all the links in the recipe category called Dinner as seen in the variable link above.

In [None]:
f.get_urls(driver,link)

## Creating Link Queue
In this step, we create a link queue, so that in case the scraper breaks, we can continue from the latest scraped link

In [None]:
f.create_link_queue('../../data/interim/scraping/links.txt')

## Getting Images
Here we are scraping the recipe images.

In [None]:
f.get_images(driver)

## Scraping
In this step, we are going into all the recipe links scraped and we are scraping the following:

- Ingredients
- Nutritional info
- Preparation steps
- Last edited date
- Rating
- Tags
- Number of comments
- Comments

NB: This scraping code does not work anymore --- They changed the layout of their website somewhat, so the comments don't scrape

In [5]:
with open('../../data/interim/scraping/has_scraped.json') as json_file:
    links = json.load(json_file)

d = dict()
for link, has_scaped in links.items():
    if has_scaped == True:
        continue
    print(f'Processing recipe: {link}')
    driver.get(link)
    if f.page_exists(driver):
        print("page doesn't exist")
        time.sleep(5)
        continue

    try:
        cookie = driver.find_element(By.ID,'onetrust-accept-btn-handler')
        cookie.click()
    except NoSuchElementException:
        pass

    print('\tGetting ingredients...')
    ingredients = [i.text for i in driver.find_elements(By.CLASS_NAME,'ingredient')]
    print('\tGetting nutrition...')
    nutrition = f.grab_nutrition(driver)
    print('\tGetting preparation...')
    preparation = f.grab_preparation(driver)
    print('\tGetting date...')
    date = f.grab_date(driver)
    print('\tGetting rating...')
    rating = f.grab_recipe_rating(driver)
    print('\tGetting tags...')
    try:
        tags = f.get_tags(driver)
    except NoSuchElementException:
        pass
    print('\tGetting number of comments...')
    try: 
        number_of_comments = driver.find_element(By.CLASS_NAME,'tips-count-heading').text
    except NoSuchElementException:
        number_of_comments = "0 TIPS"
    print('\tGetting comments...')
    if number_of_comments != "0 TIPS":
        comments = f.grab_comments(driver)
    else:
        comments = []
    
    d = {'ingredients': ingredients,
                            'nutrition': nutrition,
                            'preparation': preparation,
                            'date': date,
                            'rating': rating,
                            'tags': tags,
                            'number_of_comments': number_of_comments,
                            'comments': comments}
    
    f.save_and_remove_from_queue(link, d)
    time.sleep(5)


Processing recipe: https://tasty.co/recipe/chicken-spanakopita-pie
	Getting ingredients...
	Getting nutrition...
	Getting preparation...
	Getting date...
	Getting rating...
	Getting tags...
	Getting number of comments...
	Getting comments...
{'ingredients': ['4 boneless, skinless chicken breasts', 'salt, to taste', 'pepper, to taste', '4 cloves garlic, minced', '1 tablespoon greek oregano', '2 tablespoons olive oil', '½ cup ricotta cheese (125 g)', '2 cups frozen spinach (80 g), thawed, chopped', '1 cup feta cheese (115 g), crumbled', '½ cup green onion (75 g), chopped', '2 eggs, beaten', '2 tablespoons fresh dill, chopped', '1 package phyllo dough', '1 cup butter (230 g), melted', '1 cup red onion (150 g), diced', '½ cup kalamata olive (130 g), chopped', '1 cup tomato (200 g), diced'], 'nutrition': ['Calories 575', 'Fat 42g', 'Carbs 12g', 'Fiber 2g', 'Sugar 4g', 'Protein 35g'], 'preparation': ['Season the chicken breasts with the salt, pepper, garlic, greek oregano, and olive oil.', '

## Ending the Scraping

In [6]:
driver.close()

WebDriverException: Message: disconnected: not connected to DevTools
  (failed to check if window was closed: disconnected: not connected to DevTools)
  (Session info: chrome=120.0.6099.71)
Stacktrace:
	GetHandleVerifier [0x00007FF72E3B4D02+56194]
	(No symbol) [0x00007FF72E3204B2]
	(No symbol) [0x00007FF72E1C76AA]
	(No symbol) [0x00007FF72E1AE1E9]
	(No symbol) [0x00007FF72E1AF7CE]
	(No symbol) [0x00007FF72E1C7CC3]
	(No symbol) [0x00007FF72E1A0580]
	(No symbol) [0x00007FF72E242D41]
	(No symbol) [0x00007FF72E235E40]
	(No symbol) [0x00007FF72E204A45]
	(No symbol) [0x00007FF72E205AD4]
	GetHandleVerifier [0x00007FF72E72D5BB+3695675]
	GetHandleVerifier [0x00007FF72E786197+4059159]
	GetHandleVerifier [0x00007FF72E77DF63+4025827]
	GetHandleVerifier [0x00007FF72E44F029+687785]
	(No symbol) [0x00007FF72E32B508]
	(No symbol) [0x00007FF72E327564]
	(No symbol) [0x00007FF72E3276E9]
	(No symbol) [0x00007FF72E318094]
	BaseThreadInitThunk [0x00007FFBDB217344+20]
	RtlUserThreadStart [0x00007FFBDD0626B1+33]
