<h2>Caveat</h2>
Web sites often change the format of their pages so this may not always work. If it doesn't, rework the examples after examining the html content of the page (most browsers will let you see the html source - look for a "page source" option - though you might have to turn on the developer mode in your browser preferences. For example, on Chrome you need to click the "developer mode" check box under Extensions in the Preferences/Options menu. 

<h1>Scraping web pages</h1>
<li>Most data that resides on the web is in HTML 
<li>HTML: HyperText Markup Language
<li>An html web page is a structured document
<li>We can exploit this structure to extract data from the page

<li>Learn html and css at <a href="https://www.khanacademy.org/computing/computer-programming/html-css">this site</a>

<b>Web scraping</b>: Automating the process of extracting information from web pages<br>
<li>for data collection and analysis
<li>for incorporating in a web app 

<h2>Python libraries for web scraping</h2>
<li><b>requests</b> for handling the request-response cycle
<li><b>beautifulsoup4</b> for extracting data from an html string
<li><b>selenium</b> for extracting data from an html string and managing the response process, particularly when a page contains JavaScript or when a button needs to be clicked

<h2>Beautiful Soup</h2>
<li>html and xml parser
<li>makes use of formatted html tags and css properties to extract data
<li>https://www.crummy.com/software/BeautifulSoup/bs4/doc/

<h2>Web scraping using beautifulsoup4</h2>

<h3>Import necessary modules</h3>

In [2]:
import requests
from bs4 import BeautifulSoup

<h3>The http request response cycle</h3>

In [3]:
url = "http://www.epicurious.com/search/Tofu Chili"
response = requests.get(url)
if response.status_code == 200:
    print("Success")
else:
    print("Failure")

Success


In [4]:
keywords = input("Please enter the things you want to see in a recipe ")
url = "http://www.epicurious.com/search/" + keywords
response = requests.get(url)
if response.status_code == 200:
    print("Success")
else:
    print("Failure")

Please enter the things you want to see in a recipe 
Success


<h3>Set up the BeautifulSoup object</h3>

In [5]:
results_page = BeautifulSoup(response.content,'lxml')
print(type(results_page))
#print(results_page.prettify())

<class 'bs4.BeautifulSoup'>


<h3>BS4 functions</h3>

<h4>find_all finds all instances of a specified tag</h4>
<h4>returns a result_set (a list)</h4>

In [6]:
all_a_tags = results_page.find_all('a')
print(type(all_a_tags))

<class 'bs4.element.ResultSet'>


In [7]:
all_a_tags

[<a data-reactid="5" href="/" itemprop="url" title="Epicurious">Epicurious</a>,
 <a data-reactid="71" href="/holidays-events/christmas-ham-and-holiday-pork-roast-recipes-gallery">Our Best Christmas Hams and Pork Roasts</a>,
 <a class="photo-link" data-reactid="73" href="/holidays-events/christmas-ham-and-holiday-pork-roast-recipes-gallery"><div class="photo-wrap" data-reactid="74"><div class="component-lazy pending" data-component="Lazy" data-reactid="75"></div></div></a>,
 <a class="view-complete-item" data-reactid="76" href="/holidays-events/christmas-ham-and-holiday-pork-roast-recipes-gallery" itemprop="url" title="Our Best Christmas Hams and Pork Roasts"><!-- react-text: 77 -->View “<!-- /react-text --><!-- react-text: 78 -->Our Best Christmas Hams and Pork Roasts<!-- /react-text --><!-- react-text: 79 -->”<!-- /react-text --></a>,
 <a data-reactid="84" href="/expert-advice/gifts-for-francophiles-article">13 Food &amp; Cooking Gifts for Francophiles</a>,
 <a class="photo-link" data

<h4>find finds the first instance of a specified tag</h4>
<h4>returns a bs4 element</h4>


In [8]:
div_tag = results_page.find('div')
print(div_tag)

<div class="header-wrapper" data-reactid="2"><div class="header" data-reactid="3" role="banner"><h2 data-reactid="4" itemtype="https://schema.org/Organization"><a data-reactid="5" href="/" itemprop="url" title="Epicurious">Epicurious</a></h2><div class="search-form-container" data-reactid="6"><form action="/search/" autocomplete="off" data-reactid="7" method="get" role="search"><fieldset data-reactid="8"><button class="submit" data-reactid="9" type="submit">search</button><input autocomplete="off" data-reactid="10" maxlength="120" name="terms" placeholder="Find a Recipe" type="text" value=""/><button class="filter mobile" data-reactid="11">filters</button><button class="filter tablet" data-reactid="12">filter results</button></fieldset></form><div class="ingredient-filters" data-reactid="13"><h3 data-reactid="14">Include/Exclude Ingredients</h3><form class="include-ingredients" data-reactid="15"><fieldset data-reactid="16"><input aria-label="Include ingredients" data-reactid="17" place

In [9]:
type(div_tag)

bs4.element.Tag

<h4>bs4 functions can be recursively applied on elements</h4>

In [10]:
div_tag.find('a')

<a data-reactid="5" href="/" itemprop="url" title="Epicurious">Epicurious</a>

<h4>Both find as well as find_all can be qualified by css selectors</h4>
<li>using selector=value
<li>using a dictionary

<h4>Using selector=value</h4>

In [12]:
#When using this method and looking for 'class' use 'class_' (because class is a reserved word in python)
#Note that we get a list back because find_all returns a list
results_page.find_all('article',class_="recipe-content-card")

[]

<h4>Using selectors as key value pairs in a dictionary</h4>

In [13]:
#Since we're using a string as the key, the fact that class is a reserved word is not a problem
#We get an element back because find returns an element
results_page.find('article',{'class':'recipe-content-card'})

<h4>get_text() returns the marked up text (the content) enclosed in a tag.</h4>
<li>returns a string

In [14]:
results_page.find('article',{'class':'recipe-content-card'}).get_text()

AttributeError: 'NoneType' object has no attribute 'get_text'

<h4>get returns the value of a tag attribute</h4>
<li>returns a string

In [15]:
recipe_tag = results_page.find('article',{'class':'recipe-content-card'})
recipe_link = recipe_tag.find('a')
print("a tag:",recipe_link)
link_url = recipe_link.get('href')
print("link url:",link_url)
print(type(link_url))

AttributeError: 'NoneType' object has no attribute 'find'

<h2>Summary of bs4 functions</h2>

 ![image.png](attachment:image.png)

<h1>A function that returns a list containing recipe names, recipe descriptions (if any) and recipe urls</h1>


<li>We want to create a list of recipes and links to the recipes
<li>We need to figure out how to ‘programmatically’ extract each recipe name and recipe link

<li>Search for the tag with a unique attribute value that identifies recipes and recipe links
<li>We’ll look at the a (annotate) tags because clickable links are in a tags

In [16]:
for tag in results_page.find_all('article'):
    print(tag)

<article class="gallery-content-card" data-has-quickview="false" data-index="0" data-reactid="67" itemscope="" itemtype="https://schema.org/CollectionPage"><header class="summary" data-reactid="68"><strong class="tag" data-reactid="69">gallery</strong><h4 class="hed" data-reactid="70" data-truncate="3" itemprop="name"><a data-reactid="71" href="/holidays-events/christmas-ham-and-holiday-pork-roast-recipes-gallery">Our Best Christmas Hams and Pork Roasts</a></h4><p class="dek" data-reactid="72" data-truncate="1">Because nothing says "Happy Holidays" better than a big, beautiful roast that you can slice into sandwich fixings for days on end.</p></header><a class="photo-link" data-reactid="73" href="/holidays-events/christmas-ham-and-holiday-pork-roast-recipes-gallery"><div class="photo-wrap" data-reactid="74"><div class="component-lazy pending" data-component="Lazy" data-reactid="75"></div></div></a><a class="view-complete-item" data-reactid="76" href="/holidays-events/christmas-ham-and-

In [17]:
def get_recipes(keywords):
    recipe_list = list()
    import requests
    from bs4 import BeautifulSoup
    url = "http://www.epicurious.com/search/" + keywords
    response = requests.get(url)
    if not response.status_code == 200:
        return None
    try:
        results_page = BeautifulSoup(response.content,'lxml')
        recipes = results_page.find_all('article',class_="recipe-content-card")
        for recipe in recipes:
            recipe_link = "http://www.epicurious.com" + recipe.find('a').get('href')
            recipe_name = recipe.find('a').get_text()
            try:
                recipe_description = recipe.find('p',class_='dek').get_text()
            except:
                recipe_description = ''
            recipe_list.append((recipe_name,recipe_link,recipe_description))
        return recipe_list
    except:
        return None

In [18]:
get_recipes("beef")

[('Sunday Stash Braised Beef',
  'http://www.epicurious.com/recipes/food/views/sunday-stash-braised-beef',
  'This simple braise is a weeknight savior. Make a big batch and stash it in the fridge or freezer.'),
 ('Cold Beef Tenderloin with Tomatoes and Cucumbers',
  'http://www.epicurious.com/recipes/food/views/cold-beef-tenderloin-with-tomatoes-and-cucumbers',
  'Beef tenderloin is precious enough to baby on a two-zone grill: Sear it over high heat, then transfer it to the cooler side and turn it often to hit a perfect medium-rare.'),
 ('Instant Pot Beef and Sweet Potato Chili',
  'http://www.epicurious.com/recipes/food/views/instant-pot-beef-and-sweet-potato-chili',
  'Sweet potatoes almost melt as they cook under pressure in the Instant Pot, lending a silky texture and sweet flavor to this harissa-spiced chili.'),
 ('Beef Chili',
  'http://www.epicurious.com/recipes/food/views/beef-chili',
  'Skip that dusty bottle of chili powder. Instead, soak and purée whole dried chiles to stir 

In [19]:
get_recipes('Nothing')

[('Quick-Pickled Charred Vegetables',
  'http://www.epicurious.com/recipes/food/views/quick-pickled-charred-grilled-vegetables',
  "This technique is nothing short of amazing—even if you're finicky about your pickles."),
 ('Sweet Potato and Sage Pancakes',
  'http://www.epicurious.com/recipes/food/views/sweet-potato-and-sage-pancakes',
  'These wheat-free pancakes are sweetened with nothing but homemade, sugar-free applesauce.'),
 ('Jalapeño Poppers with Smoked Gouda',
  'http://www.epicurious.com/recipes/food/views/jalapeno-poppers-with-smoked-gouda',
  'The moderate heat of jalapeños is a perfect counterbalance to this rich filling, a combination of cream cheese and smoked Gouda. The results are nothing like the breaded, deep-fried apps you get in sports bars.'),
 ('Air Fryer BBQ Pork Ribs',
  'http://www.epicurious.com/recipes/food/views/air-fryer-memphis-style-bbq-pork-ribs',
  'In the air fryer, you can have tender, pull-apart ribs in a fraction of the traditional time. The spice 

<h2>Let's write a function that</h2>
<h3>given a recipe link</h3>
<h3>returns a dictionary containing the ingredients and preparation instructions</h3>

In [27]:
recipe_link = "http://www.epicurious.com" + '/recipes/food/views/spicy-lemongrass-tofu-233844'

In [28]:
def get_recipe_info(recipe_link):
    recipe_dict = dict()
    import requests
    from bs4 import BeautifulSoup
    try:
        response = requests.get(recipe_link)
        if not response.status_code == 200:
            return recipe_dict
        result_page = BeautifulSoup(response.content,'lxml')
        ingredient_list = list()
        prep_steps_list = list()
        for ingredient in result_page.find_all('li',class_='ingredient'):
            ingredient_list.append(ingredient.get_text())
        for prep_step in result_page.find_all('li',class_='preparation-step'):
            prep_steps_list.append(prep_step.get_text().strip())
        recipe_dict['ingredients'] = ingredient_list
        recipe_dict['preparation'] = prep_steps_list
        return recipe_dict
    except:
        return recipe_dict
        

In [29]:
get_recipe_info(recipe_link)

{'ingredients': ['2 lemongrass stalks, outer layers peeled, bottom white part thinly sliced and finely chopped (about 1/4 cup)',
  '1 1/2 tablespoons soy sauce',
  '2 teaspoons chopped Thai bird chilies or another fresh chili',
  '1/2 teaspoon dried chili flakes',
  '1 teaspoon ground turmeric',
  '2 teaspoons sugar',
  '1/2 teaspoon salt',
  '12 ounces tofu, drained, patted dry and cut into 3/4-inch cubes',
  '4 tablespoons vegetable oil',
  '1/2 yellow onion, cut into 1/8-inch slices',
  '2 shallots, thinly sliced',
  '1 teaspoon minced garlic',
  '4 tablespoons chopped roasted peanuts',
  '10 la lot, or pepper leaves, shredded, or 2/3 cup loosely packed Asian basil leaves'],
 'preparation': ['1. Combine the lemongrass, soy sauce, chilies, chili flakes, turmeric, sugar and salt in a bowl. Add the tofu cubes and turn to coat them evenly. Marinate for 30 minutes.',
  '2. Heat half of the oil in a 12-inch nonstick skillet over moderately high heat. Add the onion, shallot and garlic and 

<h2>Construct a list of dictionaries for all recipes</h2>

In [30]:
def get_all_recipes(keywords):
    results = list()
    all_recipes = get_recipes(keywords)
    for recipe in all_recipes:
        recipe_dict = get_recipe_info(recipe[1])
        recipe_dict['name'] = recipe[0]
        recipe_dict['description'] = recipe[2]
        results.append(recipe_dict)
    return(results)

In [31]:
get_all_recipes("Tofu")

[{'ingredients': ['5 scallions',
   '4 garlic cloves, finely grated',
   '1 (2") piece ginger, peeled, finely grated',
   '1 Tbsp. virgin coconut oil or vegetable oil',
   '2 Tbsp. Thai red curry paste',
   '1 (14-oz.) package firm tofu, drained, broken into 1" pieces',
   '1 cup unsweetened coconut milk',
   'Kosher salt',
   '1 Tbsp. fresh lime juice',
   '1 Fresno chile, thinly sliced (optional)',
   '1 bunch collard greens, leaves halved lengthwise, ribs and stems removed, covered, chilled',
   '1/2 cup cilantro leaves with tender stems',
   '1/2 cup Dang Original coconut chips or toasted unsweetened coconut flakes',
   'Lime wedges (for serving)'],
  'preparation': ['Remove dark green tops from scallions and thinly slice on a diagonal. Place in a small bowl, cover with a damp paper towel, and chill until ready to serve. Thinly slice remaining white and pale green parts crosswise and place in another small bowl; add garlic and ginger. (Have scallion mixture, curry paste, tofu, and 

<h1>Logging in to a web server</h1>

<li>Figure out the login url 
<li>https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Main+Page
<li>Look for the login form in the html source
<li>form_tag = page_soup.find('form')
<li>Look for ALL the inputs in the login form (some may be tricky!)
<li>input_tags = form_tag.find_all('input')
<li>Create a Python dict object with key,value pairs for each input
<li>Use requests.session to create an open session object
<li>Send the login request (POST)
<li>Send followup requests keeping the sessions object open

<h2>Get username and password</h2>
<li>Best to store in a file for reuse
<li>You will need to set up your own login and password and place them in a file called wikidata.txt
<li>Line one of the file should contain your username
<li>Line two your password

In [2]:
with open('/Users/ya/Desktop/WikiCredentials.txt') as f:
    contents = f.read().split('\n')
    username = contents[0]
    password = contents[1]


FileNotFoundError: [Errno 2] No such file or directory: '/Users/ya/Desktop/WikiCredentials.txt'

<h3>Construct an object that contains the data to be sent to the login page</h3>

In [3]:

payload = {
    'wpName': username,
    'wpPassword': password,
    'wploginattempt': 'Log in',
    'wpEditToken': "+\\",
    'title': "Special:UserLogin",
    'authAction': "login",
    'force': "",
    'wpForceHttps': "1",
    'wpFromhttp': "1",
    #'wpLoginToken': ‘', #We need to read this from the page
    }

NameError: name 'username' is not defined

<h3>get the value of the login token</h3>

In [4]:
def get_login_token(response):
    soup = BeautifulSoup(response.text, 'lxml')
    token = soup.find('input',{'name':"wpLoginToken"}).get('value')
    return token


<h3>Setup a session, login, and get data</h3>

In [5]:
import requests
from bs4 import BeautifulSoup
with requests.session() as s:
    response = s.get('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Main+Page')
    payload['wpLoginToken'] = get_login_token(response)
    #Send the login request
    response_post = s.post('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login',
                           data=payload)
    
    #Get another page and check if we’re still logged in
    #response = s.get('https://en.wikipedia.org/wiki/Special:Watchlist')
    #data = BeautifulSoup(response.content,'lxml')
    #print(data.find('div',class_='mw-changeslist').get_text())

NameError: name 'payload' is not defined

In [None]:
print(response_post.content)