In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import time

In [None]:
from bs4 import BeautifulSoup
import requests

To fetch a webpage's content, we can simply use the ``get()`` function within the requests library:

In [None]:
url = "https://www.npr.org/2018/11/05/664395755/what-if-the-polls-are-wrong-again-4-scenarios-for-what-might-happen-in-the-elect"
response = requests.get(url) # you can use any URL that you wish

The response variable has many highly useful attributes, such as:
- status_code
- text
- content

Let's try each of them!

### response.status_code

In [None]:
response.status_code

You should have received a status code of 200, which means the page was successfully found on the server and sent to receiver (aka client/user/you). [Again, you can click here](https://www.restapitutorial.com/httpstatuscodes.html) for a full list of status codes.

### response.text


In [None]:
response.text

Holy moly! That looks awful. If we use our browser to visit the URL, then right-click the page and click 'View Page Source', we see that it is identical to this chunk of glorious text.

### response.content

In [None]:
response.content

What?! This seems identical to the ``.text`` field. However, the careful eye would notice that the very 1st characters differ; that is, ``.content`` has a *b'* character at the beginning, which in Python syntax denotes that the data type is bytes, whereas the ``.text`` field did not have it and is a regular String.

Ok, so that's great, but how do we make sense of this text? We could manually parse it, but that's tedious and difficult. As mentioned, BeautifulSoup is specifically designed to parse this exact content (any webpage content).

## BEAUTIFUL SOUP
![title](images/soup_for_you.jpg) (property of NBC)


The [documentation for BeautifulSoup is found here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

A BeautifulSoup object can be initialized with the ``.content`` from request and a flag denoting the type of parser that we should use. For example, we could specify ``html.parser``, ``lxml``, etc [documentation here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers). Since we are interested in standard webpages that use HTML, let's specify the html.parser:

In [None]:
soup = BeautifulSoup(response.content, "html.parser")
soup

Alright! That looks a little better; there's some whitespace formatting, adding some structure to our content! HTML code is structured by `<tags>`. Every tag has an opening and closing portion, denoted by ``< >`` and ``</ >``, respectively. If we want just the text (not the tags), we can use:

In [None]:
soup.get_text()

There's some tricky Javascript still nesting within it, but it definitely cleaned up a bit. On other websites, you may find even clearer text extraction.

As detailed in the [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), the easiest way to navigate through the tags is to simply name the tag you're interested in. For example:

In [None]:
soup.head # fetches the head tag, which ecompasses the title tag

Usually head tags are small and only contain the most important contents; however, here, there's some Javascript code. The ``title`` tag resides within the head tag.

In [None]:
soup.title # we can specifically call for the title tag

This result includes the tag itself. To get just the text within the tags, we can use the ``.name`` property.

In [None]:
soup.title.string

We can navigate to the parent tag (the tag that encompasses the current tag) via the ``.parent`` attribute:

In [None]:
soup.title.parent.name

# 3. Parse the page with Beautiful Soup
In HTML code, paragraphs are often denoated with a ``<p>`` tag.

In [None]:
soup.p

This returns the first paragraph, and we can access properties of the given tag with the same syntax we use for dictionaries and dataframes:

In [None]:
soup.p['class']

In addition to 'paragraph' (aka p) tags, link tags are also very common and are denoted by ``<a>`` tags

In [None]:
soup.a

It is called the a tag because links are also called 'anchors'. Nearly every page has multiple paragraphs and anchors, so how do we access the subsequent tags? There are two common functions, `.find()` and `.find_all()`.

In [None]:
soup.find('title')

In [None]:
soup.find_all('title')

Here, the results were seemingly the same, since there is only one title to a webpage. However, you'll notice that ``.find_all()`` returned a list, not a single item. Sure, there was only one item in the list, but it returned a list. As the name implies, find_all() returns all items that match the passed-in tag.

In [None]:
soup.find_all('a')

Look at all of those links! Amazing. It might be hard to read but the **href** portion of an *a* tag denotes the URL, and we can capture it via the ``.get()`` function.

In [None]:
for link in soup.find_all('a'): # we could optionally pass the href=True flag .find_all('a', href=True)
    print(link.get('href'))

Many of those links are relative to the current URL (e.g., /section/news/).

In [None]:
paragraphs = soup.find_all('p')
paragraphs

If we want just the paragraph text:

In [None]:
for pa in paragraphs:
    print(pa.get_text())

Since there are multiple tags and various attributes, it is useful to check the data type of BeautifulSoup objects:

In [None]:
type(soup.find('p'))

Since the ``.find()`` function returns a BeautifulSoup element, we can tack on multiple calls that continue to return elements:

In [None]:
soup.find('p')

In [None]:
soup.find('p').find('a')

In [None]:
soup.find('p').find('a').attrs['href'] # att

In [None]:
soup.find('p').find('a').text

That doesn't look pretty, but it makes sense because if you look at what ``.find('a')`` returned, there is plenty of whitespace. We can remove that with Python's built-in ``.strip()`` function.

In [None]:
soup.find('p').find('a').text.strip()

**NOTE:** above, we accessed the attributes of a link by using the property ``.attrs``. ``.attrs`` takes a dictionary as a parameter, and in the example above, we only provided the _key_, not a _value_, too. That is, we only cared that the ``<a>`` tag had an attribute named ``href`` (which we grabbed by typing that command), and we made no specific demands on what the value must be. In other words, regardless of the value of _href_, we grabbed that element. Alternatively, if you inspect your HTML code and notice select regions for which you'd like to extract text, you can specify it as part of the attributes, too!

For example, in the full ``response.text``, we see the following line:

``<header class="npr-header" id="globalheader" aria-label="NPR header">``

Let's say that we know that the information we care about is within tags that match this template (i.e., **class** is an attribute, and its value is **'npr-header'**).

In [None]:
soup.find('header', attrs={'class':'npr-header'})

This matched it! We could then continue further processing by tacking on other commands:

In [None]:
soup.find('header', attrs={'class':'npr-header'}).find_all("li") # li stands for list items

This returns all of our list items, and since it's within a particular header section of the page, it appears they are links to menu items for navigating the webpage. If we wanted to grab just the links within these:

In [None]:
menu_links = set()
for list_item in soup.find('header', attrs={'class':'npr-header'}).find_all("li"):
    for link in list_item.find_all('a', href=True):
        menu_links.add(link)
menu_links # a unique set of all the seemingly important links in the header

## TAKEAWAY LESSON
The above tutorial isn't meant to be a study guide to memorize; its point is to show you the most important functionaity that exist within BeautifulSoup, and to illustrate how one can access different pieces of content. No two web scraping tasks are identical, so it's useful to play around with code and try different things, while using the above as examples of how you may navigate between different tags and properties of a page. Don't worry; we are always here to help when you get stuck!

# String formatting
As we parse webpages, we may often want to further adjust and format the text to a certain way.

For example, say we wanted to scrape a polical website that lists all US Senators' name and office phone number. We may want to store information for each senator in a dictionary. All senators' information may be stored in a list. Thus, we'd have a list of dictionaries. Below, we will initialize such a list of dictionary (it has only 3 senators, for illustrative purposes, but imagine it contains many more).

In [None]:
# this is a bit clumsy of an initialization, but we spell it out this way for clarity purposes
# NOTE: imagine the dictionary were constructed in a more organic manner
senator1 = {"name":"Lamar Alexander", "number":"555-229-2812"}
senator2 = {"name":"Tammy Baldwin", "number":"555-922-8393"}
senator3 = {"name":"John Barrasso", "number":"555-827-2281"}
senators = [senator1, senator2, senator3]
print(senators)

In the real-world, we may not want the final form of our information to be in a Python dictionary; rather, we may need to send an email to people in our mailing list, urging them to call their senators. If we have a templated format in mind, we can do the following:

In [None]:
email_template = """Please call {name} at {number}"""
for senator in senators:
    print(email_template.format(**senator))

**Please [visit here](https://docs.python.org/3/library/stdtypes.html#str.format)** for further documentation
                      
Alternatively, one can also format their text via the ``f'-strings`` property. [See documentation here](https://docs.python.org/3/reference/lexical_analysis.html#f-strings). For example, using the above data structure and goal, one could yield identical results via:

In [None]:
for senator in senators:
    print(f"Please call {senator['name']} at {senator['number']}")

Additionally, sometimes we wish to search large strings of text. If we wish to find all occurrences within a given string, a very mechanical, procedural way of doing it would be to use the ``.find()`` function in Python and to repeatedly update the starting index from which we are looking.

## Regular Expressions
A way more suitable and powerful way is to use Regular Expressions, which is a pattern matching mechanism used throughout Computer Science and programming (it's not just specific to Python). A tutorial on Regular Expressions (aka regex) is beond this lab, but below are many great resources that we recommend, if you are interested in them (could be very useful for a homework problem):
- https://docs.python.org/3.3/library/re.html
- https://regexone.com
- https://docs.python.org/3/howto/regex.html.

# Additonal Python/Homework Comment
In Homework #1, we ask you to complete functions that have signatures with a syntax you may not have seen before:

``def create_star_table(starlist: list) -> list:``

To be clear, this syntax merely means that the input parameter must be a list, and the output must be a list. It's no different than any other function, it just puts a requirement on the behavior of the function.

It is **typing** our function. Please [see this documention if you have more questions.](https://docs.python.org/3/library/typing.html)

# Walkthrough Example (of Web Scraping)
We're going to see the structure of Goodread's best books list (**NOTE: Goodreads is described a little more within the other Lab2_More_Pandas.ipynb notebook)**. We'll use the Developer tools in chrome, safari and firefox have similar tools available. To get this page we use the `requests` module. But first we should check if the company's policy allows scraping. Check the [robots.txt](https://www.goodreads.com/robots.txt) to find what sites/elements are not accessible. Please read and verify.

![](images/goodreads1.png)

In [None]:
url="https://www.npr.org/2018/11/05/664395755/what-if-the-polls-are-wrong-again-4-scenarios-for-what-might-happen-in-the-elect"
response = requests.get(url)
# response.status_code
# response.content

# Beautiful Soup (library) time!
soup = BeautifulSoup(response.content, "html.parser")
# print(soup)
# print(soup.prettify())
soup.find("title")

    # Q1: how do we get the title's text?
# soup.find("title").string
    # Q2: how do we get the webpage's entire content?
# soup.get_text()

In [None]:
URLSTART="https://www.goodreads.com"
BESTBOOKS="/list/show/1.Best_Books_Ever?page="
url = URLSTART+BESTBOOKS+'1'
print(url)
page = requests.get(url)

We can see properties of the page. Most relevant are `status_code` and `text`. The former tells us  if the web-page was found, and if found , ok. (See lecture notes.)

In [None]:
page.status_code # 200 is good

200

In [None]:
page.text[:5000]

'<!DOCTYPE html>\n<html class="desktop withSiteHeaderTopFullImage\n">\n<head>\n  <title>Best Books Ever (94932 books)</title>\n\n<meta content=\'93,439 books based on 229859 votes: The Hunger Games by Suzanne Collins, Harry Potter and the Order of the Phoenix by J.K. Rowling, Pride and Prejudice b...\' name=\'description\'>\n<meta content=\'telephone=no\' name=\'format-detection\'>\n<link href=\'https://www.goodreads.com/list/show/1.Best_Books_Ever\' rel=\'canonical\'>\n\n\n\n    <script type="text/javascript"> var ue_t0=window.ue_t0||+new Date();\n </script>\n  <script type="text/javascript">\n    var ue_mid = "A1PQBFHBHS6YH1";\n    var ue_sn = "www.goodreads.com";\n    var ue_furl = "fls-na.amazon.com";\n    var ue_sid = "727-5034609-6932727";\n    var ue_id = "GQKRR3AQ2FANQDXA7AAG";\n\n    (function(e){var c=e;var a=c.ue||{};a.main_scope="mainscopecsm";a.q=[];a.t0=c.ue_t0||+new Date();a.d=g;function g(h){return +new Date()-(h?0:a.t0)}function d(h){return function(){a.q.push({n:h,a:a

Let us write a loop to fetch 2 pages of "best-books" from goodreads. Notice the use of a format string. This is an example of old-style python format strings

In [None]:
URLSTART="https://www.goodreads.com"
BESTBOOKS="/list/show/1.Best_Books_Ever?page="
for i in range(1,3):
    bookpage=str(i)
    stuff=requests.get(URLSTART+BESTBOOKS+bookpage)
    # filetowrite="files/page"+ '%02d' % i + ".html"
    filetowrite="page"+ '%02d' % i + ".html"

    print("FTW", filetowrite)
    fd=open(filetowrite,"w")
    fd.write(stuff.text)
    fd.close()
    # f = open("page01.html", "w")
    # f.write("Now the file has more content!")
    # f.close()
    time.sleep(2)

FTW page01.html


UnicodeEncodeError: 'charmap' codec can't encode character '\u25be' in position 16100: character maps to <undefined>

## 2. Parse the page, extract book urls

Notice how we do file input-output, and use beautiful soup in the code below. The `with` construct ensures that the file being read is closed, something we do explicitly for the file being written. We look for the elements with class `bookTitle`, extract the urls, and write them into a file

In [None]:
bookdict={}
for i in range(1,3):
    books=[]
    stri = '%02d' % i
    # filetoread="files/page"+ stri + '.html'
    filetoread="page"+ stri + '.html'
    print("FTW", filetoread)
    with open(filetoread) as fdr:
        data = fdr.read()
    soup = BeautifulSoup(data, 'html.parser')
    for e in soup.select('.bookTitle'):
        books.append(e['href'])
    print(books[:10])
    bookdict[stri]=books
    print(bookdict)
    fd=open("list"+stri+".txt","w")
    fd.write("\n".join(books))
    fd.close()

Here is George Orwell's 1984

In [None]:
bookdict['02'][0]

 Lets go look at the first URLs on both pages

![](images/goodreads2.png)

## 3. Parse a book page, extract book properties

Ok so now lets dive in and get one of these these files and parse them.

In [None]:
furl=URLSTART+bookdict['02'][0]
furl

![](images/goodreads3.png)

In [None]:
fstuff=requests.get(furl)
print(fstuff.status_code)

In [None]:
#d=BeautifulSoup(fstuff.text, 'html.parser')
# try this to take care of arabic strings
d = BeautifulSoup(fstuff.text, 'html.parser', from_encoding="utf-8")

In [None]:
d.select("meta[property='og:title']")[0]['content']

Lets get everything we want...

In [None]:
d

In [None]:
#d=BeautifulSoup(fstuff.text, 'html.parser', from_encoding="utf-8")
print(
"title", d.select_one("meta[property='og:title']")['content'],"\n",
# "isbn", d.select("meta[property='books:isbn']")[0]['content'],"\n",
"type", d.select("meta[property='og:type']")[0]['content'],"\n",
# "author", d.select("meta[property='books:author']")[0]['content'],"\n",
#"average rating", d.select_one("span.average").text,"\n",
# "ratingCount", d.select("meta[itemprop='ratingCount']")[0]["content"],"\n"
#"reviewCount", d.select_one("span.count")["title"]
)

Ok, now that we know what to do, lets wrap our fetching into a proper script. So that we dont overwhelm their servers, we will only fetch 5 from each page, but you get the idea...

We'll segue of a bit to explore new style format strings. See https://pyformat.info for more info.

In [None]:
"list{:0>2}.txt".format(3)

In [None]:
a = "4"
b = 4
class Four:
    def __str__(self):
        return "Fourteen"
c=Four()

In [None]:
"The hazy cat jumped over the {} and {} and {}".format(a, b, c)

## 4. Set up a pipeline for fetching and parsing






Ok lets get back to the fetching...

In [None]:
fetched=[]
for i in range(1,3):
    # with open("files/list{:0>2}.txt".format(i)) as fd:
    with open("list{:0>2}.txt".format(i)) as fd:

        counter=0
        for bookurl_line in fd:
            if counter > 4:
                break
            bookurl=bookurl_line.strip()
            stuff=requests.get(URLSTART+bookurl)
            filetowrite=bookurl.split('/')[-1]
            # filetowrite="files/"+str(i)+"_"+filetowrite+".html"
            filetowrite=str(i)+"_"+filetowrite+".html"

            print("FTW", filetowrite)
            fd=open(filetowrite,"w")
            fd.write(stuff.text)
            fd.close()
            fetched.append(filetowrite)
            time.sleep(2)
            counter=counter+1
            
print(fetched)

Ok we are off to parse each one of the html pages we fetched. 

---

We have provided the skeleton of the code and the code to parse the year, since it is a bit more complex...see the difference in the screenshots above. 

In [None]:
import re
yearre = r'\d{4}'
def get_year(d):
    if d.select_one("nobr.greyText"):
        return d.select_one("nobr.greyText").text.strip().split()[-1][:-1]
    else:
        thetext=d.select("div#details div.row")[1].text.strip()
        rowmatch=re.findall(yearre, thetext)
        if len(rowmatch) > 0:
            rowtext=rowmatch[0].strip()
        else:
            rowtext="NA"
        return rowtext

<div class="exercise"><b>Exercise</b></div>

Your job is to fill in the code to get the genres.

In [None]:
def get_genres(d):
    # your code here
    genres=d.select("div.elementList div.left a")
    glist=[]
    for g in genres:
        glist.append(g['href'])
    return glist

In [None]:

listofdicts=[]
for filetoread in fetched:
    print(filetoread)
    td={}
    with open(filetoread) as fd:
        datext = fd.read()
    d=BeautifulSoup(datext, 'html.parser')
    td['title']=d.select_one("meta[property='og:title']")['content']
    # td['isbn']=d.select_one("meta[property='books:isbn']")['content']
    td['booktype']=d.select_one("meta[property='og:type']")['content']
    # td['author']=d.select_one("meta[property='books:author']")['content']
    #td['rating']=d.select_one("span.average").text
    # td['year'] = get_year(d)
    td['file']=filetoread
    glist = get_genres(d)
    td['genres']="|".join(glist)
    listofdicts.append(td)

In [None]:
listofdicts[0]

Finally lets write all this stuff into a csv file which we will use to do analysis.

In [None]:
df = pd.DataFrame.from_records(listofdicts)
df

In [None]:
# df.to_csv("files/meta_utf8_EK.csv", index=False, header=True)
df.to_csv("meta_utf8_EK.csv", index=False, header=True)