# Web Crawling with Python

* Adapted from Chapters 1-3 of *Web Scraping with Python: Collecting Data from the Modern Web* by Ryan Mitchell, O'Reilly, 2015.
* Based on material provided by Prof. Louridas.

---

> John Pavlopoulos, Assistant Professor (elected) <br />
> Department of Informatics, Athens University of Economics and Business <br />
> annis@aueb.gr <br />

Teaching Assistant: Konstantina Liagkou, konstantinalia4@gmail.com

# HTTP and Python

* Python has the `urllib2` HTTP module as part of its standard library.

* However, the `urllib2` module has been strongly criticised because it needs an enormous amount of work to do simple tasks.

* We'll be using the requests library: http://docs.python-requests.org/en/latest/.

* **To install it, run `pip3 install requests`.**

In [1]:
#!pip3 install requests
import requests
import pprint

r = requests.get("https://en.wikipedia.org/wiki/HTML")
print(r.content[:500])

b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-d'


# HTTP Basics

* The Hypertext Transfer Protocol (HTTP) is the basis of the web.

* HTTP is a stateless protocol. We have *requests* and *responses*.

* The different kinds of requests correspond to different *verbs*.


# HTTP `GET` and `POST`


* The two most important verbs in HTTP are `GET` and `POST`.

* The `GET` method asks for the specified resource given *in* the URI.

* The `POST` method is used to *submit* the entities included in the request *to* the specified URI.

# Example of HTTP `GET`

* When we want to take, say, the resource `index.html` from the host `www.example.com`, we issue the following request:

```
GET /index.html HTTP/1.1
Host: www.example.com
```
* And then we get, for example, the following response:

```
HTTP/1.1 200 OK
Date: Mon, 23 May 2005 22:38:34 GMT
Content-Type: text/html; charset=UTF-8
Content-Encoding: UTF-8
Content-Length: 138
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)
ETag: "3f80f-1b6-3e1cb03b"
Accept-Ranges: bytes
Connection: close

<html>
<head>
  <title>An Example Page</title>
</head>
<body>
  Hello World, this is a very simple HTML document.
</body>
</html>
```

# Query String

* We can also give a *query string* to our request, when we want to ask for something specific that does not fit into the hierarchical structure of the URL:

```
http://example.com/over/there?name=ferret
```

* The query string is what follows the question mark, and it is composed of `key=value` pairs, like for instance:

```
field1=value1&field2=value2&field3=value3...
```

# HTTP `POST` Requests


* A problem with `GET` requests is that the query string is visible in the URL.

* With `POST` requests, the parameters are passed inside the body of the request.

# Example of HTTP `POST`

* A `POST` request would look like:

```
POST http://transtats.bts.gov/DownLoad_Table.asp?Table_ID=236&Has_Group=3&Is_Zipped=0 HTTP/1.1
Remote Address:204.68.194.70:80

...
RawDataTable:T_ONTIME
sqlstr: SELECT FL_DATE,UNIQUE_CARRIER,AIRLINE_ID,CARRIER,TAIL_NUM,FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID FROM  T_ONTIME WHERE Month =1 AND YEAR=2016
varlist:FL_DATE,UNIQUE_CARRIER,AIRLINE_ID,CARRIER,TAIL_NUM,FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID
...
```


# Beautiful Soup

* The BeautifulSoup library was named after a Lewis Carroll poem of the same name in *Alice’s Adventures in Wonderland*.

* In the story, this poem is sung by a character called the Mock Turtle (itself a pun on the popular Victorian dish Mock Turtle Soup made not of turtle but of cow).

* Like its Wonderland namesake, BeautifulSoup tries to make sense of the nonsensical; it helps format and organize the messy web by fixing bad HTML and presenting us with easily-traversible Python objects representing XML structures.

* **To install BeautifulSoup run: `pip3 install beautifulsoup4`.**



# BeautifulSoup Objects

* `BeautifulSoup` objects: the result of parsing an HTML page.

* `Tag` objects: A `Tag` object corresponds to an XML or HTML tag in the original document. They are returned by methods such as`find()` or `find_all()`.

* `NavigableString` objects: the text inside tags (rather than the tags themselves).

* `Comment` objects: represent markup comments; a `Comment` object is actually a special type of `NavigableString`.

* HTML elements also contain attributes. These are available as key-value pairs if we use the tag as a dictionary, e.g. `element[attribute]`. For multi-valued attributes, the value is a list.

In [2]:
#!pip install beautifulsoup4 requests
from bs4 import BeautifulSoup

r = requests.get("https://en.wikipedia.org/wiki/HTML")

html = r.content
soup = BeautifulSoup(html, 'html.parser')
print(soup.h1)

<h1 class="firstHeading mw-first-heading" id="firstHeading"><span class="mw-page-title-main">HTML</span></h1>


In [4]:
r = requests.get("https://www.beeradvocate.com/beer/")
html = r.content
soup = BeautifulSoup(html, 'html.parser')

print(soup.prettify())

<!DOCTYPE html>
<html class="Public NoJs uix_javascriptNeedsInit LoggedOut Sidebar Responsive pageIsLtr not_hasTabLinks not_hasSearch is-sidebarOpen hasRightSidebar is-setWidth navStyle_0 pageStyle_0 hasFlexbox" dir="LTR" id="XenForo" lang="en-US" xmlns:fb="http://www.facebook.com/2008/fbml">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=Edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <base href="https://www.beeradvocate.com/community/"/>
  <script>
   var _b = document.getElementsByTagName('base')[0], _bH = "https://www.beeradvocate.com/community/";
			if (_b && _b.href != _bH) _b.href = _bH;
  </script>
  <title>
   Beer Reviews | BeerAdvocate
  </title>
  <noscript>
   <style>
    .JsOnly, .jsOnly { display: none !important; }
   </style>
  </noscript>
  <link href="css.php?css=xenforo,form,public&amp;style=6&amp;dir=LTR&amp;d=1696564682" rel="stylesheet"/>
  <link href="css.php?css=login_bar,moderator_b

In [5]:
soup.h1

<h1>What Beer Are You Reviewing Now?</h1>

* The HTML content of the page was transformed into a BeautifulSoup object, with the following structure:
```
   * html→ <html><head>...</head><body>...</body></html>
      * head → <head>...<title>Beer Reviews: Most Recent | BeerAdvocate</title>...</head>
        * title → <title>Beer Reviews: Most Recent | BeerAdvocate</title>
      * body → <body><div><div>...<h1>Beer Reviews: Most Recent</h1>...</div></div></body>
        * <div class="js-uix_panels uix_panels">
          * <div class="mainPanelWrapper">
            * <div id="uix_wrapper">
              * <div id="headerMover">
                * <div id="content" class="bridge_page">
                  * <div class="pageWidth">
                    * <div class="pageContent">
                      * <div class="uix_contentFix">
                        * <div class="mainContainer">
                          * <div class="mainContent">
                            * <div class="titleBar">
                              * <h1>Beer Reviews: Most Recent</h1>
```

* Note that the `<h1>` tag that we extracted from the page was nested many layers deep into our BeautifulSoup object structure (`html` → `body` → `div` → `...` → `h1`).

* However, when we actually fetched it from the object, we called the `h1` tag directly.

```
soup.h1
```
* In fact, any of the following function calls would produce the same output:

```
soup.html.body.h1
soup.body.h1
soup.html.h1
```

# Handling Failures

* When getting data from the web we must be ready for errors.

* These errors may be the result of malformed URLs, dropped connections, servers that are down, etc.

* Errors are handled by exceptions in Python.

In [6]:
try:
    r = requests.get("https://www.abeeradvocate.com/beer") # added typo
except requests.exceptions.ConnectionError as ce:
    print(ce)
else:
    print("Page retrieval OK")

HTTPSConnectionPool(host='www.abeeradvocate.com', port=443): Max retries exceeded with url: /beer (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fdee8df6e90>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))


In [7]:
try:
    r = requests.get("www.beeradvocate.com/beer") # missing URL schema
except requests.exceptions.MissingSchema as ms:
    print(ms)
else:
    print("Page retrieval OK")

Invalid URL 'www.beeradvocate.com/beer': No scheme supplied. Perhaps you meant http://www.beeradvocate.com/beer?


In [8]:
try:
    r = requests.get("https://www.wikipedia.org")
except requests.exceptions.MissingSchema as ms: # Missing URL schema
    print(ms)
except requests.exceptions.ConnectionError as ce: # connection error
    print(ce)
except requests.exceptions.HTTPError as herror: # invalid HTTP response
    print(herror)
except requests.exceptions.Timeout as toerr:
    print(toerr)
else:
    print("Page retrieval OK")

Page retrieval OK


In [9]:
try:
    r = requests.get("https://www.wikipedia.org")
except requests.exceptions.RequestException as rex:
    print(rex)
else:
    print("Page retrieval OK")

Page retrieval OK


# Beautiful  Soup and Beers

* We now want to get all the beers listed in the latest [Beer Advocate](https://www.beeradvocate.com/) reviews.

* If we inspect the elements, we'll see that the reviews are included in a `<h6>` element.

* So we'll get all of them.

In [11]:
r = requests.get("https://www.beeradvocate.com/beer/")
html = r.content
soup = BeautifulSoup(html, 'html.parser')

review_titles_list = soup.find_all("h6")
for review_title in review_titles_list[:3]:
    print(review_title.get_text())

Rauchbier Hell
Mönchshof Museums-Bier
Stickee Pig


* But now we decide that we would also like to have the Beer Advocate score along with each beer.

* Again, after inspecting the HTML structure, we see that the score is included in siblings of the title that have the form: 
```html
<span class="BAscore_norm">3.6</span><span class="rAvg_norm">/5</span>
```

In [12]:
review_titles_list = soup.find_all("h6")
for review_title in review_titles_list:
    print(review_title.get_text(), end=" ")
    ba_score_norm = review_title.findNextSibling('span', {'class': 'BAscore_norm'})
    ba_avg_norm = ba_score_norm.next_sibling
    print(ba_score_norm.get_text() + ba_avg_norm.get_text())

Rauchbier Hell 4/5
Mönchshof Museums-Bier 3.95/5
Stickee Pig 3.84/5
Patsy Wildberry Sour 3.4/5
Full Hop Alchemist v.31 4.09/5
Maple & Whiskey Barrel-Aged Stout 4.3/5
9th Anniversary 3/5
Aecht Schlenkerla Helles Lagerbier 4.27/5
Dark Adaptation 4.24/5
Jaybird 4.04/5
Taco Truck Lager 3.78/5
Kōkua 4.08/5
Saison Du Coteau Frontenac Gris & Blanc 4.31/5
Flipside Red IPA 3.88/5
Chocolate Peanut Butter Cookie Stout 3.36/5
All Night IPA 3.93/5
Delicious Double IPA 4.34/5
Hop Hunter 4/5
Barrel Aged Imperial Amber Lager 3.94/5
Barrel-Aged Narwhal 4.31/5
Mashing Pumpkin 3.89/5
3 Beagles Brown 4.27/5
Stone Mason Stout 4.25/5
No Salvation 4.46/5
Clouds & Cream: Strawberry Rhubarb 3.93/5
Rye'd Open Spaces 3.6/5
10 Year - Red Label 4.56/5
What Happened? 4.21/5
Sapphire Squeeze 1/5
The Beer Formerly Known As (TBFKA) La Tache 4.29/5
Grid City Pilsner 3.99/5
Pumking - Caramel 4.49/5
Avalanche India Pale Ale 3.53/5
Ettaler Curator Dunkler Doppelbock (US Import Version) 4.3/5
Nooner Pilsner 3.93/5
Lancaste

# `find()` and `find_all()` with BeautifulSoup

* `find_all(name, attrs, recursive, string, limit, keywords)`: find all occurrences and return a list.

* `find(name, attrs, recursive, string, keywords)`: find first occurrence.



* Pass in a value for `name` and you’ll tell Beautiful Soup to only consider tags with certain names.

* For example, the following will return a list of all the header tags in a document.

```
find_all({"h1","h2","h3","h4","h5","h6"})
```

* The attrs argument takes a Python dictionary of attributes and matches tags that contain any one of those attributes. 

* For example, the following function would return both the `BAscore_norm` and the `rAvg_norm` span tags in the HTML document:

```
find_all("span", {"class":{"BAscore_norm", "rAvg_norm"}})
```

In [13]:
results = soup.find_all("span", {"class":{"BAscore_norm", "rAvg_norm"}})
for i, result in enumerate(results):
    print(result.get_text(), end="") # normalisation appears as another row 
    if i % 2:
        print()

4/5
3.95/5
3.84/5
3.4/5
4.09/5
4.3/5
3/5
4.27/5
4.24/5
4.04/5
3.78/5
4.08/5
4.31/5
3.88/5
3.36/5
3.93/5
4.34/5
4/5
3.94/5
4.31/5
3.89/5
4.27/5
4.25/5
4.46/5
3.93/5
3.6/5
4.56/5
4.21/5
1/5
4.29/5
3.99/5
4.49/5
3.53/5
4.3/5
3.93/5
3.43/5
3.43/5
4/5
3.52/5
3.76/5
4.26/5
3.56/5
4.9/5
4.57/5
4.29/5
3.68/5
3.12/5
4.73/5
3.56/5
4.35/5


* The `recursive` argument is boolean. 

* It indicates how far into the document we want to go.

* If recursive is set to `True`, the `find_all()` function looks into children, and children’s children, and so on, for tags that match your parameters. 

* If it is `False`, it will look only at the top-level tags in your document. 

* By default, `find_all()` works recursively (recursive is set to `True`).

* With `string` you can search for strings instead of tags.

* For instance, if we want to find the tags that contained “ale”, we could replace our `find_all()` function in the previous example with the following lines.

* As you can see, BeautifuSoup supports regular expressions.

In [14]:
import re

name_list = soup.find_all(string=re.compile("ale"))
print(len(name_list))

7


* But perhaps we are interested in "Ale" as well, so let's make our search case-insensitive:

In [15]:
name_list = soup.find_all(string=re.compile("ale", re.IGNORECASE))
print(len(name_list))

21


* Or suppose that we want to get all the beer images.

* Regular expressions come to the rescue.

In [16]:
r = requests.get("https://www.beeradvocate.com/beer/")
html = r.content
soup = BeautifulSoup(html, 'html.parser')

images = soup.find_all("img", {"src":re.compile("beers\/.*\.jpg")}) 

for image in images:
    print(image["src"])

https://cdn.beeradvocate.com/im/beers/498033.jpg
https://cdn.beeradvocate.com/im/beers/677871.jpg
https://cdn.beeradvocate.com/im/beers/29145.jpg
https://cdn.beeradvocate.com/im/beers/240370.jpg
https://cdn.beeradvocate.com/im/beers/98495.jpg
https://cdn.beeradvocate.com/im/beers/626469.jpg
https://cdn.beeradvocate.com/im/beers/148052.jpg
https://cdn.beeradvocate.com/im/beers/668401.jpg
https://cdn.beeradvocate.com/im/beers/88880.jpg
https://cdn.beeradvocate.com/im/beers/21085.jpg
https://cdn.beeradvocate.com/im/beers/662477.jpg
https://cdn.beeradvocate.com/im/beers/605156.jpg
https://cdn.beeradvocate.com/im/beers/83771.jpg
https://cdn.beeradvocate.com/im/beers/533802.jpg
https://cdn.beeradvocate.com/im/beers/666388.jpg
https://cdn.beeradvocate.com/im/beers/19908.jpg
https://cdn.beeradvocate.com/im/beers/27792.jpg
https://cdn.beeradvocate.com/im/beers/144200.jpg
https://cdn.beeradvocate.com/im/beers/7866.jpg
https://cdn.beeradvocate.com/im/beers/99000.jpg
https://cdn.beeradvocate.com/i

* The limit argument is only used in the `find_all()` method.

* `find()` is equivalent to the same `find_all()` call, with a limit of 1; moreover, it returns an element, not a list. 

* When `find_all()` does not find anything, it returns an empty list; when `find()` does not find anything, it returns `None`.

* You might set this if you’re only interested in retrieving the first x items from the page. Be aware, however, that this gives you the first items on the page in the order that they *occur*, not necessarily the first ones that you want.

* The `keywords` argument allows you to select tags that contain a particular attribute, or set of attributes. 
* Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument called id, Beautiful Soup will filter against each tag’s ‘id’ attribute.
* For example `class` (note that we use `class_` because `class` is a reserved word in Python):

In [17]:
all_comments = soup.find_all(class_="user-comment")
print(all_comments[0])

<div ba-user="638668" class="user-comment" id="rating_fullview_container"><div id="rating_fullview_user"><div style="padding:0px;"><a class="username" href="/community/members/stevoj.638668/"><img alt="Photo of stevoj" border="0" height="48" src="https://cdn.beeradvocate.com/data/avatars/m/638/638668.jpg?1450201660" width="48"/></a></div></div><div id="rating_fullview_content_2"><a href="/beer/profile/16378/498033/"><img alt="Rauchbier Hell" border="0" height="33%" src="https://cdn.beeradvocate.com/im/beers/498033.jpg" style="float:right; margin:0px 0px 10px 10px; max-height:100px;"/></a><h6><a href="/beer/profile/16378/498033/">Rauchbier Hell</a></h6><br/><a href="/beer/styles/7/">Rauchbier</a> | 5% ABV<br/><a href="/beer/profile/16378/">Heater Allen Brewing</a><span class="muted"> in McMinnville, Oregon</span><br/><br/><span class="muted">Reviewed by <b><a class="username" href="/community/members/stevoj.638668/">stevoj</a></b> from Idaho</span><br/><br/><span class="BAscore_norm">4<

# Navigating Trees

* HTML documents are in effect trees.

* It also possible to navigate in the tree structure of an HTML document going from parents to children and vice versa, or moving across siblings.

* We can also add *pagination* in our crawling.

* The recent reviews are given in pages containing N each.

* We can get all the reviews by using the `start` parameter in the query.

* Let's move away from beers to a more intellectual endeavour.

* A common way to lay out material in a page is through an HTML table.

* In an HTML table, data comes in rows, demarcated by `<tr>` tags.

* Inside rows, cells are demarcated by `<td>` tags.

* We will explore HTML tables using [goodreads](https://www.goodreads.com).

* In particular, we'll get the [best books of the 20th century list](https://www.goodreads.com/list/show/6.Best_Books_of_the_20th_Century).

In [18]:
html = requests.get("https://www.goodreads.com/list/show/6.Best_Books_of_the_20th_Century").content

soup = BeautifulSoup(html, 'html.parser')

print(soup.prettify())

<!DOCTYPE html>
<html class="desktop withSiteHeaderTopFullImage">
 <head>
  <title>
   Best Books of the 20th Century (7718 books)
  </title>
  <meta content="7,684 books based on 50794 votes: To Kill a Mockingbird by Harper Lee, 1984 by George Orwell, The Great Gatsby by F. Scott Fitzgerald, Harry Potter and t..." name="description"/>
  <meta content="telephone=no" name="format-detection"/>
  <link href="https://www.goodreads.com/list/show/6.Best_Books_of_the_20th_Century" rel="canonical"/>
  <script type="text/javascript">
   var ue_t0=window.ue_t0||+new Date();
  </script>
  <script type="text/javascript">
   var ue_mid = "A1PQBFHBHS6YH1";
    var ue_sn = "www.goodreads.com";
    var ue_furl = "fls-na.amazon.com";
    var ue_sid = "837-5386471-4359128";
    var ue_id = "E2V1HRTCKZTKBMH40SRJ";

    (function(e){var c=e;var a=c.ue||{};a.main_scope="mainscopecsm";a.q=[];a.t0=c.ue_t0||+new Date();a.d=g;function g(h){return +new Date()-(h?0:a.t0)}function d(h){return function(){a.q.push(

* Let's say that we want to navigate the table containing the **books**, identified by the `itemtype="http://schema.org/Book"` attribute. 

* In particular, we want to get the *rows* of the table, which are contained in `<tr>` tags.

* Note that this time we will pass a keyword argument as a dictionary, because `find()`, in contrast to `find_all()`, does not take keyword arguments.

* We can put in the dictionary as many keyword arguments as we want. This is useful when an attribute name cannot be really used as a keyword argument (e.g., `name`, or `data-id`).

* Note that we can use the dictionary form in `find_all()` as well.

In [19]:
for child in soup.find("tr", {"itemtype":"http://schema.org/Book"}).children:
    print(child)



<td class="number" valign="top">1</td>


<td valign="top" width="5%">
<div class="u-anchorTarget" id="2657"></div>
<div class="js-tooltipTrigger tooltipTrigger" data-resource-id="2657" data-resource-type="Book">
<a href="/book/show/2657.To_Kill_a_Mockingbird" title="To Kill a Mockingbird">
<img alt="To Kill a Mockingbird" class="bookCover" itemprop="image" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1553383690i/2657._SY75_.jpg"/>
</a> </div>
</td>


<td valign="top" width="100%">
<a class="bookTitle" href="/book/show/2657.To_Kill_a_Mockingbird" itemprop="url">
<span aria-level="4" itemprop="name" role="heading">To Kill a Mockingbird</span>
</a> <br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/1825.Harper_Lee" itemprop="url"><span itemprop="name">Harper Lee</span></a>
</div>
</span>
<br/>
<div>
<s

* This is too much information, so we'll only get the rank of the book, the name, and the author.

* The rank is in the first cell of each row, the rest of the information in the second.

* The name of the book is given in the link with `class="bookTitle"` and the author of the book is given in the link with `class="authorName"`.

In [20]:
for i, row in enumerate(soup.find_all("tr", {"itemtype": "http://schema.org/Book"})):
    rank = row.find('td')
    title = row.find("a", {"class": "bookTitle"})
    author = row.find("a", {"class": "authorName"})
    print(rank.get_text().strip(), title.get_text().strip(), "by", author.get_text())
    # print only the top 20
    if i>20: 
        break

1 To Kill a Mockingbird by Harper Lee
2 1984 by George Orwell
3 The Great Gatsby by F. Scott Fitzgerald
4 Harry Potter and the Sorcerer's Stone (Harry Potter, #1) by J.K. Rowling
5 Animal Farm by George Orwell
6 The Hobbit (The Lord of the Rings, #0) by J.R.R. Tolkien
7 The Little Prince by Antoine de Saint-Exupéry
8 Fahrenheit 451 by Ray Bradbury
9 The Catcher in the Rye by J.D. Salinger
10 The Lion, the Witch and the Wardrobe (Chronicles of Narnia, #1) by C.S. Lewis
11 The Grapes of Wrath by John Steinbeck
12 One Hundred Years of Solitude by Gabriel García Márquez
13 Brave New World by Aldous Huxley
14 Gone with the Wind by Margaret Mitchell
15 Of Mice and Men by John Steinbeck
16 Harry Potter and the Prisoner of Azkaban (Harry Potter, #3) by J.K. Rowling
17 The Giver (The Giver, #1) by Lois Lowry
18 Lord of the Flies by William Golding
19 Slaughterhouse-Five by Kurt Vonnegut Jr.
20 Lolita by Vladimir Nabokov
21 East of Eden by John Steinbeck
22 The Handmaid’s Tale (The Handmaid's Ta

# Ethical scraping

* Ethical scraping should follow any rules set by the web admin, for example regarding how to scrape at a frequency that will not negatively impact the website's operation.
* These rools exist by design in a `robots.txt` file, used to manage and control the behavior of web crawlers and bots in websites. This file lives at the root of the site and it follows the [Robots Exclusion Standard](https://en.wikipedia.org/wiki/Robots.txt#About_the_standard). For site www.example.com, the robots.txt file lives at www.example.com/robots.txt.
* Let's see a `robots.txt` file, from [Google Search Central](https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt#create_rules) that sets two rules:

```
User-agent: Googlebot
Disallow: /nogooglebot/

User-agent: *
Allow: /

Sitemap: https://www.example.com/sitemap.xml
```

* The user agent named Googlebot is not allowed to crawl any URL that starts with https://example.com/nogooglebot/.
* All other user agents are allowed to crawl the entire site. This could have been omitted and the result would be the same; the default behavior is that user agents are allowed to crawl the entire site.
* The site's sitemap file is located at https://www.example.com/sitemap.xml.

<h1 style="color:salmon">Class work</h1>

* Find and read the `robots.txt` file of youtube.com
* Scrape and show the quotes from http://quotes.toscrape.com

# Dealing with JavaScript

* Web sites may render their contents dynamically, using JavaScript.

* When this happens, the browser issues originally an HTTP GET request which brings a page with missing HTML content.

* Then the browser issues a set of HTTP GET requests using JavaScript (AJAX) and the web page is filled in.

* We can see that this is happening by examining the network activity (through the browser developer tools) and looking for a series of HTTP GET calls.

* When this happens, issuing a simple HTTP GET request will not get us a content.

* We'll need to use a real browser that will execute the contained JavaScript.

* We can control a browser programmatically with [Selenium](https://www.seleniumhq.org/).

* In particular, we will use a particular [WebDriver](https://www.seleniumhq.org/projects/webdriver/) and control it through Python.

* We need [Selenium Python](https://selenium-python.readthedocs.io/), and for older versions we also need the `geckodriver` WebDriver of Chrome, [Firefox](https://github.com/mozilla/geckodriver/releases), etc.

In [82]:
import time
# !pip install selenium
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [94]:
video_url = "https://www.youtube.com/watch?v=cUuPL22XKfE"
with Chrome() as driver: # path to geckodriver for older versions 
    driver.get(video_url)
    wait = WebDriverWait(driver,15)
    
    # Scroll down to load comments (adjust the number of scrolls as needed)
    for _ in range(5):  # Scroll down 5 times; adjust as needed
        driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.END)
        time.sleep(2)  # Wait for the page to load
    
    # Find and extract the comments
    # comments = driver.find_elements(By.CSS_SELECTOR, "#comment #content-text")
    comments = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#comment #content-text")))
    for comment in comments:
        print(comment.text)

Δηλαδή τι θέλει να πει ο ποιητής ; για να ακουσω γνωμες ; να σβησουμε  τα φωτα μας για να μην στραβωθουμε;
Η μοναδική αντεθνική ενέργεια του Κεμάλ Ατατούρκ ήταν η αλλαγή του αλφαβήτου.
Αν δηλαδή βάλουμε το λατινικό αλφάβητο και γράψουμε "ta paidia gelane" τότε θα καταλάβουν οι ξένοι στην Ευρώπη; Αυτά γιατί πολύ μάγκα τον κάνατε τον Λιαντίνη. Όσο για τον σολωμό, τα ίδια ήταν και αυτός. έχε χάρη που έγραψε τους πρώτους στίχους ωραίους, διότι όλο το ποιήμα του ήταν και αρκετά ανθελληνικό. Τέλοπαντων, εμείς δεν πρόκειται να χαλάσουμε τη γραφή του Ομήρου επειδή κάποιοι δεν τους αρέσει. Εμείς δε χαλάμε τον ελληνικό πολιτισμό. Θα αφήσουμε την ελληνική γραφή και θα ξεχωρίσουμε από τους όμοιους ευρωπαίους, που από αυστρία μέχρι την πορτογαλία είναι ίδιοι. βαρβαρικά δε θα γράψουμε.
To grafonakatalavoun I evropei
"Ελεύθερος"..!!!
Προσωπικά δεν συμφωνώ


* or similarly by explicitly opening and closing the service

In [96]:
driver = webdriver.Chrome()

# Open the YouTube video
driver.get(video_url)

# Scroll down to load comments (adjust the number of scrolls as needed)
for _ in range(5):  # Scroll down 5 times; adjust as needed
    driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.END)
    time.sleep(2)  # Wait for the page to load

# Wait for comments to load
wait = WebDriverWait(driver, 10)
comments = wait.until(
    EC.presence_of_all_elements_located(
        (By.CSS_SELECTOR, "#content-text")
    )
)

# Extract and print the comments
for comment in comments:
    print(comment.text)

# Close the WebDriver
driver.quit()

Δηλαδή τι θέλει να πει ο ποιητής ; για να ακουσω γνωμες ; να σβησουμε  τα φωτα μας για να μην στραβωθουμε;
Η μοναδική αντεθνική ενέργεια του Κεμάλ Ατατούρκ ήταν η αλλαγή του αλφαβήτου.
Αν δηλαδή βάλουμε το λατινικό αλφάβητο και γράψουμε "ta paidia gelane" τότε θα καταλάβουν οι ξένοι στην Ευρώπη; Αυτά γιατί πολύ μάγκα τον κάνατε τον Λιαντίνη. Όσο για τον σολωμό, τα ίδια ήταν και αυτός. έχε χάρη που έγραψε τους πρώτους στίχους ωραίους, διότι όλο το ποιήμα του ήταν και αρκετά ανθελληνικό. Τέλοπαντων, εμείς δεν πρόκειται να χαλάσουμε τη γραφή του Ομήρου επειδή κάποιοι δεν τους αρέσει. Εμείς δε χαλάμε τον ελληνικό πολιτισμό. Θα αφήσουμε την ελληνική γραφή και θα ξεχωρίσουμε από τους όμοιους ευρωπαίους, που από αυστρία μέχρι την πορτογαλία είναι ίδιοι. βαρβαρικά δε θα γράψουμε.
To grafonakatalavoun I evropei
"Ελεύθερος"..!!!
Προσωπικά δεν συμφωνώ


* `WebDriverWait` will wait for the comments to load using the `EC.presence_of_all_elements_located` expected condition
* This ensures that the script will wait for the comments to appear on the page before attempting to extract and print them.

# Traversing a Single Domain

* Traversing a single domain means going through the pages that reside in one internet domain (web site).

* For example, imagine that we want to crawl through Wikipedia, starting from an article, and travelling to the other Wikipedia pages that are linked from this article.

* As a start, say that the initial article is the Wikipedia page for the recent pandemic.

In [105]:
html = requests.get("http://en.wikipedia.org/wiki/Coronavirus_disease_2019").content
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all("a")

In [106]:
for link in links:
    if 'href' in link.attrs: 
        print(link.attrs['href']) # get the href attribute of the a tag

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
/w/index.php?title=Special:CreateAccount&returnto=COVID-19
/w/index.php?title=Special:UserLogin&returnto=COVID-19
/w/index.php?title=Special:CreateAccount&returnto=COVID-19
/w/index.php?title=Special:UserLogin&returnto=COVID-19
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
#
#Nomenclature
#Symptoms_and_signs
#Complications
#Cause
#Transmission
#Virology
#SARS-CoV-2_variants
#Pathophysiology
#Respiratory_tract
#Nervous_system
#Gastrointestinal_tract
#Cardiovascular_system
#Ki

* If you look at the list of links produced, you’ll notice that all the articles you’d expect are there
   * `/wiki/Coronavirus_disease`
   * `/wiki/COVID-19_pandemic`
   * ... 
   
* However, there are also ones not very informative
    * `#cite_note-9`
    * `//foundation.wikimedia.org/wiki/Privacy_policy`


* If you examine the links that point to article pages (as opposed to other internal pages), they all have three things in common:

   1. They reside within the `div` with the `id` set to `bodyContent`.
   
   2. The URLs do not contain colons.
   
   3. The URLs begin with `/wiki/`.<br />
   
* We will model these three rules with <span style="color:red; font-family:courier;">Regular Expressions</span> to retrieve only the desired article links.


In [107]:
import re
for link in soup.find("div", 
                      {"id":"bodyContent"}).find_all("a", # rule 1
                         href=re.compile("^(/wiki/)[^:]*$")): # rule 2 and 3
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/COVID-19_pandemic
/wiki/Coronavirus_diseases
/wiki/SARS-CoV-2
/wiki/Medical_specialty
/wiki/Infectious_disease_(medical_specialty)
/wiki/Signs_and_symptoms
/wiki/Symptoms_of_COVID-19
/wiki/Complication_(medicine)
/wiki/Pneumonia
/wiki/Sepsis
/wiki/Acute_respiratory_distress_syndrome
/wiki/Kidney_failure
/wiki/Respiratory_failure
/wiki/Pulmonary_fibrosis
/wiki/Cytokine_release_syndrome
/wiki/Pediatric_multisystem_inflammatory_syndrome
/wiki/Long_COVID
/wiki/Long_COVID
/wiki/SARS-CoV-2
/wiki/Medical_diagnosis
/wiki/Reverse_transcription_polymerase_chain_reaction
/wiki/CT_scan
/wiki/Rapid_antigen_test
/wiki/COVID-19_vaccine
/wiki/Quarantine
/wiki/Social_distancing
/wiki/Management_of_COVID-19
/wiki/Contagious_disease
/wiki/SARS-CoV-2
/wiki/COVID-19_pandemic_in_Hubei
/wiki/COVID-19_pandemic
/wiki/Symptoms_of_COVID%E2%80%9119
/wiki/Breathing_difficulties
/wiki/Anosmia
/wiki/Ageusia
/wiki/Incubation_period
/wiki/Asymptomatic
/wiki/Pneumonia
/wiki/Dyspnea
/wiki/Hypoxia_(medical)
/wiki/R

* Having a script that finds all article links in one, hardcoded Wikipedia article, is interesting, but it can be better. Let's transform this code into something more like the following.

* A single function, `get_links()`, that takes in a Wikipedia article URL of the form `/wiki/<Article_Name>` and returns a list of all linked article URLs in the same form.

* Some code that calls `get_links()` with some starting article, chooses a random article link from the returned list, and calls `get_links()` again, until we stop the program, or until we reach a limit, or until there are no article links found on the new page.

In [109]:
import datetime
import random; random.seed(42) 

def get_links(article_url):
    try: 
        html = requests.get("https://en.wikipedia.org" + article_url).content
        bsObj = BeautifulSoup(html, 'html.parser')
        return bsObj.find("div", {"id":"bodyContent"}
                         ).find_all("a", href=re.compile("^(/wiki/)[^:]*$"))

    except requests.exceptions.RequestException as rex: 
        print(f"Unable to get {article_url}: reason {rex}")
        return None

num = 20
links = get_links("/wiki/Coronavirus_disease_2019")
while len(links) > 0 and num > 0:
    new_article = links[random.randint(0, len(links)-1)].attrs["href"] 
    print(new_article)
    num -= 1
    links = get_links(new_article)

/wiki/COVID-19_pandemic_in_Arkansas
/wiki/Responses_to_the_COVID-19_pandemic_in_February_2020
/wiki/World_Health_Organization%27s_response_to_the_COVID-19_pandemic
/wiki/Soberana_Plus
/wiki/COVID-19_vaccination_in_Thailand
/wiki/Pandemic_predictions_and_preparations_prior_to_the_COVID-19_pandemic
/wiki/Economic_impact_of_the_COVID-19_pandemic_in_Malaysia
/wiki/COVID-19_vaccine_card
/wiki/COVID-19_vaccination_in_the_Republic_of_Ireland
/wiki/County_Cork
/wiki/County_Londonderry
/wiki/Purple_saxifrage
/wiki/Interim_Register_of_Marine_and_Nonmarine_Genera
/wiki/Tony_Rees_(scientist)
/wiki/Thesis
/wiki/American_English
/wiki/American_English_vocabulary
/wiki/Concourse
/wiki/Shopping_malls
/wiki/Greenfield_land


# Crawling an Entire Site

* We took a random walk through a website, going from link to link. 

* But what if you need to systematically catalog or search every page on a site?

* This is useful for a number of things, for example:

   * Generating site maps.
   
   * Gathering data (say, a set of related articles).

* The general approach to an exhaustive site crawl is to start with a top-level page (such as the home page), and search for a list of all internal links on that page. 

* Every one of those links is then crawled, and additional lists of links are found on each one of them, triggering another round of crawling.

* Clearly this is a situation that can blow up very quickly. 

* If every page has 10 internal links, and a website is five pages deep (a fairly typical depth for a medium-size website), then the number of pages you need to crawl is $10^5$, or 100,000 pages, before you can be sure that you’ve exhaustively covered the website. 

* Another way to think of this, is if $R_0=10$ (e.g., as in chickenpox). Then, each person infects ten others and in five days, the worst case scenario is 100,000 people infected, starting from a single infection in day 0. 

* Strangely enough, while “5 pages deep and 10 internal links per page” are fairly typical dimensions for a website, there are very few websites with 100,000 or more pages. The reason, of course, is that the vast majority of internal links are duplicates.

* In the following example, we go for an adjustable depth.

In [42]:
pages = set()

def get_links(current_page, pages, total_pages, max_depth, depth):
    num_pages_visited = len(pages) + 1 # count the visited pages
    if num_pages_visited > total_pages or depth > max_depth:
        return # stop criterion
    pages.add(current_page) # update the visited pages with and print-and-request the new one
    print(f'{current_page}, depth: {depth}, visited: {num_pages_visited}') 
    try:
        r = requests.get("https://en.wikipedia.org" + current_page)
    except requests.exceptions.RequestException as rex:
        print(f"Unable to get {article_url}: reason {rex}")
    else:
        # parse the new webpage and find all links therein
        html = r.content
        soup = BeautifulSoup(html, 'html.parser')
        for link in soup.find_all("a", 
                                  href=re.compile("^(/wiki/)[^:]*$")):
            if 'href' in link.attrs:
                if link.attrs['href'] not in pages:
                    # for any *new* link, call thy self
                    new_page = link.attrs['href'] 
                    get_links(new_page, pages, total_pages, max_depth, depth+1)

get_links("/wiki/Tsipouro", pages, 50, 10, 0)
print(len(pages))

/wiki/Tsipouro, depth: 0, visited: 1
/wiki/Greek_language, depth: 1, visited: 2
/wiki/Proto-Greek_language, depth: 2, visited: 3
/wiki/Pre-Greek_substrate, depth: 3, visited: 4
/wiki/Pre-Indo-European_languages, depth: 4, visited: 5
/wiki/Proto-Indo-European_language, depth: 5, visited: 6
/wiki/PIE_(disambiguation), depth: 6, visited: 7
/wiki/Pie, depth: 7, visited: 8
/wiki/Pi, depth: 8, visited: 9
/wiki/Pi_(letter), depth: 9, visited: 10
/wiki/Greek_alphabet, depth: 10, visited: 11
/wiki/Alpha, depth: 10, visited: 12
/wiki/Nu_(letter), depth: 10, visited: 13
/wiki/Beta, depth: 10, visited: 14
/wiki/Xi_(letter), depth: 10, visited: 15
/wiki/Gamma, depth: 10, visited: 16
/wiki/Omicron, depth: 10, visited: 17
/wiki/Delta_(letter), depth: 10, visited: 18
/wiki/Epsilon, depth: 10, visited: 19
/wiki/Rho, depth: 10, visited: 20
/wiki/Zeta, depth: 10, visited: 21
/wiki/Sigma, depth: 10, visited: 22
/wiki/Eta, depth: 10, visited: 23
/wiki/Tau, depth: 10, visited: 24
/wiki/Theta, depth: 10, vis

* Initially, `get_links()` is called with a user-provided Wikipedia page.

* Then, each link on the first page is iterated through and a check is made to see if it is in the global set of pages (a set of pages that the script has encountered already). 

* If not, it is added to the list, printed to the screen, and the `get_links()` function is called recursively on it.

* Note how top results are quite irrelevant (`/wiki/Album, depth: 9, visited: 10`). 

* The previous function, `get_links()`, traverses the Wikipedia graph in a [depth-first search manner](https://en.wikipedia.org/wiki/Depth-first_search).

* We can also traverse the Wikipedia graph in a [breadth-first manner](https://en.wikipedia.org/wiki/Breadth-first_search).

In [110]:
from collections import deque

def get_links_bfs(starting_page, total_pages, max_depth):

    pages = set()
    SENTINEL = "__SENTINEL__" # use to count depth
    depth = 0
    num_pages_visited = 0
    
    q = deque()
    q.appendleft(starting_page)
    q.appendleft(SENTINEL)
    while len(q) > 0:
        if num_pages_visited >= total_pages or depth > max_depth:
            print(num_pages_visited)
            return
        current_page = q.pop()
        if current_page == SENTINEL:
            depth += 1
            continue
        try:
            r = requests.get("https://en.wikipedia.org" + current_page)
        except requests.exceptions.RequestException as rex:
            print(f"Unable to get {article_url}: reason {rex}")
        else:
            num_pages_visited += 1
            print(f'{current_page}, depth: {depth}, visited: {num_pages_visited}')
            html = r.content
            soup = BeautifulSoup(html, 'html.parser')
            for link in soup.find_all("a", 
                                      href=re.compile("^(/wiki/)[^:]*$")):
                if 'href' in link.attrs:
                    if link.attrs['href'] not in pages:
                        # We have encountered a new page
                        new_page = link.attrs['href'] 
                        q.appendleft(new_page)
                        pages.add(new_page)
            q.appendleft(SENTINEL)

get_links_bfs("/wiki/Tsipouro", 50, 2)

/wiki/Tsipouro, depth: 0, visited: 1
/wiki/Main_Page, depth: 1, visited: 2
/wiki/Tsipouro, depth: 1, visited: 3
/wiki/Greek_language, depth: 1, visited: 4
/wiki/Romanization_of_Greek, depth: 1, visited: 5
/wiki/Greece, depth: 1, visited: 6
/wiki/Thessaly, depth: 1, visited: 7
/wiki/Epirus_(region), depth: 1, visited: 8
/wiki/Macedonia_(Greece), depth: 1, visited: 9
/wiki/Crete, depth: 1, visited: 10
/wiki/Tsikoudia, depth: 1, visited: 11
/wiki/Distilled_beverage, depth: 1, visited: 12
/wiki/Alcohol_by_volume, depth: 1, visited: 13
/wiki/Pomace, depth: 1, visited: 14
/wiki/Wine_press, depth: 1, visited: 15
/wiki/Anise, depth: 1, visited: 16
/wiki/Greek_Orthodox_Church, depth: 1, visited: 17
/wiki/Monk, depth: 1, visited: 18
/wiki/Mount_Athos, depth: 1, visited: 19
/wiki/Grape, depth: 1, visited: 20
/wiki/Crusher/destemmer, depth: 1, visited: 21
/wiki/Ethanol_fermentation, depth: 1, visited: 22
/wiki/Whiskey, depth: 1, visited: 23
/wiki/Alcoholic_beverage, depth: 1, visited: 24
/wiki/Mez

* ChatGPT can help with drafting such routines
* Or it can add comments / [docstring](https://en.wikipedia.org/wiki/Docstring#:~:text=In%20programming%2C%20a%20docstring%20is,a%20specific%20segment%20of%20code) to existing ones

In [112]:
from collections import deque
import requests
from bs4 import BeautifulSoup
import re

def get_links_bfs(starting_page, total_pages, max_depth):
    """
    Perform breadth-first search (BFS) to collect links from web pages.

    Args:
        starting_page (str): The URL of the starting page.
        total_pages (int): The maximum number of pages to visit.
        max_depth (int): The maximum depth of traversal.

    Returns:
        None: This function does not return a value but prints visited page information.
    """

    # Set to store visited pages
    pages = set()
    
    # Sentinel value to mark the end of a depth level
    SENTINEL = "__SENTINEL__"  # Used to count depth
    
    # Initialize depth and page count
    depth = 0
    num_pages_visited = 0

    # Initialize a deque for BFS traversal
    q = deque()
    
    # Start with the initial page
    q.appendleft(starting_page)
    
    # Add the sentinel value to indicate the start of a new depth level
    q.appendleft(SENTINEL)
    
    while len(q) > 0:
        # Check if we have visited enough pages or reached the maximum depth
        if num_pages_visited >= total_pages or depth > max_depth:
            print(num_pages_visited)
            return

        # Pop the current page from the front of the queue
        current_page = q.pop()

        if current_page == SENTINEL:
            # Increase the depth when SENTINEL is encountered
            depth += 1
            continue

        try:
            # Send an HTTP GET request to the current page
            r = requests.get("https://en.wikipedia.org" + current_page)
        except requests.exceptions.RequestException as rex:
            print(f"Unable to get {current_page}: reason {rex}")
        else:
            num_pages_visited += 1
            print(f'Page: {current_page}, Depth: {depth}, Visited: {num_pages_visited}')
            html = r.content
            soup = BeautifulSoup(html, 'html.parser')
            
            # Find all links on the page that match a specific pattern (Wikipedia articles)
            for link in soup.find_all("a", href=re.compile("^(/wiki/)[^:]*$")):
                if 'href' in link.attrs:
                    if link.attrs['href'] not in pages:
                        # We have encountered a new page, so add it to the queue
                        new_page = link.attrs['href']
                        q.appendleft(new_page)
                        pages.add(new_page)
            
            # Add SENTINEL to mark the end of the current depth level
            q.appendleft(SENTINEL)

# Example usage:
get_links_bfs("/wiki/Tsipouro", 50, 2)

Page: /wiki/Tsipouro, Depth: 0, Visited: 1
Page: /wiki/Main_Page, Depth: 1, Visited: 2
Page: /wiki/Tsipouro, Depth: 1, Visited: 3
Page: /wiki/Greek_language, Depth: 1, Visited: 4
Page: /wiki/Romanization_of_Greek, Depth: 1, Visited: 5
Page: /wiki/Greece, Depth: 1, Visited: 6
Page: /wiki/Thessaly, Depth: 1, Visited: 7
Page: /wiki/Epirus_(region), Depth: 1, Visited: 8
Page: /wiki/Macedonia_(Greece), Depth: 1, Visited: 9
Page: /wiki/Crete, Depth: 1, Visited: 10
Page: /wiki/Tsikoudia, Depth: 1, Visited: 11
Page: /wiki/Distilled_beverage, Depth: 1, Visited: 12
Page: /wiki/Alcohol_by_volume, Depth: 1, Visited: 13
Page: /wiki/Pomace, Depth: 1, Visited: 14
Page: /wiki/Wine_press, Depth: 1, Visited: 15
Page: /wiki/Anise, Depth: 1, Visited: 16
Page: /wiki/Greek_Orthodox_Church, Depth: 1, Visited: 17
Page: /wiki/Monk, Depth: 1, Visited: 18
Page: /wiki/Mount_Athos, Depth: 1, Visited: 19
Page: /wiki/Grape, Depth: 1, Visited: 20
Page: /wiki/Crusher/destemmer, Depth: 1, Visited: 21
Page: /wiki/Ethano

* Or perhaps wrap it up into a modularised component.

In [138]:
%%writefile crawler.py

import argparse
import requests
from bs4 import BeautifulSoup
import re
import logging
from collections import deque
from urllib.parse import urlparse, urlunparse

def crawl_web(root_url, starting_url, config):
    """
    Crawl web pages and collect links using BFS.

    Args:
        root_url (str): The root URL for the website.
        starting_url (str): The URL of the starting page.
        config (dict): Configuration dictionary containing settings.
    """

    total_pages = config["total_pages"]
    max_depth = config["max_depth"]

    # Set to store visited pages
    visited_pages = set()

    # Sentinel value to mark the end of a depth level
    SENTINEL = "__SENTINEL__"

    # Initialize depth and page count
    depth = 0
    num_pages_visited = 0

    # Initialize a deque for BFS traversal
    queue = deque()
    queue.appendleft(root_url+starting_url)
    queue.appendleft(SENTINEL)

    while queue:
        if num_pages_visited >= total_pages or depth > max_depth:
            logging.info("Total pages visited: %d", num_pages_visited)
            return

        current_url = queue.pop()

        if current_url == SENTINEL:
            depth += 1
            continue

        try:
            response = requests.get(current_url)
            response.raise_for_status()  # Raise an exception for bad HTTP responses
        except requests.exceptions.RequestException as rex:
            logging.error("Unable to get %s: %s", current_url, rex)
        else:
            num_pages_visited += 1
            logging.info("Page: %s, Depth: %d, Visited: %d", current_url, depth, num_pages_visited)
            html = response.content
            soup = BeautifulSoup(html, "html.parser")

            # Find all links on the page that match a specific pattern
            for link in soup.find_all("a", href=re.compile("^(/wiki/)[^:]*$")):
                if "href" in link.attrs:
                    new_url = link.attrs["href"]
                    if new_url not in visited_pages:
                        # Ensure the URL has the correct scheme
                        if not urlparse(new_url).scheme:
                            new_url = urlunparse(urlparse(root_url)._replace(path=new_url))
                        queue.appendleft(new_url)
                        visited_pages.add(new_url)

            # Add SENTINEL to mark the end of the current depth level
            queue.appendleft(SENTINEL)

if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)  # Configure logging

    # Create a command-line argument parser
    parser = argparse.ArgumentParser(description="Web Crawler")
    
    # Add command-line arguments for the root URL and starting URL
    parser.add_argument("root_url", help="Root URL of the website")
    parser.add_argument("starting_url", help="URL of the starting page")
        
    # Parse the command-line arguments
    args = parser.parse_args()

    # Configuration settings
    config = {
        "total_pages": 50,
        "max_depth": 2
    }

    # Call the crawl_web function with the root URL and starting URL from command-line arguments
    crawl_web(args.root_url, args.starting_url, config)


Overwriting crawler.py


In [139]:
!python crawler.py "https://en.wikipedia.org" "/wiki/Tsipouro"

INFO:root:Page: https://en.wikipedia.org/wiki/Tsipouro, Depth: 0, Visited: 1
INFO:root:Page: https://en.wikipedia.org/wiki/Main_Page, Depth: 1, Visited: 2
INFO:root:Page: https://en.wikipedia.org/wiki/Main_Page, Depth: 1, Visited: 3
INFO:root:Page: https://en.wikipedia.org/wiki/Tsipouro, Depth: 1, Visited: 4
INFO:root:Page: https://en.wikipedia.org/wiki/Tsipouro, Depth: 1, Visited: 5
INFO:root:Page: https://en.wikipedia.org/wiki/Tsipouro, Depth: 1, Visited: 6
INFO:root:Page: https://en.wikipedia.org/wiki/Greek_language, Depth: 1, Visited: 7
INFO:root:Page: https://en.wikipedia.org/wiki/Romanization_of_Greek, Depth: 1, Visited: 8
INFO:root:Page: https://en.wikipedia.org/wiki/Greece, Depth: 1, Visited: 9
INFO:root:Page: https://en.wikipedia.org/wiki/Thessaly, Depth: 1, Visited: 10
INFO:root:Page: https://en.wikipedia.org/wiki/Epirus_(region), Depth: 1, Visited: 11
INFO:root:Page: https://en.wikipedia.org/wiki/Macedonia_(Greece), Depth: 1, Visited: 12
INFO:root:Page: https://en.wikipedia.

# Collecting Data Across an Entire Site

* Of course, web crawlers would be fairly boring if all they did was hop from one page to the other. 

* In order to make them useful, we need to be able to do something on the page while we’re there. 

* Let's look at how to build a scraper that collects the title and the first paragraph of content.

* All titles (on all pages, regardless of their status as an article page, an edit history page, or any other page) have titles under `h1` tags, and these are the only `h1` tags on the page.

* All body text lives under the `div#bodyContent` tag. However, if we want to get more specific and access just the first paragraph of text, we might be better off using `div#mw-content-text`→`p` (selecting the first paragraph tag only). 

* This is true for all content pages except file pages (for example: https://en.wikipedia.org/wiki/File:Orbit_of_274301_Wikipedia.svg), which do not have sections of content text.

In [140]:
def get_links(article_url, pages, total_pages, max_depth, depth):
    num_visited_pages = len(pages) + 1
    if num_visited_pages > total_pages or depth > max_depth:
        return
    pages.add(article_url)
    print('-' * 10)
    print(num_visited_pages)
    print('-' * 10)
    print(article_url)
    try:
        r = requests.get("https://en.wikipedia.org" + article_url)
    except requests.exceptions.RequestException as rex:
        print(f"Unable to get {article_url} reason {rex}")
    else:
        html = r.content
        soup = BeautifulSoup(html, 'html.parser')
        try:
            # here we will print the wanted attributes 
            print(soup.h1.get_text())
            print(soup.find(id ="mw-content-text").find_all("p")[0]) 
        except AttributeError:
            print("This page is missing something.")

    for link in soup.find_all("a", 
                              href=re.compile("^(/wiki/)((?!:).)*$")):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages: 
                new_page = link.attrs['href'] 
                get_links(new_page, pages, total_pages, max_depth, depth+1)


pages = set()
get_links("/wiki/Main_Page", pages, 20, 1, 0)
print(len(pages))

----------
1
----------
/wiki/Main_Page
Main Page
<p>The <b><a href="/wiki/1899_Kentucky_gubernatorial_election" title="1899 Kentucky gubernatorial election">1899 Kentucky gubernatorial election</a></b> was held on November 7, 1899. The <a href="/wiki/Republican_Party_(United_States)" title="Republican Party (United States)">Republican</a> incumbent, <a href="/wiki/William_O%27Connell_Bradley" title="William O'Connell Bradley">William Bradley</a>, was <a href="/wiki/Term_limits_in_the_United_States" title="Term limits in the United States">term-limited</a>. The <a href="/wiki/Democratic_Party_(United_States)" title="Democratic Party (United States)">Democrats</a> chose <a href="/wiki/William_Goebel" title="William Goebel">William Goebel</a>. Republicans nominated <a href="/wiki/William_S._Taylor_(Kentucky_politician)" title="William S. Taylor (Kentucky politician)">William Taylor</a>. Taylor won by a vote of 193,714 to 191,331. The vote was challenged on grounds of <a href="/wiki/Elect

# Getting all the Links in a Site

* To get all the links in site, we need to get all its *internal* links and all its *external* links.

* Internal links point to pages inside the site.

* External links point to pages outside the site.

In [141]:
# Retrieves a list of all internal links found on a page
def get_internal_links(soup, include_url):
    internal_links = set()
    # An internal link either starts with a "/" 
    # or contains the site name in the domain name
    for link in soup.find_all("a",
                             href=re.compile("^(https?:)?/{0,2}[^/]*" 
                                             + include_url)):
        if link.attrs['href'] is not None:
            internal_links.add(link.attrs['href'])
    return internal_links

In [142]:
# Retrieves a list of all external links found on a page
def get_external_links(soup, exclude_url):
    external_links = set()
    # Finds all links that start with "http" or "https" that do
    # not contain the current URL (built with a negative look ahead)
    for link in soup.find_all("a",
                             href=re.compile("^https?((?!:.+"
                                             + exclude_url + ").)*$")):
        if link.attrs['href'] is not None:
            external_links.add(link.attrs['href'])
    return external_links

In [143]:
# Gets the domain name of a URL
def get_domain_name(address):
    p = re.compile('https?://')
    schemeless_address = p.sub("", address, count=1)
    hostname = schemeless_address.split("/")[0]
    hostname_parts = hostname.split(".")[-2:]
    return '.'.join(hostname_parts)

In [144]:
# Collects a list of all internal and external URLs found on the site
def get_all_links(site_url, all_internal_links, all_external_links, depth=-1):
    if depth == 0:
        return
    try:
        r = requests.get(site_url)
    except requests.exceptions.RequestException as rex:
        print(f"Unable to get {site_url} reason: {rex}")
    else:
        html = r.content
        soup = BeautifulSoup(html, 'html.parser', from_encoding="latin-1")
        domain = get_domain_name(site_url)
        internal_links = get_internal_links(soup, domain)
        external_links = get_external_links(soup, domain)
        for link in external_links:
            if link not in all_external_links:
                all_external_links.add(link)
                print("E: " + link)
        for link in internal_links:
            if link not in all_internal_links:
                all_internal_links.add(link)
                print("I: " + link)
                if not link.startswith("http:"):
                    if link.startswith("/"):
                        link = "http://" + domain + link
                    else:
                        link = "http://" + domain + "/" + link
                get_all_links(link, all_internal_links, all_external_links, 
                              depth-1)

* Then we can run the program with, for example:

In [145]:
all_internal_links = set()
all_external_links = set()
get_all_links("http://www.lifo.gr", all_internal_links, all_external_links, 2)
print('Number of internal links:', len(all_internal_links))
print('Number of external links:', len(all_external_links))

E: https://podcasts.apple.com/gr/podcast/%CF%81%CF%8E%CF%84%CE%B1-%CE%BC%CE%B5-%CF%8C-%CF%84%CE%B9-%CE%B8%CE%B5%CF%82/id1648475977?l=el
E: https://www.lifoshop.gr/
E: https://open.spotify.com/show/1FJYeudGkl3vOKZIsxHNF6
E: https://open.spotify.com/show/5IsJp7Fd9h6J2lQepcOMt6
E: https://www.lifoshop.gr/product/afieroma-vivlio/
E: https://play.google.com/store/apps/details?id=gr.lifo.app
E: https://open.spotify.com/show/5mLh1VHjX6mKtBXTDvrUOx
E: https://apps.apple.com/us/app/lifo/id1673713585
E: https://www.lifoshop.gr/product/oi-100-kalyteres-apantiseis/
E: https://podcasts.google.com/feed/aHR0cHM6Ly9mZWVkcy5zaW1wbGVjYXN0LmNvbS9hY1FDMk5KMA?sa=X&ved=0CAMQ4aUDahcKEwjogv2on5HvAhUAAAAAHQAAAAAQBA
E: https://www.lifoshop.gr/product/klamata/
E: https://www.lifoshop.gr/product/fthino-fagito-stin-athina/
E: https://www.facebook.com/lifo.mag
E: https://instagram.com/lifomag
E: https://www.lifoshop.gr/product/lean-to-narkotiko-tis-genias-toy-trap/
E: https://podcasts.google.com/feed/aHR0cHM6Ly9mZW

I: https://www.lifo.gr/now/world/portogalia-paraitithike-o-prothypoyrgos-meta-apo-ereyna-gia-diafthora
I: https://ampa.lifo.gr/renatalks/h-rena-apanta-o-aparavatos-kanonas-ton-scheseon-apo-apostasi/
I: https://www.lifo.gr/issues/view/788
I: https://www.lifo.gr/now/sport/pethane-o-thrylikos-nikos-gioytsos
I: https://www.lifo.gr/now/world/o-proedros-toy-israil-apanta-stin-tzoli-den-ehei-paei-pote-sti-gaza
I: https://www.lifo.gr/now/world/ehoynt-mparak-israil-ehei-liges-ebdomades-perithorio-gia-na-exaleipsei-ti-hamas
I: https://www.lifo.gr/agora/business-news/ioannis-sarmas-ekdilosi-pros-timin-toy-sto-stayros-niarhos
I: https://www.lifo.gr/now/entertainment/o-boy-george-epistrefei-sto-mprontgoyei-meta-apo-20-hronia
I: https://www.lifo.gr/now/world/nioy-tzersei-epistatis-sholeioy-erihne-hlorini-kai-somatika-ygra-sto-fagito-toy-kylikeioy
I: https://www.lifo.gr/now/world/i-serena-goyiliams-pire-ton-titlo-toy-fashion-icon
I: https://mikropragmata.lifo.gr/wtf/psekasmena-sxolia-voreio-selas/
Nu

* We want to let the user select what the program will do, by providing command line arguments.
* As previously, we will use argparse:

```
python get_all_site_links.py -h
usage: get_all_site_links.py [-h] [-d DEPTH] starting_site

positional arguments:
  starting_site         starting site

optional arguments:
  -h, --help            show this help message and exit
  -d DEPTH, --depth DEPTH
                        depth of crawling
```

```python
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("starting_site", help="starting site")
parser.add_argument("-d", "--depth", type=int, default=-1,
                    help="depth of crawling")
args = parser.parse_args()

all_internal_links = set()
all_external_links = set()
get_all_links(args.starting_site, all_internal_links, all_external_links,
              args.depth)

```

### Crawling a game forum

* As our last task, let's crawl forum threads comprising Greek posts that have one or more replies from Warmane.
* The [Warmane](https://forum.warmane.com/forumdisplay.php?f=20) project imitates outdated games and it serves educational purposes.
* No obvious `robots.txt` file, so we can read the Terms of Service, to see if scraping is forbidden.
* Let's parse the first page (several exist).

In [157]:
warmane_url = "https://forum.warmane.com/forumdisplay.php?f=20&page=1"

In [158]:
response = requests.get(warmane_url)
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')

In [159]:
# the posts use a CSS class named `title` 
posts = soup.find_all(class_='title')
for post in posts:
    print(post.get_text())

Ψαχνω Έλληνες για παρέα ή casual Guild - LORDAERON x1 / Horde
??????? ???????; (Icecrown)
Psaxnw ellhniko guild ston Icecrown
Icecrown Impromptu ICC raids and community!
Anazitisi Ellhnikoy Guild
8a ginei ellhniko guild Frostmourne?
Ordo Hellenicus Raiding Guild
Ψαχνω για Alliance Ελληνικο guild στο Icecrown realm
Frostmourne S3 Elliniko guild h pareaki?
Psaxnw atoma na pexoume parea :)
Guild ellhniko
Recruit Member Elliniko guild
psaxno elliniko guild lordaeron/horde
[Horde] <Iceclowns Citadel> - Icecrown Realm
psaxno filous
Ellhniko End Game Guild ston Icecrown/Horde-Recruitment.
Wotlk Lordaeron Guild
Ψαχνω ελληνες για bg!
Icecrown Server/ WoW guild **Auto-Attack** is recruiting Greek officers
Greek guild se icecrown/all ?


### Greeklish

* There is a language mix, with posts both in Greek and Greeklish.
* We can try separating them by identifying the language.
* There already libraries for this job, like [langdetect](https://pypi.org/project/langdetect/), which we are going to use for this task.

In [151]:
#!pip install langdetect
from langdetect import detect

# Filter for Greek posts
greek_posts = [post for post in posts if 
               detect(post.get_text(strip=True)) == 'el']
for post in greek_posts:
    print(post.get_text())

Ψαχνω Έλληνες για παρέα ή casual Guild - LORDAERON x1 / Horde
Ψαχνω για Alliance Ελληνικο guild στο Icecrown realm
Ψαχνω ελληνες για bg!


* Another library we could use is called [fasttext-langdetect](https://pypi.org/project/fasttext-langdetect), which is based on [fastText](https://fasttext.cc/).
* fastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers.
* In the reported benchmark, it's better than `langdetect` across languages, also for the modern Greek language (0.2 points in Recall). Let's use it for our language detection task. 

In [162]:
# !pip install fasttext-langdetect
from ftlangdetect import detect as ft_detect

ft_detect(text="Είμαι ένα κείμενο", low_memory=False)

{'lang': 'el', 'score': 1.0}

In [165]:
greek_posts = [post for post in posts if 
               ft_detect(post.get_text(strip=True))['lang'] == 'el']
for post in greek_posts:
    print(post.get_text())

Ψαχνω Έλληνες για παρέα ή casual Guild - LORDAERON x1 / Horde
Ψαχνω για Alliance Ελληνικο guild στο Icecrown realm
Ψαχνω ελληνες για bg!


* The rest are likely posts in Greeklish, but let's check this hypothesis. 

In [169]:
greeklish_posts = [post for post in posts if 
                   ft_detect(post.get_text(strip=True))['lang'] != 'el']
for post in greeklish_posts:
    print(post.get_text())

??????? ???????; (Icecrown)
Psaxnw ellhniko guild ston Icecrown
Icecrown Impromptu ICC raids and community!
Anazitisi Ellhnikoy Guild
8a ginei ellhniko guild Frostmourne?
Ordo Hellenicus Raiding Guild
Frostmourne S3 Elliniko guild h pareaki?
Psaxnw atoma na pexoume parea :)
Guild ellhniko
Recruit Member Elliniko guild
psaxno elliniko guild lordaeron/horde
[Horde] <Iceclowns Citadel> - Icecrown Realm
psaxno filous
Ellhniko End Game Guild ston Icecrown/Horde-Recruitment.
Wotlk Lordaeron Guild
Icecrown Server/ WoW guild **Auto-Attack** is recruiting Greek officers
Greek guild se icecrown/all ?


* Let's try to address this task from scratch, by building regular expressions to match posts in Latin.

In [170]:
import re

def is_greeklish(text):
    # Creating a regular expression pattern to match Latin characters (A-Z, a-z)
    greeklish_pattern = re.compile(r'[A-Za-z]+', re.UNICODE)
    # Searching for the pattern in the text
    match = greeklish_pattern.search(text)
    # If a match is found, return True (indicating Greeklish is present)
    return match is not None

# Testing
post = "Αυτή είναι μια δοκιμαστική πρόταση σε Greeklish: kalinixta."
if is_greeklish(post):
    print("The post is written in Greeklish.")
else:
    print("The post is not written in Greeklish.")

The post is written in Greeklish.


In [171]:
# Filter for Greeklist posts
greeklish_posts = [post for post in posts if 
                   is_greeklish(post.get_text(strip=True))]
for post in greeklish_posts:
    print(post.get_text())

Ψαχνω Έλληνες για παρέα ή casual Guild - LORDAERON x1 / Horde
??????? ???????; (Icecrown)
Psaxnw ellhniko guild ston Icecrown
Icecrown Impromptu ICC raids and community!
Anazitisi Ellhnikoy Guild
8a ginei ellhniko guild Frostmourne?
Ordo Hellenicus Raiding Guild
Ψαχνω για Alliance Ελληνικο guild στο Icecrown realm
Frostmourne S3 Elliniko guild h pareaki?
Psaxnw atoma na pexoume parea :)
Guild ellhniko
Recruit Member Elliniko guild
psaxno elliniko guild lordaeron/horde
[Horde] <Iceclowns Citadel> - Icecrown Realm
psaxno filous
Ellhniko End Game Guild ston Icecrown/Horde-Recruitment.
Wotlk Lordaeron Guild
Ψαχνω ελληνες για bg!
Icecrown Server/ WoW guild **Auto-Attack** is recruiting Greek officers
Greek guild se icecrown/all ?


<h3 style="color:salmon">Homework</h3>

* Crawl the entire forum and build a CSV file with Greek and Greeklish posts. 
* The language of each post should be identified and defined in a second column.
* Keep also other metainformation, including the author and the date.
* Add another column with `reply-to` information, so that the threads can be recreated.
* You should be able to present a code snippet, reconstructing a thread from your CSV.