In [None]:
from bs4 import BeautifulSoup
import requests
import random

# Recaps

## HTTP

A [Hypertext Transfer Protocol (HTTP)](https://en.wikipedia.org/wiki/HTTP) request is made by a client, to a named host, which is located on a server.

A HTTP request contains the following elements:
- A request line.
- A series of HTTP headers, or header fields.
- A message body, if needed. which is usually a JavaScript [Object Notation (JSON)](https://www.w3schools.com/js/js_json_objects.asp)

HTTP request methods
GET
POST
PUT
DELETE

[and more ...](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods)

Example Request
<img src="https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview/http_request.png"/>

Example Reponse
<img src="https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview/http_response.png"/>

The HTTP Response will have :
- A [status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status), indicating if the request was successful or not, and why.
- A status message, a non-authoritative short description of the status code.
- HTTP headers, like those for requests.
- Optionally, a body containing the fetched resource which can be a blob or a JSON object aswell

## Databases

<img src="https://images.prismic.io/nightborn/7e215705-8aa6-4ff4-94bf-5a6a801b92a4_thumbnail_website2.jpg?auto=compress,format"/>

One of the most common databases is a :

<b>Relational Database management system (RDMS) or simply relation Databases:</b> is a collection that organizes data in predefined relationships where data is stored in one or more tables columns and rows
- Structured Query Language(SQL), is a domain-specific language used in programming and managing of data held in a relational database. https://www.w3schools.com/sql/

Some examples include
- Microsoft SQL Server
- MySQL
- PostgreSQL
- SQLite
- Oracle Database
- MariaDB


<b>[NoSQL](https://www.couchbase.com/resources/why-nosql/) database:</b>  stores information in JSON documents instead of columns and rows used by relational databases. NoSQL stands for “not only SQL” rather than “no SQL” at all. Types of NoSql Databases include
- Document databases
- Key-value stores
- Wide-column databases
- Graph databases

Some examples include:
- MongoDB
- Apache Cassandra
- Couchbase
- Amazon DynamoDB
- Redis
- Neo4j

# Web Crawling

The goal of this notebook is to go to a Wikipedia page,
 - scrape all the links from this page,
  - store them somehow
  - pick a link at random to perform the same process again.
This will repeat for a set number of iterations, but could run infinitely in theory.. until something crashes.

This is to introduce you to some concepts behind web-crawling - perhaps it will stimulate some ideas about how you might make a more directed and intentional web-crawler with some specific goal in mind...

The first thing to do is get an idea of what we are working with. Going to some random Wikipedia article (in this case.. chicken) we can see a few things:


1. The URLs are of the form `https://en.wikipedia.org/wiki/Chicken` - which is convenient and neat.
2. The HTML markup also seems quite neat and links are in clear `<a>` tags.
3. Links to other Wikipedia articles seem to be of the shortened form `/wiki/Domestication` (as seen in the pic).
4. There are _a lot_ of links on a page.

> If you didn't know, all I did in the pic above is just right click on a link and click `Inspect` - this opens up the console and expands the HTML to reveal the specific tag you are _inspecting_. This is really handy. I'm using Firefox but I'd imagine all good browsers (...Firefox or Chrome) have this feature.

## Getting Started

As with any task, you need to break it down into manageable chunks as soon as possible. First of all I want to make sure I can make a HTTP request to a Github page and retrieve the page content.

I'll be using the [Requests](https://docs.python-requests.org/en/latest/) package today, but [URLLib](https://docs.python.org/3/library/urllib.html) would also work. They pretty much do the same thing, but it seems Requests is used more often and is a slightly nicer package to use.

In [None]:
query = 'chicken'

url = 'https://en.wikipedia.org/wiki/' + query
page = requests.get(url)
soup = BeautifulSoup(page.content, features="html.parser")
print(soup)

That seemed to work fine. Notice I had to run BeautifulSoup on `page.content`, because `page` itself is just the HTTP response:

In [None]:
print(type(page))

Here we have a status code of `200` meaning the request is successful.

In [None]:
print(page.status_code)

In [None]:
for method in dir(page):
    print(method)

Let's do some BeautifulSoup magic and grab all the links from the page. A HTML link is always in a `<a>` tag, and specifically is under the `href` attribute of the `<a>` tag.

I am putting this in a `try` `except` block here as_not every_ `<a>` tag will necessarily have an `href` attribute. Try running:

```python
links = []

for a in soup.find_all("a"):
    links.append(a["href"])
```

You'll get an error and the whole loop is ruined. Getting used to when and where to use `try` `except` isn't always obvious, but it is a way of _catching_ an error, handling it in some way, and then _continuing_ as opposed to simply crashing :(

In [None]:
links = []

for a in soup.find_all("a"):
    try:
        links.append(a["href"])
    except:
        pass
    
for link in links:
    print(link)

This is a great start! You can see we have _lots_ of internal Wikipedia links (links to other Wikipedia articles).. This could be the start of a Wikipedia specific crawler.

The links which start with a `#` are references to breakpoints on the page, so that you could send the link to someone already scrolled to a specific point on the page.

Now I want to just grab all the Wikipedia links and filter everything else out.

In [None]:
filtered = [link for link in links if link.startswith('/wiki/')]

for f in filtered:
    print(f)

That got all internal Wikipedia links, but there are also images (`.jpg`, `.png`, `.tif`) in there. There are also a bunch of other things so I am gonna do that again. The cell below does the same check above, but then also checks if any of the other junk I don't want is in the link, and of course skips it if so.

In [None]:
ignores = ['png', 'jpg', 'jpeg', 'isbn', 'svg', 'identifier', \
           'File', 'Special', 'Template', 'Mailto', 'Portal', \
           'Help', 'Category', 'Talk', 'Wikipedia', 'Main_Page']

filtered = []

for link in links:
    if link.startswith('/wiki/'):
        valid = True
        for ignore in ignores:
            if ignore in link:
                valid = False
                break
        if valid:
            filtered.append(link)

for f in filtered:
    print(f)

An alternative way to find links with a particular pattern would be to use [regular expression](https://www.regular-expressions.info/#:~:text=A%20regular%20expression%20(regex%20or,with%20wildcard%20notations%20such%20as%20*)


#### <span style="color:red"> Exercise 1 </span>

practice using beautiful soup and nltk get all the text html tag in the chicken page and count the number of time the word "chicken" is used There should be about ~ 184 occurrences

#### <span style="color:red"> Exercise 2 </span>

Go over this [python doc](https://docs.python.org/3/library/re.html) and use regular expresssions to grab the links that have article. How does this improve the code?

We now have a good number of valid links to other Wikipedia articles..

Now we can really start crawling! For now lets just choose a link at random and then see what that Wikipedia article has for us...

In [None]:
random_wiki = random.choice(filtered)
url = url = 'https://en.wikipedia.org' + random_wiki
page = requests.get(url)
soup = BeautifulSoup(page.content, features="html.parser")

print(f"URL: {url}")

new_links = []

for a in soup.find_all("a"):
    try:
        new_links.append(a["href"])
    except:
        pass
    
new_filtered = []

for link in new_links:
    if link.startswith('/wiki/'):
        valid = True
        for ignore in ignores:
            if ignore in link:
                valid = False
                # As soon as we know the link is not valid, there's is no point
                # checking the rest of the ignores, so we break:
                break
        if valid:
            new_filtered.append(link)
            
for f in new_filtered:
    print(f)

This is OK.. but we are just repeating ourselves (it isn't very [DRY](... add link to DRY ...)), and this doesn't really set us up to automate the process.

Let's turn these mini routines into functions.

In [None]:
def get_soup(wiki_suffix):
    url = url = 'https://en.wikipedia.org' + wiki_suffix
    page = requests.get(url)
    soup = BeautifulSoup(page.content, features="html.parser")
    
    return soup

def link_is_valid(link):
    ignores = ['png', 'jpg', 'jpeg', 'isbn', 'svg', 'identifier', \
           'File', 'Special', 'Template', 'Mailto', 'Portal', \
           'Help', 'Category', 'Talk', 'Wikipedia', 'Main_Page']
    
    if link.startswith('/wiki/'):
        valid = True
        for ignore in ignores:
            if ignore in link:
                valid = False
                break
    return valid
    

def get_links(soup):
    links = []
    for a in soup.find_all("a"):
        try:
            link = a["href"]
            if link_is_valid(link):
                links.append(link)
        except:
            pass
    
    return links

#### <span style="color:red"> Exercise 3 </span>

Add a function that given a word in counts how many times it appears in page

We have compartmentalised the few small routines in the code above and _abstracted_ them away into functions which are concise and are pretty much self-explanatory by their function names. This is starting to feel much nicer.. And something we could begin to turn into _software_.

In [None]:
random_wiki = random.choice(filtered)
soup = get_soup(random_wiki)
links = get_links(soup)

for l in links:
    print(l)

And finally we can automate the process. This is pretty simple really, we just do it a bunch of times!

In [None]:
def crawl(seed):
    links_visited = []
    suffix = '/wiki/' + seed
    # we don't want this to run forever so we only navigate 10 down
    for i in range(10):
        soup = get_soup(suffix)
        links = get_links(soup)
        suffix = random.choice(links)
        links_visited.append(suffix)
    return links_visited



In [None]:
links_visited = crawl('soup')

for lv in links_visited:
    print(f"Visited: {lv}")

#### <span style="color:red"> Exercise 4 </span>

Try the above with a new seed

## Homework

Your task now is to do _pretty much_ what I have done in this notebook, but with another source as your starting point.

To take it futher try and find a more meaningful direction in your crawling.  Perhaps you could actually read what words are in the link, or in the page or find something which would allow your web-crawler to make a decision about _where_ it would like to go to next.

It would also be great to _store this journey. [Perhaps some more metadata into a text file](https://www.w3schools.com/python/python_file_write.asp).

Perhaps the web-crawler doesn't only move forward but can turn back (return to an older link) and start a new path. How would you record this journey?