In [4]:
from bs4 import BeautifulSoup
import requests
import random

# Recaps

## HTTP

A [Hypertext Transfer Protocol (HTTP)](https://en.wikipedia.org/wiki/HTTP) request is made by a client, to a named host, which is located on a server.

A HTTP request contains the following elements:
- A request line.
- A series of HTTP headers, or header fields.
- A message body, if needed. which is usually a JavaScript [Object Notation (JSON)](https://www.w3schools.com/js/js_json_objects.asp)

HTTP request methods
GET
POST
PUT
DELETE

[and more ...](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods)

Example Request
<img src="https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview/http_request.png"/>

Example Reponse
<img src="https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview/http_response.png"/>

The HTTP Response will have :
- A [status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status), indicating if the request was successful or not, and why.
- A status message, a non-authoritative short description of the status code.
- HTTP headers, like those for requests.
- Optionally, a body containing the fetched resource which can be a blob or a JSON object aswell

## Databases

<img src="https://images.prismic.io/nightborn/7e215705-8aa6-4ff4-94bf-5a6a801b92a4_thumbnail_website2.jpg?auto=compress,format"/>

One of the most common databases is a :

<b>Relational Database management system (RDMS) or simply relation Databases:</b> is a collection that organizes data in predefined relationships where data is stored in one or more tables columns and rows
- Structured Query Language(SQL), is a domain-specific language used in programming and managing of data held in a relational database. https://www.w3schools.com/sql/

Some examples include
- Microsoft SQL Server
- MySQL
- PostgreSQL
- SQLite
- Oracle Database
- MariaDB


<b>[NoSQL](https://www.couchbase.com/resources/why-nosql/) database:</b>  stores information in JSON documents instead of columns and rows used by relational databases. NoSQL stands for “not only SQL” rather than “no SQL” at all. Types of NoSql Databases include
- Document databases
- Key-value stores
- Wide-column databases
- Graph databases

Some examples include:
- MongoDB
- Apache Cassandra
- Couchbase
- Amazon DynamoDB
- Redis
- Neo4j

# Web Crawling

The goal of this notebook is to go to a Wikipedia page,
 - scrape all the links from this page,
  - store them somehow
  - pick a link at random to perform the same process again.
This will repeat for a set number of iterations, but could run infinitely in theory.. until something crashes.

This is to introduce you to some concepts behind web-crawling - perhaps it will stimulate some ideas about how you might make a more directed and intentional web-crawler with some specific goal in mind...

The first thing to do is get an idea of what we are working with. Going to some random Wikipedia article (in this case.. chicken) we can see a few things:


1. The URLs are of the form `https://en.wikipedia.org/wiki/Chicken` - which is convenient and neat.
2. The HTML markup also seems quite neat and links are in clear `<a>` tags.
3. Links to other Wikipedia articles seem to be of the shortened form `/wiki/Domestication` (as seen in the pic).
4. There are _a lot_ of links on a page.

> If you didn't know, all I did in the pic above is just right click on a link and click `Inspect` - this opens up the console and expands the HTML to reveal the specific tag you are _inspecting_. This is really handy. I'm using Firefox but I'd imagine all good browsers (...Firefox or Chrome) have this feature.

## Getting Started

As with any task, you need to break it down into manageable chunks as soon as possible. First of all I want to make sure I can make a HTTP request to a Github page and retrieve the page content.

I'll be using the [Requests](https://docs.python-requests.org/en/latest/) package today, but [URLLib](https://docs.python.org/3/library/urllib.html) would also work. They pretty much do the same thing, but it seems Requests is used more often and is a slightly nicer package to use.

In [5]:
query = 'chicken'

url = 'https://en.wikipedia.org/wiki/' + query
page = requests.get(url)
soup = BeautifulSoup(page.content, features="html.parser")
print(soup)

<!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-feature-night-mode-disabled skin-night-mode-clientpref-0 vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Chicken - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feat

That seemed to work fine. Notice I had to run BeautifulSoup on `page.content`, because `page` itself is just the HTTP response:

In [6]:
print(type(page))

<class 'requests.models.Response'>


Here we have a status code of `200` meaning the request is successful.

In [7]:
print(page.status_code)

200


In [8]:
for method in dir(page):
    print(method)

__attrs__
__bool__
__class__
__delattr__
__dict__
__dir__
__doc__
__enter__
__eq__
__exit__
__format__
__ge__
__getattribute__
__getstate__
__gt__
__hash__
__init__
__init_subclass__
__iter__
__le__
__lt__
__module__
__ne__
__new__
__nonzero__
__reduce__
__reduce_ex__
__repr__
__setattr__
__setstate__
__sizeof__
__str__
__subclasshook__
__weakref__
_content
_content_consumed
_next
apparent_encoding
close
connection
content
cookies
elapsed
encoding
headers
history
is_permanent_redirect
is_redirect
iter_content
iter_lines
json
links
next
ok
raise_for_status
raw
reason
request
status_code
text
url


Let's do some BeautifulSoup magic and grab all the links from the page. A HTML link is always in a `<a>` tag, and specifically is under the `href` attribute of the `<a>` tag.

I am putting this in a `try` `except` block here as_not every_ `<a>` tag will necessarily have an `href` attribute. Try running:

```python
links = []

for a in soup.find_all("a"):
    links.append(a["href"])
```

You'll get an error and the whole loop is ruined. Getting used to when and where to use `try` `except` isn't always obvious, but it is a way of _catching_ an error, handling it in some way, and then _continuing_ as opposed to simply crashing :(

In [9]:
links = []

for a in soup.find_all("a"):
    try:
        links.append(a["href"])
    except:
        pass
    
for link in links:
    print(link)

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
/w/index.php?title=Special:CreateAccount&returnto=Chicken
/w/index.php?title=Special:UserLogin&returnto=Chicken
/w/index.php?title=Special:CreateAccount&returnto=Chicken
/w/index.php?title=Special:UserLogin&returnto=Chicken
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
#
#Nomenclature
#Description
#Reproduction_and_life-cycle
#Origin_and_dispersal
#Origin
#Domestication
#Dispersal
#Diseases
#Use_by_humans
#Farming
#As_pets
#Cockfighting
#In_science
#In_culture,_folklore,_and

This is a great start! You can see we have _lots_ of internal Wikipedia links (links to other Wikipedia articles).. This could be the start of a Wikipedia specific crawler.

The links which start with a `#` are references to breakpoints on the page, so that you could send the link to someone already scrolled to a specific point on the page.

Now I want to just grab all the Wikipedia links and filter everything else out.

In [10]:
filtered = [link for link in links if link.startswith('/wiki/')]

for f in filtered:
    print(f)

/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
/wiki/Chicken
/wiki/Talk:Chicken
/wiki/Chicken
/wiki/Chicken
/wiki/Special:WhatLinksHere/Chicken
/wiki/Special:RecentChangesLinked/Chicken
/wiki/Wikipedia:File_Upload_Wizard
/wiki/Special:SpecialPages
/wiki/Wikipedia:Good_articles*
/wiki/Wikipedia:Protection_policy#semi
/wiki/Chicken_as_food
/wiki/Chicken_(disambiguation)
/wiki/Chooks_(disambiguation)
/wiki/Rooster_(disambiguation)
/wiki/Cockerel_(Faberg%C3%A9_egg)
/wiki/File:Male_and_female_chicken_sitting_together.jpg
/wiki/Conservation_status
/wiki/Taxonomy_(biology)
/wiki/Template:Taxonomy/Gallus
/wiki/Eukaryote
/wiki/Animal
/wiki/Chordate
/wiki/Bird
/wiki/Galliformes
/wiki/

That got all internal Wikipedia links, but there are also images (`.jpg`, `.png`, `.tif`) in there. There are also a bunch of other things so I am gonna do that again. The cell below does the same check above, but then also checks if any of the other junk I don't want is in the link, and of course skips it if so.

In [11]:
ignores = ['png', 'jpg', 'jpeg', 'isbn', 'svg', 'identifier', \
           'File', 'Special', 'Template', 'Mailto', 'Portal', \
           'Help', 'Category', 'Talk', 'Wikipedia', 'Main_Page']

filtered = []

for link in links:
    if link.startswith('/wiki/'):
        valid = True
        for ignore in ignores:
            if ignore in link:
                valid = False
                break
        if valid:
            filtered.append(link)

for f in filtered:
    print(f)

/wiki/Chicken
/wiki/Chicken
/wiki/Chicken
/wiki/Chicken_as_food
/wiki/Chicken_(disambiguation)
/wiki/Chooks_(disambiguation)
/wiki/Rooster_(disambiguation)
/wiki/Cockerel_(Faberg%C3%A9_egg)
/wiki/Conservation_status
/wiki/Taxonomy_(biology)
/wiki/Eukaryote
/wiki/Animal
/wiki/Chordate
/wiki/Bird
/wiki/Galliformes
/wiki/Phasianidae
/wiki/Junglefowl
/wiki/Binomial_nomenclature
/wiki/Carl_Linnaeus
/wiki/10th_edition_of_Systema_Naturae
/wiki/Synonym_(taxonomy)
/wiki/L.
/wiki/Domestication
/wiki/Red_junglefowl
/wiki/Southeast_Asia
/wiki/Chicken_as_food
/wiki/Egg_as_food
/wiki/Pet
/wiki/Cockfight
/wiki/Cultural_references_to_chickens
/wiki/Capon
/wiki/Neutered
/wiki/Chick_(young_bird)
/wiki/Hen_and_Chicken_Islands
/wiki/Comb_(anatomy)
/wiki/Bird
/wiki/Diurnality
/wiki/Junglefowl
/wiki/Bird_flight
/wiki/Flight_muscle
/wiki/Wattle_(anatomy)
/wiki/Sexual_dimorphism
/wiki/Mutation
/wiki/Omnivore
/wiki/Lizard
/wiki/Mouse
/wiki/Breed
/wiki/Gregarious
/wiki/Herd
/wiki/Egg_incubation
/wiki/Dominance_

An alternative way to find links with a particular pattern would be to use [regular expression](https://www.regular-expressions.info/#:~:text=A%20regular%20expression%20(regex%20or,with%20wildcard%20notations%20such%20as%20*)


#### <span style="color:red"> Exercise 1 </span>

practice using beautiful soup and nltk get all the text html tag in the chicken page and count the number of time the word "chicken" is used There should be about ~ 184 occurrences

In [12]:
from nltk.tokenize import word_tokenize

url = 'https://en.wikipedia.org/wiki/chicken'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
tokens = word_tokenize(soup.text)

query_word = "chicken"
word_count = 0

for i in tokens:
    if query_word == i.lower(): # or "in" instead of == picks up all chickens
        word_count += 1

print(word_count)

76


#### <span style="color:red"> Exercise 2 </span>

Go over this [python doc](https://docs.python.org/3/library/re.html) and use regular expresssions to grab the links that have article. How does this improve the code?

We now have a good number of valid links to other Wikipedia articles..

Now we can really start crawling! For now lets just choose a link at random and then see what that Wikipedia article has for us...

In [13]:
random_wiki = random.choice(filtered)
url = url = 'https://en.wikipedia.org' + random_wiki
page = requests.get(url)
soup = BeautifulSoup(page.content, features="html.parser")

print(f"URL: {url}")

new_links = []

for a in soup.find_all("a"):
    try:
        new_links.append(a["href"])
    except:
        pass
    
new_filtered = []

for link in new_links:
    if link.startswith('/wiki/'):
        valid = True
        for ignore in ignores:
            if ignore in link:
                valid = False
                # As soon as we know the link is not valid, there's is no point
                # checking the rest of the ignores, so we break:
                break
        if valid:
            new_filtered.append(link)
            
for f in new_filtered:
    print(f)

URL: https://en.wikipedia.org/wiki/Forced_molting
/wiki/Forced_molting
/wiki/Forced_molting
/wiki/Forced_molting
/wiki/Molting
/wiki/Chicken
/wiki/Feather
/wiki/Eggshell
/wiki/Intensive_farming
/wiki/Vertical_integration
/wiki/Poultry
/wiki/Husbandry
/wiki/Alfalfa
/wiki/Sodium
/wiki/Calcium
/wiki/Iodine
/wiki/Zinc
/wiki/Department_for_Environment,_Food_and_Rural_Affairs
/wiki/Corticosterone
/wiki/Lymphocyte
/wiki/Leukocyte
/wiki/Salmonella
/wiki/Chicken
/wiki/Poultry
/wiki/Chicken_as_food
/wiki/List_of_chicken_dishes
/wiki/List_of_chicken_breeds
/wiki/Capon
/wiki/Poularde
/wiki/Poussin_(chicken)
/wiki/Broiler
/wiki/Chicken_terminology
/wiki/List_of_chicken_colours
/wiki/List_of_poultry_feathers
/wiki/Chicken_farming
/wiki/Battery_cage
/wiki/Free_range
/wiki/Furnished_cages
/wiki/Yarding
/wiki/Chicken_tractor
/wiki/Poultry_farming
/wiki/Broiler_industry
/wiki/Debeaking
/wiki/Hatchery
/wiki/Chick_sexing
/wiki/Chick_culling
/wiki/Candling
/wiki/Abnormal_behaviour_of_birds_in_captivity
/wi

This is OK.. but we are just repeating ourselves (it isn't very [DRY](... add link to DRY ...)), and this doesn't really set us up to automate the process.

Let's turn these mini routines into functions.

In [14]:
def get_soup(wiki_suffix):
    url = url = 'https://en.wikipedia.org' + wiki_suffix
    page = requests.get(url)
    soup = BeautifulSoup(page.content, features="html.parser")
    
    return soup

def link_is_valid(link):
    ignores = ['png', 'jpg', 'jpeg', 'isbn', 'svg', 'identifier', \
           'File', 'Special', 'Template', 'Mailto', 'Portal', \
           'Help', 'Category', 'Talk', 'Wikipedia', 'Main_Page']
    
    if link.startswith('/wiki/'):
        valid = True
        for ignore in ignores:
            if ignore in link:
                valid = False
                break
    return valid
    

def get_links(soup):
    links = []
    for a in soup.find_all("a"):
        try:
            link = a["href"]
            if link_is_valid(link):
                links.append(link)
        except:
            pass
    
    return links


def count_words(soup, word=str):
    word_count = 0
    tokens = word_tokenize(soup.text)
    for i in tokens:
        if i == word:
            word_count += 1
    return word_count

#### <span style="color:red"> Exercise 3 </span>

Add a function that given a word in counts how many times it appears in page

We have compartmentalised the few small routines in the code above and _abstracted_ them away into functions which are concise and are pretty much self-explanatory by their function names. This is starting to feel much nicer.. And something we could begin to turn into _software_.

In [15]:
random_wiki = random.choice(filtered)
soup = get_soup(random_wiki)
links = get_links(soup)
num = count_words(soup, "and")
print(num)
for l in links:
    print(l)

150
/wiki/Genome
/wiki/Genome
/wiki/Genome
/wiki/Introduction_to_genetics
/wiki/Genome_(disambiguation)
/wiki/Genetics
/wiki/Chromosome
/wiki/DNA
/wiki/RNA
/wiki/Heredity
/wiki/Nucleotide
/wiki/Mutation
/wiki/Genetic_variation
/wiki/Allele
/wiki/Amino_acid
/wiki/Outline_of_genetics
/wiki/Index_of_genetics_articles
/wiki/Introduction_to_genetics
/wiki/History_of_genetics
/wiki/Evolution
/wiki/Molecular_evolution
/wiki/Population_genetics
/wiki/Mendelian_inheritance
/wiki/Quantitative_genetics
/wiki/Molecular_genetics
/wiki/Geneticist
/wiki/DNA_sequencing
/wiki/Genetic_engineering
/wiki/Genomics
/wiki/Medical_genetics
/wiki/Outline_of_genetics#Branches_of_genetics
/wiki/Classical_genetics
/wiki/Conservation_genetics
/wiki/Cytogenetics
/wiki/Ecological_genetics
/wiki/Immunogenetics
/wiki/Microbial_genetics
/wiki/Molecular_genetics
/wiki/Population_genetics
/wiki/Quantitative_genetics
/wiki/Personalized_medicine
/wiki/Molecular_biology
/wiki/Genetics
/wiki/Nucleotide
/wiki/DNA
/wiki/RNA
/w

And finally we can automate the process. This is pretty simple really, we just do it a bunch of times!

In [16]:
def crawl(seed):
    links_visited = []
    suffix = '/wiki/' + seed
    # we don't want this to run forever so we only navigate 10 down
    for i in range(10):
        soup = get_soup(suffix)
        links = get_links(soup)
        suffix = random.choice(links)
        links_visited.append(suffix)
    return links_visited



In [17]:
links_visited = crawl('soup')

for lv in links_visited:
    print(f"Visited: {lv}")

Visited: /wiki/Mirepoix_(cuisine)
Visited: /wiki/Parsnip
Visited: /wiki/Falcarindiol
Visited: /wiki/Nutraceutical
Visited: /wiki/Vitamins
Visited: /wiki/Energy_drink
Visited: /wiki/Calcium_supplement
Visited: /wiki/Dietary_supplement
Visited: /wiki/Low-carbohydrate_diet
Visited: /wiki/Pollotarianism


#### <span style="color:red"> Exercise 4 </span>

Try the above with a new seed

In [18]:

links_visited = crawl('Cherry_blossom')

for lv in links_visited:
    print(f"Visited: {lv}")

Visited: /wiki/Prunus_padus
Visited: /wiki/Rosaceae
Visited: /wiki/Velloziaceae
Visited: /wiki/Boraginaceae
Visited: /wiki/Vitaceae
Visited: /wiki/Sarraceniaceae
Visited: /wiki/Himantandraceae
Visited: /wiki/Plocospermataceae
Visited: /wiki/Achariaceae
Visited: /wiki/Fabaceae


## Homework

Your task now is to do _pretty much_ what I have done in this notebook, but with another source as your starting point.

To take it futher try and find a more meaningful direction in your crawling.  Perhaps you could actually read what words are in the link, or in the page or find something which would allow your web-crawler to make a decision about _where_ it would like to go to next.

It would also be great to _store this journey. [Perhaps some more metadata into a text file](https://www.w3schools.com/python/python_file_write.asp).

Perhaps the web-crawler doesn't only move forward but can turn back (return to an older link) and start a new path. How would you record this journey?

In [41]:
#web scraper homework

import nltk

url = "https://www.thesun.co.uk/"

def soupify(url): # creates soup object out of url
    page_content = requests.get(url).content
    soup = BeautifulSoup(page_content, "html.parser")
    return soup


def find_links(soup): # finds all links connected with url on page
    page_links = soup.find_all("a")
    href_links = []
    for i in page_links:
        try:
            if i["href"] in href_links:
                pass
            else:
                href_links.append(i["href"])
        except:
            pass
    href_links = [link for link in href_links if link.startswith(url)]
    return href_links


def find_titles(soup): # looks through links for titles
    words = nltk.tokenize.word_tokenize(soup.text)
    titles = [word for word in words if word.istitle()]
    return titles

def get_words(soup):
    words = word_tokenize(soup.text)
    words_filtered =  [word.lower() for word in words if word.isalpha()]
    pos_tagged = nltk.pos_tag(words_filtered)
    nouns = filter(lambda x:x[1]=='NN',pos_tagged)
    return nouns

soup = soupify(url)
main_links = find_links(soup)

all_titles = []
all_words = []
freq_file = open("sun_freq_dist.txt", "w")

print(len(main_links))

for i in range(len((main_links))):
    current_url = random.choice(main_links)
    soup = soupify(current_url)
    main_links.remove(current_url)
    words = get_words(soup)
    #all_titles += (find_titles(soup))
    all_words += words
    print(f"Visited: {current_url}")
    freq_file.write(current_url + "\n")

fdist = FreqDist()

for word in all_words:
    fdist[word] += 1

for i in fdist.most_common():
    freq_file.write(str(i) + "\n")
freq_file.close()






170
Visited: https://www.thesun.co.uk/shopping/26229461/save-15-off-hotel-chocolat-gift-mothers-day/
Visited: https://www.thesun.co.uk/tv/tv-news/
Visited: https://www.thesun.co.uk/fabulous/real-life/
Visited: https://www.thesun.co.uk/news/26431412/search-operation-ufo-norway-lake-roros/
Visited: https://www.thesun.co.uk/motors/26424928/jaguar-ispace-m62-lost-control-crash/
Visited: https://www.thesun.co.uk/travel/26438314/butlins-all-inclusive-breaks-seaside-holiday-parks/
Visited: https://www.thesun.co.uk/sport/26439912/arsenal-sweden-kristoffer-olsson-brain-clots-update/
Visited: https://www.thesun.co.uk/tvandshowbiz/26431774/katie-price-hits-back-boyfriend-jj-slater/
Visited: https://www.thesun.co.uk/news/26440578/rookie-cop-court-affair-claims-criminal/
Visited: https://www.thesun.co.uk/news/politics/
Visited: https://www.thesun.co.uk/money/26401486/natwest-buy-now-pay-later-amex-credit-card-closing/
Visited: https://www.thesun.co.uk/tv/reality/
Visited: https://www.thesun.co.uk/s