**PySDS Week 3 Lecture3. V.1**
Last author: B. Hogan

# Week 3. Day 3. Pseudocode and Web data access  

Learning goals
- Understand the function of pseudocode. 
 - Be able to use informal pseduocode
- Understand basic text scraping
- Understand how to build a basic spider 

# Section 1. Pseudocode and pseudo-pseudocode. 

Pseudo code is a means by which we articulate what we wnat to do with code without being too careful syntactically. It's about clearing away the specifics or abstracting them from the code. Often, well written pseudocode can translate very easily into running code. But the purpose is to get a sense of how our code is going to run in general.

## Pseduo-pseduocode? 

Pseudocode is not quite computer code. But it is often written in a format that is close to formal. Have a look at Springer's LINCS books (Lectures in Computer Science) to see that even the pseudocode itself is quite formal. Here is an example below: 

Now there's no real thing as pseudo-pseudocode. But here I want to suggest that pseduocode varies in its syntactic clarity. More formal pseudocode uses specific mathematical symbols or follows the general syntax of a specific language. More informal pseudoccode is simply a set of instructions, written in an inconsistent or conversational style. This is not a bad thing. The function of pseudocode is to help you organize your thoughts. If you are trying to oragnize and writing it in a certain way helps, then don't fret over its formality. However, when you go to share this with someone else, the more formal, the less likely that there will be ambiguity about what you meant. 

Below we will have two exercises in pseduocode. The first is I'll give you some code and you write the pseduocode. Then vice versa, I'll give some pseduocode and you write the code. 

In [None]:
# Cleaning out the vowels - Working Code: 
# 
def vowelSeparator(text):
    if not type(text) is str:
        return (None,None)
    
    in_vowels = []
    in_consonants = []
    
    for i in text:
        if not i.isalpha():
            continue
        if i in 'aeiouy':
            in_vowels.append(i)
        else:
            in_consonants.append(i)
    return (in_vowels,in_consonants)

print( vowelSeparator("help" )  )
print( vowelSeparator("help %£$% erwerwr" )  )
print( vowelSeparator(4))

In [None]:
# The fibonacci sequence: Working Code
#
# A mathematical sequence found in nature, 
# such as in the shape of a nautilus shell.
# See: https://en.wikipedia.org/wiki/Nautilus#Shell

def getFibonacci2(n=10):
    '''Return the first n digits of the Fibonacci sequence
    
    Notes: This version uses two counter variables.'''
    n1 = 1
    n2 = 1
    out_list = [n1,n2]
    if n <=2:
        return out_list
    
    for i in range(2,n):
        out_list.append(n1 + n2)
        n2 = n1 + n2
        n1 = n2 - n1 # This tricky line is because we have already
                     # modifid the value for n2, so to assign n2
                     # to n1 we first have to remove the n1 we just
                     # added to n2
    return out_list

def getFibonacci3(n=10):
    '''Return the first n digits of the Fibonacci sequence
    
    Notes: This version uses three counter variables.'''

    n1 = 1
    n2 = 1
    n3 = None
    out_list = [n1,n2]
    if n <=2:
        return out_list
    
    for i in range(2,n):
        n3 = n1 + n2
        out_list.append(n3)
        n1 = n2  
        n2 = n3 
        
    return out_list


getFibonacci3(7)
        

See the [Wikipedia page](https://en.wikipedia.org/wiki/Pseudocode) for some clear examples of pseudocode across languages. In fact, they present the 'fozzie bear' (or in actuality, the 'fizz buzz') program. Notice the comparative pseudocode for this algorithm that takes into account some of the ways that different languages handle the same algorithm. 

# Section 2. Basic skills for spiders and scrapers

One of the goals for today will be to consider web scrapers and web spiders.

- **Web scraper**: A program for taking in a page from the web in html and extracting the important details. To scrape a page is often to create meaningful variables in a data structure other than html. Web scraping is a central component of the web. It is how search engines know what is on a page and thus why someone would want to be presented that page as an option in search results.   

- **Web spider/web crawler**: A program for navigating the network structure of a part of the web. It begins with a seed set of $n>=1$ pages and then looks for the relevant URLs on the page. For each of the URLs it will download that page and repeat this process. Crawlers range in complexity from the one we will build today to...Google. Indeed, Google started as a crawler that indexed the web, like Hotbot, Lycos, Yahoo and Altavista before it. Interestigly, Google initially succeeded because it took into account the structure of the web as learned from its crawler. This was done through the [page rank algorithm](https://en.wikipedia.org/wiki/PageRank).


## Section 2.1 The web scraper

Web scrapers are extremely bespoke. In the paper, [Hogan and Berry (2010)](http://comprop.oii.ox.ac.uk/wp-content/uploads/sites/37/2011/06/Hogan-Berry_City_and_Community_Craigslist.pdf) I had to actually rebuild my scraper a few times, as Craigslist would change how the page was laid out, thus necessitating a new scraper. It is worth pointing out that scrapers are among the most fragile parts of data science. Much of the semantics of webpages as seen by the user comes from a combination of layout, order, section and other features that are obvious when viewing the page, but hard to see programmatically. For modern scrapers, they vary from the super basic 'find all the links' variety to ones that include "<meta>" tags. 

There are a variety of webscrapers out there as packages to be used. 

- ```html```: The basic one would be the python html package, which can read html pages, return tags and hence navigate the html structure. It's like a simplified beautifulSoup.
- ```beautifulsoup```: the generic markup parser that we have already seen for xml. In the demo below we will see it in use on html pages. 
- ```scrapy```: Another python package for scraping pages. I've not used it.  

## Example scraping: Twitter Replies

Twitter does not easily faciliate the collection of replies of tweets (so far as I know; this information changes quite regularly). But we can get the replies (at least the first set thereof*) from the webpage. Now the twitter webpage is quite busy, so we can use a simpler version of the Twitter page at http://mobile.twitter.com/.  

Below is the code to download a twitter mobile page. It will download the page of replies to a tweet I sent. You'll see that this works, but as I walk through the example, it will become clear that this is not always that efficient and it can be very fragile. 

In class we will discuss how to put this into a series of functions so that we can use it more readily. 

In [None]:
from bs4 import BeautifulSoup
import urllib.request 
import pandas as pd
import re

username = "blurky"
tweet_id = 1054644948031692800
url = "https://mobile.twitter.com/%s/status/%s" % ( username, tweet_id)

req = urllib.request.Request( url, headers={'User-Agent': 'OII SDS class 2018.1/Hogan'})
infile = urllib.request.urlopen(req)
text = infile.read()
text = str(text.decode()).replace("timeline replies","timeline_replies")
soup = BeautifulSoup(text)#.replace("timeline replies", "timeline_replies").encode("utf-8"))

reply_text = []
reply_ids = []
usernames = []

x = soup.find_all(class_ ="timeline_replies")
y = x[0].find_all(class_ ="tweet-text")
for i in y:
    j = i.find_all(class_="dir-ltr")
    reply_text.append(j[0].text)
    reply_ids.append(i["data-id"])

for i in reply_ids: 
    user_id_re = re.compile("href=\"/\w*/reply/%s" %i)
    user_text = user_id_re.findall(text)
    if len(user_text) >  0:
        usernames.append(user_text[0].split("/")[1])
        
if len(reply_text) == len(reply_ids) == len(usernames):
    reply_df = pd.DataFrame(list(zip(reply_text,reply_ids,usernames)))#,columns=["reply_text","tweet_id","username"])
else:
    print("mismatched series -  check parser")
    
reply_df

## Section 2.2: Web Crawlers 

Web crawlers need to start from somewhere. The least imaginative way would be to do an IP scan. At over 900 trilliion possible IP addresses, many of which are not even available, it might make sense to be a bit more focused.

Instead of iterating through IP addresses (which is not even thermodynamically possible with the introduction of IPV6) we should start with a seed set of pages. For each of the page, we would find all the URLs on the page (or at least many of the URLs) and the follow each one in turn. Beyond that we will see in the pseudocode how to do it. Recall that this is going to be pretty basic. To do things a bit more complex you will want to leverage pre-built python packages that abstract some of these details away. Some popular packages are:
- ```Selenium```: This is a browser controller. It allows python to click on buttons and controls on firefox as if it was a user. You can interface with a page, log in, get things behind password-protected, submit data on forms, etc...But tweaking this is going to be a real nuisnace. The good news is that it can even execute javascript in a browser (noteable when running SeleniumRC, the 'headless' selenium). 
- ```Mechanize```: This was originally a module for the ```perl``` language. It was ported over to python over a decade ago. These are maintained separately. Mechanize doesn't interface with a browser so much as set up an instance of a browser in python. It's still in use but not for python 3, so probably not the best to use it here. 
- ```MechanicalSoup```: As you guessed from beautifulSoup, it is a package for interacting like a browser with pages. It is directly meant to be the successor to Mechanize for Python 3 and uses BeautifulSoup to navigate pages. 

Bear in mind that these packages are not, properly speaking, webcrawlers. They only facilitate the process of crawling the web. In no case do these packages do the crawling out of the box. So, we will show the crawling and then you can explore this yourself later. 

In [None]:
# Class file: from https://saintlad.com/make-a-web-crawler/
# Tweaks by Bernie Hogan
# All comments by Jake Kovoor unless prefixed by BH. 

from html.parser import HTMLParser  
from urllib.request import urlopen  
from urllib import parse
import time

# We are going to create a class called LinkParser that inherits some
# methods from HTMLParser which is why it is passed into the definition
class LinkParser(HTMLParser):

    # This is a function that HTMLParser normally has
    # but we are adding some functionality to it
    def handle_starttag(self, tag, attrs):
        # We are looking for the begining of a link. Links normally look
        # like <a href="www.someurl.com"></a>
        if tag == 'a':
            for (key, value) in attrs:
                if key == 'href':
                    # We are grabbing the new URL. We are also adding the
                    # base URL to it. For example:
                    # www.saintlad.com is the base and
                    # somepage.html is the new URL (a relative URL)
                    #
                    # We combine a relative URL with the base URL to create
                    # an absolute URL like:
                    # www.saintlad.com/somepage.html
                    
                    newUrl = parse.urljoin(self.baseUrl, value)
                    # And add it to our colection of links:
                    if "/study" in newUrl:
                        continue
                    self.links = self.links + [newUrl]
                    
    # This is a new function that we are creating to get links
    # that our spider() function will call
    def getLinks(self, url):
        self.links = []
        # Remember the base URL which will be important when creating
        # absolute URLs
        self.baseUrl = url
        # Use the urlopen function from the standard Python 3 library
        response = urlopen(url)
        # Make sure that we are looking at HTML and not other things that
        # are floating around on the internet (such as
        # JavaScript files, CSS, or .PDFs for example)
        # BH: I changed this to text/html in rather than == text/html, since 
        #     some pages have text/html; encoding=utf-8.
        if 'text/html' in response.getheader('Content-Type'):
            htmlBytes = response.read()
            # Note that feed() handles Strings well, but not bytes
            # (A change from Python 2.x to Python 3.x)
            htmlString = htmlBytes.decode("utf-8")
            self.feed(htmlString)
            return htmlString, self.links
        else:
            return "",[]

In [None]:
# Spider by Bernie Hogan based on pseduocode above
# Extra features added to help with crawling 
# (i.e. pseudocode is not complete)
def spider(seed_set,stop_word,max_pages,sleep=0.1): 
    page_count = 0
    pages_with_word = []
    pages_without_word = []
    all_pages = set(seed_set)
    
    try: 
        pages_to_visit  = list(all_pages)
    except: 
        print("Spider expects a collection of URLs as first argument.")
        return (None, None)

    parser = LinkParser()
    
    while len(pages_to_visit) and page_count < max_pages:
                
        url = pages_to_visit[0]
        pages_to_visit = pages_to_visit[1:] #Get rid of first page        
        page_count += 1

        data, links = parser.getLinks(url)

        if data.find(stop_word) >-1: #we find word in the data, its position is 0+ 
            for i in links: 
                if i not in all_pages:
                    all_pages.add(i)
                    pages_to_visit.append(i)
            pages_with_word.append(url)
        else: 
            pages_without_word.append(url)

        time.sleep(sleep)
    return (pages_with_word,pages_without_word)

In [None]:
withwordlist, withoutwordlist = spider(["https://www.oii.ox.ac.uk/people"],"network",10)
print(withwordlist)
print(withoutwordlist)

## Note on best practices 

Search engine crawlers tend to avoid certain areas of a website at the site's request. This is found in a specific file on the site called ```robots.txt```. This file is immediately under the domain name. They vary a lot but tend to ask crawlers not to use the site's search function or scrape private information. Just think of a major site and check it out for yourself! 

- [Instagram](https://www.instagram.com/robots.txt)
- [BBC](https://www.bbc.co.uk/robots.txt)
- [Yahoo](https://www.yahoo.com/robots.txt)
- [Superbad](http://superbad.com/robots.txt) (My browser start page - it's fun!)
- [Reddit](https://www.reddit.com/robots.txt)