# Parsing / Scraping Web Data With Python

This notebook provides a brief overview of scraping web data, parsing it, and doing rudimentary analyses. 
It's focused on pulling text data.

Note: This quick overview does assume a fair amount of prior knowledge. In particular:
- HTML knowledge. This assumes some level of awareness of how webpages work and are organized. 
    - If you want to know a bit about this, check out: http://www.htmldog.com
- Python knowledge. Here, I assume some awareness of basic Python / programming
    - If you want to learn more about Python, try the 'Python_Basic' or 'Python_Advanced' notebooks

In [7]:
# Import Web Data Packages
import lxml                        # LMXL is a package for processing HTML/XML data - http://lxml.de
import requests                    # Requests is a package for getting webpage data - http://docs.python-requests.org/
from bs4 import BeautifulSoup      # BeautifulSoup is a package for pulling data out of HTML/XML 
                                   #    http://www.crummy.com/software/BeautifulSoup 

# Import a function from natural language toolkit to use on our web-scraped text
# This function will be useful, but not explored, as it is not the focus of this Notebook
from nltk import FreqDist

In [13]:
# Grab a webpage. The requests 
# Let's analyze the Beautiful Soup Documentation Page
#webpage = requests.get("http://www.crummy.com/software/BeautifulSoup/bs4/doc/")

In [14]:
# Parse the webpage
webpage_soup = BeautifulSoup(webpage.content, 'lxml')

In [15]:
# We can pull things out of the webpage, for example the title, 
print webpage_soup.title

<title id="pageTitle">Facebook - Log In or Sign Up</title>


In [6]:
# We can search through different tags on the website, and see what they contain.
# For example, we can collect a list of all the links used on the website.
list_of_links = list()
for link in webpage_soup.find_all('a'):
    list_of_links.append(link.get('href'))
    
# Print the first few entries in our list of links
for i in range(0,6):              # We could also do 'print list_of_refs[0:6]', but that wouldn't add new lines between entries
    print list_of_links[i]

genindex.html
#
#beautiful-soup-documentation
http://www.crummy.com/software/BeautifulSoup/
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
#porting-code-to-bs4


In [7]:
# Notice that these links include internal and external links. 
#    '# something' - is a link to elsewhere on the same HTML webpage
#    'http:...'    - is a link to a different HTML webpage (to a different website)

# Let's only print out external links
for link in webpage_soup.find_all('a'):
    if 'http' in link.get('href'):
        print(link.get('href'))

http://www.crummy.com/software/BeautifulSoup/
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
http://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
http://kondou.com/BS4/
http://coreapython.hosting.paran.com/etc/beautifulsoup4.html
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
http://www.crummy.com/software/BeautifulSoup/download/4.x/
http://lxml.de/
http://code.google.com/p/html5lib/
http://example.com/elsie
http://example.com/lacie


In [8]:
# We can also try to look at the text that is on a webpage
# There is the '.get_text()' command that is useful for this
webpage_text = webpage_soup.get_text()

# 'get_text' returns unicode for all the text it found. This could be (and is in this case) a lot of text.
print 'Data type of our stored text is: ', type(webpage_text)
print 'Number of characters in our text is: ', len(webpage_text) 

# Let's have a peak at what our text looks like. Let's print the first couple hundred characters.
webpage_text[0:200]

Data type of our stored text is:  <type 'unicode'>
Number of characters in our text is:  68252


u"\n\n\nBeautiful Soup Documentation \u2014 Beautiful Soup 4.4.0 documentation\n\n\n\n      var DOCUMENTATION_OPTIONS = {\n        URL_ROOT:    './',\n        VERSION:     '4.4.0',\n        COLLAPSE_INDEX: false,\n    "

In [9]:
# Notice above that 'get_text()' called on the whole webpage gets a lot of text we don't want. 
# To try to focus in on the text we want, let's focus in on 'p' tags in the HTML

# Get a list of all the paragraph tags in our web page
webpage_paragraphs = webpage_soup.find_all('p')

# Loop through all the paragraphs, collecting the text contained inside them. 
all_text = ''                              # Initialize a variable to collect all the text in
for p in webpage_paragraphs:
    all_text = all_text + p.get_text()
  
# Now let's again look at what we have. Let's print the type, length, and a couple hundred characters of what we pulled out.
print 'Type of all_text is: ', type(all_text)
print 'Length of all text is: ', len(all_text)
print ''
print all_text[0:200]

Type of all_text is:  <type 'unicode'>
Length of all text is:  36180

Beautiful Soup is a
Python library for pulling data out of HTML and XML files. It works
with your favorite parser to provide idiomatic ways of navigating,
searching, and modifying the parse tree. It c


In [10]:
# Again, we have a long piece of unicode, but what it contains looks a lot more like what we want!

# Notice that right now we just have continuous unicode. Really we're interested in words. 
# It would be much easier to work with a list of words. 

# Let's split up the unicode variable we have into words
words = all_text.split()

# Let's have a look at what we have
print type(words)
print words[0:30]

# Note that we have a list of words. The 'u' before each item means they are unicode variables. That is fine for now. 

<type 'list'>
[u'Beautiful', u'Soup', u'is', u'a', u'Python', u'library', u'for', u'pulling', u'data', u'out', u'of', u'HTML', u'and', u'XML', u'files.', u'It', u'works', u'with', u'your', u'favorite', u'parser', u'to', u'provide', u'idiomatic', u'ways', u'of', u'navigating,', u'searching,', u'and', u'modifying']


In [11]:
# We can now start analyzing our extracted data!
# Let's try by simply looking for the most frequency words on this webpage. 
# We will use the FreqDist function from nltk to do this. 

# FreqDist calculates the frequency distribution for all the words we give it. 
word_freq = FreqDist(words)
# most_common will plot the n most common words
word_freq.most_common(5)

[(u'the', 398), (u'a', 229), (u'of', 131), (u'you', 117), (u'to', 106)]

In [12]:
# Unsurprisingly and uninterestingly, the most common words on this webpage are words like 'the' and 'a'. 
# That doesn't actually tell us very much. Let's try to get rid of those words to see the more interesting words.

# First let's make a list of words we don't think are interesting
bad_words = ('a', 'by', 'from', 'which', 'such', 'has', 'also', 'to', 'and', 'be', 'into', 'the', 'you', 'will', 
             'this', 'have', 'use', 'of', 'is', 'in', 'that', 'can', 'it', 'or', 'as', 'for', 'an', 'but', 'not', 
             'all', 'on', 'are', 'with','at', 'was','its', 'than', 'been', 'used', 'using', 'through')

# Now, let's loop through our list of words, only keeping the words if they are not in our list of bad words. 
words_cleaned = list()
for word in words:
    # If the word is not in our list of bad words, we add it to our new list of good words
    # Note that we are forcing all words to be all lowercase (with '.lower())
    if word.lower() not in bad_words:
        words_cleaned.append(word.lower())

# NOTE: usually, you will not make your own list of bad words, but rather use tools like nltk to do this for you. 
# Check out the notebook on 'Text Mining', which covers nltk, for more information about this.

In [232]:
# Now lets look at which words are the most common in the our cleaned scraped data
word_freq = FreqDist(words_cleaned)
word_freq.most_common(5)

# This is a bit more interesting. 
# The most common words are the package names, and related words like 'tag', 'document' and 'string'
# There is a ton more we can do with this, but that's text mining - not the main topic here. 

[(u'Soup', 106),
 (u'Beautiful', 104),
 (u'tag', 89),
 (u'document', 71),
 (u'string', 48)]

#### Bigger Analyses

Already we have the tools to do some kinds of basic analysis of web data, that could be informative. 

If I had some other information, such as website traffic data, or user ratings, I could already ask questions like:
- Does the number of internal links correlate with traffic? With user ratings? What about the number of external links?

This would require searching through multiple webpages. We could try and analyze package tutorial webpages, and try to 
see if we can predict something about their use (traffic or ratings) from their composition. 

So far we have looked 

More likely, we want to do some sort of scraping - such as pulling data off multiple webpages for some kind of further analysis. 
Let's look at something like that. 

Let's say I want to analyze something about the way different programming languages are discussed online. 

I decide on some subset of languages, and decide I'm going to do a pilot study only looking at Wikipedia pages.

In [249]:
## Set up things for web scraping wikipedia pages for programming languages

# URL for wikipedia pages
wiki_url = 'https://en.wikipedia.org/wiki/'

# A dictionary including languages of interest (python, matlab, R, etc.) and the URL extension for their wiki page
languages = {'Python': 'Python_(programming_language)', 'Matlab': 'MATLAB', 'R': 'R_(programming_language)', 
             'Java':'Java_(programming_language)', 'SQL': 'SQL', 'JavaScript':'JavaScript'}

In [265]:
# Loop through each of the languages in the dictionary and pull some data, do some basica analyses
for language in languages:
    
    ## Get the website data
    # Get the full URL for languages wiki page
    language_url = wiki_url + languages[language]
    # Request the webpage
    language_wikipage = requests.get(language_url)
    # Parse the HTML with BeautifulSoup
    language_wikipage_soup = BeautifulSoup(language_wikipage.content, 'lxml')
    
    ### Do some analyses
    
    ## We could get all the links from the pages, like we did before, to analyze if we want
    # Initialize lists to store the links
    internal_links = list()
    citations = list()
    external_links = list()
    other_links = list()
    # Loop through all the links on the page
    for link in language_wikipage_soup.find_all('a'):
        # Only examine link if it has a 'href', meaning it actually links somewhere
        if link.get('href'):
            # If 'wiki' is in the href, it points to a wikipedia page. Collect these.
            if 'wiki' in link.get('href'):
                internal_links.append(link.get('href'))
            # If 'cite' is in the href, it is a citation link. Collect these. 
            elif 'cite' in link.get('href'):
                citations.append(link.get('href'))
            # If 'http' is in the href, it is some kind of external link. Collect these. 
            elif 'http' in link.get('href'):
                external_links.append(link.get('href'))
            # Collect any other links
            else:
                other_links.append(link.get('href'))
            
    # Get counts of each of the different link types
    n_links_external = size(external_links)
    n_links_internal = size(internal_links)
    n_citations = size(citations)
    n_other_links = size(other_links)
    
    ## Analyze some of the text
    # Get a list of all the paragraph tags in our web page
    language_wiki_paragraphs = language_wikipage_soup.find_all('p')
    # Loop through all the paragraphs, collecting the text contained inside them. 
    all_text = ''
    for p in language_wiki_paragraphs:
        all_text = all_text + p.get_text()
    # Split up the words
    words = all_text.split()
    
    # Remove uninteresting words
    bad_words = ('a', 'by', 'from', 'which', 'such', 'has', 'also', 'to', 'and', 'be', 'into', 'the', 'you', 'will', 
                 'this', 'have', 'use', 'of', 'is', 'in', 'that', 'can', 'it', 'or', 'as', 'for', 'an', 'but', 'not', 
                 'all', 'on', 'are', 'with', 'at', 'was', 'its', 'than', 'been', 'used', 'using', 'through')
    words_cleaned = list()
    for word in words:
        if word.lower() not in bad_words:
            words_cleaned.append(word.lower())
    
    # Calculate word frequency
    word_freq = FreqDist(words_cleaned)
    
    ## Print out Results
    print '\n', language, '\n'
    
    print word_freq.most_common(7)

# Note we didn't print all the things we extracted, and we only extracted a small part of the pages we pulled. 
# There is a lot more we could do here, for example, comparing some how well linked (internally and externally) 
# the different languages are. Could a better linked wiki page be indicative of a better supported language?


Java 

[(u'java', 122), (u'class', 36), (u'method', 27), (u'sun', 19), (u'implementation', 16), (u'memory', 15), (u'public', 15)]

Python 

[(u'python', 75), (u'language', 21), (u'programming', 19), (u'standard', 15), (u'c', 13), (u'languages', 11), (u"python's", 10)]

JavaScript 

[(u'javascript', 85), (u'web', 32), (u'code', 19), (u'language', 18), (u'browsers', 16), (u'browser', 15), (u'netscape', 14)]

R 

[(u'r', 40), (u'data', 15), (u'statistical', 11), (u'r,', 7), (u'packages,', 6), (u'programming', 6), (u'packages', 6)]

Matlab 

[(u'matlab', 39), (u'function', 10), (u'other', 9), (u'value', 8), (u'programming', 7), (u'functions', 7), (u'array', 6)]

SQL 

[(u'sql', 58), (u'data', 25), (u'relational', 18), (u'standard', 16), (u'query', 13), (u'value', 12), (u'database', 12)]


#### Wrapping Up

The analyses we ran above are toy analyses (they aren't going to tell us anything amazing), but if you know something about these languages, some of this might look reasonable:
- R is a language known for statistical analysis of data
- Java calls it's functions 'methods', has classes etc.
- JavaScript is often used on the web / in browsers
- SQL is a databasing langauge, with tools to query the database
- and so on...

There are a lot more programming languages out there. Maybe we could cluster there main uses and attributes by doing a this kind of web crawling and text analysis at a bigger scale? We could use something like word clouds to visualize the data. Or, if we stumble upon a new language, maybe the quickest way to get a sense of what it does is to run this on it's wiki page. 


The basic tools to get data and start looking at it are here, but we've skipped quite a lot of important steps 
Things to update for a real analyses:
- Saving the web data:
    - We probably don't want to scrape and analyze all together, but rather scrape the data, save it nicely, and then analyze.
- Data cleaning:
    - Ensuring the quality of the data is really important, something we did not do here. Notice that in some of our most popular words, having punctuation is causing duplicate words, and that is not how we want it to work. 
- Doing a better (real) analyses, in this case probably some kind of textual analysis:
    - For example, what if we did a sentiment analysis on the wiki pages, and tried to predict language popularity?
- Getting other data:
    - Here we just scraped text from websites. In practice, you might want to scrape other types of data, for example, polling a website periodically to extract something like stock data. Or pulling pictures to train your computer vision algortihm with. Or monitoring websites to send you an alert when something changes. You might also be visiting websites that the scraper would have to log into with some kind of credentials that you prodive the script with. There is a lot more one could do with web scraping, but at it's core calling a website, getting the data, and parsing the parts you need like we did here is the basis of getting data from the web. 