# Crawl web pages

*If all else fails, you can crawl a website*

* only when information is not available through API
* beware of copyrights
* don't crawl too often (`sleep(2)`)
* use a library that will clean the HTML
* beware that the structure of the webpage might change at ANY time (unlike APIs that are generally stable)


### Simple example with BeautifulSoup HTML parser

In [1]:
from bs4 import BeautifulSoup # pip install beautifulsoup4

html_sample_doc = """
<html><head><title>The Dormouse's story</title></head>
  <body>
    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>

    <p class="story">...</p>
  </body>
</html>
"""

soup = BeautifulSoup(html_sample_doc, 'html.parser') # does all the parsing

for link in soup.find_all('a'):                      # searching for a particular tag
    print(link.get('href'))                          # get properties

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


----

### 2nd Example: PubMed Similar Articles

In this case, we want to access PubMed's links on the right box ('Similar articles")

![pubmed_similar](pubmed_similar.png)

Since the side bar is loaded with an additional asynchronous (Javascript) request, one needs to 

* maintain a web-scraping session using requests.Session
* parse the url that is used for getting the side bar
* follow that link and get the links from the div with class="portlet_content"

In [2]:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests # pip install requests

base_url = 'http://www.ncbi.nlm.nih.gov'
website = base_url + '/pubmed/?term=mtap+prmt'

# parse the main page and grab the link to the side bar
session = requests.Session()
soup = BeautifulSoup(session.get(website).content, 'html.parser')

# this "XPath" can be found by inspecting the website source code
url = urljoin(base_url, soup.select('div#disc_col a.disc_col_ph')[0]['href'])

# parsing the side bar
soup = BeautifulSoup(session.get(url).content, 'html.parser')

for i, a in enumerate(soup.select('div.portlet_content ul li.brieflinkpopper a')):
    print i, a.text, urljoin(base_url, a.get('href'))

0 The metabolite 5'-methylthioadenosine signals through the adenosine receptor A2B in melanoma. http://www.ncbi.nlm.nih.gov/pubmed/25087184
1 Down-regulation of methylthioadenosine phosphorylase (MTAP) induces progression of hepatocellular carcinoma via accumulation of 5'-deoxy-5'-methylthioadenosine (MTA). http://www.ncbi.nlm.nih.gov/pubmed/21356366
2 Quantitative analysis of 5'-deoxy-5'-methylthioadenosine in melanoma cells by liquid chromatography-stable isotope ratio tandem mass spectrometry. http://www.ncbi.nlm.nih.gov/pubmed/18996776
3 Review Protein arginine methylation of non-histone proteins and its role in diseases. http://www.ncbi.nlm.nih.gov/pubmed/24296620
4 Review Plant PRMTs broaden the scope of arginine methylation. http://www.ncbi.nlm.nih.gov/pubmed/22624881
5 Aryl Pyrazoles as Potent Inhibitors of Arginine Methyltransferases: Identification of the First PRMT6 Tool Compound. http://www.ncbi.nlm.nih.gov/pubmed/26101569
6 PRMT1 Is a Novel Regulator of Epithelial-Mesenchy