# Python Study Group (PSG)

Today we will be reviewing the python package [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/ "BeautfulSoup Homepage")

1. We will explore applying this library to parse an [xml](https://en.wikipedia.org/wiki/XML "definition") file downloaded from [NCBI's BLAST program](https://blast.ncbi.nlm.nih.gov/Blast.cgi "BLAST Homepage").
2. Finally, we will see how coupling this with the [requests](http://docs.python-requests.org/en/master/ "Requests Homepage") library will provide most of the tools required for building a generic web-crawler.

### Setup


#### Get the Data

The data for this tutorial is available [here](https://github.com/ComBEE-UW-Madison/PythonStudyGroup.git "COMBEE GitHub")

You may clone the repo and follow along with the command:

```bash
git clone https://github.com/ComBEE-UW-Madison/PythonStudyGroup.git
```

Navigate to the COMBEE github data directory.

`cd /Path/to/PythonStudyGroup/2019Spring/data`


#### Pip installing BeautifulSoup and requests

First we will ensure we have the BeautifulSoup library with `pip`

```bash
pip install --user beautifulsoup
pip install --user requests
```

The `--user` flag will install the library in your own `$HOME`. (This will allow us to bypass requiring super user permissions)




In [None]:
# Import our BeautifulSoup object
from bs4 import BeautifulSoup
import requests
!ls

We have a BLAST output in xml in our `$HOME/COMBEE/data directory

In [None]:
# fp = 'markup.xml'
fp = 'BLASTalignment.xml'
fh = open(fp)
soup = BeautifulSoup(fh, features='xml')
fh.close()
print(type(soup))

In [None]:
dir(soup)
# [c for c in soup.children]
# soup

In [None]:
print(soup.prettify())

In [None]:
soup.BlastOutput_program.string

In [None]:
children = list()
for child in soup.children:
    children.append(child)

In [None]:
l = [('gene A',3), ('gene B',3), ('gene C',5)]
d = {elem[0].replace('gene ','transcript'):elem[1] for elem in l}
# d = {elem[0].replace('gene ','transcript') for elem in l}    
type(d)

In [None]:
children = [child for child in soup.children]
# print(children)
seq = soup.Hsp_hseq.text
# print(seq)
print(soup.Hit_accession.string)
print(soup.Hsp_identity.text)

In [None]:
# print(children)
tags = [tag for tag in soup.find_all(True)]
print(len(tags))
print(tags)

In [None]:
hits = [hit_acc.string for hit_acc in soup.find_all('Hit_accession')]
print(hits)

In [None]:
d = [c for c in soup.descendants]
len(d)

### An example of a web-crawler (spider)

Here we use...

- requests
- random
- string
- multiprocessing
- BeautifulSoup 

to quickly request and parse HTML markups to retrieve links for a recursive link explorer.

In [None]:
import string
import random

import requests as r

from multiprocessing import Pool
from bs4 import BeautifulSoup

In [None]:
# Uncomment to check a library's methods
# dir(random)
# dir(string)
# dir(r)
# dir(BeautifulSoup)

In [None]:
# Now we create a function to generate a random url
def random_url(url_length=3):
    #Starting three characters (lowercase)
    starting = ''.join(random.choice(string.ascii_lowercase) for _ in range(url_length))
    url = ''.join(['http://', starting, '.com'])
    return(url)

In [None]:
for i in range(10):
    url = random_url(i+2)
    print(url)

In [None]:
def handle_local_links(url,link):
    # we need to handle for local links need to prepend url for requests
    if link.startswith('/'):
        return(''.join([url,link]))
    else:
        return(link)

In [None]:
def get_links(url):
    try:
        response = r.get(url)
        soup = BeautifulSoup(response.text, 'lxml')
        # Hopes of avoiding navigation bars
        body = soup.body
        links = [link.get('href') for link in body.find_all('a')]
        links = [handle_local_links(url, link) for link in links]
        links = [str(link.encode('ascii')) for link in links]
        return(links)
    except TypeError as e:
        print(e)
        return([])
    except IndexError as e:
        print(e)
        return([])
    except AttributeError as e:
        print(e)
        return([])
    except Exception as e:
        print(str(e))
        return([])

## Logic for scraping the web.

I've provided a simple example. Perhaps this could be applied to selecting a random organism from NCBI.

Pseudocode:
```
urls = random_urls()
While num_urls_found < 100:
    markups = [retrieve_url_markup(url) for url in urls]
    urls = [url for get_urls(markup) for markup in markups for get_urls(marku]
    num_urls_found += len(urls)
```

### Follow-up

How would this be performed? What markups would need to be retrieved and how could you parse them to access the organism's genome?

In [None]:
def main():
    num_searches = 50
    p = Pool(processes=num_searches)
    parse_us = [random_url() for _ in range(num_searches)]
    while True:
        data = p.map(get_links, [link for link in parse_us])
    #     Flatten data
        data = [url for url_list in data for url in url_list]
#         for url_list in data:
#             for url in url_list:
#                 data.append(url)
        parse_us = data
        p.close()

        with open('urls.text', 'w') as fh:
            fh.write(str(data))
            print('Written {}'.format(fh))


In [None]:
print(url)

In [None]:
resp = r.get('https://www.crummy.com/software/BeautifulSoup/')

In [None]:
type(resp)

In [None]:
dir(resp)

In [None]:
resp.ok

In [None]:
resp.text
soup = BeautifulSoup(resp.text, 'lxml')

In [None]:
soup