
# Scrapping a webpage using BeautifulSoup

[`Beautiful Soup`](https://www.crummy.com/software/BeautifulSoup/) is a Python library to search and extract what we need from a document. I use it to access the data in `Geno 2.0 Next Generation` [webpage](https://genographic.nationalgeographic.com/reference-populations-next-gen/) for each population. The overall workflow is as follows:
1. Identify a source, whether a website url or locally saved file.
2. Use a parser to parse HTML codes of the source. Default is `html.parser` but for run this notebook, you need to install `html5lib` library to parse sources written in HTML5. To install see [here](https://pypi.python.org/pypi/html5lib).
3. Find HTML elements  such as `div` or `a` that hold the required information. We can also select elements with certain `id` or `class`. 
4. Then use commands such as `findAll` or `find` to find all or an instance of the information, you are looking for.
5. Possibly do a post-process on the found information, to make it in the required format. Here, I collect them in an ordered dictionary to convert the dataset in JSON at the end.

This script uses three libraries:
* `BeautifulSoup`: To scrape the webpage
* `Collections`: To hold an ordered list of items in a dictionary
* `json`: To save extracted data in JSON format

In [1]:
from bs4 import BeautifulSoup
from collections import OrderedDict
import json

### Specify the source

I have saved a local copy of the webpage in `webpage` directory.

In [2]:
# url to scrape
url_to_scrape = 'https://genographic.nationalgeographic.com/reference-populations-next-gen/'
# local file to scrape
file_to_scrape = open("./webpage/Reference Populations - Geno 2.0 Next Generation.html")
# Create a beautifulsoup object from html content
soup = BeautifulSoup(file_to_scrape,"html5lib")

# Looking into soup!
Let's see what is inside the variable `soup`. It contains all HTML elements in the webpage. Looking through the code, I realized the info that I'm interested in are wrapped in `<div>` elements that look like this:  

```
<div class="pop-211">
...
</div>
```
The class name is `pop-x` where `x` ranges from 200 to 260. But we don't need to know the exact range if we use `except` command in Python. Within each of these `<div>` elements, there are a few `<li>` items which look like the following block:


```
<li class="pop-id-2105" style="width:8%;">
            <div class="wp-autosomal-bar-label">
                <p>Eastern Africa</p>
                <div class="wp-autosomal-bar-line"></div>
            </div>
            <div class="wp-autosomal-bar-section">
                <h3>2%</h3>
            </div>
        </li>
```
We are interested in the strings within `<p>` (`<p>Eastern Africa</p>`) and `<h3>` (`<h3>2%</h3>`) tags. So the idea is this:

1. Find all `<div>` elements with `class=pop-x`,
2. Extract the text within `<p>` elements in `<div class="wp-autosomal-bar-label">`.
3. Extract the text within `<h3>` elements in `<div class="wp-autosomal-bar-section">`.

Number two gives the population name and number three gives us the percentage. We're ready to implement the algorithm.
 

### Implementation

In [32]:
# create an empty parent dictionary containing 
# dictionaries for all labels
dic = []

for identifier in range(200,270):
    # make sure you use a wide enough range
    # to include all possible numbers
    
    # create an ordered dictionary to keep 
    # all info about genetic contributions
    # of this identifier
    d = OrderedDict()
    
    try:
        # find all `div elements corresponding to `identifier`
        # This contains all HTML codes within that <div> 
        data = soup.findAll('div', class_="pop-"+str(identifier))[0]
        
        # Population selected to find its genetic contributions
        population_label = cells.findAll('h3')[0].get_text()
        d['title'] = population_label
        
        # How much each gene contributes in the selected populations
        # find <div>s with the mentioned classes 
        label = [key.find('p').text for key in data.findAll('div',class_="wp-autosomal-bar-label")]
        percent = [key.find('h3').text for key in data.findAll('div',class_="wp-autosomal-bar-section")]
        
        # make sure that we have the number of labels
        # and percentages match!
        if (len(label)==len(percent)):
            # if yes, put them in an ordered dictionary
            for i in range(len(label)):
                d[label[i]]=percent[i].split('%')[0]
        
        # append the ordered dictionary to the parent dictionary
        dic.append(d)
    
    except:
        # if identifier does not exist, do not return an IndexError
        IndexError

Now we could see how a dictionary for each label looks like. It contains a `title` with a set of `labels` and `values` for each genetic type.

In [34]:
dic[1]

OrderedDict([('title', 'Ashkenazi Jewish'),
             ('Eastern Europe', '2'),
             ('Finland & Northern Siberia', '28'),
             ('Eastern Asia', '18'),
             ('Central Asia', '42'),
             ('Asia Minor', '8'),
             ('Southern Asia', '2')])

### Saving the results
Finally we can save all the results in a `JSON` (JavaScript Object Notation) file

In [5]:
with open('data.json', 'w') as outfile:
    json.dump(dic, outfile)