The purpose of this notebook is to explore how to parse data from AddGene.

As of now, the following attributes are desired for all plasmids:
- [X] name (str)
- [X] plasmid id # (int)
- [X] purpose (str)
- [X] publication (str)
- [X] sequence information (link to GenBank file)
- [ ] vector backbone name (str)
- [ ] vector backbone url (str)
- [ ] vector type (str[])
- [ ] tag/fusion protein (str[])
- [ ] bacterial resistance(s) (str[] for now)
- [ ] growth temp (str for now)
- [ ] growth strain(s) (str)
- [ ] copy number (str)
- [ ] gene/insert name (str)

# Config bs4

In [77]:
from bs4 import BeautifulSoup, Tag, NavigableString
from requests import get

This `dict` will be used to store all plasmid data

In [5]:
plasmid_data = {}

In [43]:
# id for "pBABE puro EGFP"
plasmid_id = 128041

def addgene_url(plasmid_id: int) -> str:
    return f'https://www.addgene.org/{plasmid_id}/'

url = addgene_url(plasmid_id)

In [7]:
page = get(url)

In [8]:
page = BeautifulSoup(page.text, 'html.parser')

This seems to be the selector to get the element that contains most if not all of the plasmid data

In [9]:
raw_plasmid_info = page.find_all('section', 'addgene-panel-catalog-item')

In [10]:
raw_plasmid_info = raw_plasmid_info[0]

### Entire Page

In [11]:

print(raw_plasmid_info)

<section class="panel panel-default addgene-panel-catalog-item addgene-panel-plasmid-item">
<div class="panel-heading">
<div class="row row-vertically-centered">
<div id="plasmid-flame-container">
<span class="addgene-flame-with-popover addgene-flame addgene-flame-medium" data-click-away="true" data-toggle="popover"></span>
</div>
<h1>
<span class="material-name"> pBABE puro EGFP </span>
<br/>
<small>
                          (Plasmid
                          
                          #<span id="addgene-item-id">128041</span>)
                        </small>
</h1>
<div class="text-right" id="print-link-container">
<a class="print-link" href="javascript:window.print();">
<span class="glyphicon glyphicon-print"></span>
                            
                             Print
                        </a>
</div>
</div>
</div>
<div class="panel-body">
<div class="material-properties">
<!-- Top section, images and basic data -->
<div class="row" id="top-section">
<div class="col-x

Now that we found the `section` with plasmid data, let's start clean it, then begin extracting data.

In [112]:
# strip whitespace
cleaned = [i.strip() for i in str(raw_plasmid_info).split('\n')]
cleaned = "".join(cleaned)
raw_plasmid_info = BeautifulSoup(cleaned)

In [13]:
with open(f'{plasmid_id}.html', 'w') as f:
    f.write(str(raw_plasmid_info))

In [69]:
# plasmid name
plasmid_data['name'] = raw_plasmid_info.find(attrs={'class': 'material-name'}).contents[0]

' pBABE puro EGFP '

In [79]:
# addgene plasmid id
plasmid_data['id'] = int(raw_plasmid_info.find(attrs={'id': 'addgene-item-id'}).contents[0])

## Extracting desc
Extracting the purpose and sequences will be a little more involved

In [118]:
plasmid_desc = raw_plasmid_info.find(attrs={'id': 'plasmid-description-list'})

In [125]:
def extract_field_content(desc, string: str):
    items = desc.find_all('li')
    for i in items:
        if i.find(string=string):
            return i.find(attrs={'class': 'field-content'}).contents


In [126]:
extract_field_content(plasmid_desc, "Purpose")

['\n',
 <strong>(Empty Backbone)</strong>,
 '\n                            \n                            Retroviral vector with N terminal EGFP\n                        ']

In order to get string, just concatenate all tag contents

Also, there needs to be a helper function that removes all empty strings

In [127]:
def get_strings(arr: []) -> []:
    extracted = []
    for i in arr:
        extracted.append(i.get_text())
        
    cleaned = []
    for i in extracted:
        stripped = i.strip()
        if len(stripped):
            cleaned.append(stripped)
    return cleaned

In [129]:
plasmid_data['purpose'] = ' '.join(get_strings(extract_field_content(plasmid_desc, "Purpose")))

In [131]:
plasmid_data

{'name': ' pBABE puro EGFP ',
 'id': 128041,
 'purpose': '(Empty Backbone) Retroviral vector with N terminal EGFP'}

In [132]:
extract_field_content(plasmid_desc, "Publication")

['\n',
 <a href="/browse/article/28203582/">
                             
 
                             
                                 Emory Custom Cloning Core Plasmids - Oskar Laur (unpublished)
                             
 
                             
                             </a>,
 '\n',
 <span>
 <small>
                                     ( <a href="#how-to-cite">How to cite <span class="glyphicon glyphicon-arrow-down"></span></a>)
                                 </small>
 </span>,
 '\n']

In [166]:
def get_link(desc, string: str) -> str:
    for i in extract_field_content(desc, string):
        if type(i) == Tag:
            if 'href' in i.attrs:
                return i['href']

In [173]:
plasmid_data['publication_link'] = get_link(plasmid_desc, "Publication")

### Get link to sequence page

In [172]:
plasmid_data['sequence_link'] = plasmid_desc.find(attrs={'id': 'sequence_information'}).a.attrs['href']

'/128041/sequences/'

## Extracting details

In [113]:
plasmid_details = raw_plasmid_info.find(attrs={'id': 'detail-sections'})

### Vector Backbone

In [42]:
with open(f'{plasmid_id}_details.html', 'w') as f:
    f.write(str(plasmid_details))

In [44]:
for i in plasmid_details:
    print(i)



<div class="col-xs-12">
<section>
<h2><span class="title">Ordering</span></h2>
<table class="add-to-cart-table table-striped table-condensed" id="ordering-table">
<thead>
<tr>
<th>Item</th>
<th>Catalog #</th>
<th>Description</th>
<th>Quantity</th>
<th colspan="2">Price (USD)</th>
</tr>
</thead>
<tbody>
<tr id="row-128041">
<td style="width: 7em;">Plasmid</td>
<td style="width: 7em;">128041</td>
<td>
<span id="format-details-128041"> Standard format: Plasmid sent in bacteria as agar stab </span>
</td>
<td class="text-center" style="width: 5.25em;">
<span data-add-to-cart-quantity="1" data-item-id="128041">1</span>
</td>
<td class="text-left price-cell" style="width: 5em;">
                
                
                
                $<span data-item-id="128041" data-marginal-price="" data-marginal-price-item-data='{"itemId": "128041", "bulkGroup": "1", "unitPrices": [{"minQuantity": 1, "bulkUnitPrice": 85.0}, {"minQuantity": 6, "bulkUnitPrice": 85.0}, {"minQuantity": 21, "bulkUn

['\n',
 <div class="field-label">Vector backbone</div>,
 '\n                        \n                        pBABE puro\n                        \n                        ',
 <div style="clear: left;">
 <a href="/vector-database/query/?q_vdb=pBABE%20puro" rel="noopener noreferrer" target="_blank"><span class="glyphicon glyphicon-new-window"></span> (Search Vector Database)</a>
 </div>,
 '\n']

In [114]:
def get_detail_section(title: str, sections):
    for section in sections:
        element_title = section.find(attrs={'class': 'title'})
        if element_title.text.strip().lower() == title.lower():
            return section

def extract_string(tag) -> str:
    for s in tag.children:
        if type(s) == NavigableString:
            if s == '\n':
                continue
            return str(s).strip()

def extract_detail_fields(section) -> {}:
    fields = {}
    for field_element in section.find_all(attrs={'class': 'field'}):
        key = extract_string(field_element.find(attrs={'class': 'field-label'}))
        value = extract_string(field_element)
        # try extracting an array
        if value is None:
            element = field_element.find(attrs={'class': "addgene-document-list"})
            if element:
                value = [i.text for i in element.children]
        
        # attempt to grab the href
        a = field_element.find('a')
        if a:
            href = a.attrs['href'].strip()
        else:
            href = None
        
        fields[key] = {'value': value, 'href': href}
        
    return fields


detail_sections = ['backbone', 'growth in bacteria', 'gene/insert']
sections = plasmid_details.find_all('section')
for title in detail_sections:
    chunk = get_detail_section(title, sections)
    print(extract_detail_fields(chunk))


{'Vector backbone': {'value': 'pBABE puro', 'href': '/vector-database/query/?q_vdb=pBABE%20puro'}, 'Vector type': {'value': 'Mammalian Expression, Retroviral', 'href': None}, '/ Fusion Protein': {'value': [], 'href': None}}
{'Bacterial Resistance(s)': {'value': 'Ampicillin, 100 μg/mL', 'href': None}, 'Growth Temperature': {'value': '37°C', 'href': None}, 'Growth Strain(s)': {'value': 'NEB Stable', 'href': None}, 'Copy number': {'value': 'High Copy', 'href': None}}
{'Gene/Insert name': {'value': 'None', 'href': None}, 'Tag/ Fusion Protein': {'value': [], 'href': None}}


# Redo desc

In [132]:
def parse_field(tag: Tag) -> {}:
    label = tag.find(attrs={'class': 'field-label'}).text
    print(label)
    _content = tag.find(attrs={'class': 'field-content'})
    # try extracting an array
    if _content is None:
        element = tag.find('ul')
        
        if element:
            values = []
            for i in element.children:
                href = i.find('a').attrs['href']
                values.append({'value': i.text, 'href': href})
            return {label: {'value': values}}
        print('this happened')
            
    else:
        a = _content.find('a')
        if a:
            value = a.text
            href = str(a['href'])
        else:
            value = _content.text
            href = None
        return {label: {'value': value, 'href': href}}

for field in plasmid_desc.find_all(attrs={'class': 'field'}):
    print(parse_field(field))


Purpose
{'Purpose': {'value': '(Empty Backbone)Retroviral vector with N terminal EGFP', 'href': None}}
Depositing Lab
{'Depositing Lab': {'value': 'Oskar Laur', 'href': '/browse/pi/4513/'}}
Publication
{'Publication': {'value': 'Emory Custom Cloning Core Plasmids - Oskar Laur (unpublished)', 'href': '/browse/article/28203582/'}}
Sequence Information
{'Sequence Information': {'value': [{'value': 'Sequences (1)', 'href': '/128041/sequences/'}]}}


AttributeError: 'NoneType' object has no attribute 'text'

{<div class="field-label">Purpose</div>: {'value': '(Empty Backbone)Retroviral vector with N terminal EGFP', 'href': None}}
{<div class="field-label">Depositing Lab</div>: {'value': 'Oskar Laur', 'href': '/browse/pi/4513/'}}
{<div class="field-label">Publication</div>: {'value': 'Emory Custom Cloning Core Plasmids - Oskar Laur (unpublished)', 'href': '/browse/article/28203582/'}}
{<div class="field-label">Sequence Information</div>: {'value': [{'value': 'Sequences (1)', 'href': '/128041/sequences/'}]}}
this happened
None
