# Web Scraping and Parsing Inhibitor Data in Python

KIDFamMap is a database listing thousands of known kinase inhibitors. 

Kinases are listed on the web page at URL http://gemdock.life.nctu.edu.tw/kidfammap/browse.php and each can be clicked on individually to navigate to a page of information about that kinase. This page includes a link to another page, which contains a table of inhibitors for that kinase.

Here we use the scrapy package to create and run a spider to scrape the website's HTML code for all of the relevant information.
- First, the URLs for each kinase's web page are obtained from the main page.
- These URLs are then followed, taking us to each kinase's web page.
- From there we extract the URLs for the web page containing the individual kinase's table of inhibitors.
- We then follow these URLs and extract all of the relevant information.

Once the data has been extracted from the website, it is cleaned and inserted into a Pandas data frame.

After that, the following steps take place:
- Using information in the data frame, we generate a column of URLs that our web app can use to display images of the inhibitors' chemical structures.
- Make kinases and inhibitors uppercase.
- Translate kinase names to match "Entry name" in kinase table
- Remove duplicate rows.
- Remove unnecessary columns.
- Use this data frame to make two data frames: one listing the kinase-inhibitor pairs, and one listing each inhibitor alongside all of its information. These two data sets can then be linked via the inhibitor name in our relational database.
- Make primary key columns.
- Export as csv files.

Template code and general information about scrapy was gained from https://www.datacamp.com/courses/web-scraping-with-python

Import packages

In [14]:
import pandas as pd
import scrapy
from scrapy.crawler import CrawlerProcess
import urllib.parse 
import urllib.request 

Make a spider

In [2]:
# Use the scrapy.Spider class to make your own spider

class InhibitorSpider( scrapy.Spider ):
    
    name = "inhibitor_spider"
    
    # Define the first action to take
        
    def start_requests( self ): 
        
        # Define which URL to start at
        
        url = 'http://gemdock.life.nctu.edu.tw/kidfammap/browse.php'

        # Go to the website at the above URL and get a response object
        # Define what to do with the response object
        # i.e. send it to the parse method defined below
        
        yield scrapy.Request( url = url, callback = self.parse )
            
    # Using the previous response object, get the URLs for each 
    # kinase's web page, and go to those websites  
    
    def parse( self, response ):
        
        # Each kinase listed on the page is a hyperlink, leading to
        # a page of information for that kinase.
        # Define a CSS locator to point to the URLs in those
        # hyperlinks and extract them as strings.
        
        links = response.css('table > tr > td > a::attr(href)').extract()
        
        # Go to the kinases' web pages using the new URLs
        # and send the response objects to the parse2 method below
        
        for link in links:
            yield response.follow(url = link, callback = self.parse2)
    
    # Using the previous response objects, get the URLs for each 
    # inhibitor table.
    
    def parse2( self, response ):
        
        # Each kinase's web page has a hyperlink to another page 
        # containing a table of inhibitors for that kinase.
        # Define a CSS locator to point to the URLs in those
        # hyperlinks and extract them as strings.
        
        inhib_links = response.css('a.show_inhibitor::attr(href)').extract()
        
        # Go to the kinases' web pages using the new URLs
        # and send the response objects to the parse3 method below
        
        for ilink in inhib_links:
            yield response.follow(url = ilink, callback = self.parse3)

    # Using the previous response objects, get information from the 
    # table of inhibitors for each kinase
    
    def parse3( self, response ):
        
        # Each inhibitor list web page has information we'd like to
        # extract and place into a list "inhibs" (which we must initialise
        # in the next cell rather than here).
        # Define a CSS locator to point to the data in the rows
        # of the inhibitor table and extract the text.
       
        raw = response.css('div.result tbody > tr').extract()
        
        # Save the data in a list "fields"
        
        fields = [field for field in raw]
        
        # For each kinase, append the information about its
        # inhibitors to "inhibs"
        
        inhibs.append(fields)

Run the spider: crawl KIDFamMap for inhibitors

In [3]:
# Our data will be returned to "inhibs"

inhibs = [] 

# Run the spider

process = CrawlerProcess()
process.crawl(InhibitorSpider)
process.start()

# N.B. kernel needs to be cleared before repeating

2020-01-15 19:21:41 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: scrapybot)
2020-01-15 19:21:41 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.7, Platform Windows-10-10.0.18362-SP0
2020-01-15 19:21:41 [scrapy.crawler] INFO: Overridden settings: {}
2020-01-15 19:21:41 [scrapy.extensions.telnet] INFO: Telnet Password: 0c32db61d08fe7b4
2020-01-15 19:21:41 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-01-15 19:21:42 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defau

2020-01-15 19:21:47 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://gemdock.life.nctu.edu.tw/kidfammap/show_inhibitor.php?QueryType=Protein&QueryName=MAP3K9&Query_Pid=MAP3K9&Query_Cid=&Query_Family=> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2020-01-15 19:21:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/show_inhibitor.php?QueryType=Protein&QueryName=MYLK4&Query_Pid=MYLK4&Query_Cid=&Query_Family=> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=MYLK4)
2020-01-15 19:21:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/show_inhibitor.php?QueryType=Protein&QueryName=MAPKAPK3&Query_Pid=MAPKAPK3&Query_Cid=&Query_Family=> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=MAPKAPK3)
2020-01-15 19:21:47 [scrapy.core.engine] DEBUG: Crawled 

2020-01-15 19:21:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/show_inhibitor.php?QueryType=Protein&QueryName=KSR2&Query_Pid=KSR2&Query_Cid=&Query_Family=> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=KSR1)
2020-01-15 19:21:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=IRAK1> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:21:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=RYK> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:21:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=ROS1> (referer: http://gemdock.life.nctu.edu.tw/kidfammap

2020-01-15 19:21:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=INSRR> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:21:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/show_inhibitor.php?QueryType=Protein&QueryName=INSR&Query_Pid=INSR&Query_Cid=&Query_Family=> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=ROR2)
2020-01-15 19:21:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=FGR> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:21:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=FGFR4> (referer: http://gemdock.life.nctu.edu.tw/kidfamma

2020-01-15 19:21:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=MYO3B> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:21:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=MYO3A> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:21:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=MINK1> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:21:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=MAP4K5> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:21:55 [scrapy.core.engine] DEBUG: Crawled (200) <

2020-01-15 19:21:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/show_inhibitor.php?QueryType=Protein&QueryName=CAMK1G&Query_Pid=CAMK1G&Query_Cid=&Query_Family=> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=ULK3)
2020-01-15 19:21:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/show_inhibitor.php?QueryType=Protein&QueryName=MAP2K6&Query_Pid=MAP2K6&Query_Cid=&Query_Family=> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=MAP2K3)
2020-01-15 19:21:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=NEK1> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:21:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?Pro

2020-01-15 19:21:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=MAK> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:21:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/show_inhibitor.php?QueryType=Protein&QueryName=DYRK2&Query_Pid=DYRK2&Query_Cid=&Query_Family=> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=PRPF4B)
2020-01-15 19:21:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/show_inhibitor.php?QueryType=Protein&QueryName=MAPK1&Query_Pid=MAPK1&Query_Cid=&Query_Family=> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=NLK)
2020-01-15 19:21:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQu

2020-01-15 19:22:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=CSNK1A1L> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=CSNK1A1> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=STK33> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=STK17A> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:02 [scrapy.core.engine] DEBUG: Crawled (2

2020-01-15 19:22:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=STK32B> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=STK32A> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=SGK3> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=SGK2> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:03 [scrapy.core.engine] DEBUG: Crawled (200) <G

2020-01-15 19:22:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=LOC375449> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/show_inhibitor.php?QueryType=Protein&QueryName=PRKCB&Query_Pid=PRKCB&Query_Cid=&Query_Family=> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=PRKCH)
2020-01-15 19:22:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=GRK5> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=GRK7> (referer: http://gemdock.life.nctu.edu.tw/k

2020-01-15 19:22:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=TNK2> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/show_inhibitor.php?QueryType=Protein&QueryName=PRKCA&Query_Pid=PRKCA&Query_Cid=&Query_Family=> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=PRKCE)
2020-01-15 19:22:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=TEK> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=SYK> (referer: http://gemdock.life.nctu.edu.tw/kidfamma

2020-01-15 19:22:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=FLT1> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/show_inhibitor.php?QueryType=Protein&QueryName=BRAF&Query_Pid=BRAF&Query_Cid=&Query_Family=> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=BRAF)
2020-01-15 19:22:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/show_inhibitor.php?QueryType=Protein&QueryName=RET&Query_Pid=RET&Query_Cid=&Query_Family=> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=RET)
2020-01-15 19:22:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType

2020-01-15 19:22:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=SLK> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/show_inhibitor.php?QueryType=Protein&QueryName=LCK&Query_Pid=LCK&Query_Cid=&Query_Family=> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=LCK)
2020-01-15 19:22:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/show_inhibitor.php?QueryType=Protein&QueryName=ABL2&Query_Pid=ABL2&Query_Cid=&Query_Family=> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=ABL2)
2020-01-15 19:22:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=

2020-01-15 19:22:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=CAMKK2> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/show_inhibitor.php?QueryType=Protein&QueryName=TTK&Query_Pid=TTK&Query_Cid=&Query_Family=> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=TTK)
2020-01-15 19:22:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=EIF2AK2> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/show_inhibitor.php?QueryType=Protein&QueryName=MAP2K4&Query_Pid=MAP2K4&Query_Cid=&Query_Family=> (referer: ht

2020-01-15 19:22:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/show_inhibitor.php?QueryType=Protein&QueryName=MAPK12&Query_Pid=MAPK12&Query_Cid=&Query_Family=> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=MAPK12)
2020-01-15 19:22:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=CDK9> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=CDK7> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=CDK6> (referer: http://gemdock.life.nctu.edu.tw/kid

2020-01-15 19:22:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=RPS6KA5> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=RPS6KA1> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=PRKCI> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://gemdock.life.nctu.edu.tw/kidfammap/ProcessQuery.php?ProteinQueryType=x&Query_ProteinName=PRKCQ> (referer: http://gemdock.life.nctu.edu.tw/kidfammap/browse.php)
2020-01-15 19:22:16 [scrapy.core.engine] DEBUG: Crawled (200

Clean up the data in "inhibs" and store in "inhibitors"

In [5]:
inhibitors = []

for i in inhibs: # for each kinase
    for j in i: # for each kinase-inhibitor relationship
        chemical = []
        inh = j.split("</td>") # Split row into individual fields
        for k in inh: # for each field, remove unnecessary characters
            field = k.replace("<tr>","") 
            field = field.replace("\r","")
            field = field.replace("\t","")
            field = field.replace("<td>","")
            field = field.replace("\n","")
            chemical.append(field) # Make a row of cleaned, separate fields
        inhibitors.append(chemical) # Add this row to "inhibitors"

Define column names, based on those on the KIDFamMap website

In [6]:
headers = ["Index","Kinase","Inhibitor","Partial_Img_URL",
           "Ki_nM","IC50_nM","Kd_nM","EC50_nM","POC","Source","Link","To_Remove"]

Save the inhibitors information in a Pandas data frame

In [7]:
inhibitors_df = pd.DataFrame(inhibitors, columns = headers)

Using information in the data frame, generate a column of URLs for the inhibitors' chemical structure images. Our web app can subsequently use these to display images.

In [8]:
IMG_URL = []

for n,i in enumerate(inhibitors_df.Partial_Img_URL):
    URL = 'http://gemdock.life.nctu.edu.tw/kidfammap/data/png/'
    URL += str(inhibitors_df.Source[n])+"/"
    URL += str(i)+".png"
    IMG_URL.append(URL)

IMG_URL = pd.Series(IMG_URL)
inhibitors_df = inhibitors_df.assign(IMG_URL = IMG_URL)

Make "kinase" and "inhibitor" entries uppercase.

In [9]:
uppercase_kinase = []
uppercase_inhib = []

for n,i in enumerate(inhibitors_df.Kinase):
    uppercase_kinase.append(str(i).upper())
    uppercase_inhib.append(inhibitors_df.Inhibitor[n].upper())

uppercase_kinase = pd.Series(uppercase_kinase)
uppercase_inhib = pd.Series(uppercase_inhib)

inhibitors_df = inhibitors_df.assign(Kinase = uppercase_kinase)
inhibitors_df = inhibitors_df.assign(Inhibitor = uppercase_inhib)

Translate kinase names using UniProt

In [12]:
kinase_list = list(inhibitors_df.Kinase.drop_duplicates())
len(kinase_list)

148

In [15]:
# Create partial URL
url = "https://www.uniprot.org/uploadlists/"

# Define parameters
params = {
"from": "GENENAME", # Assume kinase names are in format "GENENAME"
"to": "ID", # Retrieve IDs in "ID" format
"format": "tab", # Produce tab-delimited output
"query": "", # The query protein ID will be defined during the loop
}

# Create an empty list to store the results in
results=[]

for i in kinase_list: 
    params["query"]=str(i) # Enter the kinase ID
    data = urllib.parse.urlencode(params) 
    data = data.encode("utf-8")
    req = urllib.request.Request(url, data) # Run query in Uniprot
    with urllib.request.urlopen(req) as f:
       response = f.read()
    line=response.decode("utf-8") 
    results.append(line) # Store results

In [16]:
# Split up the search results into a list of lists

results2=[]

for n,i in enumerate(results):
    results2.append(i.split())

In [17]:
# Make a dictionary of proteins

proteindict={}

for i in results2: # i = one kinase and all of its possible translations
    if(len(i)) > 2: # ignore empty lists
        humans = 0
        for e in range(3,len(i),2): # For every other item in the list (i.e. a potentially correct ID)
            if "_HUMAN" in str(i[e]): # check how many "human" options there are
                 humans += 1 
        if humans > 1: # If there are multiple "human" options
            for j in range(3,len(i),2):
                if str(i[j]) == str(i[2]) + "_HUMAN": # keep the one that equates to the original ID with
                    # suffix "_HUMAN"
                    proteindict[str(i[2])] = str(i[j])
        if humans == 1: # If there is only one "human" option, choose it
            for h in range(3,len(i),2):
                if "_HUMAN" in str(i[h]):
                    proteindict[str(i[2])] = str(i[h])

In [18]:
# Translate the kinases in the data frame and insert into new column

data = []

for n,i in enumerate(inhibitors_df.Kinase): # Iterate over original data frame's IDs
    if i in proteindict.keys():
        kinase = proteindict.get(i)
        data.append(kinase)
    else:
        data.append(str(i) + "_NOT_FOUND_IN_UNIPROT")

data = pd.Series(data)
inhibitors_df = inhibitors_df.assign(HUMAN_KINASE = data)

In [27]:
### UH OH! ###

inhibitors_df.HUMAN_KINASE.unique()

array(['ADRBK1_NOT_FOUND_IN_UNIPROT', 'DAPK3_HUMAN', 'DAPK2_HUMAN',
       'DAPK1_HUMAN', 'PASK_HUMAN', 'MKNK2_HUMAN', 'MYLK4_HUMAN',
       'MAPKAPK3_NOT_FOUND_IN_UNIPROT', 'CAMK2G_NOT_FOUND_IN_UNIPROT',
       'CAMK4_NOT_FOUND_IN_UNIPROT', 'CASK_HUMAN',
       'MAP3K9_NOT_FOUND_IN_UNIPROT', 'MAPK2_HUMAN', 'AVR2A_HUMAN',
       'PIM1_HUMAN', 'CHEK1_NOT_FOUND_IN_UNIPROT', 'PIM2_HUMAN',
       'BMPR1B_NOT_FOUND_IN_UNIPROT', 'KSR2_HUMAN',
       'TNK2_NOT_FOUND_IN_UNIPROT', 'MERTK_HUMAN',
       'CHEK2_NOT_FOUND_IN_UNIPROT', 'TEK_NOT_FOUND_IN_UNIPROT',
       'RAF1_HUMAN', 'ITK_HUMAN', 'BMPR2_HUMAN',
       'TGFBR1_NOT_FOUND_IN_UNIPROT', 'FYN_HUMAN', 'IRAK4_HUMAN',
       'SRC_HUMAN', 'MET_HUMAN', 'ALK_HUMAN', 'INSR_HUMAN', 'FGFR1_HUMAN',
       'IGF1R_HUMAN', 'FGFR2_HUMAN', 'KIT_HUMAN', 'CSK_HUMAN',
       'FES_HUMAN', 'LYN_HUMAN', 'EPHA7_HUMAN', 'EPHA3_HUMAN',
       'EPHB4_HUMAN', 'BTK_HUMAN', 'PAK1_HUMAN', 'HCK_HUMAN',
       'OXSR1_HUMAN', 'TNIK_HUMAN', 'STRADA_NOT_FOUND_IN_UNIPROT'

Make a temporary column combining the inhibitor and kinase names, to check for duplicates

In [180]:
unique = []

for n,i in enumerate(inhibitors_df.Inhibitor):
    uniq = str(i)+str(inhibitors_df.Kinase[n])
    unique.append(uniq)

unique = pd.Series(unique)
inhibitors_df = inhibitors_df.assign(UNIQUE = unique)

Drop any duplicate kinase-inhibitor pairs and reset the indices

In [181]:
inhibitors_df = inhibitors_df.drop_duplicates(subset="UNIQUE")
inhibitors_df = inhibitors_df.reset_index(drop=True)

Remove any columns not required for the web app

In [182]:
inhibitors_df = inhibitors_df.drop(["Index","To_Remove","Partial_Img_URL","Link","UNIQUE"], axis = 1)

Make data frame of kinase-inhibitor pairs

In [183]:
inhib_kin_df = inhibitors_df[['Kinase', 'Inhibitor']]

Make data frame of inhibitors

In [184]:
inhibitors_df = inhibitors_df.drop_duplicates(subset="Inhibitor")
inhibitors_df = inhibitors_df.drop(["Kinase"], axis = 1)
inhibitors_df = inhibitors_df.reset_index(drop=True)

Make a column of primary keys

In [185]:
prim_key = []

count = 1

for i in inhibitors_df.Inhibitor:
    key = "IN"+"{:07d}".format(count)
    prim_key.append(key)
    count += 1

prim_key = pd.Series(prim_key)

inhibitors_df = inhibitors_df.assign(ID_IN = prim_key)

In [186]:
prim_key = []

count = 1

for i in inhib_kin_df.Inhibitor:
    key = "KI"+"{:07d}".format(count)
    prim_key.append(key)
    count += 1

prim_key = pd.Series(prim_key)

inhib_kin_df = inhib_kin_df.assign(ID_KI = prim_key)

Write to csv

In [187]:
inhib_kin_df.to_csv("inhib_kin.csv", index = False)

In [188]:
inhibitors_df.to_csv("inhibitors.csv", index = False)