# Web scraping KIDFamMap 1 of 2:
## Scraping kinase IDs and making a list of URLs

KIDFamMap is a database listing thousands of known kinase inhibitors. 

340 kinases are listed on the "browse" page at URL http://gemdock.life.nctu.edu.tw/kidfammap/browse.php.

Each kinase acts as a hyperlink that can be clicked on to navigate to a "kinase" page, containing information about that kinase.

The "kinase" page includes a link to an "inhibitors" page, containing a list of inhibitors for either
- the same kinase, or
- a "reference kinase"

This means that a simple spider that performs the following:
- scrape and follow every "kinase" page URL from the "browse" page
- scrape and follow every "inhibitor" page URL from the "kinase" page
- scrape and parse the inhibitor details from the "inhibitor" page

would not return the inhibitor details for all 340 kinases, because the kinases that link to "reference kinases", rather than themselves, would be skipped.

However, it is possible to manually insert the kinase ID of interest into any "inhibitors" page's URL, in order to find its inhibitor information. Some kinases simply do not have any inhibitors listed on KIDFamMap, but many more do.

Here we use the scrapy package to make a spider to scrape all of the 340 kinase IDs on KIDFamMap. We then generate the 340 "inhibitor" page URLs using a loop. These URLs will be used in "Web-scraping-KIDFamMap_2-of-2_Inhibitor-information-and-csv-generation.ipynb" to scrape KIDFamMap for the inhibitor details and produce two tables for our database: "inhib_kin.csv" and "inhibitors.csv".

Import packages

In [None]:
import pandas as pd
import scrapy
from scrapy.crawler import CrawlerProcess

Make a spider

In [None]:
# Use the scrapy.Spider class to make a kinase-name-scraping spider

class KinaseSpider( scrapy.Spider ):
    
    name = "kinase_spider"
    
    # Define the first action to take
        
    def start_requests( self ): 
        
        # Define which URL to follow
        
        url = 'http://gemdock.life.nctu.edu.tw/kidfammap/browse.php'

        # Go to the website at the above URL and get a response object
        # which contains the HTML code for that web page
        # Define what to do with the response object
        # i.e. send it to the parse method defined below
        
        yield scrapy.Request( url = url, callback = self.parse )
            
    # Using the HTML in the previous response object, get the kinase 
    # names
    
    def parse( self, response ):
        
        # Define an xpath locator to point to the kinases in the HTML
        # code and extract them as strings
        # place into a list "kinases" (which we must initialise in the
        # next cell rather than here)
        
        kins = response.xpath( '//td/a/text()' ).extract()
        
        kinases.append( kins )

Run the spider: crawl KIDFamMap for kinases

In [None]:
kinases = []

# Run the spider

process = CrawlerProcess()
process.crawl( KinaseSpider )
process.start()

# N.B. kernel needs to be cleared before repeating

The kinases are all stored in the first element of the "kinases" list

In [None]:
kinases = kinases[ 0 ]

Make URLs from the kinases list

In [None]:
kinase_urls = []

for i in kinases:
    url = "http://gemdock.life.nctu.edu.tw/kidfammap/show_inhibitor.php?QueryType=Protein&QueryName=" + str( i ) + "&Query_Pid=" + str( i )
    kinase_urls.append( url )

Print the URLs

In [None]:
print( kinase_urls )