# Download SCMP

This notebook will scrape every record of a horse that has raced in Hong Kong which SCMP possesses.  Using a page of JSON HKRacing generates when given an empty search, it runs through the list of every horse on the site, recording every race each has taken part in, as well as a variety of other variables. 

## Import  

In [29]:
import pandas as pd 
import numpy as np
import json 
import csv
import os
from pandas.io.json import json_normalize

import requests 

import scrapy
from scrapy.crawler import CrawlerProcess

## Retrieve IDs

Scrape the page of JSON that HKRacing generates when you give it a search for a horse name.  In this case we're scrapping the page that results from searching an emmpty space, since this gives you the full list of all horses recorded on HKRacing. 

In [30]:
url = 'https://www.scmp.com/sport/racing/ajax/horses-search/HorseName/%20'
hk_url = 'https://www.scmp.com/sport/racing/stats/horses/{}/{}'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
requested_json = requests.get(url, headers=headers)
type(requested_json)

requests.models.Response

Convert the request object that returns from scraping the webpage and slice out the first {data:...} tab so that the result isn't nested and can be easily normalized.  Once that data tag has been removed, convert the string response into a JSON dictionary. 

In [31]:
decoded_json = requested_json.content.decode()
sliced_json = decoded_json[8:-1]
loaded_json = json.loads(sliced_json)
type(loaded_json)

dict

Convert the JSON Dictionary into a Data Frame that we will be able to iterate through while scraping the full pages for each horse. 

In [32]:
normal_json = json_normalize(loaded_json['result'])
indexed_json = normal_json.reset_index()
indexed_json.head()

Unnamed: 0,index,horse_id,horse_name
0,0,B340,B340
1,1,V614,A SHIN HIKARI
2,2,B345,ABOVE
3,3,A328,ABSOLUCOOL
4,4,P626,ABSOLUTELY WIN


## Build a Spider

Create a spider using scrapy to crawl the scmp website.  For each entry in the JSON retrieved earlier, the spider will go to that page and retrieve all relevant information as marked below.  It will all be saved in the file desginated under custom_settings. 

In [33]:
class HKSpider(scrapy.Spider):

    name = "hkracing_spider"
    allowed_domains = ['www.scmp.com']

    custom_settings = {
        'FEED_FORMAT':'csv', 
        'FEED_URI':'hkhorses.csv'
    }
    
    def start_requests(self): 
        #for each Horse
        for index in indexed_json['index'].values: 
            yield scrapy.Request(
                hk_url.format(indexed_json.at[index, 'horse_id'], indexed_json.at[index, 'horse_name']\
                    .strip().lower().replace(" ", "-")),
                self.get_horse,
            )
    
    def get_horse(self, response):
        print("processing: " + response.url)
        
        name=response.xpath("//div[@class='wrapper']//div[@class='header']//h1/text()").extract()
        sire=response.xpath("//div[@class='wrapper']//div[@class='details']//p/text()[11]").extract()
        dame=response.xpath("//div[@class='wrapper']//div[@class='details']//p/text()[13]").extract()
        date=response.xpath("//div[@class='race-table']//tbody//tr//td[1]//text()").extract()
        race_number=response.xpath("//div[@class='race-table']//tbody//tr//td[2]//a/text()").extract()
        track=response.xpath("//div[@class='race-table']//tbody//tr//td[3]/text()").extract()
        distance=response.xpath("//div[@class='race-table']//tbody//tr//td[4]/text()").extract()
        cl=response.xpath("//div[@class='race-table']//tbody//tr//td[5]/text()").extract()
        rank=response.xpath("//div[@class='race-table']//tbody//tr//td[6]/text()").extract()
        trainer=response.xpath("//div[@class='race-table']//tbody//tr//td[7]/text()").extract()
        weight=response.xpath("//div[@class='race-table']//tbody//tr//td[8]/text()").extract()
        jockey=response.xpath("//div[@class='race-table']//tbody//tr//td[9]/text()").extract()
        dr=response.xpath("//div[@class='race-table']//tbody//tr//td[10]/text()").extract()
        gr=response.xpath("//div[@class='race-table']//tbody//tr//td[11]/text()").extract()
        win_time=response.xpath("//div[@class='race-table']//tbody//tr//td[15]/text()").extract()
        last_qtr=response.xpath("//div[@class='race-table']//tbody//tr//td[16]/text()").extract()
        section_time=response.xpath("//div[@class='race-table']//tbody//tr//td[17]/text()").extract()
        ln_running=response.xpath("//div[@class='race-table']//tbody//tr//td[18]/text()").extract()
        w_m=response.xpath("//div[@class='race-table']//tbody//tr//td[19]/text()").extract()
        horse_wt=response.xpath("//div[@class='race-table']//tbody//tr//td[20]/text()").extract()
        rt=response.xpath("//div[@class='race-table']//tbody//tr//td[21]/text()").extract()
        odds_on=response.xpath("//div[@class='race-table']//tbody//tr//td[22]/text()").extract()
        odds_last=response.xpath("//div[@class='race-table']//tbody//tr//td[23]/text()").extract()
        
        
        row_data=zip(date, race_number, track, distance, cl, rank, trainer, weight, jockey, dr, gr, win_time, last_qtr, section_time, ln_running, w_m, horse_wt, rt, odds_on, odds_last)
        
        
        for item in row_data: 
            scraped_info = {
                'name':name, 
                'sire':sire, 
                'dame':dame, 
                'date':item[0], 
                'race_number':item[1], 
                'track':item[2], 
                'distance':item[3], 
                'cl':item[4], 
                'rank':item[5], 
                'trainer':item[6], 
                'weight':item[7], 
                'jockey':item[8], 
                'dr':item[9], 
                'gr':item[10], 
                'win_time':item[11], 
                'last_qtr':item[12], 
                'section_time':item[13], 
                'ln_running':item[14], 
                'w_m':item[15], 
                'horse_wt':item[16], 
                'rt':item[17], 
                'odds_on':item[18], 
                'odds_last':item[19],
            }
            yield scraped_info 

## Retrieve and Store Data

Run the scrapy spider so long as you don't already have the data, and making sure to look like a browser while you do. 

In [34]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.settings['LOG_LEVEL']='WARNING'

if not(os.path.isfile('hkhorses.csv')): 
    process.crawl(HKSpider)
    process.start()

2019-05-10 15:22:01 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: scrapybot)
2019-05-10 15:22:01 [scrapy.utils.log] INFO: Versions: lxml 4.3.2.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b  26 Feb 2019), cryptography 2.6.1, Platform Windows-10-10.0.17763-SP0


Double check here that the data reads properly and looks like what you'd expect. 

In [35]:
hk_csv = pd.read_csv('hkhorses.csv')
hk_csv.shape

(17449, 23)

In [36]:
hk_csv.head(5)

Unnamed: 0,name,sire,dame,date,race_number,track,distance,cl,rank,trainer,...,gr,win_time,last_qtr,section_time,ln_running,w_m,horse_wt,rt,odds_on,odds_last
0,A SHIN HIKARI (V614) 榮進之光,Deep Impact,Catalina,11-12-16,257,ST tf g A,2000,G1,10,M. Sakaguchi,...,H,2:00.9,24.1,73.34 23.46 25.38,1-1-1-10,7.75,1134,129,6.8,8.6
1,A SHIN HIKARI (V614) 榮進之光,Deep Impact,Catalina,13-12-15,251,ST tf g A,2000,G1,1,M. Sakaguchi,...,H,2:00.6,23.6,73.39 23.59 23.62,1-1-1-1,1.0,1111,114,28.0,38.0
2,ABSOLUTELY WIN (P626) 肯定贏,Oasis Dream,Five Fields,02-03-13,432,ST tf g C,1400,G3,11,G. W. Moore,...,B,1:22.0,23.6,35.43 23.00 24.60,1-1-1-11,6.25,1111,94,6.8,7.1
3,ADONIS (A324) 明月昇輝,Exceed And Excel,Mythical Play,14-04-19,575,ST tf g C,1200,5,5,L. Ho,...,H/PC,1:10:00,23.79,24.90 21.95 23.68,8-6-5,3.25,1043,34,3.8,2.7
4,ADONIS (A324) 明月昇輝,Exceed And Excel,Mythical Play,24-03-19,520,ST tf g C+3,1400,5,4,L. Ho,...,H/PC,1:23:23,24.13,36.61 23.09 23.78,8-8-6-4,1.5,1033,34,9.1,12.0
