In [2]:
import scrapy
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

Welcome to this walk-through on a simple scraper.  The program made here accomplishes three tasks:

1. Collects a list of URLs from a starting page
2. Scrapes each URL on the list automatically
3. Parses data from text fields and tables into a JSON file

For a simple example, we will scrape character info for each character in SSB4 on the Super Smash Bros. Wiki using Scrapy.

## Data storage

Scrapy provides a useful tool for storing parsed data: the 'Item' class - which is a glorified dictionary.  We can inherit the class and then specify which fields we would like the item to have, so let's try some basic info for each character - their name, their original game, a short description and their special movelist.

In [3]:
class Character(scrapy.Item):
    
    name = scrapy.Field()
    game = scrapy.Field()
    description = scrapy.Field()
    specials = scrapy.Field()

Now let's work out how to get the list of URLs so we can scrape the information for each character.  Normally we would define this inside the class of our own spider, but for the sake of storytelling we can do it in the global scope first.

The function takes the `GET` request from scrapy's spider as a `response` variable.  From here we can work with Scrapy's in-built Xpath selector to work through the html of the page.

If you visit [the site](https://www.ssbwiki.com/Super_Smash_Bros._for_Wii_U) on Google Chrome, it's easy to view the HTML in a nice tree format by right-clicking the element you'd like to scrape and hitting "inspect".


The main page lists all the characters in a table, with a link to their respective pages where we will fill out the attributes in our `Character` object.

There are multiple creative ways to access the part of the page you'd like.  In the way below, I have just selected the `[1]`st instance of a table with the `class` attribute equal to `wikitable`, and parsed all `<a>` tags from that node.  This gives us a list of strings that look like this:

`<a href="/Mario_(SSB4)" title="Mario (SSB4)">Mario</a>`

Sweet!  Now we can extract the address of the hyperlink tag by continuing our Xpath - we do this with the full stop character. Then target the `href` attribute and `extract()` each instance.  Then simply concatenate the root of the wiki with each character URL and we have a working list ready to scrape!

In [4]:
def parse_URLS(response):
                               # //table              : all tables
                               # [@class='wikitable'] : with class = 'wikitable' 
                               # [1]                  : 1st appearance
                               # //a                  : all 'a' tags from that node
    characters = response.xpath('//table[@class="wikitable"][1]//a')

                               # .      : continue from last selection (above)
                               # @href  : all href variables
    urls = characters.xpath('.//@href').extract()
    urls = [('https://www.ssbwiki.com' + url) for url in set(urls) if '(SSB4)' in url]

    return urls # list of strings

If you got lost after reading all those XPath arguments, me too!  For the second half I'll be using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) to accomplish the same task, which has a much friendler API.  We create a BeautifulSoup instance of the response html string and then traverse it effortlessly

So let's work out what to do on each character page to fill out our `Item`.  [Here's Mario's](https://www.ssbwiki.com/Mario_(SSB4)).  The name is easy - it's a h1 tag with a recognizable `id`, so we use the `find` method and feed in the attribute via a dictionary.  The `text` method removes the tags and returns the name, then we just strip the ending to get it in a format we like.

The description too is fairly simple - it's just the first paragraph tag.

The last two involve traversing a table to find the correct info.  To find `game` we select the table via its distinguishable tag, then use `findAll('tr')` to return a list of every row in the table.  The game is on row three, in column 2, so we select the row index with `[2]`.  Do the same for the columns by using the tag `td`, then extract the text and strip the line break at the end.

For the special moves, we are extracting four different elements from a box much further down.  You may notice this box has no distinguishable attributes!  Trickier to select, but of course not impossible.

To get there, I selected a part of the page slightly above it, the `Moveset` span class.  Then traversed the tree upward to the corresponding `Moveset` header.  From here we can move laterally toward the table and select it.

Then selecting the elements is done as before.  Note that this table is not the same for all characters because of their normal moves having different properties.  For this reason I indexed the elements I wanted from the bottom using negative integers.

In [5]:
def parse_char(response):
    soup = BeautifulSoup(response.text, 'html.parser')
    char = Character()

    char['name'] = soup.find('h1', {'id':'firstHeading'}).text.rstrip(' (SSB4)')
    char['description'] = soup.p.text.strip('\n')
    char['game'] = soup.find('table', {'class':'infobox bordered'}).findAll('tr')[2].findAll('td')[1].text.strip('\n')
    moves = soup.find('span', {'id':'Moveset'}).find_parent('h2').find_next_sibling('table',{'class':'wikitable'})

    char['specials'] = []
    for num in [-13, -10, -7, -4]:
        move = moves.findAll('tr')[num].findAll('td')[1]
        char['specials'].append(move.text)

    return char

Now we're ready to crawl!  We set up a `spider` with the parsing functions we have just defined.  The first spider will return a list of urls via `parse_URLS()` and spawn a separated spider for each.  We specify the callback function within the class, where we use `parse_char()` to return a character item for each URL.  All these character items are then stored in a JSON, correctly formatted!

In [6]:
class SmashSpider(scrapy.spiders.CrawlSpider):
    name = "smash"
    start_urls = ["https://www.ssbwiki.com/Super_Smash_Bros._for_Wii_U"]

    def parse(self, response):
        urls = parse_URLS(response)
        
        for url in urls:
            yield scrapy.Request(url, callback=self.secondary_parse)

    def secondary_parse(self, response):
        char = parse_char(response)
        
        yield char

Then run the code below to begin the crawl and save the data!

In [7]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'FEED_FORMAT': 'csv',
    'FEED_URI': 'smash_characters.csv'
})

process.crawl(SmashSpider)
process.start()

2018-08-02 10:51:13 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2018-08-02 10:51:13 [scrapy.utils.log] INFO: Versions: lxml 4.2.3.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 17.5.0, Python 3.6.3 |Anaconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.2.0 (OpenSSL 1.0.2o  27 Mar 2018), cryptography 2.0.3, Platform Windows-10-10.0.17134-SP0
2018-08-02 10:51:13 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'smash_characters.csv', 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2018-08-02 10:51:13 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2018-08-02 10:51:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',


2018-08-02 10:51:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ssbwiki.com/Jigglypuff_(SSB4)>
{'description': 'Jigglypuff (プリン, Purin) is a playable character in Super '
                'Smash Bros. 4. After initially being seen several times '
                'during the Super Smash Bros. for Wii U 50-Fact Extravaganza '
                'on October 23rd, 2014,[1] it was formally added to the '
                'official website on November 5th, 2014. Jigglypuff is once '
                'again voiced by Rachael Lillis in English and Mika Kanai in '
                'Japanese, albeit via recycled voice clips. As in previous '
                'games, it also has different voice actresses in French and '
                'German.',
 'game': 'Pokémon',
 'name': 'Jigglypuff',
 'specials': ['Rollout', 'Pound', 'Sing', 'Rest']}
2018-08-02 10:51:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ssbwiki.com/Pikachu_(SSB4)>
{'description': 'Pikachu (ピカチュウ, Pikachu) is a

2018-08-02 10:51:20 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ssbwiki.com/Link_(SSB4)>
{'description': 'Link (リンク, Link) is a playable character in Super Smash Bros. '
                '4. His return to the series was announced during the E3 '
                'Nintendo Direct on June 11th, 2013.[1] He was also among the '
                'first wave of amiibo that are compatible with SSB4. Akira '
                "Sasanuma reprises his role as Link's voice actor, albeit via "
                'voice clips recycled from The Legend of Zelda: Twilight '
                'Princess.',
 'game': 'The Legend of Zelda',
 'name': 'Link',
 'specials': ["Hero's Bow", 'Gale Boomerang', 'Spin Attack', 'Bomb']}
2018-08-02 10:51:20 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ssbwiki.com/Ness_(SSB4)>
{'description': 'Ness (ネス, Ness) returns as a playable character in Super '
                'Smash Bros. 4. Ness was officially confirmed on October 3rd, '
                '201

2018-08-02 10:51:22 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ssbwiki.com/Charizard_(SSB4)>
{'description': 'Charizard (リザードン, Lizardon) is a playable character in Super '
                'Smash Bros. 4. Its return to the series was announced during '
                'a Super Smash Bros. Direct on April 8th, 2014, during which '
                'its fellow Pokémon representative Greninja was also '
                "revealed.[1] Shin'ichirō Miki reprises his role as "
                "Charizard's voice actor, albeit via re-recorded voice clips "
                'that match how it sounds in the Pokémon anime.',
 'game': 'Pokémon',
 'name': 'Charizard',
 'specials': ['Flamethrower', 'Flare Blitz', 'Fly', 'Rock Smash']}
2018-08-02 10:51:22 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ssbwiki.com/Little_Mac_(SSB4)>
{'description': 'Little Mac (リトル・マック, Little Mac) is a playable character in '
                'Super Smash Bros. 4. He was revealed during a Ninte

2018-08-02 10:51:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ssbwiki.com/Bayonetta_(SSB4)>
{'description': 'Bayonetta (ベヨネッタ, Bayonetta) is a character and newcomer in '
                'Super Smash Bros. 4, and is the seventh and final '
                'downloadable character. She was announced alongside Corrin '
                'during the Super Smash Bros. - Final Video Presentation on '
                'December 15th, 2015 and both were released on February 3rd, '
                '2016. She is the sixth third-party character to be introduced '
                "in SSB4, following fellow SEGA character Sonic, Capcom's Mega "
                "Man and Ryu, Bandai Namco's Pac-Man and Square Enix's Cloud. "
                'Bayonetta was added to the game as the winner of the Smash '
                'Bros. Fighter Ballot, being the highest-voted character in '
                'Europe and among the top 5 in North America, making her the '
                'overall #1 wor

2018-08-02 10:51:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ssbwiki.com/Mario_(SSB4)>
{'description': 'Mario (マリオ, Mario) is a playable character in Super Smash '
                'Bros. 4. He was confirmed on June 11th, 2013 during the E3 '
                '2013 Nintendo Direct.[1] He was also one of the main subjects '
                "of the Developer's Direct for Super Smash Bros. later during "
                'E3 2013.[2] He was among the first wave of amiibo figurines '
                'for SSB4. Mario is once again voiced by Charles Martinet, who '
                'also reprises his long-time role as Luigi, Wario, and '
                'Waluigi, albeit via recycled voice clips from Super Smash '
                'Bros. Brawl. [3]',
 'game': 'Mario',
 'name': 'Mario',
 'specials': ['Fireball', 'Cape', 'Super Jump Punch', 'F.L.U.D.D.']}
2018-08-02 10:51:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ssbwiki.com/Sheik_(SSB4)>
{'description': 'Sheik 

2018-08-02 10:51:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ssbwiki.com/Peach_(SSB4)> (referer: https://www.ssbwiki.com/Super_Smash_Bros._for_Wii_U)
2018-08-02 10:51:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ssbwiki.com/Yoshi_(SSB4)> (referer: https://www.ssbwiki.com/Super_Smash_Bros._for_Wii_U)
2018-08-02 10:51:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ssbwiki.com/Corrin_(SSB4)> (referer: https://www.ssbwiki.com/Super_Smash_Bros._for_Wii_U)
2018-08-02 10:51:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ssbwiki.com/Peach_(SSB4)>
{'description': 'Peach (ピーチ, Peach) is a playable character in Super Smash '
                'Bros. 4. She was confirmed on September 12th, 2013,[1] the '
                'day before the 28th anniversary of Super Mario Bros., the '
                'landmark NES game in which she debuted. She is among the '
                'first wave of compatible amiibo figures.',
 'game': 'Mario',
 'na

2018-08-02 10:51:31 [scrapy.core.engine] INFO: Closing spider (finished)
2018-08-02 10:51:31 [scrapy.extensions.feedexport] INFO: Stored csv feed (58 items) in: smash_characters.csv
2018-08-02 10:51:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 18042,
 'downloader/request_count': 59,
 'downloader/request_method_count/GET': 59,
 'downloader/response_bytes': 2204095,
 'downloader/response_count': 59,
 'downloader/response_status_count/200': 59,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 8, 2, 9, 51, 31, 217577),
 'item_scraped_count': 58,
 'log_count/DEBUG': 118,
 'log_count/INFO': 8,
 'request_depth_max': 1,
 'response_received_count': 59,
 'scheduler/dequeued': 59,
 'scheduler/dequeued/memory': 59,
 'scheduler/enqueued': 59,
 'scheduler/enqueued/memory': 59,
 'start_time': datetime.datetime(2018, 8, 2, 9, 51, 13, 954983)}
2018-08-02 10:51:31 [scrapy.core.engine] INFO: Spider closed (finished)


In [9]:
import pandas as pd
pd.read_csv('smash_characters.csv')

Unnamed: 0,description,game,name,specials
0,"Ryu (リュウ, Ryū) is a playable character in Supe...",Street Fighter,Ryu,"Hadoken,Tatsumaki Senpukyaku,Shoryuken,Focus A..."
1,"Robin (ルフレ, Reflet) is a playable newcomer in ...",Fire Emblem,Robin,"Thunder,Arcfire,Elwind,Nosferatu"
2,"Captain Falcon (キャプテン・ファルコン, Captain Falcon) r...",F-Zero,Captain Falcon,"Falcon Punch,Raptor Boost,Falcon Dive,Falcon Kick"
3,"Olimar (ピクミン＆オリマー, Pikmin & Olimar) is a playa...",Pikmin,Olimar,"Pikmin Pluck,Pikmin Throw,Winged Pikmin,Pikmin..."
4,"Ike (アイク, Ike) is a playable character in Supe...",Fire Emblem,Ike,"Eruption,Quick Draw,Aether,Counter"
5,"Jigglypuff (プリン, Purin) is a playable characte...",Pokémon,Jigglypuff,"Rollout,Pound,Sing,Rest"
6,"Pikachu (ピカチュウ, Pikachu) is a playable charact...",Pokémon,Pikachu,"Thunder Jolt,Skull Bash,Quick Attack,Thunder"
7,"Dr. Mario (Dr. マリオ, Dr. Mario) is a playable c...",Mario,Dr. Mario,"Megavitamins,Super Sheet,Super Jump Punch,Dr. ..."
8,"Lucas (リュカ, Lucas) is a playable character in ...",EarthBound,Lucas,"PK Freeze,PK Fire,PK Thunder,PSI Magnet"
9,"A Mii Swordfighter (剣術タイプ, Fencing Type), or M...",Super Smash Bros.,Mii Swordfighter,"Gale Strike,Airborne Assault,Stone Scabbard,Bl..."
