## API Scrape
After scraping explainxkcd.com the old fashioned way, I discovered I could use Media Wiki's built in API to query the database. I also discovered that xkcd.com has an EXTREMELY simple API to grab the data that I scraped from explainxkcd.com (you can just get a json with the title, title text, and transcript of each comic directly from his server). So I did a bunch of extra work for nothing, but I value the experience.
<br><br>
I went on to use the explainxkcd.com API to get topics for each comic. This was a little tricky because I was simultaneously struggling to learn XPATH syntax, api query sytax, and the particular way that explainxkcd's wiki is set up, all while hacking scrapy to do my bidding. I am very happy to have something that finally worked. <br><br>

## User's Note:
If you want to run this script, you must restart the kernel to run each crawl. It's really built for my machine, since it depends on another json file from a different scrape.

In [1]:
import pandas as pd
import numpy as np
import scrapy
from scrapy.crawler import CrawlerProcess


class CatSpider(scrapy.Spider):
    name = "Cats"
    
    # Here is where we insert our API call.
    start_urls = [
        'https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Comics_by_topic&cmlimit=max'
        ]

    # Identifying the information we want from the query response and extracting it using xpath.
    def parse(self, response):
        yield {
            'topics': response.xpath('/api/query/categorymembers/cm/@title').extract()
            }
            
process = CrawlerProcess({
    'FEED_FORMAT': 'json',
    'FEED_URI': 'all_topics.json',
    # Note that because we are doing API queries, the robots.txt file doesn't apply to us.
    'ROBOTSTXT_OBEY': False,
    'USER_AGENT': 'ThinkfulDataScienceBootcampCrawler (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True,
    'LOG_ENABLED': True,
})
                                         

# Starting the crawler with our spider.
process.crawl(CatSpider)
process.start()
print('Something happened!!')

2018-01-17 02:56:05 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2018-01-17 02:56:05 [scrapy.utils.log] INFO: Overridden settings: {'AUTOTHROTTLE_ENABLED': True, 'FEED_FORMAT': 'json', 'FEED_URI': 'all_topics.json', 'HTTPCACHE_ENABLED': True, 'USER_AGENT': 'ThinkfulDataScienceBootcampCrawler (thinkful.com)'}
2018-01-17 02:56:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.throttle.AutoThrottle']
2018-01-17 02:56:05 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware'

Something happened!!


In [1]:
import pandas as pd
import numpy as np

In [10]:
tpx = pd.read_json('all_topics.json')
topics = pd.DataFrame()
topics['topic'] = tpx.loc[0].topics
topics['topic'] = topics.topic.apply(lambda x: x[9:])

In [3]:
df = pd.read_csv('cleaned_data.csv')
df.index = df.comic_number

In [15]:
def replace_space(string):
    if ' ' in string:
        string = string.replace(' ', '_')
    return string
        
topics['topic_'] = topics.topic.apply(replace_space)
topic_urls = iter(topics.topic_)

In [5]:

import scrapy
from scrapy.crawler import CrawlerProcess

class TopicCrawlSpider(scrapy.Spider):
    name = "topic_crawl"
    
    # Here is where we insert our API call.
    next_topic = next(topic_urls)
    start_urls = [
        'https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:{}&cmlimit=max'.format(next_topic)
        ]

    # Identifying the information we want from the query response and extracting it using xpath.
    def parse(self, response):
        yield {
            'comics': response.xpath('/api/query/categorymembers/cm/@title').extract()
            }
        
        try:
            next_topic = 'https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:{}&cmlimit=max'.format(next(topic_urls))
            next_page = response.urljoin(next_topic)
            # Request the next page and recursively parse it the same way we did above
            yield scrapy.Request(next_page, callback=self.parse)
        except:
            pass
        
            
process = CrawlerProcess({
    'FEED_FORMAT': 'json',
    'FEED_URI': 'comics_by_topic.json',
    # Note that because we are doing API queries, the robots.txt file doesn't apply to us.
    'ROBOTSTXT_OBEY': False,
    'USER_AGENT': 'ThinkfulDataScienceBootcampCrawler (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True,
    'LOG_ENABLED': True,
})
                                         

# Starting the crawler with our spider.
process.crawl(TopicCrawlSpider)
process.start()
print('Something happened!!')

2018-01-17 10:52:57 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2018-01-17 10:52:57 [scrapy.utils.log] INFO: Overridden settings: {'AUTOTHROTTLE_ENABLED': True, 'FEED_FORMAT': 'json', 'FEED_URI': 'comics_by_topic.json', 'HTTPCACHE_ENABLED': True, 'USER_AGENT': 'ThinkfulDataScienceBootcampCrawler (thinkful.com)'}
2018-01-17 10:52:57 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.throttle.AutoThrottle']
2018-01-17 10:52:57 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddle

2018-01-17 10:52:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Boomerangs&cmlimit=max>
{'comics': ['445: I Am Not Good with Boomerangs', '475: Further Boomerang Difficulties', '939: Arrow', '1000: 1000 Comics', '1350: Lorenz']}
2018-01-17 10:52:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Butterfly_net&cmlimit=max> (referer: https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Boomerangs&cmlimit=max) ['cached']
2018-01-17 10:52:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Butterfly_net&cmlimit=max>
{'comics': ['1110: Click and Drag', '1193: Externalities', '1243: Snare', '1523: Microdrones', '1622: Henge', '1635: Birdso

2018-01-17 10:52:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Conspiracy_theory&cmlimit=max> (referer: https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Computers&cmlimit=max) ['cached']
2018-01-17 10:52:59 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Conspiracy_theory&cmlimit=max>
{'comics': ['250: Snopes', '258: Conspiracy Theories', '690: Semicontrolled Demolition', '966: Jet Fuel', '1081: Argument Victory', '1224: Council of 300', '1274: Open Letter', '1400: D.B. Cooper', '1664: Mycology', '1677: Contrails', '1717: Pyramid Honey', '1803: Location Reviews']}
2018-01-17 10:52:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Categ

2018-01-17 10:53:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Food&cmlimit=max> (referer: https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Fiction&cmlimit=max) ['cached']
2018-01-17 10:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Food&cmlimit=max>
{'comics': ['18: Snapple', '27: Meat Cereals', '30: Donner', '38: Apple Jacks', '140: Delicious', '141: Parody Week: Achewood', '142: Parody Week: Megatokyo', '149: Sandwich', '328: Eggs', '341: 1337: Part 1', '388: Fuck Grapefruit', '418: Stove Ownership', '425: Fortune Cookies', '434: xkcd Goes to the Airport', '442: xkcd Loves the Discovery Channel', '452: Mission', '472: House of Pancakes', '654: Nachos', '677: Asshole', '839: Explorers', '915: Connoisseur', '974: The Ge

2018-01-17 10:53:01 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Kites&cmlimit=max>
{'comics': ['235: Kite', '268: Choices: Part 5', '442: xkcd Loves the Discovery Channel', '482: Height', '1000: 1000 Comics', '1378: Turbine', '1608: Hoverboard', '1614: Kites', "1756: I'm With Her"]}
2018-01-17 10:53:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Language&cmlimit=max> (referer: https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Kites&cmlimit=max) ['cached']
2018-01-17 10:53:01 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Language&cmlimit=max>
{'comics': ['36: Scientists', '37: Hyphen', '41: Old Drawing', '72: Classhole', '75

2018-01-17 10:53:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Music&cmlimit=max> (referer: https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Math&cmlimit=max) ['cached']
2018-01-17 10:53:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Music&cmlimit=max>
{'comics': ['56: The Cure', "61: Stacey's Dad", '70: Guitar Hero', '97: A Simple Plan', '118: 50 Ways', '119: Worst Band Name Ever', '132: Music Knowledge', '134: Myspace', '159: Boombox', '161: Accident', '193: The Perfect Sound', '206: Reno Rhymes', "210: 90's Flowchart", '274: With Apologies to The Who', '321: Thighs', '324: Tapping', '339: Classic', '343: 1337: Part 3', '344: 1337: Part 4', '345: 1337: Part 5', '368: Bass', '389: Keeping Time', '400: Important Life Lesson

2018-01-17 10:53:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Public_speaking&cmlimit=max> (referer: https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Protip&cmlimit=max) ['cached']
2018-01-17 10:53:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Public_speaking&cmlimit=max>
{'comics': ['153: Cryptography', '285: Wikipedian Protester', '365: Slides', '545: Neutrality Schmeutrality', '685: G-Spot', '690: Semicontrolled Demolition', '756: Public Opinion', '829: Arsenic-Based Life', '867: Herpetology', '975: Occulting Telescope', '1000: 1000 Comics', '1073: Weekend', '1090: Formal Languages', '1661: Podium', '1672: Women on 20s', '1736: Manhattan Project', '1781: Artifacts', '1827: Survivorship Bias', 'Conservation']}
2018-01-1

2018-01-17 10:53:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Roomba&cmlimit=max> (referer: https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Romance&cmlimit=max) ['cached']
2018-01-17 10:53:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Roomba&cmlimit=max>
{'comics': ['413: New Pet', '506: Theft of the Magi', '908: The Cloud', '1183: Rose Petals', '1193: Externalities', '1558: Vet', '1881: Drone Training']}
2018-01-17 10:53:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Sarcasm&cmlimit=max> (referer: https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Roomba&cmlimit=ma

2018-01-17 10:53:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Smartphones&cmlimit=max> (referer: https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Sheeple&cmlimit=max) ['cached']
2018-01-17 10:53:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Smartphones&cmlimit=max>
{'comics': ['596: Latitude', '864: Flying Cars', '989: Cryogenics', '1036: Reviews', '1304: Glass Trolling', '1372: Smartwatches', '1373: Screenshot', '1411: Loop', '1422: My Phone is Dying', '1427: iOS Keyboard', '1457: Feedback', '1705: Pokémon Go', '1710: Walking Into Things', '1711: Snapchat', '1731: Wrong', '1787: Voice Commands', '1789: Phone Numbers', '1801: Decision Paralysis', '1802: Phone', '1813: Vomiting Emoji', '1814: Color Pattern', '1815: Flag', 

2018-01-17 10:53:05 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Time&cmlimit=max>
{'comics': ['926: Time Vulture', '994: Advent Calendar', '1061: EST', '1086: Eyelash Wish Log', '1190: Time', '1335: Now', '1491: Stories of the Past and Future', '1524: Dimensions', '1577: Advent', '1624: 2016', '1655: Doomsday Clock', '1658: Estimating Time', '1686: Feel Old', '1688: Map Age Guide', '1704: Gnome Ann', '1786: Trash', '1806: Borrow Your Laptop', '1822: Existential Bug Reports', '1849: Decades', '1930: Calendar Facts', '1935: 2018', 'Category:Daylight saving time', 'Category:Time management', 'Category:Comics to make one feel old', 'Category:Time travel']}
2018-01-17 10:53:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Tips&cmlimit=max> (referer: https://www.explainxkcd.com/wiki/ap

2018-01-17 10:53:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Virtual_Assistants&cmlimit=max>
{'comics': ['1363: xkcd Phone', '1707: xkcd Phone 4', '1807: Listening', '1931: Virtual Assistant']}
2018-01-17 10:53:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Wind_turbine&cmlimit=max> (referer: https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Virtual_Assistants&cmlimit=max) ['cached']
2018-01-17 10:53:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.explainxkcd.com/wiki/api.php?action=query&format=xml&list=categorymembers&cmtitle=Category:Wind_turbine&cmlimit=max>
{'comics': ['556: Alternative Energy Revolution', '1119: Undoing', '1378: Turbine']}
2018-01-17 10:53:07 [scrapy.core.engine] DEBUG: Crawled (200) <G

Something happened!!


In [13]:
topics_by_comic = pd.read_json('comics_by_topic.json')
topics_by_comic.index = topics.index
topics['comics'] = topics_by_comic

In [23]:
for row in topics.index:
    topic = topics.loc[row].topic_
    comics = topics.loc[row].comics
    df[topic] = 0
    for comic in comics:
        try:
            number = int(comic.split(':')[0])
        except:
            break
        df.at[number, topic] = 1
        

In [33]:
df["Banned_from_conferences"].sum()

6.0

In [47]:
tpx = list(topics.topic_)
topics['no_comics'] = list(df[tpx].sum(axis=0))

In [61]:
print('most popular topic: ', topics.loc[topics.no_comics.idxmax()].topic)
print('most rare topic: ', topics.loc[topics[topics.no_comics > 0].no_comics.idxmin()].topic)

most popular topic:  Computers
most rare topic:  Optimization


In [69]:
df = df[df.comic_number.notnull()].copy()
df.comic_number = df.comic_number.apply(int)
df.to_csv('comics_and_topics.csv', index=False)