# 5.2.4 [API/Scrapy Challenge](https://courses.thinkful.com/data-201v1/project/5.2.4)

## Challenge

Do a little scraping or API-calling of your own.  Pick a new website and see what you can get out of it.  Expect that you'll run into bugs and blind alleys, and rely on your mentor to help you get through.  

Formally, your goal is to write a scraper that will:

1) Return specific pieces of information (rather than just downloading a whole page)  
2) Iterate over multiple pages/queries  
3) Save the data to your computer  

Once you have your data, compute some statistical summaries and/or visualizations that give you some new insights into your scraping topic of interest.  Write up a report from scraping code to summary and share it with your mentor.

### Twitter API
I did a little reading on the Twitter API. They have free access to past 7-days, 30-days, by search terms. The specific endpoint I was looking to try required filling out an application. This seemed like a bit much for this pracetice so I decided to move on from this approach.

### Wikipedia API
The practice problem of pulling the titles of all pages that link to a specific page is something I have hard of colleagues doing to harvest metadata on various topics. So I wanted to see if I could pull all the television series that link to a specific genre.

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess


class WikiSpider(scrapy.Spider):
    name = "WS"
    
    # Here is where we insert our API call.
    start_urls = [
        'https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=linkshere&titles=Action_(TV_series)&lhprop=title%7Credirect'
        ]

    # Identifying the information we want from the query response and extracting it using xpath.
    def parse(self, response):
        for item in response.xpath('//lh'):
            # The ns code identifies the type of page the link comes from.  '0' means it is a Wikipedia entry.
            # Other codes indicate links from 'Talk' pages, etc.  Since we are only interested in entries, we filter:
            if item.xpath('@ns').extract_first() == '0':
                yield {
                    'title': item.xpath('@title').extract_first() 
                    }
        # Getting the information needed to continue to the next ten entries.
        next_page = response.xpath('continue/@lhcontinue').extract_first()
        
        # Recursively calling the spider to process the next ten entries, if they exist.
        if next_page is not None:
            next_page = '{}&lhcontinue={}'.format(self.start_urls[0],next_page)
            yield scrapy.Request(next_page, callback=self.parse)
            
    
process = CrawlerProcess({
    'FEED_FORMAT': 'json',
    'FEED_URI': 'action_links.json',
    # Note that because we are doing API queries, the robots.txt file doesn't apply to us.
    'ROBOTSTXT_OBEY': False,
    'USER_AGENT': 'LearningtoCrawl (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True,
    'LOG_ENABLED': False,
    # We use CLOSESPIDER_PAGECOUNT to limit our scraper to the first 100 links.    
    'CLOSESPIDER_PAGECOUNT' : 12
})
                                         

# Starting the crawler with our spider.
process.crawl(WikiSpider)
process.start()
print('Complete')

Complete


In [2]:
import pandas as pd

# Checking data 
action=pd.read_json('action_links.json', orient='records')
print(action.shape)
print(action.tail())

(108, 1)
                               title
103                 Glee (TV series)
104      Hasten Down the Wind (song)
105                      John Vargas
106  List of TV Guide covers (1990s)
107                       Dave Jeser


### Playing with Scrapy
It may be more efficient to scrape or use an API to pull all of the links from one Wikipedia's Series by genre pages (oddly there are a few differently worded lists).

In any regard I would either need to pull a list of genre links and iterate to see which series pages linked to that genre. The other option would be to to scrape a list of series names and scrape the genre from the article. 

My goal with this exercise is to just get comfortable with using xlm and scrapy.

In [1]:
# Importing in each cell because of the kernel restarts.
import scrapy
import re
from scrapy.crawler import CrawlerProcess

class GenreSpider(scrapy.Spider):
    # Naming the spider is important if you are running more than one spider of
    # this class simultaneously.
    name = "genre_list"
    
    # URL(s) to start with.
    start_urls = [
        'https://en.wikipedia.org/wiki/List_of_comedy_television_series',
    ]

    # Use XPath to parse the response we get.
    def parse(self, response):
        for article in response.xpath('//*[@id="mw-content-text"]/div/div[*]/ul/li'):            
            # Yield a dictionary with the values we want.
            yield {
                'name': article.xpath('i/a/@title').extract_first(),
                'link': article.xpath('i/a/@href').extract_first(),
                'text': article.xpath('text()').extract()
            }
            #print(article.xpath('/ul/li[*]/i/a/@title'))

process = CrawlerProcess({
    'FEED_FORMAT': 'json',         # Store data in JSON format.
    'FEED_URI': 'genre_list.json',       # Name our storage file.
    'LOG_ENABLED': False,          # Turn off logging for now.
    'ROBOTSTXT_OBEY': True,
    'USER_AGENT': 'LearningToCrawl (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True
})

# Start the crawler with our spider.
process.crawl(GenreSpider)
process.start()
print('Success!')           

Success!


In [2]:
import pandas as pd

# Checking whether we got data 

films=pd.read_json('genre_list.json', orient='records')
print(films.shape)
print(films.tail())

(1372, 3)
                                link                      name  \
1367  /wiki/Wizards_of_Waverly_Place  Wizards of Waverly Place   
1368        /wiki/WKRP_in_Cincinnati        WKRP in Cincinnati   
1369          /wiki/The_Wonder_Years          The Wonder Years   
1370               /wiki/Workaholics               Workaholics   
1371  /wiki/Wrecked_(U.S._TV_series)  Wrecked (U.S. TV series)   

                   text  
1367     [ (2007–2012)]  
1368     [ (1978–1982)]  
1369     [ (1988–1993)]  
1370     [ (2011–2017)]  
1371  [ (2016–present)]  
