## Web Scrape (Scrapy) 

I decided to use scrapy for this project as I have previous experience with it. Scrapy is a great tool because it gives us a framework to build around for our scrape. It starts by creating a new file and assigning each step of the scrape to its own python file. I have found that scrapy works best when executed in the command line. I built 
the scrape in (Atom) so I will be adding the key parts of my build in this notebook. 

The core parts of a scrapy build are:
- The items folder
- the piplines folder
- A couple of small tweaks in settings
- building our spider for crawling


## Items

This is where we designate which items we want to scrape from the website. We build out our own Scrapy class and 
pick out features to target. I have added them in the order they appear on Boxofficemojo. 

In [None]:
import scrapy

class BoxofficeItem(scrapy.Item):
    title = scrapy.Field()
    domestic_revenue = scrapy.Field()
    world_revenue = scrapy.Field()
    distributor = scrapy.Field()
    opening_weekend_revenue = scrapy.Field()
    no_opening_theaters = scrapy.Field()
    budget = scrapy.Field()
    Release_date = scrapy.Field()
    MPAA_rating = scrapy.Field()
    run_time = scrapy.Field()
    genres = scrapy.Field()
    days_in_release = scrapy.Field()


## Spider

This is where we define how we want our website to be scraped. A spider also allows us to "crawl" between our designated links, and how to extract our desired data from the page. You can use Beautiful soup to parse, but I decided to use lxml

Here is how the scraping cycle works in my spider:
    1. Generate an inital request to crawl from our starting URl
        -I am targeting the top domestic box office pages, this makes things pretty easy because 
        all we have to do is change the year in our URL to get to the next page. I accomplished this 
        with a simple for loop. 
    2. We then outline callback functions to parse our selected items and return item objects
        -this is where we can choose which selector library to use i.e BeautifulSoup or lxml
    3. These items are sent through our pipeline and into our csv file.


In [None]:
 import scrapy

from boxofficeinfo.items import BoxofficeItem

class BoxofficeSpider(scrapy.Spider):
    name = "Boxofficeinfo"
    allowed_domains = ["boxofficemojo.com"]
    start_urls = [
    "https://www.boxofficemojo.com/year/2010/"
    ]

    for year in [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]: #desired years
        start_urls.append("https://www.boxofficemojo.com/year/"+str(year)+"/") #for loop to move to the next page

    def parse(self, response): #outline which area of the site I want to parse from, all of my features are in one table
        for tr in response.xpath('//*[@id="table"]/div/table/tr')[1:]:
            href = tr.xpath('./td[2]/a/@href')
            url = response.urljoin(href[0].extract())
            yield scrapy.Request(url, callback=self.parse_page_contents)

    def parse_page_contents(self, response): #these 3 values are in their own separate section, they require a different xpath
        item = BoxofficeItem()
        item["title"] = response.xpath('//*[@id="a-page"]/main/div/div[1]/div[1]/div/div/div[2]/h1/text()')[0].extract()
        item["domestic_revenue"] = response.xpath('//*[@id="a-page"]/main/div/div[3]/div[1]/div/div[1]/span[2]/span/text()')[0].extract()
        item["world_revenue"] = response.xpath('//*[@id="a-page"]/main/div/div[3]/div[1]/div/div[3]/span[2]/a/span/text()')[0].extract()

        elements = [] 
        for div in response.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div')[0:]:
            elements.append(' '.join(div.xpath('./span[1]/text()')[0].extract().split()))
        '''
        To get the xpath, we have to use the page inspect feature on our desired page, we can then use 
        the "copy xpath" on our targeted part of the page to get the correct xpath. This took me a few 
        tries, you can use scrapy shell and the url with the xpath in the command line to test for the 
        correct location. 
        some pages have missing information, or have it in different palces. This elements list that I will 
        append to helps us grab the correct information from our site. I have outlined if-else statements 
        for every element so that our spider won't stop running if a value is missing. Opening revenue also 
        the number of opening theaters attached, so I added two statements to keep them separate
        '''
        #Distributor
        if 'Distributor' in elements:
            d = elements.index('Distributor') + 1
            loc_dist = '//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(d)
            item["distributor"] = response.xpath(loc_dist)[0].extract()
        else:
            item["distributor"] = "N/A"

        # Opening Revenue
        if 'Opening' in elements:
           o = elements.index('Opening') + 1
           loc_open_rev = '//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/span/text()'.format(o)
           try:
               item["opening_revenue"] = response.xpath(loc_open_rev)[0].extract()
           except:
               item["opening_revenue"] = "N/A"
        else:
            item["opening_revenue"] = "N/A"

        # Opening Theaters
        if 'Opening' in elements:
           o = elements.index('Opening') + 1
           loc_open_theater = '//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(o)
           try:
               item["opening_theaters"] = response.xpath(loc_open_theater)[0].extract().split()[0]
           except:
               item["opening_theaters"] = "N/A"
        else:
            item["opening_theaters"] = "N/A"

        # Budget
        if 'Budget' in elements:
            b = elements.index('Budget') + 1
            loc_budget = '//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/span/text()'.format(b)
            item["budget"] = response.xpath(loc_budget)[0].extract()
        else:
            item["budget"] = "N/A"
            
        #Release Date (already had r below, used s for season)
        if 'Release Date' in elements: 
            s = elements.index('Release Date') + 1
            loc_release = '//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(s)
            item["Release_date"] = response.xpath(loc_release)[0].extract() #just want the first value here
        else:
            item["Release_date"] = "N/A"

        # MPAA
        if 'MPAA' in elements:
            m = elements.index('MPAA') + 1
            loc_MPAA = '//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(m)
            item["MPAA_rating"] = response.xpath(loc_MPAA)[0].extract()
        else:
            item["MPAA_rating"] = "N/A"
            
        #Run Time
        if 'Run Time' in elements:
            t = elements.index('Run Time') + 1
            loc_run_time = '//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(t)
            item["run_time"] = response.xpath(loc_run_time)[0].extract()
        else:
            item["run_time"] = "N/A"

        # Genres
        if 'Genres' in elements:
            g = elements.index('Genres') + 1
            loc_genres = '//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(g)
            item["genres"] = ",".join(response.xpath(loc_genres)[0].extract().split())
        else:
            item["genres"] = "N/A"

        # In Release
        if 'In Release' in elements:
            r = elements.index('In Release') + 1
            loc_release = '//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(r)
            item["release_days"] = response.xpath(loc_release)[0].extract().split()[0]
        else:
            item["release_days"] = "N/A"
        yield item   

## Item Pipeline

After our item has been scraped by a spider it gets sent to the item pipeline which processes it through 
our outlined components that are executed sequentially. Each component is a python class that implements a 
simple method. Recieve the item, perform our action (in this case write to csv), and decide if the item 
should be dropped and no longer processed



In [None]:
import csv

class BoxofficePipeline(object): # Make sure to change to this name in the settings folder

    def __init__(self): # here is my action to be performed. Write to csv with my selected row names
        self.csvwriter = csv.writer(open("boxoffice_date.csv", "w", newline=''))
        self.csvwriter.writerow(["Title", "Domestic_Revenue", "World_Revenue", "Distributor", "Opening_Weekend_Revenue", "no_Opening_Theaters", "Budget", "Release Date", "MPAA_Rating", "Genre", "Days_In_Release" ])

    def process_item(self, item, spider):
        row = []
        row.append(item["Title"])
        row.append(item["Domestic_Revenue"])
        row.append(item["World_Revenue"])
        row.append(item["Distributor"])
        row.append(item["Opening_Weekend_Revenue"])
        row.append(item["no_Opening_Theaters"])
        row.append(item["Budget"])
        row.append(item["Release Date"])
        row.append(item["MPAA_Rating"])
        row.append(item["run_time"])
        row.append(item["Genre"])
        row.append(item["Days_In_Release"])
        self.csvwriter.writerow(row)
        return item


## Settings

The final step with scrapy is to check and make sure the correct settings are applied. The great thing about scrapy is that it only contains settings that are considered to be important or commonly used. It also has the 
majority of its settings disabled by default.

All we have to do is remove the # by the settings we want to adjust. For this scrape I only needed to change a couple of settings.


In [None]:
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#this is where I added my pipeline name that we created in the spider
ITEM_PIPELINES = {
    'boxofficeinfo.pipelines.BoxofficePipeline': 300,
}
#naming our bot. It is good practice to identify ourself to the site we are scraping
BOT_NAME = 'boxofficeinfo'

SPIDER_MODULES = ['boxofficeinfo.spiders']
NEWSPIDER_MODULE = 'boxofficeinfo.spiders'

#this will obey all of the rules from the sites robot.txt 
# Obey robots.txt rules
ROBOTSTXT_OBEY = True

After customizing these pages, we can run the spider in the command line. 

Scrapy Documentation:
https://docs.scrapy.org/en/latest/