# Time travel Time!

Now we will use all our collected links to get data from the way back machine. 

If you haven't heard of the wayback machine before it is an ongoing subproject of the Internet Archive. They go out and collect snapshots of what websites looked like over time. In total they have the largest collection of accessible website history on the planet. Perhaps google beats them out, but there's no way to leverage their collection for free (as far as I know?).

It should be noted that not all the links we found in our gathering step will have data in the wayback machine. This is for two reasons at least. 

One, websites are super fluid things always getting changes made to them, and subpages get added frequently even outside of a pandemic emergency. This means that as of the time I write this our links are reflections of retailers pages in December 2020. Certain pages may simply not exist if we look back to January 2020. It's an assumption that the majority of worthwhile material out there representing a retailers communications about covid mitigations won't have been created and then removed before this date since the pandemic isn't over. This means that I have some confidence that pages with pandemic information on them will have a legacy that we can track over the last 12 months.  

Two, the wayback machine has limited resources. They manage an incredible amount of scraping, but some of the links that I've collected simply don't have historical snapshots on the wayback machine. Of the 120 that I collected most recently, about 5 of them had 0 snapshots. I believe that this is a sufficiently small amount of missed data, and it doesn't indicate that the methods of collection are faulty. That said, some pages have more than 0 snapshots, but nothing like daily coverage. To get around these issues we would have to effectively take on teh place of teh wayback machine and run daily scrapes going forward in time with a collection of relevant links. That is outside the scope of our current project. Hopefully these restrictions won't prevent us from identifying interesting trends in what terms showed up on what retailer pages at what times. 


first lets import all the needed packages

In [9]:
# our scraper
import scrapy
# regular expressions library, useful for extracting text
import re
# shows progress bars
from tqdm import tqdm
# parallel programming, lets us write faster code
from multiprocessing import Process,Queue
# Crawler Process used to start a scrapy spider
from scrapy.crawler import CrawlerProcess
# beautiful soup for parsing webpages and getting visible text
from bs4 import BeautifulSoup as bs
# basic web requests, for getting list of wayback snapshots for page if any 
import requests as rq
# url encoder, needed for certain wayback snapshots
from requests.utils import quote
# help us read json and json line files
import json 

Now we define a function which returns a spider class for the crawler process to start. Instead of defining the class outside the function this way we can provide variables that the class can operate on such as `url`, `retailer`, `from_date`, and `to_date`. Respectively these are the website url, the retailer name, the yyyymmdd format start of the date range we want, and yyyymmdd format end of the date range.


In [10]:
def url_make_spider(url,retailer,from_date,to_date):
    # catch error when wayback has no snapshots for this page in the date range
    try:
        # encode the url safely, this makes it possible to pass urls that have queries in them
        clean_url = quote(url,safe="")
        # make a request to get the list of historical snapshots
        res = rq.get(f'https://web.archive.org/cdx/search/cdx?url={clean_url}&from={from_date}&to={to_date}'
                     '&output=json&fl=timestamp,original,statuscode,digest')
        # this returns a json array with a header line (think csv columns)
        # the first line of the returned content isn't data 
        header= json.loads(res.content)[0]
        # get the rest of the data, we only really care about whats in the first column, the timestamps
        res_data = json.loads(res.content)[1:]
    except Exception as e:
        # problem is likeely just that there was no wayback machine scrapes for this page in the date range
        print("err problem with url",e)
        print(res.content)
        # log it to a file so we can tell which pages didn't get data
        with open("problem_log","a") as phile:
            phile.write(f"{e} {url}\n")
        return
    # we now construct a list of the timestamps that are all atleast 1 day apart
    # we don't want to scrape at any granualarity less that 1 day because it increases the number of results dramatically
    # and the pages may not have anything of sigificance in terms of changes
    single_days = []
    # this holds the exact timestamp to give to the wayback machine to return a page at that date
    long_form = []
    # loop over the data
    for r in res_data:
        # the 8 encompases the date info in the string yyyymmdd without going into hours minutes and seconds 
        single_date = r[0][:8]
        # if we haven't already added the shortened day to the single days, do so and add the long form version of the timestamp 
        if not single_date in single_days:
            single_days.append(single_date)
            long_form.append(r[0])
    
    # urls gets used in start_requests, templated out for each of our long form timestamps
    urls = [f'http://web.archive.org/web/{timestamp}id_/{url}' for timestamp in long_form]
    class MySpider(scrapy.Spider):
        #  spider definition
        name='retail'
        # auto throttle so the server isn't overloaded from our collections
        # this helps them not want to kick us off
        # delay our downloads by 1 second also
        # allow max 3 requests to go to waybackmachine, this can go as high as 10 I think, but
        # again they might decide to kick you if you make too many simultaneous requests
        custom_settings = {
            'AUTOTHROTTLE_ENABLED': True,
            'AUTOTHROTTLE_DEBUG': True,
            'DOWNLOAD_DELAY':1,
            'AUTOTHROTTLE_TARGET_CONCURRENCY':3,
        }
        # make a request for each appropriate timestamp, these will get queued and processed atmost 3 at a time 
        def start_requests(self):
            for url in urls:
                yield scrapy.Request(url=url,callback=self.parse)
        # for each webpage we get back save the visual information as a line in our results log (described lower down)
        def parse(self,response):
            # convert to bs
            soup = bs(response.body)
            # put a specific separator between each of the html elements visual text
            text = soup.get_text('--sep--')
            yield {"website":response.url,"text":text,"retailer":retailer}
            

    return MySpider



Now we will create a function that actually makes use of the function we created above.

run_spider takes a websie url, retailer name, from_date, and to_date. it will then activate a spider to start crawling the wayback machine.

Note the FEEDS. This specifies where the output should go when we get data from the spider. For each retailer we will store the raw website data in the `./data_processing` folder under another folder with the retailer name. The actual file will be named `timed_scrapes_{from_date}_{to_date}.jl` with the from_date and to_date info in it. 

The final notebook will look through the data_processing folder for our result files and put them all together.

In [11]:


def run_spider(url,retailer,from_date,to_date):
    # create a CrawlerProcess with certain settings
    # the f"..." string is a formatted string allowing us to put variables in a specific positions in the string\
    
    process = CrawlerProcess({
            'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1), retail covid research',
            "FEEDS":{
                f"./data_processing/{retailer}/timed_scrapes_{from_date}_{to_date}.jl":{"format":"jsonlines"},
            }
    })
    # call the url_make_spider function which sets up our spider class with url,retailer,from_date, and to_date
    myspider_class = url_make_spider(url,retailer,from_date,to_date)
    # provide the spider class to the crawl process
    process.crawl(myspider_class)
    # go ahead and start
    process.start() 
    # nothing happens after this until the crawling is finished

Now we will actually start collecting data from the wayback machine.

first we get all our values of retailer websites identified in current day to hold important pandemic information.

In [12]:
websites = set()
with open("./data_processing/all_urls.jl") as phile:
    for line in phile:
        j_ob = json.loads(line)
        websites.add((j_ob["website"],j_ob["retailer"]))

print(len(websites))


620


now we establish the bounds of our time range that we care about. Feel free to modify these values whenever you re run this notebook. the pattern to follow is yyyymmdd. so we use 4 numbers for the year, 2 for the month, and 2 for the days. 


In [13]:

from_date = "20200110"
to_date ="20201230"

then we go through this entire list creating a history grabbing spider to go through. Again if your computer becomes sluggish because of the amount of output feel free to remove the `#` in front of `#%%capture` and none of the output will come from the cell while it runs. Also note, this is not parallelized code, because we don't really want to hammer the wayback machine with requests. It's like a really wise senior citizen, we can learn lots if we are patient. 

In [None]:
%%capture
# loop and setup a progress bar with tqdm
# each value in websites is like this [website,retailer], so we can destructure to website, retailer in the for loop
for website,retailer in tqdm(websites):
    p = Process(target=run_spider,args=(website,retailer,from_date,to_date))
    p.start()
    p.join()
#run_spider(url)