# COMM 4P35 - Web Archives Tutorial


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BrockDSL/ARCH_Data_Explore/blob/main/COMM_4P35_Activity.ipynb)


## Part 1 - Analyzing changes to Canada.ca pages

This notebook uses a subset of the the data from the [COVID in Niagara Archive](https://archive-it.org/collections/13781). We'll use Google Collab to explore how some pages from the [canada.ca](https://canada.ca) domain have changed during the course of the pandemic.


In [None]:
# Loading in the the pieces

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 200)

import difflib
from IPython import display

import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
%matplotlib inline

### Step 1.

We'll load up the CSV file of data that represents our crawls of the canada.ca pages and randomly display one row of this spreadsheet.

In [None]:
web_page_text = pd.read_csv("https://raw.githubusercontent.com/BrockDSL/ARCH_Data_Explore/main/snap_shot_canada_ca.csv")

web_page_text['crawl_date']= pd.to_datetime(web_page_text['crawl_date'],format='%Y%m%d')
#add an extra column with how the length of each crawl. Useful for later calculations
for index, row in web_page_text.iterrows():
    web_page_text.at[index, "length"] = len(web_page_text.at[index,"content"])
web_page_text.sample(1)

In [None]:
print("Total number of web pages captures in this archive subset: " + str(len(web_page_text)))

## Step 2 

Let's look at how many times the top 25 URLs in this archive have been crawled. 

In [None]:
web_page_text.groupby(["url"]).count().sort_values(by="crawl_date",ascending=False)[0:25]

## Step 3

Let's look at a specific URL... We set it in the next cell

In [None]:
URL = "https://www.canada.ca/en/public-health/services/diseases/2019-novel-coronavirus-infection/prevention-risks/covid-19-improving-indoor-ventilation.html"

... with that set, let's plot out the change in content length of that page

In [None]:
url_data = web_page_text[web_page_text['url'] == URL].sort_values(by="crawl_date")

plt.plot(url_data['crawl_date'],url_data['length'])
plt.xticks(rotation=45)
plt.title("Word count variation by crawl for \n" + URL)
plt.show()


### Step 4

Curious. We see a huge step in page length.

Let's open up both version of this page on the Internet Archive and see if we can spot the difference in the pages.


In [None]:
max_page = url_data[url_data['length'] == url_data['length'].max()]
max_page_date = str(max_page['crawl_date'].values[0]).split('T')[0].replace('-','')


print("\n\nLongest version of this page on the Internet Archive was captured "\
      + max_page_date + "\n" \
      + "Open this version on Internet Archive \n"
      + "https://web.archive.org/web/" \
      + max_page_date + "/" + URL)



min_page = url_data[url_data['length'] == url_data['length'].min()]
min_page_date = str(min_page['crawl_date'].values[0]).split('T')[0].replace('-','')


print("\n\nShortest version of this page on the Internet Archive was captured "\
      + min_page_date + "\n" \
      + "Open this version on Internet Archive \n"
      + "https://web.archive.org/web/" \
      + min_page_date + "/" + URL)


Do you notice any differences in these pages?

## Part 2 - Run your own analysis

We'll now look at a selection of pages from a different domain in that dataset. Here we will use [ontario.ca](https://ontario.ca)


In [None]:
P2_web_page_text = pd.read_csv("https://raw.githubusercontent.com/BrockDSL/ARCH_Data_Explore/main/snap_shot_ontario_ca.csv")
P2_web_page_text.dropna(inplace=True)

In [None]:
P2_web_page_text['crawl_date']= pd.to_datetime(P2_web_page_text['crawl_date'],format='%Y%m%d')
#add an extra column with how the length of each crawl. Useful for later calculations
for index, row in P2_web_page_text.iterrows():
    P2_web_page_text.at[index, "length"] = len(P2_web_page_text.at[index,"content"])
    
P2_web_page_text.sample(1)

Top 25 URLs crawled in this Archive

In [None]:
P2_web_page_text.groupby(["url"]).count().sort_values(by="crawl_date",ascending=False)[0:25]

Open the [CSV file](file.csv) and look through it using Excel or something similar. Try to find and interesting URL that shows some changes in page length. You can experiment by setting the `P2_URL` variable in the next cell.

In [None]:
P2_URL = ""

In [None]:
P2_url_data = P2_web_page_text[P2_web_page_text['url'] == P2_URL].sort_values(by="crawl_date")


plt.plot(P2_url_data['crawl_date'],P2_url_data['length'])
plt.xticks(rotation=45)
plt.title("Word count variation by crawl for \n" + P2_URL)
plt.show()


P2_max_page = P2_url_data[P2_url_data['length'] == P2_url_data['length'].max()]
P2_max_page_date = str(P2_max_page['crawl_date'].values[0]).split('T')[0].replace('-','')


print("\n\nLongest version of this page on the Internet Archive was captured "\
      + P2_max_page_date + "\n" \
      + "Open this version on Internet Archive \n"
      + "https://web.archive.org/web/" \
      + P2_max_page_date + "/" + P2_URL)



P2_min_page = P2_url_data[P2_url_data['length'] == P2_url_data['length'].min()]
P2_min_page_date = str(P2_min_page['crawl_date'].values[0]).split('T')[0].replace('-','')


print("\n\nShortest version of this page on the Internet Archive was captured "\
      + P2_min_page_date + "\n" \
      + "Open this version on Internet Archive \n"
      + "https://web.archive.org/web/" \
      + P2_min_page_date + "/" + P2_URL)


Describe the changes you see in the page between the shortest and longest version