# COMM 4P35 - Web Archives Tutorial


## Part 1 - Analyzing changes to Canada.ca pages

This notebook uses a subset of the the data from the [COVID in Niagara Archive](https://archive-it.org/collections/13781). We'll use Google Collab to explore how some pages from the [canada.ca](https://canada.ca) domain have changed during the course of the pandemic.

Notebooks are comprised of 'cells'. Some are HTML other are code. If you click in a cell that is code you'll notice a 'play' button shows up in the left hand margin. Clicking on that play button will cause the code to run. 

Scroll through this page, reading the details and clicking on the play button in each one of the code cells.

In [None]:
# Loading in the the pieces

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 200)

from textblob import TextBlob
import nltk


import difflib
from IPython import display

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

import matplotlib.pyplot as plt
%matplotlib inline

### Step 1.

We'll load up the CSV file of data that represents our crawls of the canada.ca pages. We'll add some extra processing:

- We calculate the length of each entry, and add as a new column
- We calculate the sentiment of each entry, and as two new columns

and randomly display one row of this spreadsheet.

In [None]:

#Open up the CSV file fo data
web_page_text = pd.read_csv("https://raw.githubusercontent.com/BrockDSL/ARCH_Data_Explore/main/snap_shot_canada_ca.csv")


#Make sure the date column is treated as a Date
web_page_text['crawl_date']= pd.to_datetime(web_page_text['crawl_date'],format='%Y%m%d')


#add an extra column with how the length of each crawl. Useful for later calculations
for index, row in web_page_text.iterrows():
    web_page_text.at[index, "length"] = len(web_page_text.at[index,"content"])
    
    
    
##add two extra columns to the date that shows the calculated 'sentiment' of the entries

polarity = []
subjectivity = []


for entry in web_page_text.content:
    #print(day,"\n")
    score = TextBlob(entry)
    polarity.append(score.sentiment.polarity)
    subjectivity.append(score.sentiment.subjectivity)
    
web_page_text['polarity'] = polarity
web_page_text['subjectivity'] = subjectivity

    
    
    
# A random 'sample' of 1 record
web_page_text.sample(1)

In [None]:
print("Total number of web pages captures in this archive subset: " + str(len(web_page_text)))

## Step 2 

Let's look at how many times the top 25 URLs in this archive have been crawled. 

In [None]:
web_page_text.groupby(["url"]).count().sort_values(by="crawl_date",ascending=False)[0:25]

## Step 3

Let's look at a specific URL... We set it in the next cell

In [None]:
URL="https://www.canada.ca/en/public-health/services/diseases/2019-novel-coronavirus-infection.html"

... with that set, let's plot out the change in content length of that page

In [None]:
url_data = web_page_text[web_page_text['url'] == URL].sort_values(by="crawl_date")

plt.plot(url_data['crawl_date'],url_data['length'])
plt.xticks(rotation=45)
plt.title("Word count variation by crawl for \n" + URL)
plt.show()


### Step 4

Curious. We see changes in the page length. Runing the cell below will generate a link to the Internet Archive for each different length version of this page and tell you when it was first harvested. Try a few of the links to see if you can spot what was added to the page.

In [None]:

unique_days = url_data.groupby("length").first().sort_values(by='crawl_date')
print("\n")
for index, row in unique_days.iterrows():
    date = str(row['crawl_date']).split(' ')[0].replace('-','')
    length = len(row['content'])
    print("Date of crawl: ",date, ". Length of page: ",length)
    print("View on Internet archive https://web.archive.org/web/" + date + "/" + URL)
    print("\n")

### Step 5 - Graphing Changes in Sentiment

Let's map out the changes in sentiment scores for all of the capture dates for this URL.


In [None]:
plt.plot(url_data['crawl_date'],url_data['polarity'])
plt.xticks(rotation=45)
plt.title("Polarity change for URL:\n"+URL)
plt.ylabel("Polarity")
plt.xlabel("Date of Crawl")
plt.show()


plt.plot(url_data['crawl_date'],url_data['subjectivity'])
plt.xticks(rotation=45)
plt.title("Subjectivity change for URL:\n"+URL)
plt.ylabel("Subjectivity")
plt.xlabel("Date of Crawl")
plt.show()


## To conclude

By looking at various harvests of a particular page on the Internet Archive we can measure the change in sentiment that accompanies a change in page length. in this way we can see how additions and deletions to a page change both the content and the intention of the page.

## Part 2 - Run your own analysis

We'll now look at a selection of pages from a different domain in that dataset. Here we will use [ontario.ca](https://ontario.ca)


In [None]:
P2_web_page_text = pd.read_csv("https://raw.githubusercontent.com/BrockDSL/ARCH_Data_Explore/main/snap_shot_ontario_ca.csv")
P2_web_page_text.dropna(inplace=True)

In [None]:
P2_web_page_text['crawl_date']= pd.to_datetime(P2_web_page_text['crawl_date'],format='%Y%m%d')
#add an extra column with how the length of each crawl. Useful for later calculations
for index, row in P2_web_page_text.iterrows():
    P2_web_page_text.at[index, "length"] = len(P2_web_page_text.at[index,"content"])
    
P2_web_page_text.sample(1)

Top 25 URLs crawled in this Archive

In [None]:
P2_web_page_text.groupby(["url"]).count().sort_values(by="crawl_date",ascending=False)[0:25]

Try to find and interesting URL in the list you just created that shows some changes in page length. You can experiment by setting the `P2_URL` variable in the next cell to that URL.

In [None]:
P2_URL = ""

now run the next cell to perform the analysis.

In [None]:
P2_url_data = P2_web_page_text[P2_web_page_text['url'] == P2_URL].sort_values(by="crawl_date")

#add an extra column with how the length of each crawl. Useful for later calculations
for index, row in P2_url_data.iterrows():
    P2_url_data.at[index, "length"] = len(P2_url_data.at[index,"content"])

#Add two extra columns for sentiment scores
P2_polarity = []
P2_subjectivity = []

for entry in P2_url_data.content:
    #print(day,"\n")
    score = TextBlob(entry)
    P2_polarity.append(score.sentiment.polarity)
    P2_subjectivity.append(score.sentiment.subjectivity)
    
P2_url_data['polarity'] = P2_polarity
P2_url_data['subjectivity'] = P2_subjectivity


print("Analysis for: ",P2_URL,"\n")

#Find all changes in page length for this URL

P2_unique_days = P2_url_data.groupby("length").first().sort_values(by='crawl_date')

for index, row in P2_unique_days.iterrows():
    date = str(row['crawl_date']).split(' ')[0].replace('-','')
    print("Date of crawl: ",date)
    print("Length of page: ",len(row['content']))
    print("Polarity: ", row['polarity'])
    print("Subjectivity",row['subjectivity'])
    print("View on Internet archive https://web.archive.org/web/" + date + "/" + URL)
    print("\n")

#Graph Sentiment

#Plot out Word counts of crawls
plt.plot(P2_url_data['crawl_date'],P2_url_data['length'])
plt.xticks(rotation=45)
plt.title("Word count variation by crawl for \n" + P2_URL)
plt.show()

plt.plot(P2_url_data['crawl_date'],P2_url_data['polarity'])
plt.xticks(rotation=45)
plt.title("Polarity change for URL:\n"+P2_URL)
plt.ylabel("Polarity")
plt.xlabel("Date of Crawl")
plt.show()

plt.plot(P2_url_data['crawl_date'],P2_url_data['subjectivity'])
plt.xticks(rotation=45)
plt.title("Subjectivity change for URL:\n"+P2_URL)
plt.ylabel("Subjectivity")
plt.xlabel("Date of Crawl")
plt.show()

Describe the changes you see in the page between the shortest and longest version