<a href="https://colab.research.google.com/github/BrockDSL/ARCH_Data_Explore/blob/main/Municipal_Document_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


![dsl logo](https://github.com/BrockDSL/ARCH_Data_Explore/blob/main/dsl_logo.png?raw=true)


# Municipal Document Similarity

Will load up different snapshots of [URLs of interest](https://github.com/BrockDSL/ARCH_Data_Explore/blob/main/urls_of_interest.txt) and compare similarity using [spaCy tools](https://spacy.io/) for it.


In [None]:

#spaCy tools in separate cell, just to save some download

!pip install spacy==3.2.0
!pip install --upgrade --no-cache-dir gdown
# if doing anything with word vectors use this version
# run this cell then hit ctrl-m-. to restart the runtime then proceed
!python -m spacy download en_core_web_md


In [None]:
#restart run-time automatically
import os
os.kill(os.getpid(), 9)

In [1]:
#Install Libraries

import pandas as pd
import spacy
import gdown
from google.colab import files

pd.set_option('display.max_colwidth', False)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 200)

print("Libraries Loaded")

Libraries Loaded


In [None]:
#Verify package versions

#!pip show gdown
#!pip show spacy

In [3]:
#SpaCy object that will handle similarity
nlp = spacy.load("en_core_web_md")

In [4]:
#Load dataset for comparison
gdown.download("https://drive.google.com/u/0/uc?id=1oKNphdZkuNfeh-beuTkcIBo_EFLWO9zX&export=download","municipal_data.csv.gz",quiet=False)
!gunzip municipal_data.csv.gz
archive_data = pd.read_csv("municipal_data.csv")
#get rid of some uneeded cols
del(archive_data['Unnamed: 0'])
del(archive_data['index'])
del(archive_data['mime_type_web_server'])
del(archive_data['mime_type_tika'])
del(archive_data['language'])
archive_data['crawl_date']= pd.to_datetime(archive_data['crawl_date'],format='%Y-%m-%d')
print("Data Loaded")

Downloading...
From: https://drive.google.com/u/0/uc?id=1oKNphdZkuNfeh-beuTkcIBo_EFLWO9zX&export=download
To: /content/municipal_data.csv.gz
100%|██████████| 51.3M/51.3M [00:00<00:00, 116MB/s]


Data Loaded


In [42]:
#URLs of interest
url_list = pd.read_csv("https://raw.githubusercontent.com/BrockDSL/ARCH_Data_Explore/main/urls_of_interest.txt",header=None)
url_list.columns = ["base_url"]

In [None]:
#archive_data.head()


## URL Selection


In [43]:
url_list

Unnamed: 0,base_url
0,https://www.niagararegion.ca/health/covid-19/default.aspx
1,https://www.niagararegion.ca/health/covid-19/protect-yourself.aspx
2,https://www.niagararegion.ca/health/covid-19/employee-information.aspx
3,https://www.niagararegion.ca/health/covid-19/municipal-bills.aspx
4,https://www.niagararegion.ca/health/covid-19/testing.aspx
5,https://www.niagararegion.ca/health/covid-19/symptoms.aspx
6,https://www.niagararegion.ca/health/covid-19/resources.aspx
7,https://www.niagararegion.ca/health/covid-19/self-isolation.aspx
8,https://www.niagararegion.ca/health/covid-19/social-support.aspx
9,https://www.notl.com/COVID-19/


In [63]:
#@title URLs to Compare
#@markdown enter the number of the first column above to select that cell as your base of comparsion
base_url_choice = 0 #@param {type:"integer"}
#@markdown enter the number of the first column above to select that cell your comparison url
compare_to_choice = 19 #@param {type:"integer"}


In [None]:
#base_url_time_stamps
archive_data[archive_data['url'] == url_list.iloc[base_url_choice]['base_url']][['crawl_date']]

In [110]:
#@title Base URL Version Choice
#@markdown Choose index of timestamp for which crawl to use

base_url_version_choice = 31069 #@param {type:"integer"}


In [None]:
archive_data[archive_data['url'] == url_list.iloc[compare_to_choice]['base_url']][['crawl_date']]

In [70]:
#@title COMPARE TO Crawl date

#@markdown Choose index of timestamp for which crawl to use
compare_to_url_version_choice = 28385 #@param {type:"integer"}

In [152]:
#Final creation and comparison

base_url_f = url_list.iloc[base_url_choice]['base_url']
base_url_ts = archive_data.iloc[base_url_version_choice]['crawl_date']

comp_url_f = url_list.iloc[compare_to_choice]['base_url']
comp_url_ts = archive_data.iloc[compare_to_url_version_choice]['crawl_date']

print("Comparison\n")


doc_base = archive_data[archive_data['url']== base_url_f]
doc_base = doc_base[doc_base['crawl_date'] == base_url_ts]
doc_base = doc_base.head(1)
bdate = str(doc_base.crawl_date.values[0]).split('T')[0].split(' ')[0].replace('-','')

print("Base URL: ", base_url_f)
print("Crawl date: ", base_url_ts)
print("IA link: https://web.archive.org/web/" + bdate + "/" + base_url_f)

comp_base = archive_data[archive_data['url']== comp_url_f]
comp_base = comp_base[comp_base['crawl_date'] == comp_url_ts]
comp_base = comp_base.head(1)
cdate = str(comp_base.crawl_date.values[0]).split('T')[0].split(' ')[0].replace('-','')

print("\n")
print("Comp URL: ",comp_url_f)
print("Crawl date: ",comp_url_ts)
print("IA link: https://web.archive.org/web/" + bdate + "/" + comp_url_f)

print("\nSimilarity Score")
db = nlp(doc_base.content.values[0])
dc = nlp(comp_base.content.values[0])

print(db.similarity(dc))

Comparison

Base URL:  https://www.niagararegion.ca/health/covid-19/default.aspx
Crawl date:  2020-11-20 00:00:00
IA link: https://web.archive.org/web/20201120/https://www.niagararegion.ca/health/covid-19/default.aspx


Comp URL:  http://portcolborne.ca/page/covid-19
Crawl date:  2020-11-06 00:00:00
IA link: https://web.archive.org/web/20201120/http://portcolborne.ca/page/covid-19

Similarity Score
0.9780733164429761


In [None]:
#Run cell to show Base Document
doc_base


In [None]:
#Run cell to show Compare to Document
comp_base