# Text Analysis Notebook 
This notebook contains a variety of text analysis tools for working with and visualizing web archive data. 


## Set-Up
The following cells set-up the Google Colab environment for working with web archive data. This requires the downloading and installing of several packages, and may take a couple of minutes to run. 

### Install Dependencies
Downloads and installs dependencies for running the Archives Unleashed Toolkit (AUT) and the All Our Yesterdays Toolkit (AOYTK). 

In [2]:
# this cell downloads and installs the required dependencies for running the AUT
# and creates the environment variables required to use Java, Spark and PySpark
# this cell only needs to be run in Colab, when running this script from the Docker image
# these variables will already have been set appropriately
# create the appropriate environment variables to be able to use Java, Spark and PySpark
%%capture
!apt-get update
!apt-get install -y openjdk-11-jdk-headless -qq 
!apt-get install maven -qq

!curl -L "https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz" > spark-3.1.1-bin-hadoop2.7.tgz
!tar -xvf spark-3.1.1-bin-hadoop2.7.tgz
!pip install -q findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-1.11.0-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.1.1-bin-hadoop2.7"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars aut-1.1.0-fatjar.jar --py-files aut-1.1.0.zip pyspark-shell'

In [3]:
# download the AUT 
%%capture
!wget "https://github.com/archivesunleashed/aut/releases/download/aut-1.1.0/aut-1.1.0.zip"
!wget "https://github.com/archivesunleashed/aut/releases/download/aut-1.1.0/aut-1.1.0-fatjar.jar"

### Mount Google Drive 
In order to persist working files/data across multiple runs of the notebook. 


In [4]:
from google.colab import drive 
drive.mount("/content/drive/")

Mounted at /content/drive/


### Load AOYTK

Upload `aoytk.py` to the workspace and then run the following cell. 

Note: once the package is available through pip, this will be able to be done without the need to manually load in the module. 

In [5]:
import aoytk

### Set Working Directory
Select a working folder for reading and writing data. 

In [6]:
aoytk.display_path_select()

Text(value='', description='Folder path:')

Button(description='Submit', style=ButtonStyle())

Folder path set to: /content/drive/MyDrive/AOY/


## Data Exploration 
The following cells provide tools for exploring web-archive derivatives (which are saved in the working folder selected in the setup stage). 

### Domain Frequencies

*Note:* Currently in the development stage -- much of this will be modularized out into `aoytk.py`

In [8]:
import ipywidgets as widgets
file_options = widgets.Dropdown(description = "Derivative file:", options = [x for x in os.listdir(aoytk.path) if x.endswith((".csv", ".parquet", ".pqt"))])
display(file_options)

Dropdown(description='Derivative file:', options=('all-text.csv',), value='all-text.csv')

In [9]:
selected_file = aoytk.path + '/' + file_options.value
import pandas as pd
data = pd.read_csv(selected_file)
data.head()

Unnamed: 0,crawl_date,domain,url,content
0,20060622205609,gca.ca,http://www.gca.ca/indexcms/?organizations&orgi...,Green Communities Canada | Our Member Organiza...
1,20060622205609,ppforum.com,http://www.ppforum.com/en/speeches/index.asp?t...,Speeches - Public Policy Forum Building Better...
2,20060622205609,communist-party.ca,http://communist-party.ca/calendar/cal_week.ph...,Calendar CPC Coming Events Thursday 22 June 20...
3,20060622205610,canadafirst.net,http://www.canadafirst.net/immi_crime/canada_t...,TERRORIST DESTINATION Canada's borders come un...
4,20060622205610,web.net,http://www.web.net/~ccr/edboardmtg.html,Meeting with editorial board MEETING WITH EDIT...


In [11]:
data["domain"].value_counts().head(10)

communist-party.ca        39
greenparty.ca             39
ccsd.ca                   22
partimarijuana.org        22
gca.ca                    20
canadianactionparty.ca    19
westernblockparty.com     18
policyalternatives.ca     17
egale.ca                  17
web.net                   16
Name: domain, dtype: int64