<a href="https://colab.research.google.com/github/archivesunleashed/notebooks/blob/master/aut_pyspark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with AUT and PySpark

## Environment

## Setup PySpark

The following commands download and install PySpark.


In [0]:
%%capture

!wget "https://www.dropbox.com/s/c7dypk5uepmq920/aut.zip"
!wget "https://www.dropbox.com/s/ggz7w1q42dciupf/aut-0.18.2-SNAPSHOT-fatjar.jar"

In [2]:
!ls

aut-0.18.2-SNAPSHOT-fatjar.jar	aut.zip  sample_data


In [0]:
%%capture

!apt-get update
!apt-get install -y openjdk-8-jdk-headless -qq 
!apt-get install maven -qq

!curl -L "https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz" > spark-2.4.4-bin-hadoop2.7.tgz
!tar -xvf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars aut-0.18.2-SNAPSHOT-fatjar.jar --py-files aut.zip pyspark-shell'

In [0]:
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

## Data

This directory contains sample data that you might want to use with the Archives Unleashed Toolkit. The ARC and WARC file are drawn from the Canadian Political Parties & Political Interest Groups Archive-It Collection, collected by the University of Toronto. We are grateful that they've provided this material to us.

If you use their material, please cite it as (in this case if a website):

    University of Toronto Libraries, Canadian Political Parties and Interest Groups, Archive-It Collection 227, Canadian Action Party, http://wayback.archive-it.org/227/20051004191340/http://canadianactionparty.ca/Default2.asp


In [0]:
%%capture
!mkdir data
!wget "https://github.com/archivesunleashed/aut-resources/blob/master/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz?raw=true" -O data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz
!wget "https://github.com/archivesunleashed/aut-resources/blob/master/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz?raw=true" -O data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz

In [11]:
!ls data

ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz
ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz


In [0]:
from aut import *
from pyspark.sql.functions import desc

In [0]:
archive = WebArchive(sc, sqlContext, "data")

In [14]:
webpages = archive.webpages()
webpages.show(10, True)

+----------+--------------------+--------------------+--------------------+--------+--------------------+
|crawl_date|                 url|mime_type_web_server|      mime_type_tika|language|             content|
+----------+--------------------+--------------------+--------------------+--------+--------------------+
|  20091218|http://www.equalv...|           text/html|application/xhtml...|      en|HTTP/1.1 200 OK
...|
|  20091218|http://www.libera...|           text/html|application/xhtml...|      en|HTTP/1.1 200 OK
...|
|  20091218|http://www.canadi...|           text/html|application/xhtml...|      en|HTTP/1.1 200 OK
...|
|  20091218|http://www.equalv...|           text/html|application/xhtml...|      en|HTTP/1.1 200 OK
...|
|  20091218|http://www.libera...|           text/html|application/xhtml...|      en|HTTP/1.1 200 OK
...|
|  20091218|http://greenparty...|           text/html|application/xhtml...|      fr|HTTP/1.1 200 OK
...|
|  20091218|http://www.equalv...|           te

In [15]:
links = archive.links()
links.show(10, True)

+----------+--------------------+--------------------+--------------------+
|crawl_date|                 src|                dest|              anchor|
+----------+--------------------+--------------------+--------------------+
|  20091218|http://www.equalv...|http://www.equalv...|                    |
|  20091218|http://www.equalv...|http://www.equalv...|       RSS SUBSCRIBE|
|  20091218|http://www.equalv...|http://www.equalv...|Bulletin d’AVE - ...|
|  20091218|http://www.equalv...|http://www.equalv...|MORE ABOUT EV'S Y...|
|  20091218|http://www.equalv...|http://www.thesta...|Coyle: Honouring ...|
|  20091218|http://www.equalv...|http://gettingtot...|Getting to the Ga...|
|  20091218|http://www.equalv...|http://www.snapde...|                    |
|  20091218|http://www.libera...|http://www.libera...|Liberal Party of ...|
|  20091218|http://www.libera...|http://www.libera...|   Michael Ignatieff|
|  20091218|http://www.libera...|http://www.libera...|        Introduction|
+----------+

In [16]:
images = archive.images()
images.show(10, True)

+--------------------+--------------------+---------+--------------------+--------------+-----+------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|mime_type_tika|width|height|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+--------------+-----+------+--------------------+--------------------+--------------------+
|http://farm3.stat...|4047878934_ef12ba...|      jpg|          image/jpeg|    image/jpeg|  100|    75|e1a376f170b815f49...|2165fd2908950e9f6...|/9j/4AAQSkZJRgABA...|
|http://farm3.stat...|4047881126_fc6777...|      jpg|          image/jpeg|    image/jpeg|   75|   100|371a2a5142c611405...|933f937c949826696...|/9j/4AAQSkZJRgABA...|
|http://farm3.stat...|4047879492_a72dd8...|      jpg|          image/jpeg|    image/jpeg|  100|    75|8877679361cde970d...|31dbaaed2f7194c95...|/9j/4AAQSkZJRgABA...|
|htt

In [17]:
image_links = archive.image_links()
image_links.show(10, True)

+--------------------+--------------------+
|                 src|           image_url|
+--------------------+--------------------+
|http://www.equalv...|http://www.equalv...|
|http://www.equalv...|http://www.equalv...|
|http://www.equalv...|http://www.equalv...|
|http://www.equalv...|http://www.equalv...|
|http://www.equalv...|http://www.equalv...|
|http://www.equalv...|http://www.equalv...|
|http://www.equalv...|http://www.equalv...|
|http://www.equalv...|http://www.equalv...|
|http://www.equalv...|http://www.equalv...|
|http://www.equalv...|http://www.equalv...|
+--------------------+--------------------+
only showing top 10 rows



In [18]:
pdfs = archive.pdfs()
pdfs.show(10, True)

+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server| mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|http://partimarij...|Massicotti_Affida...|      pdf|     application/pdf|application/pdf|4daa676e867d0ac65...|5d7c895db1b592aaa...|JVBERi0xLjIKJSDi4...|
|http://www.web.ne...|      securityqs.PDF|      pdf|     application/pdf|application/pdf|eadd48d19fd55e103...|2ab70423309828af9...|JVBERi0xLjIgDQol4...|
|http://partimarij...|           Ewing.pdf|      pdf|     application/pdf|application/pdf|8e43fec319e76e0f5...|930cd5ecf521b2b8f...|JVBERi0xLjIKJSDi4...|
|http://www.libera...|           35050.pdf|      pdf|     application/pdf|ap

In [19]:
audio = archive.audio()
audio.show(10, True)

+--------------------+--------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                 url|      filename|extension|mime_type_web_server|      mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|http://www.canadi...|   COLLINS1.RA|       ra|   audio/x-realaudio|audio/x-pn-realaudio|0128cb24f439f13a7...|ff1f9fdc00805d8fe...|LnJh/QAEAAAucmE0A...|
|http://www.animal...|2006-01-13.mp3|      mp3|          audio/mpeg|          audio/mpeg|e4b3825ea1ecae26d...|990919d05d6cd4bdb...|//NAxAkWuVKkX9gQA...|
+--------------------+--------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [20]:
video = archive.video()
video.show(10, True)

+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|http://v2.cache7....|       videoplayback|      flv|         video/x-flv|   video/x-flv|670586aafcb4824b5...|07e88d18aea50510d...|RkxWAQUAAAAJAAAAA...|
|http://www.bloc.o...|            bloc.wmv|      wmv|      video/x-ms-wmv|video/x-ms-wmv|fc16dd3c9c289a7ce...|1a77f9f3d9b18d31a...|MCaydY5mzxGm2QCqA...|
|http://www.noshar...|       HomaCBClQ.WMV|      wmv|          text/plain|video/x-ms-wmv|ef89c319f8ccd119a...|46e34725a78df33d0...|MCaydY5mzxGm2QCqA...|
|http://www.bloc.o...|        16juin02.wmv|      wmv|      video/x-ms-wmv|video/x-

In [21]:
spreadsheets = archive.spreadsheets()
spreadsheets.show(10, True)

+---+--------+---------+--------------------+--------------+---+----+-----+
|url|filename|extension|mime_type_web_server|mime_type_tika|md5|sha1|bytes|
+---+--------+---------+--------------------+--------------+---+----+-----+
+---+--------+---------+--------------------+--------------+---+----+-----+



In [22]:
presentation_program_files = archive.presentation_program()
presentation_program_files.show(10, True)

+--------------------+--------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                 url|filename|extension|mime_type_web_server|      mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|http://www.afn.ca...| aig.pps|      pps|application/vnd.m...|application/vnd.m...|f38d64504487dd373...|b7d60930a981e2bc2...|0M8R4KGxGuEAAAAAA...|
+--------------------+--------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [23]:
word_processor_files = archive.word_processor()
word_processor_files.show(10, True)

+--------------------+--------------------+---------+--------------------+------------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|    mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+------------------+--------------------+--------------------+--------------------+
|http://canadianac...|Some_facts_about_...|      doc|  application/msword|application/msword|f35c8570d81a0f4f5...|64378b21c8ea6bce5...|0M8R4KGxGuEAAAAAA...|
|http://www.nawl.c...|Pub_Brief_Antiter...|      doc|  application/msword|application/msword|b0528837322957073...|35f8fdc77d6e92b40...|0M8R4KGxGuEAAAAAA...|
|http://www.equalv...|       layton-en.doc|      doc|  application/msword|application/msword|3c28c798bfcc25ffe...|f9a0f96ab31de9cdd...|0M8R4KGxGuEAAAAAA...|
|http://www.nawl.c...|PRESS_MDStatement...|      doc|  app

In [24]:
text_files = archive.text_files()
text_files.show(10, True)

+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|http://agoracosmo...|            html.mes|      mes|          text/plain|    text/plain|7dee99acd58abc66c...|096c51df2828f49b8...|PGJyPg0KPHA+PGZvb...|
|http://agoracosmo...|           html.mes1|     mes1|          text/plain|    text/plain|58c9b7de5042206c3...|ede3e2b202a8f61d5...|PGJyPg0KPHA+PGZvb...|
|http://www.noshar...|       HomaCBClQ.WMV|      WMV|          text/plain|video/x-ms-wmv|ef89c319f8ccd119a...|46e34725a78df33d0...|MCaydY5mzxGm2QCqA...|
|http://www.conser...|20060411-videogut...|      flv|          text/plain|   video