
<img src="https://cloud.archivesunleashed.org/assets/logo-8d2126e162dc682078284bb8f5585e4365fbad6dc04aa2afbae747626bd815ea.png" height="100px" width="100px">

# Welcome

Welcome to the Archives Unleashed Cloud Visualization Demo in Jupyter Notebook for your collection { Collection Name } id { Collection Id }. 

This demonstration takes the main derivatives from the Archives Unleashed Cloud and uses some common Python data analysis approaches to produce some basic information about the collection you provided.

This product is in beta, so if you encounter any issues, please post an [issue in our Github repository](https://github.com/archivesunleashed/auk/issues) to let us know what went wrong.

If you have some basic Python coding experience, you can change the code we provided to suit your own needs.

### How Jupyter Notebooks Work:

If you have no previous experience of Jupyter Notebooks, the most important thing to understand is that that <Shift><Enter/Return> will run the python code inside a window and output it to the site.
    
The window titled `"#RUN THIS FIRST"` should be the first place you go. This will import all the libraries and set basic variables (eg. where your derivative files are located) for the notebook. After that, everything else should be able to run on its own.

Unfortunately, we cannot support code that you produced yourself. We recommend that you use `File > Make a Copy` first before changing the code in the repository. That way, you can always return to the basic visualizations we have offered here. Of course, you can also just re-download the Jupyter Notebook service from your Archives Unleashed Cloud account.

In [1]:
# RUN THIS FIRST
# This Window will set up all the necessary libraries and dependencies
# for your Collection.
coll_id = "4656"
auk_fp = "data/"
auk_full_text = auk_fp + coll_id + "-fulltext.txt"
auk_gephi = auk_fp + "coll_id-gephi.gexf"
auk_graphml = auk_fp + "coll_id-gephi.grapml"
auk_domains = auk_fp + "coll_id-fullurls.txt"
auk_filtered_text = auk_fp + "coll_id-filtered_text.zip"

In [None]:
# The following script will attempt to install the necessary dependencies
# for the visualisations. You may prefer to install these on your
# own in the command line.

import sys
!{sys.executable} -m pip install -U pip
!{sys.executable} -m pip install matplotlib
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install nltk

In [None]:
import pandas as pd
import numpy as np
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.draw.dispersion import dispersion_plot as dp
import wordcloud as wc
import matplotlib.pyplot as plt
from collections import Counter

# Text Analysis

The following function uses the nltk python library to search for the top most used words in the collection. Depending on the size of the text file, it may take a while to compute.

In [None]:
# You can change the value of top to get more results.
# Citation for nltk library: 
top = 30

def get_textfile () :
    tokens = []
    with open (auk_full_text) as fin:
        for line in fin:
            tokens += word_tokenize(str(line).split(",")[3])
    return tokens

def get_text_years():
    tokens = []
    with open (auk_full_text) as fin:
        for line in fin:
            split_line = str(line).split(',')
            tokens.append((split_line[0][1:5], split_line[3]))
    return tokens

def get_top_tokens(total=20):
    tokens = get_textfile()
    tokens = [(value, key) for key, value in Counter(tokens).items()]
    tokens = list(filter(lambda x : len(x[1]) > 3, tokens))
    tokens.sort(reverse=True)
    return(tokens[0:total])

def get_top_tokens_by_year(total=20):
    tokens = get_text_years()
    sep = {key: "" for key, value in tokens}
    for year, text in tokens:
            sep[str(year)] = sep[str(year)] + " " + text
    ret = [(key, Counter(word_tokenize(val)).most_common(total)) for key, val in sep.items()]
    return (ret)

# Get a list of the top words in the collection.
get_top_tokens(top)

In [None]:
# Get a list of the top tokens, separated by year.
get_top_tokens_by_year()

In [None]:
text = get_textfile()
dp(text, ["he", "she"]) # dispersion plot.

# Bibliography

Bird, Steven, Edward Loper and Ewan Klein (2009), *Natural Language       Processing with Python*. O’Reilly Media Inc.

Archives Unleashed Project. (2018). Archives Unleashed Toolkit (Version 0.17.0). Apache License, Version 2.0.