
<img src="https://raw.githubusercontent.com/archivesunleashed/archivesunleashed.org/master/themes/hugo-material-docs/static/images/cropped-logo.png" height="200px" width="500px">

# Welcome

Welcome to the Archives Unleashed Cloud Visualization Demo in Jupyter Notebook for your collection. This demonstration takes the main derivatives from the Cloud and uses Python to analyze and produce information about your collection.

This product is in beta, so if you encounter any issues, please post an [issue in our Github repository](https://github.com/archivesunleashed/auk/issues) to let us know about any bugs you encountered or features you would like to see included.

If you have some basic Python coding experience, you can change the code we provided to suit your own needs.

Unfortunately, we cannot support code that you produced yourself. We recommend that you use `File > Make a Copy` first before changing the code in the repository. That way, you can always return to the basic visualizations we have offered here. Of course, you can also just re-download the Jupyter Notebook file from your Archives Unleashed Cloud account.

### How Jupyter Notebooks Work:

If you have no previous experience of Jupyter Notebooks, the most important thing to understand is that that <Shift><Enter/Return> will run the python code inside a window and output it to the site.
    
The window titled `# RUN THIS FIRST` should be the first place you go. This will import all the libraries and set basic variables (e.g. where your derivative files are located) for the notebook. After that, everything else should be able to run on its own.


In [None]:
# RUN THIS FIRST

# This Window will set up all the necessary libraries and dependencies
# for your Collection.
coll_id = "4656"
auk_fp = "data/"
auk_full_text = auk_fp + coll_id + "-fulltext.txt"
auk_gephi = auk_fp + coll_id + "-gephi.gexf"
auk_graphml = auk_fp + coll_id + "-gephi.grapml"
auk_domains = auk_fp + coll_id + "-fullurls.txt"
auk_filtered_text = auk_fp + coll_id + "-filtered_text.zip"

# The following script will attempt to install the necessary dependencies
# for the visualisations. You may prefer to install these on your
# own in the command line.
import sys
from collections import Counter

try:  # a library for manipulating column data.
    import pandas as pd
except ImportError:
    !{sys.executable} -m pip install pandas  

try:
    import matplotlib.pyplot as plt # a library for Plotting
except ImportError:
    !{sys.executable} -m pip install matplotlib

try:
    import numpy as np # a library for complex mathematics
except ImportError:
    !{sys.executable} -m pip install numpy
    
try:
    from nltk.tokenize import word_tokenize
    from nltk.draw.dispersion import dispersion_plot as dp
except ImportError:
    !{sys.executable} -m pip install nltk
    nltk.download('punkt')

# Text Analysis

The following set of functions use the [Natural Language Toolkit](https://www.nltk.org) Python library to search for the top most used words in the collection, as well as facilitate breaking it down by name or domain.

In [None]:
# You can change the value of `top` to get more results. 
top = 30

def clean_domain(s):
    stop_words = ["com", "org", "net", "edu"]
    ret = ""
    dom = s.split(".")
    if len(dom) <3:
        ret = dom[0]
    elif dom[-2] in stop_words:
        ret = dom[-3]
    else:
        ret = dom[1]
    return ret

def get_textfile (minlen=3) :
    tokens = []
    with open (auk_full_text) as fin:
        for line in fin:
            tokens += word_tokenize(str(line).split(",")[3])
    tokens = [x for x in tokens if len(x) > minlen]
    return tokens

def get_text_domains(minlen=3):
    tokens = []
    with open (auk_full_text) as fin:
        for line in fin:
            split_line = str(line).split(',')
            tokens.append((clean_domain(split_line[1]), Counter([x for x in word_tokenize(str(split_line[3])) if len(x) > minlen])))
    return tokens

def get_text_years(minlen=3):
    tokens = []
    with open (auk_full_text) as fin:
        for line in fin:
            split_line = str(line).split(',')
            tokens.append((split_line[0][1:5], Counter([x for x in word_tokenize(str(split_line[3])) if len(x) > minlen])))
    return tokens

def year(minlen=3):
    return get_text_years(minlen)

def domain(minlen=3):
    return get_text_domains(minlen)

def get_top_tokens(total=20, minlen=3):
    return [(key, value) for key, value in Counter(get_textfile(minlen)).most_common(total)]

def get_top_tokens_by(fun, total=20, minlen=3):
    sep = dict()
    tokens = fun(minlen)
    sep = {k[0]: Counter() for k in tokens}
    for key, value in tokens:
        sep[key] += value
    ret = [(key, val.most_common(total)) for key, val in sep.items()]
    return (ret)

Now that you have saved the above functions, you can now use them in a variety of ways. 

### Text by Year

In [None]:
# Get the set of available years in the collection 
set([x[0] for x in get_text_years()])

Now we can create separate lists with text files from individual years in this collection. The example below selects all items from the year 2016. You may need to change it.

In [None]:
year_results = [t[1] for t in get_text_years() if t[0] == "2016"]

In [None]:
# print the first ten results from the year specified above
year_results[:10]

In [None]:
# you may now want to export this file so you can work with it. 
# this will appear in the directory that this notebook is in
# you may want to change the output path

with open("results-2016.txt", "w") as output_file:
    for value in year_results:
        output_file.write(str(value))

### Text by Domain

In [None]:
# Get the set of available domains in the collection 
set([x[0] for x in get_text_domains()])

In [None]:
# extract only the given domain to a file and see how many results there are

domain_results = [t[1] for t in get_text_domains() if t[0] == "nanaimodailynews"]
len(domain_results)

In [None]:
# print the first five results from the year specified above
domain_results[:5]

In [None]:
# you may now want to export this file so you can work with the text of one domain. 
# this will appear in the directory that this notebook is in
# you may want to change the output path

with open("results-domain.txt", "w") as output_file:
    for value in domain_results:
        output_file.write(str(value))

## Overall Collection Characteristics

In [None]:
# Get a list of the top words in the collection
# (regardless of year).
get_top_tokens(top)

In [None]:
# Get a list of the top tokens, separated by year.
get_top_tokens_by(year, top)

In [None]:
# Get a list of top tokens, separated by domain.
get_top_tokens_by(domain, top, 4)

In [None]:
# Create a dispersion plot, showing where the list of words appear
# in the text.
text = get_textfile()
dp(text, ["he", "she"]) # uses the nltk dispersion plot library (dp).

# Bibliography

Bird, Steven, Edward Loper and Ewan Klein (2009), *Natural Language       Processing with Python*. O’Reilly Media Inc.

Archives Unleashed Project. (2018). Archives Unleashed Toolkit (Version 0.17.0). Apache License, Version 2.0.