<a href="https://colab.research.google.com/github/DigitalHistory-Lund/elam_stm_prep/blob/main/ELAM_STM_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to a Notebook (for Structural Topic Modelling the ELAM corpus)

This _Notebook_ has been set up to be run through Google
Colab, though it could just as well be downloaded and be
executed in any jupyter environment. 

_Notebooks_ are divided into cells, (like this one) that
can contain code or text. There are multiple ways of running the code in a cell, the two simplest being clicking the little _play_ button in the cell's top-left corner or clicking a cell and pressing \<ctrl> + \<enter>.

In [None]:
#@markdown # Setup
#@markdown To set up the Structural Topic Modelling interface, run this cell once,
#@markdown and then restart the runtime, form the _Runtime_ menu. 

#@markdown It does not need to be run again.

import os

!pip install rpy2==3.5.1

if not os.path.exists('data'):
    GITHUB_PRIVATE_KEY = """-----BEGIN OPENSSH PRIVATE KEY-----
b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAAAMwAAAAtzc2gtZW
QyNTUxOQAAACCQpcICmfcuRfJO4cjrtgRW5a3n6iPm5cDJqBCS6UIbgAAAAJghg+oXIYPq
FwAAAAtzc2gtZWQyNTUxOQAAACCQpcICmfcuRfJO4cjrtgRW5a3n6iPm5cDJqBCS6UIbgA
AAAEClBxMiTc/vCb1FiTcc0mbuBvH4QdbrxhJGmt+jFh/wlJClwgKZ9y5F8k7hyOu2BFbl
refqI+blwMmoEJLpQhuAAAAAFG1hdGpvaGFASFQ1Q0cxMTM1VlNIAQ==
-----END OPENSSH PRIVATE KEY-----
    """

    # Create the directory if it doesn't exist.
    ! mkdir -p /root/.ssh
    # Write the key
    with open("/root/.ssh/id_ed25519", "w") as f:
        f.write(GITHUB_PRIVATE_KEY)
    # Add github.com to our known hosts
    ! ssh-keyscan -t ed25519 github.com >> ~/.ssh/known_hosts
    # Restrict the key permissions, or else SSH will complain.
    ! chmod go-rwx /root/.ssh/id_ed25519

    # Note the `git@github.com` syntax, which will fetch over SSH instead of
    # HTTP.
    ! git clone git@github.com:DigitalHistory-Lund/elam_prep_data.git data

if not os.path.exists('stm'):
    ! git clone git@github.com:DigitalHistory-Lund/elam_stm_prep.git stm

!apt-get install r-cran-stm
!apt-get install r-cran-tm
!apt-get install r-cran-igraph
!wget https://raw.githubusercontent.com/aurelberra/stopwords/master/stopwords_latin.txt -c

In [None]:
#@markdown Run this cell to display the README.md file from the code repository.
#@markdown It contains some basic explanations of what is to come.
import IPython
with open('stm/README.md', 'r', encoding='utf8') as f:
    readme = f.read()
IPython.display.Markdown(readme)

In [None]:
#@markdown Optional cell: Mount your GoogleDrive and save all experiments there.
from google.colab import drive
drive.mount('/gdrive')
import os
os.makedirs('/gdrive/MyDrive/stm_notebook_data/', exist_ok=True)

!ln -s "/gdrive/My Drive/stm_notebook_data" /content/data/corpora


In [None]:
# Loading the Corpus class

from stm.corpus_manager import Corpus
from google.colab import files


In [None]:
# Setting up the STM management and GUI objects.
corpus = Corpus(root_data_path='data', database_name='corpus.sqlite3')
stm = corpus.stm
plotter = stm.plotter

In [None]:
#@markdown # 1 Generating the corpus
#@markdown Select:
#@markdown 1. which level to aggregate (group) the data on, 
#@markdown 2. wether to use the raw paragraphs or lemmatized paragraphs
#@markdown 3. minimum term length to use
#@markdown 4. customize the stopwords by adjusting the righthand text area
#@markdown  - All terms added to this box will be added to the stopwords
#@markdown  - Terms written with a "-" suffix will be removed, e.g: "-!" will remove the exclamation point from the stopwords list

#@markdown Then press "Save settings" to generate and load the new corpus into memory
corpus.gui

In [None]:
#@markdown # 2 Fit the STM model
#@markdown Select:
#@markdown 1. the number of topics
#@markdown 2. the number of iterations (max) to train the model (if it converges earlier, it will stop)
#@markdown     - Note: For very large numbers it can take well over half an hour to calculate.
#@markdown 3. whether to facet the model by the authors
#@markdown 4. wether to facet the model by the work (title)

#@markdown Press "Fit stm" to for the model, and wait.
stm.gui

In [None]:
#@markdown # 3 Plotting

#@markdown Select the type of plot and which topics to work visualize, then press
#@markdown "Visualize" to generate (alternatively load) the plot.
#@markdown See the README.md file for more details on the type of plots.

#@markdown Note: All files are accessible throught the file-browser on the left portion of the colab interface.
plotter.gui

In [None]:
#@markdown ## All steps in one cell
#@markdown Running thiss cell will show a window with all three interfaces, 
#@markdown separated by tabs.

#@markdown __Warning__: Running this seems to interfere with the above visualisations. 
from ipywidgets import Tab
tabber = Tab()
tabber.children = [corpus.gui, stm.gui, plotter.gui]
tabber.titles = ['Corpus settings', 'STM fitting', 'Visualization']
tabber

In [None]:
#@markdown ## Download output from current model
#@markdown Run this cell to download zip with the corpus, the model and all 
#@markdown visualizations of the current model
files.download(plotter.zipper())