# Topic Modeling with DARIAH topics

We use this python library to do topic modeling on the AO3 corpus: https://dariah-de.github.io/Topics/  
Issue: the library is designed to work with simple .txt files, while we have an R environment.  
We need to convert the R environment into .txt files: this can be done directly via Python!

## 1. Preparation
Install and call the libraries

In [None]:
!pip install dariah
!pip install pyreadr
!pip install langdetect

In [None]:
import dariah
import pyreadr
from langdetect import detect

import re
import seaborn as sns
import matplotlib.pyplot as plt

## 2. Corpus loading

Load the corpus from the R environment.  
**Note:** you will have to upload the "AO3_corpus.RData" file in the "Files" panel on the left.

In [None]:
result = pyreadr.read_r('AO3_corpus.RData')
print(result.keys()) 

In [None]:
my_df = result["my_df"]
all_texts = result["all_texts"]["all_texts"] # this is to have a vector, not a dataframe

## 3. Corpus cleaning
Remove texts that are too short or not in English.  
**Note:** this are the same operations already done for stylometry in R

In [None]:
# create unique ids
my_df["ID"] = my_df.index.values
# drop short text
my_df = my_df.drop(my_df[my_df.length < 1000].index)
# recognize language
my_df["lang"] = [detect(x) for x in my_df["incipit"]]
# remove non-English texts
my_df = my_df.drop(my_df[my_df.lang != 'en'].index)
# visualize
my_df.head()

## 4. Corpus creation

Now everything is ready to create the corpus as ".txt" files

In [None]:
# create new directory
!mkdir corpus

# loop on metadata (my_df) to write texts (all_texts)
for i in my_df["ID"]:
  
  # define filename
  author = re.sub(r'\W+', '', str(my_df.loc[i,'author']))
  title = re.sub(r'\W+', '', str(my_df.loc[i,'title']))  
  filename = 'corpus/'+author+'_'+title+'.txt'
  
  # write file
  text_file = open(filename, 'w')
  n = text_file.write(all_texts[i])
  text_file.close()

## 5. Topic modeling

The training can start! (it might take a few minutes)

In [None]:
model, vis = dariah.topics(directory="corpus",
               stopwords=100,
               num_topics=10,
               num_iterations=1000)

## 6. Results

Visualize the results (as tables and plots)

In [None]:
# table with all values

model.topic_document.head()

In [None]:
# see topic/document heatmap

%matplotlib inline
vis.topic_document()

In [None]:
# the plot might not be that good
# better use the seaborn package directly, instead of dariah's functions

plt.figure(figsize=(50,50))
sns.heatmap(model.topic_document, cmap="Blues")
plt.show()

In [None]:
# see the words that compose the topics
vis.topic("topic0")

In [None]:
# see topics in a document
vis.document("ocean_eyes_221_ChasingShadows")