# Topic Modelling Approaches
The aim here is to use BERTopic to topic model *The Years*: first to try the published text and then see whether I can get ahold of material from the genetic dossier to compare.

## Imports and Whatnot

In [20]:
# import nltk libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# download nltk libraries
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# import BERTopic
from bertopic import BERTopic
from umap import UMAP

[nltk_data] Downloading package punkt to /Users/Joshua/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Joshua/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/Joshua/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Data Munging (Where does that word come from?)

In [10]:
# read data
data = "data/years-raw.txt"
with open(data, 'r') as file:
    data = file.read().replace('\n', ' ')

# tokenize data
tokens = word_tokenize(data)

# remove punctuation
tokens = [word for word in tokens if word.isalnum()]

# remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# lemmatize tokens
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(lemmatized_tokens)




## BERTopic implies an ERNIETopic, in this essay I will,,,

### Generating topic model

In [14]:
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(lemmatized_tokens)
topic_model.save("model")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

### Visualising data

In [21]:
topic_model = BERTopic.load("model")

topic_model.get_topic_info()


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,2198,-1_spotted_habit_victoria_form,"[spotted, habit, victoria, form, pencil, dazed...","[spotted, spotted, habit]"
1,0,1724,0_said_stated__,"[said, stated, , , , , , , , ]","[said, said, said]"
2,1,475,1_looked___,"[looked, , , , , , , , , ]","[looked, looked, looked]"
3,2,449,2_thought___,"[thought, , , , , , , , , ]","[thought, thought, thought]"
4,3,437,3_eleanor_mildred__,"[eleanor, mildred, , , , , , , , ]","[Eleanor, Eleanor, Eleanor]"
...,...,...,...,...,...
1458,1457,10,1457_crossing_crisscrossed__,"[crossing, crisscrossed, , , , , , , , ]","[crossing, crossing, crossing]"
1459,1458,10,1458_litter_cat__,"[litter, cat, , , , , , , , ]","[litter, litter, litter]"
1460,1459,10,1459_mend_mended_wedd_loosed,"[mend, mended, wedd, loosed, dandy, shepherded...","[mend, mend, mend]"
1461,1460,10,1460_eleanor___,"[eleanor, , , , , , , , , ]","[Eleanor, Eleanor, Eleanor]"
