# Advanced

## Task 01: Develop a custom topic model tailored to the linguistic features of a specific medieval language.

Description: This code applies topic modeling to Medieval Latin texts, incorporating linguistic knowledge to enhance the accuracy and interpretability of the results. The aim is to discover meaningful topics within the texts and assess the quality of the extracted topics.

Hints:

- Research relevant linguistic resources (e.g., dictionaries, grammars)
- Experiment with different model architectures (e.g., LDA with linguistic constraints).


**Libraries:**

* **nltk:**
    * `tokenize`: Splits text into words (tokens).
* **gensim:**
    * `corpora`: Creates a dictionary and corpus for topic modeling.
    * `models`: Provides the base LDA model for customization.
* **string:**  Handles string manipulations (like removing punctuation).
* **numpy:** Performs numerical calculations.
* **matplotlib.pyplot:**  Visualizes results (not used in the current code, but could be for future analysis).
* **collections.defaultdict:** Provides convenient dictionary-like storage.
* **re:** Provides regular expression matching operations.

**Code Structure:**

1. **Preprocessing with Linguistic Knowledge:**
   * `preprocess_medieval_latin` function:
     * Tokenizes text.
     * Converts to lowercase.
     * Removes punctuation and numbers.
     * Normalizes Medieval Latin orthography (e.g., 'j' to 'i', 'v' to 'u').
     * Resolves common abbreviations (e.g., 'dns' to 'dominus').
     * Removes stop words.

2. **Morphological Analysis (Optional):**
   * `get_stem` function (placeholder):
     * Intended for stemming words to their base forms. Requires implementation or integration with a Medieval Latin stemmer.

3. **Custom Dictionary Creation:**
   * `MedievalLatinDictionary` class:
     * Inherits from `gensim.corpora.Dictionary`.
     * Creates a dictionary with additional stem information for each token.

4. **Custom LDA Model:**
   * `MedievalLatinLDA` class:
     * Inherits from `gensim.models.LdaModel`.
     * Allows for incorporating linguistic knowledge during topic inference (this part is left as a placeholder for customization).

5. **Main Process:**
   * Load sample Medieval Latin texts.
   * Preprocess texts using the custom function.
   * Create a custom dictionary.
   * Create a corpus using the dictionary.
   * Train the custom LDA model.
   * Print the top words for each topic.
   * Calculate and print the topic coherence score.

**Key Points and Refinements:**

* **Linguistic Adaptation:** The code is specifically designed for Medieval Latin, addressing unique orthographic and morphological features.
* **Customization Potential:** Provides a framework for incorporating more sophisticated linguistic rules and knowledge into the topic modeling process.
* **Topic Coherence:** Assesses the interpretability of the topics discovered by the model.



In [4]:
import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize
import gensim
from gensim import corpora
import string
import re
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
from gensim.models import LdaModel
from gensim.models.ldamodel import LdaState

# 1. Preprocessing with linguistic knowledge
def preprocess_medieval_latin(text):
    # Tokenize
    tokens = word_tokenize(text.lower())

    # Remove punctuation and numbers
    tokens = [token for token in tokens if token not in string.punctuation and not token.isdigit()]

    # Apply Medieval Latin specific rules
    tokens = [normalize_medieval_latin(token) for token in tokens]

    # Remove stopwords
    stopwords = set(['et', 'in', 'ad', 'ut', 'cum', 'non', 'qui', 'ab', 'ex', 'de'])
    tokens = [token for token in tokens if token not in stopwords]

    return tokens

def normalize_medieval_latin(token):
    # Apply Medieval Latin orthographic normalizations
    token = re.sub(r'[jJ]', 'i', token)
    token = "SOMETHING SIMILAR IS MISSING"

    # Normalize common abbreviations
    abbreviations = {
        'dns': 'dominus',
        'xps': 'christus',
        "ADD MORE ABREVIATION OR LEVERAGE ONLINE SOURCE"
    }
    return abbreviations.get(token, token)

# 2. Morphological analysis
def get_stem(word):
    # Implement or use a Medieval Latin stemmer
    # This is a placeholder function
    return word

# 3. Custom dictionary creation
class MedievalLatinDictionary(corpora.Dictionary):
    def __init__(self, documents):
        self.stems = {}
        super().__init__(documents)
        self._create_stems()

    def _create_stems(self):
        for token in self.token2id:
            self.stems[token] = get_stem(token)

    def doc2bow(self, document, allow_update=False, return_missing=False):
        # Override to use stems
        stemmed_doc = [self.stems.get(token, token) for token in document]
        return super().doc2bow(stemmed_doc, allow_update, return_missing)

# 4. Custom LDA model
class MedievalLatinLDA(LdaModel):
    def __init__(self, corpus=None, id2word=None, num_topics=100, **kwargs):
        super().__init__(corpus=corpus, id2word=id2word, num_topics=num_topics, **kwargs)

    def get_document_topics(self, bow, minimum_probability=None, minimum_phi_value=None, per_word_topics=False):
        # Override to incorporate linguistic knowledge
        # This is where you could add constraints based on syntax or semantics
        return super().get_document_topics(bow, minimum_probability, minimum_phi_value, per_word_topics)

# 5. Main process
# Sample Medieval Latin texts (replace with your corpus)
texts = [
    "Pater noster, qui es in caelis, sanctificetur nomen tuum.",
    "In principio erat Verbum, et Verbum erat apud Deum, et Deus erat Verbum.",
    # Add more texts
]

# Preprocess texts
processed_texts = [preprocess_medieval_latin(text) for text in texts]

# Create custom dictionary
dictionary = MedievalLatinDictionary(processed_texts)

# Create corpus
corpus = [dictionary.doc2bow(text) for text in processed_texts]

# Train custom LDA model
num_topics = 5  # Adjust as needed
lda_model = MedievalLatinLDA(corpus=corpus, id2word=dictionary, num_topics=num_topics)

# Print top words for each topic
print("Top words for each topic:")
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic: {idx} \nWords: {topic}\n")

# Calculate topic coherence
coherence_model = gensim.models.CoherenceModel(model=lda_model, texts=processed_texts, dictionary=dictionary, coherence='c_v')
coherence = coherence_model.get_coherence()
print(f"Topic Coherence: {coherence}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Top words for each topic:
Topic: 0 
Words: 0.077*"erat" + 0.077*"uerbum" + 0.077*"es" + 0.077*"noster" + 0.077*"caelis" + 0.077*"tuum" + 0.077*"pater" + 0.077*"nomen" + 0.077*"sanctificetur" + 0.077*"deum"

Topic: 1 
Words: 0.077*"uerbum" + 0.077*"erat" + 0.077*"es" + 0.077*"noster" + 0.077*"pater" + 0.077*"caelis" + 0.077*"sanctificetur" + 0.077*"tuum" + 0.077*"nomen" + 0.077*"principio"

Topic: 2 
Words: 0.077*"uerbum" + 0.077*"erat" + 0.077*"es" + 0.077*"noster" + 0.077*"caelis" + 0.077*"nomen" + 0.077*"pater" + 0.077*"sanctificetur" + 0.077*"principio" + 0.077*"tuum"

Topic: 3 
Words: 0.254*"uerbum" + 0.254*"erat" + 0.095*"apud" + 0.095*"deus" + 0.095*"principio" + 0.095*"deum" + 0.016*"es" + 0.016*"tuum" + 0.016*"sanctificetur" + 0.016*"caelis"

Topic: 4 
Words: 0.125*"nomen" + 0.125*"sanctificetur" + 0.125*"pater" + 0.125*"tuum" + 0.125*"caelis" + 0.125*"noster" + 0.125*"es" + 0.021*"erat" + 0.021*"uerbum" + 0.021*"principio"

Topic Coherence: 0.08105956953597629


## Solution

1. THE SOLUTIONS ARE FULLY FLEXIBLE