# Beginner

## Task 01: Identify common themes in a corpus of Old English texts (or any other text)

Task: Identify common themes in a corpus of Old English texts.
        Description: Use a simple topic model (e.g., LDA) to find recurring topics in a collection of Old English poems or prose.
        Hints:
        Preprocess the text by removing stopwords and lemmatizing.
        Start with a small number of topics (e.g., 5-10).


This guide will walk you through the code, explaining each part and helping you implement it.

1. Setting Up the Environment:

    NLTK: This code uses a library called NLTK for natural language processing tasks. You'll need to download some resources for NLTK to work. In the first section of the code, there are lines that start with nltk.download(). Run each of these lines one by one in your coding environment (like a Jupyter Notebook or a Python script) to download the necessary resources.

2. Preparing Your Texts:

    Sample Texts: The code includes some sample Old English texts. This is where you'll replace those samples with your actual texts you want to analyze. Each text should go inside the quotation marks within the texts list. Make sure to add commas to separate each entry in the list.

3. Preprocessing Function:

    Understanding the Function: This part of the code defines a function called preprocess that takes text as input and performs some cleaning tasks. Let's break down what it does:
    
    - Tokenization: It splits the text into smaller units like words using word_tokenize. Imagine cutting up a sentence into individual words.
    - Removing Punctuation and Numbers: It removes punctuation marks (like commas, periods) and numbers using loops and conditional statements
    - Removing Stopwords: It removes common words that don't carry much meaning (like "the", "a", "is") using stopwords from NLTK.
    - Lemmatization: It reduces words to their base form (e.g., "running" becomes "run") using a process called lemmatization.
    - Running the Function: You don't need to call this function manually. The code later applies this function to all your texts automatically.

4. Processing Your Texts:

    This part uses the preprocess function on each text in your list and stores the cleaned texts in a new list called processed_texts.

5. Creating the Dictionary:

    Understanding Dictionaries: The code creates a "dictionary" using the corpora.Dictionary function. This dictionary acts like a list that keeps track of all the unique words encountered in your texts and assigns them an ID number.

6. Creating the Corpus:

    Understanding Corpus: The code creates a "corpus" using the corpora.doc2bow function. Think of a corpus as a collection of documents (your texts) represented numerically based on the dictionary. Here, each document is converted to a list where each word is represented by its ID and its frequency (how many times it appears) in that document.

7. Training the LDA Model:

    LDA Model: This code uses a technique called Latent Dirichlet Allocation (LDA) to identify hidden topics in your texts. The gensim.models.LdaMulticore function creates the LDA model.
    Number of Topics: You can adjust the num_topics variable to specify how many hidden topics you want the model to find in your texts. By default, it's set to 5. Experiment with different values to see how it affects the results.

8. Printing the Topics:

    Understanding Topics: After training the model, the code uses the print_topics function to show the most relevant words for each of the identified topics. This helps you understand what each topic represents.

9. Printing Dominant Topic per Text:

    This part analyzes each of your texts and figures out the most dominant topic (out of the 5 identified) for each text. It then prints the text number, the dominant topic ID, and a score indicating how strong that topic is for that text.
       

In [1]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import "SOMETHING IS MISSING FROM HERE THAT HELPS TOKENIZING"
from nltk.stem import WordNetLemmatizer
import "THE PACKAGE NAME IS MISSING"
from gensim import corpora
import string

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Sample Old English texts (replace with your actual texts as the code will work)
texts = [
    "Hwæt! We Gardena in geardagum, þeodcyninga, þrym gefrunon, hu ða æþelingas ellen fremedon.",
    "Oft Scyld Scefing sceaþena þreatum, monegum mægþum, meodosetla ofteah, egsode eorlas.",
    # Add more texts here
]

# Preprocessing function
def preprocess(text):
    # Tokenize
    tokens = word_tokenize(text."A KIND OF TRANSFORMATION IS MISSING FROM HERE"())

    # Remove punctuation and numbers
    tokens = [token for token in tokens if token not in string.punctuation and not token.isdigit()]

    # Remove stopwords (note: these are modern English stopwords)
    stop_words = set(stopwords.words('english'))
    tokens = ["SOMETHING for SOMETHING in SOMETHING" if token not in stop_words]

    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    return tokens

# Preprocess all texts
processed_texts = [preprocess(text) for text in texts]

# Create dictionary
dictionary = corpora.Dictionary(processed_texts)

# Create corpus
corpus = [dictionary.doc2bow(text) for text in processed_texts]

# Train LDA model
num_topics = 5  # You can adjust this
lda_model = gensim.models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=num_topics)

# Print topics
print("Top topics:")
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic: {idx} \nWords: {topic}\n")

# Print dominant topic for each text
for i, corp in enumerate(corpus):
    top_topics = lda_model.get_document_topics(corp)
    top_topic = sorted(top_topics, key=lambda x: x[1], reverse=True)[0]
    print(f"Text {i + 1} - Dominant topic: {top_topic[0]} (Score: {top_topic[1]:.2f})")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Top topics:
Topic: 0 
Words: 0.078*"æþelingas" + 0.078*"ða" + 0.077*"gardena" + 0.077*"gefrunon" + 0.077*"geardagum" + 0.076*"hwæt" + 0.075*"fremedon" + 0.075*"hu" + 0.073*"þeodcyninga" + 0.072*"þrym"

Topic: 1 
Words: 0.045*"monegum" + 0.045*"gardena" + 0.045*"scyld" + 0.045*"ellen" + 0.045*"mægþum" + 0.045*"egsode" + 0.045*"eorlas" + 0.045*"þrym" + 0.045*"scefing" + 0.045*"hwæt"

Topic: 2 
Words: 0.072*"ellen" + 0.069*"þrym" + 0.066*"þeodcyninga" + 0.063*"hu" + 0.062*"fremedon" + 0.061*"hwæt" + 0.059*"geardagum" + 0.059*"gefrunon" + 0.058*"gardena" + 0.057*"ða"

Topic: 3 
Words: 0.078*"scefing" + 0.078*"meodosetla" + 0.078*"oft" + 0.078*"ofteah" + 0.078*"sceaþena" + 0.078*"mægþum" + 0.078*"scyld" + 0.078*"þreatum" + 0.078*"eorlas" + 0.078*"monegum"

Topic: 4 
Words: 0.046*"egsode" + 0.045*"þrym" + 0.045*"gefrunon" + 0.045*"hu" + 0.045*"geardagum" + 0.045*"ða" + 0.045*"fremedon" + 0.045*"mægþum" + 0.045*"eorlas" + 0.045*"sceaþena"

Text 1 - Dominant topic: 0 (Score: 0.93)
Text 2 - Dom

## Solution

1: from nltk.tokenize import word_tokenize

2: import gensim

3: tokens = word_tokenize(text.lower())

4: tokens = [token for token in tokens if token not in string.punctuation and not token.isdigit()]