<a href="https://colab.research.google.com/github/GanB/language-models-paper/blob/master/statistical_encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stemming and Lemmatization

## Stemming

Stemming is a process used in natural language processing and information retrieval to reduce words to their base or root form, known as the "stem." The purpose of stemming is to normalize words so that variations of the same word can be treated as identical during analysis or search. For example, the words "run," "running," and "ran" would all be reduced to the stem "run."

Stemming algorithms typically remove common suffixes from words to extract the stem. This process involves rules or heuristics that operate on the word's structure to remove endings such as "-s," "-ed," or "-ing." The resulting stem may not always be a valid word or may be a partial word, but it serves as a common representation for related word forms.

Stemming can be useful in various natural language processing tasks, such as text mining, information retrieval, and search engines. By reducing words to their stems, it becomes easier to group similar words together and perform operations like searching or indexing based on the root form of a word. However, it's important to note that stemming algorithms are not always perfect and may produce incorrect stems or remove parts that change the word's meaning.

Popular stemming algorithms:

* Porter Stemming Algorithm:  It applies a set of rules to strip common English suffixes from words. It is simple and efficient but may sometimes produce stems that are not actual words.

* Snowball Stemming Algorithm:  Snowball is an extension of the Porter algorithm. It supports multiple languages and provides more accurate stemming than the original Porter algorithm. Snowball allows for easier customization and the addition of new languages.

* Lancaster Stemming Algorithm: The Lancaster stemming algorithm is an aggressive stemming algorithm that applies a series of rules to remove suffixes from words. It is known for its fast execution speed but can sometimes produce very aggressive stems, leading to more drastic reductions than other algorithms.

* Lovins Stemming Algorithm: The Lovins stemming algorithm is based on a set of stemming rules developed by J. C. Lovins. It focuses on reducing words to their root forms by applying various transformations. This algorithm is often used for information retrieval tasks.

* Porter2 Stemming Algorithm (also known as the English Stemmer): This is an updated version of the Porter stemming algorithm, designed to improve the accuracy of stemming for English words. It addresses some of the limitations of the original Porter algorithm and is commonly used in search engines and information retrieval systems.



In [None]:
import nltk
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

words = ['studying', 'studied', 'running', 'ran', 'sleeping', 'slept', 'flies',
         'dies', 'meeting', 'talking', 'talks', 'talked', 'cherries',
         'generously', 'cats', 'better', 'rocks', 'wolves' ]

stemmed_words = [stemmer.stem(word) for word in words]

print("PorterStemmer:")

for word, stem in zip(words, stemmed_words):
    print(f"Word: {word} | Stem: {stem}")

PorterStemmer:
Word: studying | Stem: studi
Word: studied | Stem: studi
Word: running | Stem: run
Word: ran | Stem: ran
Word: sleeping | Stem: sleep
Word: slept | Stem: slept
Word: flies | Stem: fli
Word: dies | Stem: die
Word: meeting | Stem: meet
Word: talking | Stem: talk
Word: talks | Stem: talk
Word: talked | Stem: talk
Word: cherries | Stem: cherri
Word: generously | Stem: gener
Word: cats | Stem: cat
Word: better | Stem: better
Word: rocks | Stem: rock
Word: wolves | Stem: wolv


In [None]:
import nltk
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("english")

words = ['studying', 'studied', 'running', 'ran', 'sleeping', 'slept', 'flies',
         'dies', 'meeting', 'talking', 'talks', 'talked', 'cherries',
         'generously', 'cats', 'better', 'rocks', 'wolves' ]

stemmed_words = [stemmer.stem(word) for word in words]

print("SnowballStemmer:")
for word, stem in zip(words, stemmed_words):
    print(f"Word: {word} | Stem: {stem}")

SnowballStemmer:
Word: studying | Stem: studi
Word: studied | Stem: studi
Word: running | Stem: run
Word: ran | Stem: ran
Word: sleeping | Stem: sleep
Word: slept | Stem: slept
Word: flies | Stem: fli
Word: dies | Stem: die
Word: meeting | Stem: meet
Word: talking | Stem: talk
Word: talks | Stem: talk
Word: talked | Stem: talk
Word: cherries | Stem: cherri
Word: generously | Stem: generous
Word: cats | Stem: cat
Word: better | Stem: better
Word: rocks | Stem: rock
Word: wolves | Stem: wolv


## Lemmatization

Lemmatization is a natural language processing technique that aims to determine the base or dictionary form of a word, called the "lemma." Unlike stemming, which truncates words to their root form, lemmatization considers the word's context and grammatical role to derive the canonical form.

The process of lemmatization involves analyzing words based on their part of speech (POS) tags, such as noun, verb, adjective, adverb, etc., and applying morphological rules to transform them to their base form. This base form is typically a valid word that can be found in a dictionary.

For example, the lemma of the word "running" would be "run," and the lemma of "better" would be "good." Lemmatization takes into account factors like tense, plurality, and inflection to ensure accurate normalization.

Lemmatization offers more accurate results compared to stemming because it considers the context and semantics of words. It can be useful in various natural language processing applications, such as text analysis, information retrieval, machine translation, and sentiment analysis.

However, lemmatization can be computationally more expensive than stemming due to the need for dictionary lookups and morphological analysis. It also requires the availability of linguistic resources like POS taggers and lemmatization rules specific to the language being processed.

Popular Lemmatization algorithms:

* WordNet Lemmatizer: WordNet is a lexical database that includes information about word senses and relationships. The WordNet lemmatizer maps words to their corresponding lemmas based on the WordNet database. It is commonly used in applications that require English lemmatization.

* Stanford CoreNLP Lemmatizer: Stanford CoreNLP is a popular natural language processing toolkit that provides lemmatization functionality. Its lemmatizer utilizes a combination of rule-based approaches and machine learning techniques to determine the lemma of a word. It supports multiple languages and can handle various word forms and inflections.

* spaCy Lemmatizer: spaCy is a widely used library for natural language processing in Python. It includes a lemmatizer component that applies lemmatization based on the word's POS tag and syntactic dependencies. spaCy supports multiple languages and provides accurate lemmatization results.

* TreeTagger: TreeTagger is a part-of-speech tagger and lemmatizer developed by the Natural Language Processing Group at the University of Stuttgart. It utilizes a combination of rule-based and stochastic methods to perform lemmatization for several languages. TreeTagger is known for its accuracy and wide language coverage.

* Morpha: Morpha is a lemmatization tool that applies morphological analysis to derive lemmas. It uses finite-state transducers to handle different word forms and inflections. Morpha is particularly useful for English lemmatization and is known for its speed and accuracy.


In [None]:
nltk.download("wordnet")
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words = ['studying', 'studied', 'running', 'ran', 'sleeping', 'slept', 'flies',
         'dies', 'meeting', 'talking', 'talks', 'talked', 'cherries',
         'generously', 'cats', 'better', 'rocks', 'wolves' ]

lemmatized_words = [lemmatizer.lemmatize(word) for word in words]


print('WordNetLemmatizer:')
for word, lemma in zip(words, lemmatized_words):
    print(f"Word: {word} | Lemma: {lemma}")

WordNetLemmatizer:
Word: studying | Lemma: studying
Word: studied | Lemma: studied
Word: running | Lemma: running
Word: ran | Lemma: ran
Word: sleeping | Lemma: sleeping
Word: slept | Lemma: slept
Word: flies | Lemma: fly
Word: dies | Lemma: dy
Word: meeting | Lemma: meeting
Word: talking | Lemma: talking
Word: talks | Lemma: talk
Word: talked | Lemma: talked
Word: cherries | Lemma: cherry
Word: generously | Lemma: generously
Word: cats | Lemma: cat
Word: better | Lemma: better
Word: rocks | Lemma: rock
Word: wolves | Lemma: wolf


In [None]:
import nltk
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

sentence = "The quick brown fox jumps over the lazy dog."

tokens = nltk.word_tokenize(sentence)

pos_tags = nltk.pos_tag(tokens)

lemmatized_words = []
for word, pos in pos_tags:
    if pos.startswith('N'):  # Noun
        lemma = lemmatizer.lemmatize(word, pos='n')
    elif pos.startswith('V'):  # Verb
        lemma = lemmatizer.lemmatize(word, pos='v')
    elif pos.startswith('J'):  # Adjective
        lemma = lemmatizer.lemmatize(word, pos='a')
    elif pos.startswith('R'):  # Adverb
        lemma = lemmatizer.lemmatize(word, pos='r')
    else:
        lemma = lemmatizer.lemmatize(word)
    lemmatized_words.append(lemma)

for token, pos, lemma in zip(tokens, pos_tags, lemmatized_words):
    print(f"Token: {token} | POS: {pos[1]} | Lemma: {lemma}")


Token: The | POS: DT | Lemma: The
Token: quick | POS: JJ | Lemma: quick
Token: brown | POS: NN | Lemma: brown
Token: fox | POS: NN | Lemma: fox
Token: jumps | POS: VBZ | Lemma: jump
Token: over | POS: IN | Lemma: over
Token: the | POS: DT | Lemma: the
Token: lazy | POS: JJ | Lemma: lazy
Token: dog | POS: NN | Lemma: dog
Token: . | POS: . | Lemma: .


# Count Vectorizer

CountVectorizer is a feature extraction technique commonly used in natural language processing (NLP) to convert a collection of text documents into a matrix of token counts.

The CountVectorizer performs the following steps:

* Tokenization: It breaks down the text into individual words or terms called tokens. It removes punctuation and converts the text to lowercase by default. It also allows customization of tokenization rules.

* Vocabulary Building: It constructs a vocabulary of unique tokens from the text data. Each unique token becomes a feature in the resulting matrix. The vocabulary is typically represented as a dictionary where the keys are the tokens, and the values are the indices or positions of the tokens in the matrix.

* Counting: It counts the occurrence of each token in each document. The resulting matrix, called the document-term matrix, represents the frequency of tokens in the text data. Each row corresponds to a document, and each column corresponds to a token in the vocabulary. The matrix contains the count of each token in each document.

The resulting document-term matrix can be used as input to various machine learning algorithms, such as classification or clustering algorithms. It represents the text data in a numerical form that algorithms can understand and process.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "I love cats",
    "I love dogs",
    "Cats and dogs are pets",
    "Dogs are loyal",
]

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(documents)

vocabulary = vectorizer.get_feature_names_out()
print("Vocabulary:", vocabulary)

print("Document-Term Matrix:")
print(X.toarray())


Vocabulary: ['and' 'are' 'cats' 'dogs' 'love' 'loyal' 'pets']
Document-Term Matrix:
[[0 0 1 0 1 0 0]
 [0 0 0 1 1 0 0]
 [1 1 1 1 0 0 1]
 [0 1 0 1 0 1 0]]


# TFIDF


TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a numerical statistic used in natural language processing and information retrieval to determine the importance of a term (word) in a document relative to a collection of documents.

TF-IDF combines two components: term frequency (TF) and inverse document frequency (IDF).

* Term Frequency (TF): It measures the frequency of a term (word) within a document. It is calculated by dividing the number of occurrences of a term in a document by the total number of terms in that document. The intuition behind TF is that the more frequent a term is in a document, the more likely it is to be important for that document.

* Inverse Document Frequency (IDF): It measures the rarity or uniqueness of a term across all documents in a collection. IDF is calculated by taking the logarithm of the ratio between the total number of documents and the number of documents containing the term. The intuition behind IDF is that terms that occur in a smaller number of documents are more informative and carry more weight compared to terms that occur in a larger number of documents.

The TF-IDF score of a term in a document is obtained by multiplying its term frequency (TF) in the document with the inverse document frequency (IDF) of the term.

The TF-IDF representation of a document collection can be used for various tasks, such as document retrieval, text classification, and information extraction. It allows us to represent documents in a numerical form that captures the importance of terms within the documents and across the collection.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Example text documents
documents = [
    "I love cats",
    "I love dogs",
    "Cats and dogs are pets",
    "Dogs are loyal",
]

# Create an instance of TfidfVectorizer
vectorizer = TfidfVectorizer()

# Learn the vocabulary and transform the documents into a TF-IDF matrix
X = vectorizer.fit_transform(documents)

# Get the vocabulary (tokens) and the TF-IDF matrix
vocabulary = vectorizer.get_feature_names_out()
print("Vocabulary:", vocabulary)

print("TF-IDF Matrix:")
print(X.toarray())


Vocabulary: ['and' 'are' 'cats' 'dogs' 'love' 'loyal' 'pets']
TF-IDF Matrix:
[[0.         0.         0.70710678 0.         0.70710678 0.
  0.        ]
 [0.         0.         0.         0.62922751 0.77722116 0.
  0.        ]
 [0.52338122 0.41263976 0.41263976 0.33406745 0.         0.
  0.52338122]
 [0.         0.55349232 0.         0.44809973 0.         0.70203482
  0.        ]]


# One Hot Encoding

One-hot encoding is a process used to represent categorical variables as binary vectors. It is a technique commonly employed in machine learning and data preprocessing tasks.

In one-hot encoding, each category or value in a categorical variable is transformed into a binary vector representation. This binary vector has the length equal to the number of unique categories in the variable. It contains all zeros except for a single one at the index corresponding to the category.

Here's an example to illustrate one-hot encoding:

Suppose we have a categorical variable "Color" with three possible categories: "Red", "Green", and "Blue". One-hot encoding would transform this variable into three binary features: "Color_Red", "Color_Green", and "Color_Blue".

Color	Color_Red	Color_Green	Color_Blue
Red	1	0	0
Green	0	1	0
Blue	0	0	1
In this example, each row represents an instance or observation with a specific color. The one-hot encoded representation indicates the presence of a particular color by setting the corresponding binary feature to 1 and the others to 0.

One-hot encoding is useful for machine learning algorithms as it allows categorical variables to be represented in a format that can be easily understood and processed by models. It enables the inclusion of categorical information in numerical computations and avoids any ordinal relationship assumption between categories.

In [None]:
from sklearn.preprocessing import OneHotEncoder

categories = ['Red', 'Green', 'Blue']

encoder = OneHotEncoder()

encoded_data = encoder.fit_transform([[category] for category in categories]).toarray()

print(encoded_data)


[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

sentences = [
    "I love cats",
    "I love dogs",
    "Cats and dogs are pets",
    "Dogs are loyal",
]

vectorizer = CountVectorizer(binary=True)

X = vectorizer.fit_transform(sentences)

vocabulary = vectorizer.get_feature_names_out()
print("Vocabulary:", vocabulary)

print("One-Hot Encoded Matrix:")
print(X.toarray())


Vocabulary: ['and' 'are' 'cats' 'dogs' 'love' 'loyal' 'pets']
One-Hot Encoded Matrix:
[[0 0 1 0 1 0 0]
 [0 0 0 1 1 0 0]
 [1 1 1 1 0 0 1]
 [0 1 0 1 0 1 0]]


# Information Gain



Information gain is a concept used in the field of machine learning and decision trees to measure the effectiveness of a particular attribute in classifying or predicting a target variable. It quantifies the amount of information provided by an attribute in reducing the uncertainty about the target variable.

In the context of decision trees, information gain is typically used to determine the best attribute to split the data at each node of the tree. The attribute with the highest information gain is chosen as the splitting criterion because it provides the most useful information for making predictions.

To calculate information gain, the concept of entropy is used. Entropy is a measure of impurity or disorder in a set of examples. A set with low entropy means that the examples are predominantly of the same class, while a high entropy indicates a more evenly distributed set.

The information gain of an attribute is calculated by taking the entropy of the original set and subtracting the weighted average of the entropies of the subsets created by splitting the data on that attribute. The attribute with the highest information gain is chosen as the best split.

In summary, information gain quantifies the reduction in entropy achieved by splitting the data on a particular attribute and is used to select the most informative attribute for decision tree construction or feature selection in machine learning tasks.

In [1]:
import math

def calculate_entropy(data):
    # Count the occurrences of each label in the dataset
    label_counts = {}
    for row in data:
        label = row[-1]
        if label not in label_counts:
            label_counts[label] = 0
        label_counts[label] += 1

    # Calculate the entropy
    entropy = 0.0
    num_examples = len(data)
    for label in label_counts:
        probability = label_counts[label] / num_examples
        entropy -= probability * math.log2(probability)

    return entropy

def split_data(data, attribute_index, attribute_value):
    # Split the data based on the given attribute and its value
    subsets = []
    for row in data:
        if row[attribute_index] == attribute_value:
            subset = row[:attribute_index] + row[attribute_index+1:]
            subsets.append(subset)

    return subsets

def calculate_information_gain(data, attribute_index):
    # Calculate the information gain for the given attribute

    # Calculate the entropy of the original dataset
    entropy_original = calculate_entropy(data)

    # Get the unique values of the attribute
    attribute_values = set(row[attribute_index] for row in data)

    # Calculate the weighted average entropy of the subsets
    weighted_entropy = 0.0
    num_examples = len(data)
    for value in attribute_values:
        subsets = split_data(data, attribute_index, value)
        subset_entropy = calculate_entropy(subsets)
        subset_weight = len(subsets) / num_examples
        weighted_entropy += subset_weight * subset_entropy

    # Calculate the information gain
    information_gain = entropy_original - weighted_entropy

    return information_gain

# Example usage
dataset = [
    ['Short', 'No', 'No', 'Not spam'],
    ['Long', 'Yes', 'No', 'Spam'],
    ['Long', 'No', 'Yes', 'Spam'],
    ['Short', 'No', 'No', 'Not spam'],
    ['Medium', 'No', 'No', 'Not spam']
]

# Calculate the information gain for each attribute
attribute_indices = [0, 1, 2]
for attribute_index in attribute_indices:
    information_gain = calculate_information_gain(dataset, attribute_index)
    print(f"Information Gain for Attribute {attribute_index}: {information_gain}")


Information Gain for Attribute 0: 0.9709505944546686
Information Gain for Attribute 1: 0.3219280948873623
Information Gain for Attribute 2: 0.3219280948873623


In [None]:
!pip install lupyne[graphql,rest]

Collecting lupyne[graphql,rest]
  Downloading lupyne-3.0-py3-none-any.whl (27 kB)
Collecting fastapi (from lupyne[graphql,rest])
  Downloading fastapi-0.99.1-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.4/58.4 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting strawberry-graphql[asgi]>=0.84.4 (from lupyne[graphql,rest])
  Downloading strawberry_graphql-0.192.0-py3-none-any.whl (264 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m264.1/264.1 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting graphql-core<3.3.0,>=3.2.0 (from strawberry-graphql[asgi]>=0.84.4->lupyne[graphql,rest])
  Downloading graphql_core-3.2.3-py3-none-any.whl (202 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m202.9/202.9 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
Collecting python-multipart<0.0.7,>=0.0.5 (from strawberry-graphql[asgi]>=0.84.4->lupyne[graphql,rest])
  Downloading python_multipart-0.0.6-py3-

# Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is a generative statistical model used for topic modeling. It is a popular algorithm for analyzing collections of documents and identifying the underlying topics that occur within them. LDA assumes that each document is a mixture of various topics, and each topic is characterized by a distribution of words.

In LDA, a document is represented as a probability distribution over topics, and each topic is represented as a probability distribution over words. The model assumes that the documents are generated in the following way:

The number of words in the document is determined.
For each word in the document:
a. Randomly choose a topic from the document's topic distribution.
b. Randomly choose a word from the chosen topic's word distribution.
The process of generating the documents is reversed during the inference phase, where given a collection of documents, LDA aims to discover the underlying topic distributions and word distributions that are most likely to have generated the observed data.

LDA uses a Dirichlet prior to model the topic distributions and word distributions. The Dirichlet distribution is a continuous probability distribution over the simplex (a multi-dimensional generalization of a triangle) that is commonly used to model distributions of proportions. It allows the topic and word distributions to have a smooth distribution of probabilities over the possible values.

By applying LDA to a collection of documents, it becomes possible to extract the underlying topics and understand the themes that exist within the corpus. LDA has applications in various fields, including information retrieval, natural language processing, text mining, and recommendation systems.

In [2]:
from gensim import corpora
from gensim.models import LdaModel

# Sample documents
documents = [
    "The sky is blue",
    "The sun is bright",
    "The sun in the sky is bright",
    "We can see the shining sun, the bright sun"
]

# Tokenize the documents
tokenized_docs = [doc.lower().split() for doc in documents]

# Create a dictionary from the tokenized documents
dictionary = corpora.Dictionary(tokenized_docs)

# Convert tokenized documents to bag of words representation
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

# Perform LDA
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, passes=10)

# Print the topics and their corresponding words
for topic_id, topic_words in lda_model.show_topics(num_topics=-1, num_words=5):
    print(f"Topic {topic_id}: {topic_words}")

# Get the topic distribution for a specific document
document_index = 2
document_topics = lda_model.get_document_topics(corpus[document_index])
print(f"Topic distribution for document {document_index}: {document_topics}")


Topic 0: 0.214*"the" + 0.167*"is" + 0.119*"sky" + 0.118*"bright" + 0.118*"sun"
Topic 1: 0.168*"the" + 0.101*"sun" + 0.101*"bright" + 0.099*"sun," + 0.099*"we"
Topic distribution for document 2: [(0, 0.92483795), (1, 0.075162075)]


# Lucene

At its core, Lucene is a full-text search library that allows developers to create, index, and search large volumes of text data. It provides an inverted index structure, which enables fast search operations by mapping terms to the documents that contain them. This allows for efficient retrieval of relevant documents based on user queries.

Some key features of Lucene include:

Indexing: Lucene supports the indexing of documents, which involves tokenizing, analyzing, and storing the textual content of documents. It provides a flexible API for indexing various types of data, such as plain text, HTML, XML, and more.

Searching: Lucene allows for both simple and complex search queries. It supports various query types, including term queries, phrase queries, wildcard queries, fuzzy queries, and more. Lucene provides efficient algorithms for scoring and ranking search results based on relevance.

Ranking: Lucene employs scoring models to determine the relevance of documents to a given query. It uses factors such as term frequency, inverse document frequency, and vector space models to assign scores to documents and rank them accordingly.

Text Analysis: Lucene includes a range of text analysis capabilities, such as tokenization, stemming, stop-word removal, and synonym expansion. These features enhance search accuracy and help handle different linguistic variations.

Highlighting: Lucene offers the ability to highlight search terms within the retrieved documents, making it easier for users to identify the context in which the terms appear.

Faceted Search: Lucene supports faceted search, which allows users to explore data along multiple dimensions or facets. Facets can be defined on different document attributes, such as categories, tags, or any other metadata.

In [3]:
import lucene

from java.nio.file import Paths
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field, TextField
from org.apache.lucene.index import IndexWriter, IndexWriterConfig
from org.apache.lucene.search import IndexSearcher
from org.apache.lucene.queryparser.classic import QueryParser
from org.apache.lucene.store import SimpleFSDirectory
from org.apache.lucene.util import Version

# Initialize Lucene
lucene.initVM()

# Specify the path to the index directory
index_dir = "/path/to/index/directory"

# Create the index writer
directory = SimpleFSDirectory(Paths.get(index_dir))
analyzer = StandardAnalyzer()
config = IndexWriterConfig(analyzer)
writer = IndexWriter(directory, config)

# Create documents and add them to the index
doc1 = Document()
doc1.add(Field("content", "This is the first document", TextField.TYPE_STORED))
writer.addDocument(doc1)

doc2 = Document()
doc2.add(Field("content", "This is the second document", TextField.TYPE_STORED))
writer.addDocument(doc2)

# Commit changes and close the index writer
writer.commit()
writer.close()

# Create the index searcher
reader = writer.getReader()
searcher = IndexSearcher(reader)

# Perform a search
query_text = "first document"
query_parser = QueryParser("content", analyzer)
query = query_parser.parse(query_text)
hits = searcher.search(query, 10)

# Print search results
print("Search results:")
for hit in hits.scoreDocs:
    doc_id = hit.doc
    score = hit.score
    doc = searcher.doc(doc_id)
    content = doc.get("content")
    print(f"Document ID: {doc_id}, Score: {score}, Content: {content}")

# Close the index searcher and reader
searcher.getIndexReader().close()
directory.close()


ModuleNotFoundError: ignored