# Lemmatization and full text search (FTS)

The task is concentrated on using full text search engine (ElasticSearch) to perform basic search
operations in a text corpus.

## Tasks

Task objective (8 points):
1. Install ElasticSearch (ES).
2. Install an ES plugin for Polish https://github.com/allegro/elasticsearch-analysis-morfologik 

Steps to accomplish tasks 1 and 2 in the terminal::
* Navigate to the elastic directory (from the nlp repository on GitHub): cd elastic
* Start the services with Docker Compose: docker compose up -d
* To install the plugin, run: docker exec -it elastic-search-1 ./bin/elasticsearch-plugin install https://github.com/allegro/elasticsearch-analysis-morfologik/releases/download/v8.15.2/elasticsearch-analysis-morfologik-8.15.2.zip
* Restart the Docker Compose services: docker-compose restart

In [73]:
!pip install elasticsearch



In [74]:
from elasticsearch import Elasticsearch

es = Elasticsearch([{'host': 'localhost', 'port': 9200, 'scheme': 'http'}])

if es.ping():
    print("ElasticSearch works correct!")
else:
    print("No connection with ElasticSearch.")


ElasticSearch works correct!


In [75]:
print(es.cat.plugins(
    params={
        'v': 'true',
        'h': 'name,component,version,description'
    }
))

name   component           version description
node-1 analysis-morfologik 8.15.2  Morfologik Polish Lemmatizer plugin for Elasticsearch


  print(es.cat.plugins(


Imports

In [76]:
import time
import datasets
import numpy as np
import elasticsearch
import pandas as pd
from datasets import load_dataset

3. Define an ES analyzer for Polish texts containing:
   1. standard tokenizer
   2. synonym filter with alternative forms for months, e.g. `kwiecień`, `kwi`, `IV`.
   3. lowercase filter
   4. Morfologik-based lemmatizer
   5. lowercase filter (looks strange, but Morfologi produces capitalized base forms for proper names, so we have to lowercase them once more).

Polish analyzer with synonyms

In [77]:
synonyms_settings = {
    "analysis": {
        "analyzer": {
            "synonyms_analyzer": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": [
                    "lowercase",
                    "polish_synonyms",
                    "morfologik_stem",
                ]
            }
        },
        "filter": {
            "polish_synonyms": {
                "type": "synonym",
                "synonyms": [
                    "styczeń, sty, I",
                    "luty, lut, II",
                    "marzec, mar, III",
                    "kwiecień, kwi, IV",
                    "maj, V",
                    "czerwiec, cze, VI",
                    "lipiec, lip, VII",
                    "sierpień, sie, VIII",
                    "wrzesień, wrz, IX",
                    "październik, paź, X",
                    "listopad, lis, XI",
                    "grudzień, gru, XII"
                ]
            }
        }
    }
}


In [78]:
synonyms_mappings = {
    "properties": {
        "text": {
            "type": "text",
            "analyzer": "synonyms_analyzer"
        }
    }
}

4. Define another analyzer for Polish, without the synonym filter.

Polish analyzer without synonyms

In [79]:
no_synonyms_settings = {
    "analysis": {
        "analyzer": {
            "no_synonyms_analyzer": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": [
                    "lowercase",
                    "morfologik_stem",
                ]
            }
        }
    }
}

In [80]:
no_synonyms_mappings = {
    "properties": {
        "text": {
            "type": "text",
            "analyzer": "no_synonyms_analyzer"
        }
    }
}

5. Define an ES index for storing the contents of the corpus [FiQA-PL](https://huggingface.co/datasets/clarin-knext/fiqa-pl) using both analyzers.
   Use different names for the fields analyzed with a different pipeline.

In [81]:
# es.indices.delete(index="synonyms_index")
es.indices.create(index="synonyms_index", settings=synonyms_settings, mappings=synonyms_mappings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'synonyms_index'})

In [82]:
# es.indices.delete(index="no_synonyms_index")
es.indices.create(index="no_synonyms_index", settings=no_synonyms_settings, mappings=no_synonyms_mappings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'no_synonyms_index'})

Both indexes were created correctly

6.Load the data to the ES index.

In [83]:
dataset = load_dataset("clarin-knext/fiqa-pl", "corpus")

In [84]:
df = pd.DataFrame(dataset['corpus'])

In [85]:
df_text = df['text']

In [86]:
df

Unnamed: 0,_id,title,text
0,3,,"Nie mówię, że nie podoba mi się też pomysł szk..."
1,31,,Tak więc nic nie zapobiega fałszywym ocenom po...
2,56,,Nigdy nie możesz korzystać z FSA dla indywidua...
3,59,,Samsung stworzył LCD i inne technologie płaski...
4,63,,Oto wymagania SEC: Federalne przepisy dotycząc...
...,...,...,...
57633,599946,,">Cóż, po pierwsze, drogi to coś więcej niż hob..."
57634,599953,,"Tak, robią. Na dotacje dla firm farmaceutyczny..."
57635,599966,,">To bardzo smutne, że nie rozumiesz ludzkiej n..."
57636,599975,,„Czy Twój CTO pozwolił dużej grupie użyć „„adm...


In [87]:
df_text = df['text']

Loading data to the indexes

In [88]:
for idx, text in enumerate(df_text):
    document = {
        "text": text,
    }
    es.index(index="synonyms_index", id=idx, document=document)
    es.index(index="no_synonyms_index", id=idx, document=document)

In [89]:
es.count(index="synonyms_index")['count']

57638

In [90]:
es.count(index="no_synonyms_index")['count']

57638

7. Determine the number of documents and the number of matches containing the word `kwiecień` (in any form) including and excluding the synonyms.

Including synonyms

In [91]:
query_synonyms = {
    "match": {
        "text":{
            "query":"kwiecień",
            "analyzer":"synonyms_analyzer"
        }
    }
}

In [92]:
result = es.search(index="synonyms_index", query=query_synonyms)
print("Got", result['hits']['total']['value'],  "documents for" , "synonyms_index")

Got 306 documents for synonyms_index


Excluding synonyms

In [93]:
query_no_synonyms = {
    "match": {
        "text":{
            "query":"kwiecień",
            "analyzer":"no_synonyms_analyzer"
        }
    }
}

In [94]:
result = es.search(index="no_synonyms_index", query=query_no_synonyms)
print("Got", result['hits']['total']['value'],  "documents for" , "no_synonyms_index")

Got 257 documents for no_synonyms_index


8.Download the QA pairs for the [FiQA-PL dataset](https://huggingface.co/datasets/clarin-knext/fiqa-pl-qrels).

In [95]:
dataset_QA = load_dataset("clarin-knext/fiqa-pl-qrels")
dataset_QA_test = dataset_QA['test']
df_qa_test = pd.DataFrame(dataset_QA['test'])

In [96]:
dataset_queries = datasets.load_dataset("clarin-knext/fiqa-pl", "queries")
df_queries = pd.DataFrame(dataset_queries['queries'])

In [97]:
df_qa_test['query-id'].unique()

array([    8,    15,    18,    26,    34,    42,    56,    68,    89,
          90,    94,    98,   104,   106,   109,   475,   503,   504,
         515,   529,   547,   549,   559,   570,   585,   588,   594,
         603,   604,   620,   622,   659,   672,   684,   687,   689,
         691,   699,   701,   715,   721,   744,   750,   753,   766,
         776,   810,   813,   849,   852,   853,   858,   859,   864,
         879,   885,   895,   904,   928,   929,   932,   939,   945,
         957,   988,  1074,  1085,  1090,  1150,  1157,  1159,  1198,
        1230,  1281,  1284,  1297,  1306,  1309,  1310,  1321,  1322,
        1391,  1393,  1415,  1416,  1441,  1451,  1469,  1530,  1670,
        1676,  1736,  1748,  1783,  1812,  1815,  1819,  1824,  1826,
        1832,  1871,  1877,  1889,  1915,  1920,  1933,  1948,  1994,
        2010,  2051,  2070,  2075,  2076,  2088,  2108,  2118,  2154,
        2181,  2183,  2204,  2264,  2296,  2306,  2316,  2318,  2330,
        2334,  2348,

In [98]:
df_queries

Unnamed: 0,_id,title,text
0,0,,Co jest uważane za wydatek służbowy w podróży ...
1,4,,Wydatki służbowe - ubezpieczenie samochodu pod...
2,5,,Rozpoczęcie nowego biznesu online
3,6,,„Dzień roboczy” i „termin płatności” rachunków
4,7,,Nowy właściciel firmy – Jak działają podatki d...
...,...,...,...
6643,4102,,"Jak mogę ustalić, czy moja stopa zwrotu jest „..."
6644,3566,,"Gdzie mogę kupić akcje, jeśli chcę zainwestowa..."
6645,94,,Wykorzystywanie punktów kart kredytowych do op...
6646,2551,,Jak znaleźć tańszą alternatywę dla tradycyjnej...


9.Compute NDCG@5 for the QA dataset (the test subset) for the following setusp:
   * synonyms enabled and disabled,
   * lemmatization in the query enabled and disabled.

In [99]:
K = 5

def calc_ndcg_k(scores):
    if len(scores) != K : Exception("Invalid scores arr size, != 5")
    dcg = np.sum(scores / np.log2(np.arange(2, len(scores) + 2)))
    idcg = np.sum(sorted(scores, reverse=True) / np.log2(np.arange(2, len(scores) + 2)))
    ndcg = dcg / idcg if idcg > 0 else 0.0
    return ndcg

df_queries_text = df_queries['text']
arr = np.array([0.0 for i in range(K)])
tmp = set()

In [100]:
es_synonyms = {
        "match": {
            "text":{
                "query":"To fill up",
                "analyzer":"synonyms_analyzer"
            }
        }
    }

es_no_lemmatizaion = {
        "match": {
            "text":{
                "query":"To fill up",
                "analyzer":"standard"
            }
        }
    }

es_no_synonyms = {
        "match": {
            "text":{
                "query":"To fill up",
                "analyzer":"no_synonyms_analyzer"
            }
        }
    }

Synonyms Enabled

In [101]:
ndcg = 0
iterator = 0

for query_id in df_qa_test['query-id'].unique():
    query = df_queries[df_queries['_id'] == str(query_id)].iloc[0]['text']
    es_synonyms['match']['text']['query'] = query
    resp = es.search(index="synonyms_index", query=es_synonyms)
    corpus_ids = df_qa_test[df_qa_test['query-id'] == query_id]['corpus-id']
    
    tmp = set()
    for idx in corpus_ids:
        _id = df[df['_id'] == str(idx)].index.tolist()[0]
        tmp.add(_id)
        
    for idx, val in enumerate(resp['hits']['hits'][:K]):
        _id = np.float64(val['_id'])
        if _id in tmp:
            arr[idx] = 3
        else:
            arr[idx] = 0
    ndcg += calc_ndcg_k(arr)
    iterator += 1

In [102]:
mean_ndcg = ndcg / iterator
print("NDCG for synonyms is:", mean_ndcg)

NDCG for synonyms is: 0.2669137178379146


Synonyms Disabled

In [103]:
ndcg = 0
iterator = 0

for query_id in df_qa_test['query-id'].unique():
    query = df_queries[df_queries['_id'] == str(query_id)].iloc[0]['text']
    es_no_synonyms['match']['text']['query'] = query
    resp = es.search(index="no_synonyms_index", query=es_no_synonyms)
    corpus_ids = df_qa_test[df_qa_test['query-id'] == query_id]['corpus-id']
    
    tmp = set()
    for idx in corpus_ids:
        _id = df[df['_id'] == str(idx)].index.tolist()[0]
        tmp.add(_id)
        
    for idx, val in enumerate(resp['hits']['hits'][:K]):
        _id = np.float64(val['_id'])
        if _id in tmp:
            arr[idx] = 3
        else:
            arr[idx] = 0
    ndcg += calc_ndcg_k(arr)
    iterator += 1

In [104]:
mean_ndcg = ndcg / iterator
print("NDCG for no synonyms is:", mean_ndcg)

NDCG for no synonyms is: 0.2657322972429152


No Lemmatization

In [105]:
no_lemmatization_mappings = {
    "properties": {
        "text": {
            "type": "text",
            "analyzer": "standard"
        }
    }
}

In [106]:
# es.indices.delete(index="no_lemmatization_index")
es.indices.create(index="no_lemmatization_index", mappings=no_lemmatization_mappings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'no_lemmatization_index'})

In [107]:
for idx, text in enumerate(df_text):
    document = {
        "text": text,
    }
    es.index(index="no_lemmatization_index", id=idx, document=document)

In [108]:
es.count(index="no_lemmatization_index")['count']

57638

In [109]:
ndcg = 0
iterator = 0

for query_id in df_qa_test['query-id'].unique():
    query = df_queries[df_queries['_id'] == str(query_id)].iloc[0]['text']
    es_no_lemmatizaion['match']['text']['query'] = query
    resp = es.search(index="no_lemmatization_index", query=es_no_lemmatizaion)
    corpus_ids = df_qa_test[df_qa_test['query-id'] == query_id]['corpus-id']
    
    tmp = set()
    for idx in corpus_ids:
        _id = df[df['_id'] == str(idx)].index.tolist()[0]
        tmp.add(_id)
        
    for idx, val in enumerate(resp['hits']['hits'][:K]):
        _id = np.float64(val['_id'])
        if _id in tmp:
            arr[idx] = 3
        else:
            arr[idx] = 0
    ndcg += calc_ndcg_k(arr)
    iterator += 1

In [110]:
mean_ndcg = ndcg / iterator
print("NDCG for no lemmatization is:", mean_ndcg)

NDCG for no lemmatization is: 0.207829023930387


The NDCG results show that enabling synonyms slightly improves search relevance compared to disabling them, with the score increasing from 0.2657 to 0.2669. 
The much lower NDCG score for no lemmatization" which is 0.2078 indicates that lemmatization significantly impacts relevance. Both synonyms and lemmatization improve search effectiveness, with lemmatization having an especially strong influence.

Answer the following questions (2 points):
1. What are the strengths and weaknesses of regular expressions versus full text search regarding processing of text?
Regex is good for precise pattern matching, like identifying specific structures in text, but it’s complex, resource-heavy, and lacks relevance ranking. Full text search is more scalable, ranks results by relevance, and supports broader, fuzzier searches, but it can't handle detailed patterns like regex.

2. Can an LLM be applied in the context of searching for documents? Justify your answer, excluding the obvious observation that an LLM can be used to formulate the answer.
LLM improve searches by understanding the meaning behind queries, retrieving relevant results even when keywords don't match exactly. They’re especially good with complex queries and summarizing information, but they require a lot of computing power and might need fine-tuning for specific fields. Overall, LLMs boost search precision with context-aware results.

## Hints

1. Full text search engines were developed for storing and searching textual data.
1. The most popular FTSes are SOLR and ElasticSearch (ES).
1. Some relational databases support full text search, but usually it is limited and not easy to adapt.
1. Both for SOLR and ES there are plugins supporting Polish.
1. FTSes use *inverted-index* to store the data. At loading time the text is split by *tokenizer* into 
   *tokens* and individual tokens go through *filters*. The resulting tokens are placed as keys in a hash-like
   structure. The values are so called *posting lists*, containing pointers to the documents where the tokens come from.
1. The minimal FTS configuration requires two elements: a tokenizer and a set of filters (the set might be empty in the extreme
   case). **Changing the configuration of an index does not result in the new definitions being applied to the already
   stored documents.** In such cases the index has to be *rebuilt*, meaning that the documents have to be loaded once
   again.
1. FTSes contain a large number of tokenizers, e.g. they may know semantics of HTML documents and treat HTML tags as
   tokens. Some popular tokenizers include:
   1. *standard tokenizer* - based on the Unicode tokenization rules,
   1. *whitespace tokenizer* - which splits the tokens by white spaces,
   1. *url tokenizer* - which keeps the URLs as indivisible tokens.
1. Some tokens such as commas and full stops might be removed at the stage of filtering. Filtering of common tokens reduces the index size.
1. Some popular filters include:
   1. *lowercase filter* - which downcases the letters,
   1. *ASCII folding filter* - which removes Polish diacritics,
   1. *stop token filter* - which removes the specified tokens (described above),
   1. *lematizers* - which find the base form of a word,
   1. etc. (present implementation of ES has more than 50 filters)
1. **Lemmatization** is a process when the inflected form of a word is replaced with its base form, e.g
   the form *psu* is replaced with *pies*. You should notice that there are many ambiguous forms, e.g.
   *goli* can have the following base forms: *golić*, *gol* and *goły*. To overcome the ambiguity, FTSes 
   take very pragmatic approach - for a given inflected form all possible base forms are put in the index.
   Even though it's not valid from the linguistics' point of view, it works well in practice.
1. [Term vector API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html) allows to retrieve useful 
   statistics of a given term in a particular document or in the whole document collection.
1. Polish retrieval models comparison is available at :https://huggingface.co/spaces/sdadas/pirb
2. In the `elastic` directory there's a basic configuration for runnig ElasticSearch with `docker compose`:
   1. You can use the `docker-compose.yml` configuration - it will start ES with the morfologik plugin installed.
   2. You can also modify the `Dockerfile` configuration and run it locally.
   3. In the `query.sh` file there's a check for ES showing if it is possible to connect to the instance using `curl`. 
   4. The correct output from curl is `[]` meaning there aren't any indices defined.