# Similarity Modulation

**Student Information**
AmirMohammad Babaei
9831011

Here we are going to implement another similarity other that the BM25 which is the default in Elastic. We want you to implement a tf-idf similarity and test it with same queries in phase2 so that you can get a sense of how well your Elastic tf-idf works. Follow the instructions and fill where ever it says # TODO.  <br>
You can contact me in case of any problems via Telegram: @mahvash_sp

In [3]:
!pip install parsivar

Processing /home/amir01/.cache/pip/wheels/54/0a/38/7d0b1aabbd644340a94fb8685fd20d9f35814d735973d07f40/parsivar-0.2.3-py3-none-any.whl
Processing /home/amir01/.cache/pip/wheels/23/18/48/8fd6ec11da38406b309470566d6f099c04805d2ec61d7829e7/nltk-3.4.5-py3-none-any.whl
[31mERROR: hazm 0.7.0 has requirement nltk==3.3, but you'll have nltk 3.4.5 which is incompatible.[0m
Installing collected packages: nltk, parsivar
  Attempting uninstall: nltk
    Found existing installation: nltk 3.3
    Uninstalling nltk-3.3:
      Successfully uninstalled nltk-3.3
Successfully installed nltk-3.4.5 parsivar-0.2.3


In [41]:
from elasticsearch import Elasticsearch, helpers
from parsivar import Normalizer, Tokenizer, FindStems
from string import punctuation
import re
import json
import pickle
import warnings

In [22]:
# import data in json format
file_name = 'IR_data_news_12k.json'
def load_docs(file_name):
    with open(file_name) as f:
        data = json.load(f)
    return data
    
data['0']['title']

'اعلام زمان قرعه کشی جام باشگاه های فوتسال آسیا'

In [14]:
stopwords_remove = True
stemming = True

In [9]:
def load_stopwords(stopwords_path):
    stopwords_set = None
    with open(stopwords_path, 'r') as f:
        stopwords_set = set(f.read().split())
    return stopwords_set

In [12]:
def remove_stopwords(words, persian_stopwords_path='./persian-stopwords.txt'):
    persian_stopwords = load_stopwords(persian_stopwords_path)
    return [word for word in words if word not in persian_stopwords]

In [19]:
def preprocess(text, stopwords_remove=True, stemming=True):
    normalizer = Normalizer()
    tokenizer = Tokenizer()
    stemmer = FindStems()
    
    pure_text = re.sub(f'[{punctuation}؟،٪×÷»«]+', '', text)
    normal_text = normalizer.normalize(pure_text)
    res = tokenizer.tokenize_words(normal_text)
    if stemming:
        res = list(map(stemmer.convert_to_stem, res))
    if stopwords_remove:
        res = remove_stopwords(res)
    
    return res

print(preprocess('سلام من امروز می خواهم این تابع را ۳ بار امتحان کنم.', stopwords_remove=stopwords_remove, stemming=stemming))

['سلام', 'امروز', 'خواست&خواه', 'تابع', '3', 'امتحان', 'کرد&کن']


In [24]:
def preprocess_contents(docs_dict, stopwords_remove=True, stemming=True):
    
    for docID, body in docs_dict.items():
        body['content'] = preprocess(body['content'], stopwords_remove, stemming)
        
    return docs_dict

preprocessed_docs = preprocess_contents(load_docs(file_name), stopwords_remove=stopwords_remove, stemming=stemming)

In [None]:
# Filter warnings
warnings.filterwarnings('ignore')

In [25]:
# data keys
preprocessed_docs['0'].keys()

dict_keys(['title', 'content', 'tags', 'date', 'url', 'category'])

In [43]:
def save_index(index, filename):
    with open(filename, 'wb') as outp:  # Overwrites any existing file.
        pickle.dump(index, outp, pickle.HIGHEST_PROTOCOL)
        print(f'docs saved in {filename}')
        
save_index(preprocessed_docs, 'preprocessed_IR_docs.pkl')

docs saved in preprocessed_IR_docs.pkl


After starting your Elasticsearch on your pc (localhost:9200 is the default) we have to connect to it via the following piece of code


In [26]:
import configparser

config = configparser.ConfigParser()
config.read('example.ini')

['example.ini']

In [27]:
# Here we try to connect to Elastic
es = Elasticsearch(
    cloud_id=config['ELASTIC']['cloud_id'],
    http_auth=(config['ELASTIC']['user'], config['ELASTIC']['password'])
)

  es = Elasticsearch(


## Create tf-idf Index

### Create Index

In [28]:
# Name of index 
sm_index_name = 'tfidf_index'

In [30]:
# Delete index if one does exist
if es.indices.exists(index=sm_index_name):
    es.indices.delete(index=sm_index_name)

# Create index    
es.indices.create(index=sm_index_name)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'tfidf_index'})

### Add documents

In here we used the bulk doc formatter which was introduced in the first subsection of phase 3. <br>
You can find out more in [Here](https://stackoverflow.com/questions/61580963/insert-multiple-documents-in-elasticsearch-bulk-doc-formatter).

In [31]:

from elasticsearch.helpers import bulk

def bulk_sync():
    actions = [
        {
            '_index': sm_index_name,
            '_id':doc_id,
            '_source': doc
        } for doc_id,doc in preprocessed_docs.items()
    ]
    bulk(es, actions)
    
    


In [32]:
# run the function to add documents
bulk_sync()

In [33]:
# Check index
es.count(index = sm_index_name)

ObjectApiResponse({'count': 12202, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}})

### Configuring a similarity

In order to configure a new similarity function you have to change the similarity from the settings api of the index. This can be done via the function 'put_settings' in python. What we do is to change the 'default' similarity function in Elastic so that it uses the replaced similarity instead. Type of this similarity is set to 'scripted' because tf-idf is not among the pre-defined similarity functions in Elastic anymore. As this similarity is a scripted type the source code of it must be written **by you** and passed to it.<br>
> In order for the changes to be applied, first we close the index and change the settings and then reopen it<br>

Write the tf-idf code in a string and pass it as a value to the "source" key. <br>
You can find the variables needed in your code in [Here](https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-similarity-context.html).

Sourse Code of tf-idf similarity
```java
double tf = Math.log(1 + doc.freq); 
double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)); 
double norm = 1/Math.sqrt(doc.length); 
return query.boost * tf * idf * norm;
```

In [34]:
# TODO : uncomment the code bellow, write the tf-idf code in here
source_code = "double tf = Math.log(1 + doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)); double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;"

In [35]:
# closing the index
es.indices.close(index=sm_index_name)

# applying the settings
es.indices.put_settings(index=sm_index_name, 
                            settings={
                                "similarity": {
                                      "default": {
                                        "type": "scripted",
                                        "script": {
                                          # TODO : uncomment the code bellow and pass the suitable parameter
                                          "source": source_code
                                        }
                                      }
                                }
                            }
                       )

# reopening the index
es.indices.open(index=sm_index_name)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True})

### Query

In this section you have to test your index with same queries you tested phase2. The goal here is to observe how different or simillar your tf-idf Elastic implementation works.

In [36]:
# A function that creates appropriate body for our match content type query
def get_query(text):
    body ={
      "query": {
        "match": {
          "content": f"{text}"
        }
      }
    }
    
    return body

In [37]:
queries = [
    #TODO : add your queries in string format to this list
    get_query("ایران"),
    get_query("قهرمانی تیم ملی ایران"),
    get_query("استمهال"),
    get_query("مناقشات سیاسی خاورمیانه")
]

In [38]:
queries

[{'query': {'match': {'content': 'ایران'}}},
 {'query': {'match': {'content': 'قهرمانی تیم ملی ایران'}}},
 {'query': {'match': {'content': 'استمهال'}}},
 {'query': {'match': {'content': 'مناقشات سیاسی خاورمیانه'}}}]

In [39]:
all_res_tfidf = []


for q in queries:
    res_tfidf = es.search(index=sm_index_name, body=get_query(q), explain=True)
    all_res_tfidf.append(dict(res_tfidf))

  res_tfidf = es.search(index=sm_index_name, body=get_query(q), explain=True)


In [40]:
for res, q in zip(all_res_tfidf, queries):
    print(q)
    for doc in res['hits']['hits']:
        print(f"Score: [{doc['_score']}]: {doc['_source']['url']}")
    print("----------------------------")

{'query': {'match': {'content': 'ایران'}}}
Score: [0.6754794]: https://www.farsnews.ir/news/14001030000095/آینده-سردار-آزمون-رسما-مشخص-شد
Score: [0.27887025]: https://www.farsnews.ir/news/14001107000656/آمار-بازی-ایران-عراق-یوزها-زهردار-از-شیرهای-بین-النهرین
Score: [0.26513416]: https://www.farsnews.ir/news/14001217001047/رقابتهای-بین-المللی-تکواندو-جام-فجر|هر-چهار-طلاهای-روز-نخست-به-مردان
Score: [0.25457254]: https://www.farsnews.ir/news/14001219000570/مسابقات-قهرمانی-آسیا|-پیروزی-دلچسب-دختران-هندبالیست-جوان-ایران-مقابل
Score: [0.25030154]: https://www.farsnews.ir/news/14000921000552/با-حکم-رئیس‌جمهور-مختارپور-رئیس-سازمان-اسناد-و-کتابخانه-ملی-شد
Score: [0.24964733]: https://www.farsnews.ir/news/14001219000747/جام-ریاست-فدراسیون-جهانی-تکواندو|-روز-طلایی-بانوان-ایرانی-و-قزاق
Score: [0.2478899]: https://www.farsnews.ir/news/14001220000571/نشر-خبر-توقیف-دو-نفتکش-ایرانی-در-این-مقطع-از-مذاکرات-ترفند-آمریکا-برای
Score: [0.24641925]: https://www.farsnews.ir/news/14001118000913/کامیابی‌نیا-در-

<div dir="rtl">
    همانطور که در ادامه مشاهده می شود، داک مربوط به «آینده سردار آزمون رسما مشخص شد» در نتیجه سه کوئری وجود دارد که با توجه به بررسی متن آن متوجه شدم که این داک به اشتباه در این کوئری ها آمده است. این داک دارای کلمه ایران، قهرمانی، مناقشات، سیاسی، خاورمیانه و استمهال نمی باشد و اما دارای تعداد زیادی کلمه تیم می باشد.
</div>

![image.png](./images/azmoon-1.png)