# Similarity Modulation

Here we are going to implement another similarity other that the BM25 which is the default in Elastic. We want you to implement a tf-idf similarity and test it with same queries in phase2 so that you can get a sense of how well your Elastic tf-idf works. Follow the instructions and fill where ever it says # TODO.  <br>
You can contact me in case of any problems via Telegram: @mahvash_sp

In [1]:
from elasticsearch import Elasticsearch, helpers
import json
import warnings

In [2]:
# import data in json format
import os

os.chdir('../../')
file_name = os.path.join(os.getcwd(), 'Phase_1', 'assets', 'IR_data_news_12k.json')

with open(file_name) as f:
    data = json.load(f)

In [3]:
# Filter warnings
warnings.filterwarnings('ignore')

In [4]:
# data keys
data['0'].keys()

dict_keys(['title', 'content', 'tags', 'date', 'url', 'category'])

In [23]:
from hazm import *

normalizer = Normalizer()

# coding: utf8
from os import path
import codecs
from hazm.Normalizer import Normalizer
# from Phase_1.src.utils import data_path
default_stop_words = path.join(os.getcwd(), 'Phase_1', 'src', 'data', 'stopwords.dat')


class StopWord:
    """ Class for remove stop words

         >>> StopWord().clean(["در","تهران","کی","بودی؟"])
         ['بودی؟', 'تهران', 'کی']
         >>> StopWord(normal=True).clean(["در","تهران","کی","بودی؟"])
         ['بودی؟', 'تهران']

         """

    def __init__(self, file_path=default_stop_words, normal=False):
        self.file_path = file_path
        self.normal = normal
        self.normalizer = Normalizer().normalize
        self.stop_words = self.init(file_path, normal)

    def init(self, file_path, normal):
        if not normal:
            return set(
                line.strip("\r\n") for line in codecs.open(file_path, "r", encoding="utf-8").readlines())
        else:
            return set(
                self.normalizer(line.strip("\r\n")) for line in
                codecs.open(file_path, "r", encoding="utf-8").readlines())

    def set_normalizer(self, func):
        self.normalizer = func
        self.stop_words = self.init(self.file_path, self.normal)

    def __getitem__(self, item):
        return item in self.stop_words

    def __str__(self):
        return str(self.stop_words)

    def clean(self, iterable_of_strings, return_generator=False):
        if return_generator:
            return filter(lambda item: not self[item], iterable_of_strings)
        else:
            return list(filter(lambda item: not self[item], iterable_of_strings))

def preprocess_content(content):
    str_empty = ' '
    stemmer = Stemmer()
    content = normalizer.normalize(content)
    content = word_tokenize(content)
    content = StopWord(normal=False).clean(content)
    content = [stemmer.stem(word) for word in content]
    content = str_empty.join(content)
    return content

for i in range(len(data)):
    data[str(i)]['content'] = preprocess_content(data[str(i)]['content'])

After starting your Elasticsearch on your pc (localhost:9200 is the default) we have to connect to it via the following piece of code


In [24]:
# Here we try to connect to Elastic
es = Elasticsearch("http://localhost:55003")

## Create tf-idf Index

### Create Index

In [25]:
# Name of index 
sm_index_name = 'tfidf_index'

In [26]:
# Delete index if one does exist
if es.indices.exists(index=sm_index_name):
    es.indices.delete(index=sm_index_name)

# Create index    
es.indices.create(index=sm_index_name)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'tfidf_index'})

### Add documents

In here we used the bulk doc formatter which was introduced in the first subsection of phase 3. <br>
You can find out more in [Here](https://stackoverflow.com/questions/61580963/insert-multiple-documents-in-elasticsearch-bulk-doc-formatter).

In [27]:

from elasticsearch.helpers import bulk

def bulk_sync():
    actions = [
        {
            '_index': sm_index_name,
            '_id':doc_id,
            '_source': doc
        } for doc_id,doc in data.items()
    ]
    bulk(es, actions)
    
    


In [28]:
# run the function to add documents
bulk_sync()

In [29]:
# Check index
es.count(index = sm_index_name)

ObjectApiResponse({'count': 10500, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}})

### Configuring a similarity

In order to configure a new similarity function you have to change the similarity from the settings api of the index. This can be done via the function 'put_settings' in python. What we do is to change the 'default' similarity function in Elastic so that it uses the replaced similarity instead. Type of this similarity is set to 'scripted' because tf-idf is not among the pre-defined similarity functions in Elastic anymore. As this similarity is a scripted type the source code of it must be written **by you** and passed to it.<br>
> In order for the changes to be applied, first we close the index and change the settings and then reopen it<br>

Write the tf-idf code in a string and pass it as a value to the "source" key. <br>
You can find the variables needed in your code in [Here](https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-similarity-context.html).

In [30]:
source_code = "double tf = 1 + Math.log(doc.freq); double idf = Math.log((field.docCount)/(term.docFreq)); return tf * idf;"

In [31]:
# closing the index
es.indices.close(index=sm_index_name)

# applying the settings
es.indices.put_settings(index=sm_index_name, 
                            settings={
                                "similarity": {
                                      "default": {
                                        "type": "scripted",
                                        "script": {
                                          "source": source_code
                                        }
                                      }
                                }
                            }
                       )

# reopening the index
es.indices.open(index=sm_index_name)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True})

### Query

In this section you have to test your index with same queries you tested phase2. The goal here is to observe how different or simillar your tf-idf Elastic implementation works.

In [32]:
# A function that creates appropriate body for our match content type query
def get_query(text):
    body ={
    "query":{  
        "match" : {
            "content" : text
            }
        }
    }
    
    return body

In [33]:
queries = [
    #TODO : add your queries in string format to this list
    "تیم استقلال",
    "مجلس",
    "یوفا",
    "نرخ استخاره",
    "تیبو کورتوا",
    "تحریم جهانی علیه کشور جهان سومی ایران در جمعه"
]

In [34]:
all_res_tfidf = []


for q in queries:
    res_tfidf = es.search(index=sm_index_name, body=get_query(q), explain=True)
    all_res_tfidf.append(dict(res_tfidf))

In [22]:
for res, q in zip(all_res_tfidf, queries):
    print(q)
    for doc in res['hits']['hits']:
        print(doc['_source']['url'])
    print("----------------------------")

تیم استقلال
https://www.farsnews.ir/news/14001206000566/گفت‌و‌گوی-ویژه-با-الهامی-جعبه-سیاه-استقلالِ-حجازی-و-آقا-فیروز-از
https://www.farsnews.ir/news/14001127000238/گفت‌و‌گوی-ویژه-با-زلزله-استقلال-به-خاکسپاری-برادرم-نرسیدم-کاپیتان
https://www.farsnews.ir/news/14001024000498/6-نکته-از-لیگ-برتر-در-نیم-فصل-اول|-پرسپولیس-و-5-تیم-رکورددار-حاشیه-
https://www.farsnews.ir/news/14000930000263/گفت‌وگوی-ویژه-با-مورچه-اتمی-استقلال-از-توصیه‌های-کی-روش-و-قلعه‌نویی
https://www.farsnews.ir/news/14001210000357/اکبرپور-می-خواهند-استقلال-قهرمان-نشود-امیدوارم-رقابت-با-پرسپولیس
https://www.farsnews.ir/news/14001015000879/هفته-چهاردهم-لیگ-برتر|کورس-سرخابی‌ها-و-سپاهان-برای-قهرمانی-نیم-فصل
https://www.farsnews.ir/news/14001207000966/هفته-بیستم-لیگ-برتر|-استقلال-20-می-شود-پرسپولیس-مقابل-دومین-تیم
https://www.farsnews.ir/news/14001115000506/معوقه-از-هفته-شانزدهم-لیگ-برتر|-صدرنشینی-مقتدرانه-استقلال-با-شکست
https://www.farsnews.ir/news/14001207001061/آجورلو-از-چه-می-ترسید-که-پشت-درهای-بسته-تصمیم-می-گیرید-از-مجیدی