# Search Engine (TF-IDF)
**TF-IDF** stands for “Term Frequency — Inverse Document Frequency”. This is a technique to quantify a word in documents, we generally compute a weight to each word which signifies the importance of the word in the document and corpus.

If I give you a sentence for example *“This building is so tall”**. Its easy for us to understand the sentence as we know the semantics of the words and the sentence. But how will the computer understand this sentence? The computer can understand any data only in the form of numerical value. So, for this reason we vectorize all of the text so that the computer can understand the text better.

By vectorizing the documents we can further perform multiple tasks such as finding the relevant documents, ranking, clustering and so on. This is the same thing that happens when you perform a google search. The web pages are called documents and the search text with which you search is called a query. google maintains a fixed representation for all of the documents. When you search with a query, google will find the relevance of the query with all of the documents, ranks them in the order of relevance and shows you the top k documents, all of this process is done using the vectorized form of query and documents. Although Googles algorithms are highly sophisticated and optimized, this is their underlying structure.

Terminology

- **t**: term (word)
- **d**: document (set of words)
- **N**: count of corpus
- **corpus**: the total document set

## Step #1: Read Files

1. Read the data/files_path.txt which contains all the documents you have to read.
2. Read the files listed in data/files_path.txt and create a dictionary where keys are file names and values are file contents.

```python
docs = {
    "file_1": "content_1",
    "file_2": "content_2",
    ...
}
```

In [19]:
import json

In [83]:
with open("./result.json") as f:
    data = json.load(f)

In [84]:
data

{'name': 'Machine Learning with Python',
 'type': 'private_supergroup',
 'id': 10017328477,
 'messages': [{'id': -999029553,
   'type': 'service',
   'date': '2021-01-27T17:32:53',
   'actor': 'Ali Hejazi',
   'actor_id': 4368073731,
   'action': 'create_group',
   'title': 'Machine Learning with Python',
   'members': ['Ali Hejazi', 'Fatemeh Modarres'],
   'text': ''},
  {'id': -999029551,
   'type': 'service',
   'date': '2021-01-27T17:33:06',
   'actor': 'Ali Hejazi',
   'actor_id': 4368073731,
   'action': 'edit_group_photo',
   'photo': '(File not included. Change data exporting settings to download.)',
   'width': 640,
   'height': 640,
   'text': ''},
  {'id': -999029549,
   'type': 'service',
   'date': '2021-01-27T17:33:43',
   'actor': 'Ali Hejazi',
   'actor_id': 4368073731,
   'action': 'migrate_to_supergroup',
   'text': ''},
  {'id': 1,
   'type': 'service',
   'date': '2021-01-27T17:33:43',
   'actor': 'Machine Learning with Python',
   'actor_id': 10017328477,
   'actio

In [22]:
data['messages'][100]

{'id': 100,
 'type': 'message',
 'date': '2021-02-04T15:46:52',
 'from': 'Sara Gh',
 'from_id': 4393680157,
 'reply_to_message_id': 85,
 'text': 'بچه ها جون، برای هر کلاس، تاپیک کلاس خودش لینک زوم هستش. روی تاپیک کلاس اون روز کلیک کنید.'}

In [23]:
data['messages'][100]['text'].split()

['بچه',
 'ها',
 'جون،',
 'برای',
 'هر',
 'کلاس،',
 'تاپیک',
 'کلاس',
 'خودش',
 'لینک',
 'زوم',
 'هستش.',
 'روی',
 'تاپیک',
 'کلاس',
 'اون',
 'روز',
 'کلیک',
 'کنید.']

## Step #2: Extract Unique Words in all Documents
Create a set of all words (vocab) and print the number of unique words.

In [144]:
vocab = set()
for msg in data['messages']:
    if type(msg['text']) != list:
        vocab.update(msg['text'].split())
vocab
len(vocab)

685

In [25]:
tf_dict = dict()
for msg in data['messages']:
    if type(msg['text']) == list:
        continue
    for words in msg['text'].split():
        if words in tf_dict:
            tf_dict[words] += 1
        else:
            tf_dict[words] = 1 
tf_dict

{'سلام': 5,
 'ب': 2,
 'همگی،': 1,
 'علی،': 1,
 'امشب': 4,
 'کلاس': 14,
 'رفع': 1,
 'اشکال': 1,
 'چه': 4,
 'ساعتی': 1,
 'هست؟': 2,
 'وقت': 3,
 'نشد': 1,
 'چیزی': 2,
 'بذارم.': 1,
 'همون': 5,
 'شنبه': 3,
 'ادامه': 1,
 'میدیم.': 1,
 'هفته\u200cی': 1,
 'بعد': 10,
 'چند': 2,
 'جلسه': 9,
 'میذارم.': 1,
 'علی': 18,
 'دستت': 6,
 'درد': 3,
 'نکنه': 2,
 'میشه': 8,
 'لینک': 15,
 'شنیه': 1,
 'قبل': 2,
 'رو': 31,
 'هم': 16,
 'بذاری؟': 1,
 'بچه\u200cهای': 1,
 'این': 10,
 'همگی': 1,
 'درس': 2,
 'خونن\u200cها': 1,
 '😁': 2,
 'اوکی': 1,
 'مرسیییی': 1,
 'نه': 2,
 'آقا': 1,
 'دو': 5,
 'خوبه.': 1,
 'عالی': 5,
 '.': 1,
 'یک': 1,
 'طلا': 3,
 'قربانت': 4,
 '👌': 5,
 '👍👍': 1,
 'ساعته': 1,
 'ایشالا؟': 1,
 'آره': 6,
 '👍🏼👍🏼': 1,
 'ممنون': 8,
 'دمت': 3,
 'گرم': 3,
 'استاد': 3,
 'حجازی': 1,
 '👍': 3,
 'الان': 12,
 'کلا': 4,
 'پسورد': 1,
 'از': 14,
 'روش': 2,
 'برداشتم.': 1,
 'ببین': 2,
 'کار': 4,
 'می\u200cکنه': 4,
 'یا': 5,
 'نه.': 1,
 'بله': 4,
 'با': 7,
 'تشکر': 1,
 'اقدام': 1,
 'فوری': 1,
 'اگه': 6,
 'کسی': 5,
 '

## Step #3: Extract Number of Words in each Document

1. Extract words in each document by creating a dictionary named `tf_dict` where keys are document names and values are another dictionary.
2. In the nested dictionary, keys are words and values are the corresponding word frequency.

In [86]:
tf_dict = {}
for msg in data['messages']:
    if (type(msg['text']) == list) or (not msg['text']):
        continue
    
    if msg['from'] not in tf_dict:
        tf_dict[msg['from']] = {}
        
    for words in msg['text'].split():    
        if words in tf_dict[msg['from']]:
            tf_dict[msg['from']][words] += 1
        else:
            tf_dict[msg['from']][words] = 1 
            
tf_dict['Sara Gh']

{'دمت': 2,
 'گرم': 2,
 'استاد': 1,
 'حجازی': 1,
 'بچه': 2,
 'ها': 2,
 'جون': 1,
 'این': 1,
 'جلسه': 1,
 'اول': 1,
 'کلاس': 3,
 'هستش': 1,
 'علی👌': 1,
 'من': 2,
 'الان': 2,
 'روش': 1,
 'کلیک': 2,
 'کردم': 1,
 'access': 1,
 'request': 1,
 'نیاز': 1,
 'داره': 1,
 'اره': 1,
 'درست': 1,
 'شد.': 1,
 'عالیییی👌👌': 1,
 'مرسییی': 1,
 'علی': 1,
 'جان🙏🏻🙏🏻': 1,
 'جون،': 1,
 'برای': 1,
 'هر': 1,
 'کلاس،': 1,
 'تاپیک': 2,
 'خودش': 1,
 'لینک': 1,
 'زوم': 1,
 'هستش.': 1,
 'روی': 1,
 'اون': 1,
 'روز': 1,
 'کنید.': 1,
 'محمد': 1,
 'ریاضی؟': 1,
 'عالی': 1,
 'برادر👌🏻👌🏻': 1,
 'عههه': 1,
 'دیدم': 1,
 'اینو😅😅': 1}

## Step #4: Create `tf` (Term Frequency)

1. Create a dictionary where words are keys and values are a list.
2. Values are a list of corresponding documents frequencies.

```python
tf = {
    word_1: [freq_doc_1, freq_doc_2, freq_doc_3, ..., freq_doc_n],
    word_2: [freq_doc_1, freq_doc_2, freq_doc_3, ..., freq_doc_n],
    word_3: [freq_doc_1, freq_doc_2, freq_doc_3, ..., freq_doc_n],
    ...
    word_n: [freq_doc_1, freq_doc_2, freq_doc_3, ..., freq_doc_n],
}
```

| | doc_1 | doc_2 | ... | doc_n |
| -- | -- | -- | -- | -- |
| word_1 | 10 | 4 | ... | 14 |
| word_2 | 8 | 11 | ... | 4 |
| word_3 | 3 | 5 | ... | 1 |

We've already found unique words in variable 'vocab'!

Now we create a set of unique documents

In [160]:
sender = set()
for msg in data['messages']:
    if (type(msg['text']) != list) and (msg['text']):
        sender.update([msg['from']])

sender

{'Adrian Kamyab',
 'Ali Hejazi',
 'Ali Naeim abadi',
 'Amin',
 'Amir Sajedian',
 'Arash',
 'Arefe B',
 'Farnoush Azour',
 'Fatemeh Modarres',
 'Gazal',
 'Hadi',
 'M',
 'Machine Learning with Python',
 'Mahsa Taheri',
 'Mohammad Riazi',
 'Mohammad pmb',
 'Mohsen',
 'Mostafa',
 'Peyman T',
 'Sara Gh'}

In [200]:
for words in vocab:
    tf[words] = [0]*len(sender)
    for ind, i in enumerate(sender):
        if words in tf_dict[i]:
            tf[words][ind] = tf_dict[i][words]
            
tf

{'کردم': [0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
 'روز\u200cهای': [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'باهاش': [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'میذارن': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 'اون': [0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 'علی،': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 'نژادی': [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'نمیدونم': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'رعیت': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'کرد،': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 'عصبی': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 'هفت': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
 'hyperlink': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
 'میخواد': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [11]:
from tqdm import tqdm

In [12]:
tf = {}

In [13]:
for w in tqdm(vocab):
    vector = []
    for name, word_freq in tf_dict.items():
        vector.append(word_freq.get(w, 0))
        
    tf[w] = vector

100%|██████████| 685/685 [00:00<00:00, 130553.83it/s]


## Step #4: Search¶
Using dot product of vectors, ask a user to enter a query and find the most relevant documents.

Example:

- query: "محسن نقش"
- output: [doc_28, doc_4, ..., doc_19]

In [14]:
import numpy as np
np.argmax([i*j for i, j in zip(tf["هستم"], tf["خروس"])])

0

In [15]:
list(tf_dict.keys())[0]

'Fatemeh Modarres'

In [16]:
query = input("Enter a phrase:")

Enter a phrase:علی
