# 02. Practice Session

> Covering Data Types, Functions, and IO.

## Search Engine (TF-IDF)

> **TF-IDF** stands for “Term Frequency — Inverse Document Frequency”. This is a technique to quantify a word in documents, we generally compute a weight to each word which signifies the importance of the word in the document and corpus.

If i give you a sentence for example _“This building is so tall”_. Its easy for us to understand the sentence as we know the semantics of the words and the sentence. But how will the computer understand this sentence? The computer can understand any data only in the form of numerical value. So, for this reason we vectorize all of the text so that the computer can understand the text better.

By vectorizing the documents we can further perform multiple tasks such as finding the relevant documents, ranking, clustering and so on. This is the same thing that happens when you perform a google search. The web pages are called documents and the search text with which you search is called a query. google maintains a fixed representation for all of the documents. When you search with a query, google will find the relevance of the query with all of the documents, ranks them in the order of relevance and shows you the top k documents, all of this process is done using the vectorized form of query and documents. Although Googles algorithms are highly sophisticated and optimized, this is their underlying structure.

Terminology
- **t**: term (word)
- **d**: document (set of words)
- **N**: count of corpus
- **corpus**: the total document set

### Step #1: Read Files

> 1. Read the `data/files_path.txt` which contains all the documents you have to read.
> 2. Read the files listed in `data/files_path.txt` and create a dictionary where keys are file names and values are file contents.

```python
docs = {
    "file_1": "content_1",
    "file_2": "content_2",
    ...
} 
```

In [19]:
docs = {}

In [16]:
with open("./data/docs/docs.txt") as f:
    files = [f.strip() for f in f.readlines()]

In [22]:
for file_path in files:
    with open(file_path) as f:
        name = file_path.split("/")[-1].split(".")[0]
        if name in docs:
            docs[name] += f.read()
        else:
            docs[name] = f.read()

### Step #3: Extract Unique Words in all Documents

> Create a set of all words (`vocab`) and print the number of unique words.

In [40]:
vocab = set()
for name, content in docs.items():
    vocab.update(content.split())

In [41]:
print(len(vocab))

9004


### Step #2: Extract Number of Words in each Document

> 1. Extract words in each document by creating a dictionary named `tf_dict` where keys are document names and values are another dictionary.
> 2. In the nested dictionary, keys are words and values are the corresponding word frequency.

In [33]:
tf_dict = {}
for name, content in docs.items():
    
    word_dict = {}
    for w in content.split():
        if w in word_dict:
            word_dict[w] += 1
        else:
            word_dict[w] = 1
            
    tf_dict[name] = word_dict

### Step #3: Create `tf` (Term Frequency)

> 1. Create a dictionary where words are keys and values are a list.
> 2. Values are a list of corresponding documents frequencies.

```python
tf = {
    word_1: [freq_doc_1, freq_doc_2, freq_doc_3, ..., freq_doc_n],
    word_2: [freq_doc_1, freq_doc_2, freq_doc_3, ..., freq_doc_n],
    word_3: [freq_doc_1, freq_doc_2, freq_doc_3, ..., freq_doc_n],
    ...
    word_n: [freq_doc_1, freq_doc_2, freq_doc_3, ..., freq_doc_n],
}
```

| |doc_1|doc_2|...|doc_n|
|--|--|--|--|--|
|word_1|10|4|...|14|
|word_2|8|11|...|4|
|word_3|3|5|...|1|

In [49]:
from tqdm import tqdm

In [42]:
tf = {}

In [51]:
for w in tqdm(vocab):
    vector = []
    for name, word_freq in tf_dict.items():
        vector.append(word_freq.get(w, 0))
        
    tf[w] = vector

100%|██████████| 9004/9004 [00:00<00:00, 119980.79it/s]


In [55]:
max(tf["محسن"])

149

In [53]:
tf.keys()

