# Assignment 2 IRS - With Synonyms

Information retrieval is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. The core purpose of this assignment is to give you the flavor of IRS. You need to follow some steps listed below and in the end, you'll be able to build your own small IRS. So, let's start.

In [12]:
# required imports
import numpy as np
import fnmatch
import os

Suppose we have 3 files containing data :

### File Contents

!["This is my book" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/f1.png?raw=true)
!["This is my pen" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/f2.png?raw=true)
!["This is book is intersting" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/f3.png?raw=true)

# Step 1 Create Files with Dummy data

You have to create few files with dummy data of your own choice as shown above.

# Step 2 Traverse Directories

 Now, You have to traverse the directories and store all the files into a dict type variable(files_dict). 

In [13]:
# Here we have initialized some variables, you can add more if required.

file_count = 0             # file_count to count number of files
files_dict = {}            # files_dic to store count of every file    
unique_word_set = set()    # unique_word_set to store all the unique words in a set
file_path = './files'      # file_path to store the path of the directory containing files


In [14]:
#Your code starts here
for file in os.listdir(file_path):
    if fnmatch.fnmatch(file, '*.txt') and file != 'synonyms.txt':
        file_count += 1
        with open(os.path.join(file_path, file), 'r') as f:
            files_dict[file] = 0
            for line in f:
                for word in line.split():
                    files_dict[file] += 1
#Your code ends here       

Displaying the count of files.

In [15]:
print("\nTotal Number  of files\n", file_count)


Total Number  of files
 3


Displaying Dictionary containing all files.

In [16]:
print("\nDictionary containing  files\n", files_dict)


Dictionary containing  files
 {'f1.txt': 4, 'f2.txt': 4, 'f3.txt': 4}


# Step 3 Extract Unique Vocabulary

write code to print all the unique words in every file and store them in a set

In [20]:
#Your code starts here    
# 3. counting the number of unique words in all the files
for path in os.listdir(file_path):
    # check if current path is a file
    if fnmatch.fnmatch(path, '*.txt') and path != 'synonyms.txt':
        with open(os.path.join(file_path, path), 'r') as file:
            # read the file and split the words
            words = file.read().split()
            lowercase_words = np.char.lower(words)
            # add the words to the set
            unique_word_set.update(lowercase_words)

print('Unique word set:', unique_word_set)
#Your code ends here

Unique word set: {'this', 'my', 'is', 'book', 'pen', 'interesting'}


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o1.png?raw=true)


# Step 4 Create Term Document Matrix

Create Term-Doc-matrix using Bag of word approach.and display its contents initially and finally.

1. Create Term doc matrix such that colmns will be unique words and all the files will be rows
2. Write code to count all the unique words appearances in all the files and store it in a dictionary for words 

In [21]:
#Your code starts here    
# create a matrix of size file_count x unique_word_set and initialize it with zeros
matrix = np.zeros((file_count, len(unique_word_set)))
print('Matrix:', matrix)

dict_of_words = {}
dict_of_files = {}
        
        
for idx,val in enumerate(unique_word_set):
    dict_of_words[val] = idx
print('dictionary of unique words\n',dict_of_words)

for i,file in enumerate(files_dict):
    dict_of_files[file] = i
print('dictionary of files\n',dict_of_files)

#Your code ends here

Matrix: [[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]
dictionary of unique words
 {'this': 0, 'my': 1, 'is': 2, 'book': 3, 'pen': 4, 'interesting': 5}
dictionary of files
 {'f1.txt': 0, 'f2.txt': 1, 'f3.txt': 2}


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o2.png?raw=true)

# Step 5 Fill Term Document Matrix

1. Fill the term doc matrix by checking if the unique word exists in a file or not
2. If it exists then substitute a 1 in term_doc_matrix (eg : TERM_DOC_MATRIX[file][word] = 1 ) 
3. Do the same for all the files present in the directory

In [22]:
#Your code starts here    
# filling the term matrix if the word is present in the file iterating through all the files
for path in os.listdir(file_path):
    # check if current path is a file
    if fnmatch.fnmatch(path, '*.txt') and path != 'synonyms.txt':
        with open(os.path.join(file_path, path), 'r') as file:
            # read the file and split the words
            words = file.read().split()
            lowercase_words = np.char.lower(words)
            for word in lowercase_words:
                matrix[dict_of_files[path]][dict_of_words[word]] = 1

print('Dictionary of unique words\n',dict_of_words)
print(matrix)

#Your code ends here

Dictionary of unique words
 {'this': 0, 'my': 1, 'is': 2, 'book': 3, 'pen': 4, 'interesting': 5}
[[1. 1. 1. 1. 0. 0.]
 [1. 1. 1. 0. 1. 0.]
 [0. 1. 1. 1. 0. 1.]]


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o4.png?raw=true)


# Step 6 Ask for a user Query

For user query make a column vector of length of all the unique words present in a set

In [23]:
#Your code starts here    
# creating a column matrix of size unique_word_set x 1 and initializing it with zeros
column_matrix = np.zeros((len(unique_word_set), 1))
print('Column matrix:', column_matrix)

#Your code ends here

Column matrix: [[0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]]


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o5.png?raw=true)


In [24]:
query = input("\nWrite something for searching:  ")
print("Query is:", query)

Query is: some ballpoint


### Expected Output

![Expected Output of query](images/Query.png)

# Step 7 Load Synonyms

In [25]:
synonym_file_path = r"files\synonyms.txt"
synonyms_dict = {} # dictionary to store synonyms
with open(synonym_file_path, 'r') as file:
    lines = file.readlines()
    for line in lines:
        key, values = line.split(':')
        values = values.strip()
        values = values.split(',')
        values = [val.strip() for val in values]
        synonyms_dict[key] = values
        
#your code ends here
print("\nSynonyms Dictionary\n")
synonyms_dict


Synonyms Dictionary



{'write': ['compose', 'draft', 'author', 'create'],
 'file': ['document', 'record', 'dossier', 'report'],
 'example': ['illustration', 'instance', 'sample', 'demonstration'],
 'query': ['question', 'inquiry', 'search', 'request'],
 'synonym': ['equivalent', 'substitute', 'alternate', 'replacement'],
 'retrieve': ['fetch', 'recover', 'obtain', 'bring back'],
 'system': ['framework', 'structure', 'organization', 'arrangement'],
 'search': ['seek', 'look for', 'explore', 'examine'],
 'lost': ['misplaced', 'missing', 'forgotten', 'mislaid'],
 'pen': ['write', 'ink', 'ballpoint', 'fountain'],
 'paper': ['document', 'sheet', 'form', 'letter'],
 'book': ['novel', 'volume', 'publication', 'tome'],
 'read': ['peruse', 'scan', 'study', 'look at'],
 'interesting': ['fascinating', 'engaging', 'intriguing', 'absorbing'],
 'computer': ['machine', 'device', 'processor', 'laptop'],
 'software': ['program', 'application', 'app', 'platform']}

### Expected Output

![Synonym Dict Example](images\Synonym_dict.png)

# Step 8 Extend User Query

In [27]:
expanded_query = []
# Write code to expand the query using synonyms
#your code starts here
for word in query.split():
    for key,values in synonyms_dict.items():
        if word in values:
            expanded_query.append(key)
            break
    expanded_query.append(word)

#your code ends here

print("Expanded Query")
print(expanded_query)

Expanded Query
['some', 'pen', 'ballpoint']


### Expected Output

![Extended Query](images\Expanded_Query.png)

Now work with extended query and find the relevant documents.

In [None]:
# Check every word of query if it exists in the set of unique words or not
# If exists then increment the count of that word in word dictionary


In [29]:
#Your code starts here    
for word in expanded_query:
    lower_word = word.lower()
    if lower_word in dict_of_words:
        column_matrix[dict_of_words[lower_word]] = 1

print(column_matrix)
#Your code ends here

[[0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]]


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o6.png?raw=true)


# Step 7 Display Resultant Vector

Display 
1. Resultant vector.
2. Max value in resultant vector.
3. Index of max value in resultant vector.


In [30]:
#Your code starts here  
result = np.dot(matrix, column_matrix)
print(result)
index = np.argmax(result)
print('max_index:', index)
print('max_value:', result[index])
#Your code ends here

[[0.]
 [1.]
 [0.]]
max_index: 1
max_value: [1.]


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o7.png?raw=true)


# Step 8 Display the contents of file


Write the code to identify the file_name having maximum value in the resultant vector and display its contents.

In [31]:
#Your code starts here    
def find_file_name(dictionary, target_index):
    for file_name, index in dictionary.items():
        if index == target_index:
            return file_name
    return None

file_name = find_file_name(dict_of_files, index)
with open(os.path.join(file_path, file_name), 'r') as file:
    print(file.read())
#Your code ends here

This is my pen


Congratulations Now you are able to build your own small IRS which can work even if query does not have exact same words.