# Assignment 1 IRS

Information retrieval is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. The core purpose of this assignment is to give you the flavor of IRS. You need to follow some steps listed below and in the end, you'll be able to build your own small IRS. So, let's start.

In [1]:
# required imports
import numpy as np
import fnmatch
import os


Suppose we have 3 files containing data :

### File Contents

<img src="1.png"/>
<img src="2.png"/>
<img src="3.png"/>

# Step 1 Create Files with Dummy data

You have to create few files with dummy data of your own choice as shown above.

# Step 2 Traverse Directories

 Now, You have to traverse the directories and store all the files into a dict type variable(files_dict). 

In [2]:
# Here we have intialized some variables, you can add more if required.

file_count = 0             # file_count to count number of files
files_dict = {}            # files_dic to store count of every file    
unique_word_set = set()    # unique_word_set to store all the unique words in a set


In [3]:
#Your code starts here   
filename_pattern = "file_*.txt"
abs_notebook_path = "/".join(os.path.realpath("IRS.ipynb").split("/")[:-1])

files_dir_path = f"{abs_notebook_path}/files"
dir_file_list = [file for file in os.listdir(files_dir_path)]

for idx, file in enumerate(dir_file_list):
    if fnmatch.fnmatch(file, filename_pattern):
        file_count += 1
        files_dict[file] = idx
#Your code ends here

Displaying the count of files.

In [4]:
print("\nTotal Number  of files\n", file_count)


Total Number  of files
 3


Displaying Dictionary containing all files.

In [5]:
print("\nDictionary containing  files\n", files_dict)


Dictionary containing  files
 {'file_1.txt': 0, 'file_2.txt': 1, 'file_3.txt': 2}


# Step 3 Extract Unique Vocabulary

In [6]:
# write code to print all the unique words in every file and store them in a set

In [7]:
#Your code starts here
def get_words_from_files(filename):
    """Utility function to get data(words) in form of list from file"""
    line = ""
    with open(f"{files_dir_path}/{filename}", "r") as f:
        lines = f.read()
        return [w.lower() for w in lines.split(" ")]

for file in dir_file_list:
    if fnmatch.fnmatch(file, filename_pattern):
        data = get_words_from_files(file)
        for word in data:
            unique_word_set.add(word)
print("unique words in files\n",unique_word_set)
print("count of files\n", file_count)
#Your code ends here

unique words in files
 {'this', 'book', 'interesting', 'pen', 'my', 'is'}
count of files
 3


### Expected Output

<img src="4.png"/>

# Step 4 Create Term Document Matrix

Create Term-Doc-matrix using Bag of word approach.and display its contents initially and finally.

In [8]:
# Create Term doc matrix such that colmns will be unique words and all the files will be rows
# Write code to count all the unique words appearances in all the files and store it in a dictionary for words 

In [9]:
#Your code starts here
term_doc_matrix = np.zeros((file_count,len(unique_word_set)))

unique_word_dict = {word:idx for idx, word in enumerate(unique_word_set)}

print("TERM DOC MATRIX INITIALLY\n",term_doc_matrix)
print("dictionary of unique words\n", unique_word_dict)
print("dictionary of files\n", files_dict)
#Your code ends here

TERM DOC MATRIX INITIALLY
 [[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]
dictionary of unique words
 {'this': 0, 'book': 1, 'interesting': 2, 'pen': 3, 'my': 4, 'is': 5}
dictionary of files
 {'file_1.txt': 0, 'file_2.txt': 1, 'file_3.txt': 2}


### Expected Output

<img src="5.png"/>

# Step 5 Fill Term Document Matrix

In [10]:
# Fill the term doc matrix by checking if the unique word exists in a file or not
# If it exists then substitute a 1 in term_doc_matrix (eg : TERM_DOC_MATRIX[file][word] = 1 ) 
# Do the same for all the files present in the directory

In [11]:
#Your code starts here

for file in dir_file_list:
    if fnmatch.fnmatch(file, filename_pattern):
        data = get_words_from_files(file)
        for word in data:
            if word.lower() in unique_word_set:
                term_doc_matrix[files_dict[file]][unique_word_dict[word.lower()]] = 1
print("TERM DOC MATRIX after filling\n",term_doc_matrix)
#Your code ends here

TERM DOC MATRIX after filling
 [[1. 1. 0. 0. 1. 1.]
 [1. 0. 0. 1. 1. 1.]
 [0. 1. 1. 0. 1. 1.]]


### Expected Output

<img src="6.png"/>

# Step 6 Ask for a user Query

In [12]:
# For user query make a column vector of length of all the unique words present in a set

In [13]:
#Your code starts here    
column_vector = np.zeros((len(unique_word_set),1))
print("col_vector initially\n", column_vector)
#Your code ends here

col_vector initially
 [[0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]]


### Expected Output

<img src="7.png"/>

In [14]:
query = input("\nWrite something for searching:  ")
# Check every word of query if it exists in the set of unique words or not
# If exists then increment the count of that word in word dictionary


In [15]:
#Your code starts here    
for word in query.split(" "):
    if word.lower() in unique_word_set:
        column_vector[unique_word_dict[word.lower()]][0] += 1

print(column_vector)
#Your code ends here

[[1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]]


### Expected Output

<img src="8.png"/>

# Step 7 Display Resultant Vector

Display 
1. Resultant vector.
2. Max value in resultant vector.
3. Index of max value in resultant vector.


In [16]:
#Your code starts here  
# initializing the resultant_matrix
resultant_matrix = np.zeros((file_count,1))
# checking if the word in query are similar to that of word in term doc matrix
for row_idx,row in enumerate(term_doc_matrix):
    for col_idx, col in enumerate(row):
        #if the word is same and then incrementing its occurrence in resultant_matrix
        if col == column_vector[col_idx][0] == 1:
            resultant_matrix[row_idx][0] += 1
max_value = np.amax(resultant_matrix)
# np.where will return a tuple with the list of indices at which the condition is true.
max_value_index = np.where(resultant_matrix == max_value)[0][0]
print(resultant_matrix)
print("Maximum in resultant is: ", max_value)
print("Index of maximum in resultant is: ", max_value_index)
#Your code ends here

[[3.]
 [2.]
 [3.]]
Maximum in resultant is:  3.0
Index of maximum in resultant is:  0


### Expected Output

<img src="9.png"/>

# Step 8 Display the contents of file


In [17]:
#Write the code to identify the file_name having maximum value in the resultant vector and display its contents.

In [18]:
#Your code starts here    
for file_name,file_idx in files_dict.items():
    if file_idx == max_value_index:
        print("The file with maximum resultant is: ", file_name)
        print("The content of {} is: ".format(file_name))
        with open(f"{files_dir_path}/{file_name}", "r") as f:
            print(f.read())
#Your code ends here

The file with maximum resultant is:  file_1.txt
The content of file_1.txt is: 
This is my book


Congratulations Now you are able to build your own small IRS.