# Assignment 2 IRS

Information retrieval is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. The core purpose of this assignment is to give you the flavor of IRS. You need to follow some steps listed below and in the end, you'll be able to build your own small IRS. So, let's start.

In [243]:
# required imports
import numpy as np
import fnmatch
import os
import re

Suppose we have 3 files containing data :

### File Contents

!["This is my book" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/f1.png?raw=true)
!["This is my pen" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/f2.png?raw=true)
!["This is book is intersting" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/f3.png?raw=true)

# Step 1 Create Files with Dummy data

You have to create few files with dummy data of your own choice as shown above.

# Step 2 Traverse Directories

 Now, You have to traverse the directories and store all the files into a dict type variable(files_dict). 

In [244]:
# Here we have intialized some variables, you can add more if required.
file_count = 0             # file_count to count number of files
files_dict = {}            # files_dic to store count of every file    
unique_word_set = set()    # unique_word_set to store all the unique words in a set
words_dict={}              # to count all the unique words appearances in all the files 
ids_dict={}                # dictionary to get the ids of words in all files 
word_count={}             # to maintain a list of counts of each word in all files

In [245]:
#Your code starts here   
#it takes the argument 'directory' which is actually the path to directory,
#and uses the 'os.walk' function to traverse the directory and all of its subdirectories recursively.
#For each file in each directory, this function creates a file path by joining the file name
#with the directory name using the 'os.path.join' function. Lastly, it calls the 'process_file' function 
#for each file path.
def traverse_directories(directory):
    for root, dirs, files in os.walk(directory):  #traversing the directory
        # Process each file in the directory
        for file in files:
            file_path = os.path.join(root, file)  #making file path for each file by joining it with directory name
            process_file(file_path)

#it takes the argument 'file_path', which is a string representing the path to a file.
#a>increments the 'file_count' global variable by 1 
#b>assigns the file path to the files_dict dictionary using the file_id as the key

def process_file(file_path):
    global file_count
    global files_dict
    global unique_word_set
    file_id = file_count 
    # Assign the file path to the files_dict dictionary
    files_dict[file_id] = file_path
    # Increment the file_count variable
    file_count += 1

directory = "C:\\Users\\User\\OneDrive\\Desktop\\check"
traverse_directories(directory)
#Your code ends here

Displaying the count of files.

In [246]:
print("\nTotal Number  of files\n", file_count)


Total Number  of files
 3


Displaying Dictionary containing all files.

In [247]:
print("\nDictionary containing  files\n", files_dict)


Dictionary containing  files
 {0: 'C:\\Users\\User\\OneDrive\\Desktop\\check\\f1.txt', 1: 'C:\\Users\\User\\OneDrive\\Desktop\\check\\f2.txt', 2: 'C:\\Users\\User\\OneDrive\\Desktop\\check\\f3.txt'}


# Step 3 Extract Unique Vocabulary

write code to print all the unique words in every file and store them in a set

In [248]:
#Your code starts here    
def extract_unique_vocabulary():
    global unique_word_set
    # Read the file content
    for file_id, file_path in files_dict.items():
        with open(file_path, 'r') as f:
            content = f.read()
            words = re.findall(r'\w+', content)
             # Convert the words to lowercase
            words = [word.lower() for word in words]
             # Add the unique words to the unique_word_set set using set union
            unique_word_set |= set(words)
         # Count the frequency of each word in the file
        for word in words:
             # If the word has not been seen before, add it to the word_count dictionary with a count of 1
            if word not in word_count:
                word_count[word] = 1
            # If the word has been seen before, increment its count
            else:
                word_count[word] += 1
extract_unique_vocabulary()
#print(word_count)
print("\nUnique vocabulary in all files:")
print(unique_word_set)
'''
for word in sorted(unique_word_set):
    print(word)
'''
print("\nTotal Number of unique words\n", len(unique_word_set))
print(word_count)
#Your code ends here


Unique vocabulary in all files:
{'i', 'play', 'pakistan', 'team', 'cricket'}

Total Number of unique words
 5
{'i': 1, 'play': 1, 'cricket': 2, 'team': 2, 'pakistan': 2}


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o1.png?raw=true)


# Step 4 Create Term Document Matrix

Create Term-Doc-matrix using Bag of word approach.and display its contents initially and finally.

- Create Term doc matrix such that colmns will be unique words and all the files will be rows
- Write code to count all the unique words appearances in all the files and store it in a dictionary for words 

In [249]:
#Your code starts here    
#Create Term doc matrix such that colmns will be unique words and all the files will be rows
term_doc_matrix = np.zeros((file_count, len(unique_word_set)))
print(term_doc_matrix)

#code to count all the unique words appearances in all the files and store it in a dictionary for words
def dict_for_unique_words():
    i = 0
    for word in unique_word_set:
        words_dict[word]=i
        i+=1
dict_for_unique_words()
print("Dictionary of Unique Words")
print(words_dict)
print("Dictionary of Files")
print(files_dict)
#Your code ends here

[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]
Dictionary of Unique Words
{'i': 0, 'play': 1, 'pakistan': 2, 'team': 3, 'cricket': 4}
Dictionary of Files
{0: 'C:\\Users\\User\\OneDrive\\Desktop\\check\\f1.txt', 1: 'C:\\Users\\User\\OneDrive\\Desktop\\check\\f2.txt', 2: 'C:\\Users\\User\\OneDrive\\Desktop\\check\\f3.txt'}


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o2.png?raw=true)

# Step 5 Fill Term Document Matrix

- Fill the term doc matrix by checking if the unique word exists in a file or not
- If it exists then substitute a 1 in term_doc_matrix (eg : TERM_DOC_MATRIX[file][word] = 1 ) 
- Do the same for all the files present in the directory

In [250]:
#Your code starts here    
# Create a dictionary to map the unique words to their word IDs
for w,i in words_dict.items():
    ids_dict[i]=w
#print(ids_dict)

print("Dictionary of unique words")
print(words_dict)

# Populate the term-document matrix
for file_id, file_path in files_dict.items():
    file_words = [w.lower() for w in re.findall(r'\w+', open(file_path, 'r').read())]
    print(file_words)
    for word, word_id in words_dict.items():
        if word in file_words:
            # Set the value at the corresponding cell in the term-document matrix to 1
            term_doc_matrix[file_id][word_id] = 1

# Print the term-document matrix
print('\nTerm Document Matrix')
print(term_doc_matrix)


#Your code ends here

Dictionary of unique words
{'i': 0, 'play': 1, 'pakistan': 2, 'team': 3, 'cricket': 4}
['i', 'play', 'cricket']
['cricket', 'team', 'pakistan']
['pakistan', 'team']

Term Document Matrix
[[1. 1. 0. 0. 1.]
 [0. 0. 1. 1. 1.]
 [0. 0. 1. 1. 0.]]


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o4.png?raw=true)


# Step 6 Ask for a user Query

For user query make a column vector of length of all the unique words present in a set

In [251]:
#Your code starts here    
col_vector= np.zeros(len(unique_word_set))             # Initialize the column vector to all zeros
col_vector=col_vector.reshape(len(unique_word_set),1)  # reshape used to convert it into column vector
print(col_vector)
#Your code ends here

[[0.]
 [0.]
 [0.]
 [0.]
 [0.]]


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o5.png?raw=true)


In [252]:
query = input("\nWrite something for searching  ")
# Check every word of query if it exists in the set of unique words or not
# If exists then increment the count of that word in word dictionary
query_words=query.split(' ')             # Split the query into individual words
# Convert the words to lowercase
query_words = [word.lower() for word in query_words]
#print(query_words) to get the query words 
for word in query_words:
    if word in unique_word_set:
        word_count[word]+=1


KeyboardInterrupt: Interrupted by user

In [None]:
#Your code starts here  
# Check every word of the query to see if it exists in the set of unique words
for word in query_words:
    # If a word exists in the set, increment the value at the corresponding index in the column vector
    if word in unique_word_set:
        col_vector[words_dict[word]]+=1
print(col_vector)

#Your code ends here

### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o6.png?raw=true)


# Step 7 Display Resultant Vector

Display 
1. Resultant vector.
2. Max value in resultant vector.
3. Index of max value in resultant vector.


In [255]:
#Your code starts here  
#print(term_doc_matrix.shape,col_vector.shape)        #to display the order of matrices/vectors
# Compute the dot product of the term-document matrix and the column vector
resultant_vector = np.dot(term_doc_matrix, col_vector)
res=np.max(resultant_vector)                         #np.max() to get the maximum value from resultant vector
max_index=np.argmax(resultant_vector)                #np.argmax() to get the maximum index from resultant vector

print("Result")
print(resultant_vector)
print("Max Index")
print(max_index)
print("Max")
print(res)
#print(resultant_vector.shape)                       #for order of resultant vector
#Your code ends here

Result
[[0.]
 [0.]
 [0.]]
Max Index
0
Max
0.0


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o7.png?raw=true)


# Step 8 Display the contents of file


Write the code to identify the file_name having maximum value in the resultant vector and display its contents.

In [None]:
#Your code starts here    
file_name=files_dict[max_index]                  
with open(file_name, 'r') as file:
    content = file.read()
    print(content)

#Your code ends here

Congratulations Now you are able to build your own small IRS.