In [1]:
from google.colab import drive
drive.mount('/content/drive/')

%cd '/content/drive/My Drive/Colab Notebooks/Essential-NLP-master/Essential-NLP-master/cisi.zip (Unzipped Files)'

Mounted at /content/drive/
/content/drive/My Drive/Colab Notebooks/Essential-NLP-master/Essential-NLP-master/cisi.zip (Unzipped Files)


# Chapter 3 Document Similarity TF-IDF

Exploring the book **Getting Started with Natural Language Processing**

https://www.manning.com/books/getting-started-with-natural-language-processing

https://www.baeldung.com/cs/ml-similarities-in-text#:~:text=The%20simplest%20way%20to%20compute%20the%20similarity%20between,of%20all%20the%20word%20vectors%20in%20the%20document.

The idea behind TF-IDF is that we first compute the number of documents in which a word appears in. If a word appears in many documents, it will be less relevant in the computation of the similarity, and vice versa. We call this value the inverse document frequency or IDF, and we can compute it as:

$$
idf(word) = log(\frac{N}{|\{d \in C : word \in d\}|})
$$

In the formula, C is the corpus, N is the total number of documents in it, and the denominator is the number of documents that contain our word.

We can compute the IDF just once, as a preprocessing step, for each word in our corpus and it will tell us how significant that word is in the corpus itself.

At this point, instead of using the raw word counts, we can compute the document vectors by weighing it with the IDF. For each document, we’ll compute the count for each word, transform it into a frequency (that is, dividing the count by the total number of words in the document), and then multiply by the IDF.

Given that, the final score for each word will be:
$$
score(word) = frequency(word) \cdot idf(word)
$$


In [2]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
def read_documents():
    f = open("cisi/CISI.ALL")
    merged = ""
    
    for a_line in f.readlines():
        if a_line.startswith("."):
            merged += "\n" + a_line.strip()
        else:
            merged += " " + a_line.strip()
    
    documents = {}

    content = ""
    doc_id = ""

    for a_line in merged.split("\n"):
        if a_line.startswith(".I"):
            doc_id = a_line.split(" ")[1].strip()
        elif a_line.startswith(".X"):
            documents[doc_id] = content
            content = ""
            doc_id = ""
        else:
            content += a_line.strip()[3:] + " "
    f.close()
    return documents

In [None]:
documents = read_documents()
print(len(documents))
print(documents.get("1"))

1460
 18 Editions of the Dewey Decimal Classifications Comaromi, J.P. The present study is a history of the DEWEY Decimal Classification.  The first edition of the DDC was published in 1876, the eighteenth edition in 1971, and future editions will continue to appear as needed.  In spite of the DDC's long and healthy life, however, its full story has never been told.  There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad. 


In [None]:
docs_list = [documents[i] for i in documents.keys()]

In [None]:
def read_queries():
    f = open("cisi/CISI.QRY")
    merged = ""
    
    for a_line in f.readlines():
        if a_line.startswith("."):
            merged += "\n" + a_line.strip()
        else:
            merged += " " + a_line.strip()
    
    queries = {}

    content = ""
    qry_id = ""

    for a_line in merged.split("\n"):
        if a_line.startswith(".I"):
            if not content=="":
                queries[qry_id] = content
                content = ""
                qry_id = ""
            qry_id = a_line.split(" ")[1].strip()
        elif a_line.startswith(".W") or a_line.startswith(".T"):
            content += a_line.strip()[3:] + " "
    queries[qry_id] = content
    f.close()
    return queries

In [None]:
queries = read_queries()
print(len(queries))
print(queries.get("1"))

112
What problems and concerns are there in making up descriptive titles? What difficulties are involved in automatically retrieving articles from approximate titles? What is the usual relevance of the content of articles to their titles? 


In [None]:
vectorizer = TfidfVectorizer()

def process_tfidf_similarity(base_document, docs_list):
  docs_list=[base_document]+docs_list
  embeddings = vectorizer.fit_transform(docs_list)
  cosine_similarities = cosine_similarity(embeddings[0], embeddings[1:]).flatten()
  index = np.argmax(cosine_similarities)

  print("Most similar document by TF-IDF with the score:", np.round(cosine_similarities[index],4))
  print('query:', base_document)
  print('document:',docs_list[index])

In [None]:
process_tfidf_similarity( queries['20'], docs_list)

Most similar document by TF-IDF with the score: 0.3164
query: Testing automated information systems. 
document: User Evaluation of Information Retrieval Systems Cleverdon, C.W. While Fairthorne may not have been the first person to recognize it, certainly, for this author, Fairthorne was the first to make explicit the fundamental problems of information retrieval systems, namely the clash between OBNA and ABNO (Only-But-Not-All and All-But-Not-Only). Although it was not until 1958 that the terms occur in Fairthorne's writings, the concept had been discussed in many meetings of the AGARD Documentation Panel and elsewhere.  Originally it was considered that to meet these two requirements, it might be necessary to have two separate systems, and the test of the UNITERM system in 1954 was based on the hypothesis that a 'Marshalling' system (e.g. U.D.C.) was fundamentally different from a 'Retrieval' system (e.g. UNITERM).  While the idea persisted in this form for some time, it gradually ev