## Assignment 3: Exploring IR & NLP
- In this assignment we are going to implement various IR techniques <b><i>From Scratch</i></b>, Please don't use available libraries except if specified that you can use it.
- You are required to submit 6 different functions for this assignment, you can additional helper functions but only 6 will be tested.
- You will be granted 10 marks for clean code and documenting the code.
- Student Name: Nikhil Shankar Chirakkal Sivasankaran
- ID: 9026254


In [61]:
sample_sentences = [
    "Python is a versatile programming language, python proved its importance in various domains.",
    "JavaScript is widely used for web development.",
    "Java is known for its platform independence.",
    "Programming involves writing code to solve problems.",
    "Data structures are crucial for efficient programming.",
    "Algorithms are step-by-step instructions for solving problems.",
    "Version control systems help manage code changes in collaboration.",
    "Debugging is the process of finding and fixing errors in python code.",
    "Web frameworks simplify the development of web applications.",
    "Artificial intelligence can be applied in various programming tasks."
]

#### PART A: Preprocessing (15 Marks)
- You are required to preprocess the text and apply the tokenization process.<br/>
- Proprocessing should include tokenization, normalization, stemming <b>OR</b> lemmatization, and Named Entity Recognition (NER).<br/>
- You need to make sure that Named Entities are not broken into separate tokens, but should be normalized by case-folding only. <br/>
- The output of this step should be list of tokenized sentences. [[sentence1_token1, sentence1_token2, .. .], [sentence2_token1, .. .], .. .] <br/>
- Please write the functionality of clean_sentences as explained in the comment (Please do comment your code at each essential step) <br/>

In [62]:
import nltk
import numpy as np
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
#stopwords removal
from nltk.corpus import stopwords
#for ner tasks
from nltk import word_tokenize, pos_tag, ne_chunk
#using spacy for NER
import spacy
#lemmatization 
from nltk.stem import WordNetLemmatizer





print(nltk.__version__)
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')

## You are allowed for PART A to use any library that would help you in the task.
def clean_sentences(sentences=None):
    ## This function takes as an input list of sentences
    ## This function returns a list of tokenized_sentences
    debug = False
    step = 1
    isInputVerified = verify_input(sentences)
    if isInputVerified == False:
        raise ValueError("Please give a list of sentences as input")
    test_doc = " ".join(sentences)
    if debug: print(f'Step {(step := step + 1)}: {test_doc}')
    tokenized_test_doc = [word_tokenize(sentence) for sentence in sentences]
    if debug: print(f'Step {(step := step + 1)}: {tokenized_test_doc}')
    if debug: display(print_sentence_summary(tokenized_test_doc))
    ner = perform_ner(tokenized_test_doc)
    ner2 = perform_ner_using_spacy(tokenized_test_doc)
    ner.extend(ner2)

    special_char_removed_tok_test_doc = remove_special_characters(tokenized_test_doc)
    if debug: print(f'Step {(step := step + 1)}: {special_char_removed_tok_test_doc}')
    lowercased_scr_tok_test_doc = perform_lowercasing(special_char_removed_tok_test_doc)
    if debug: print(f'Step {(step := step + 1)}: {lowercased_scr_tok_test_doc}')
    rsw_lw_scr_tok_test_doc = remove_stop_words(lowercased_scr_tok_test_doc)
    if debug: print(f'Step {(step := step + 1)}: {rsw_lw_scr_tok_test_doc}')
    if debug: display(print_sentence_summary(rsw_lw_scr_tok_test_doc))
    lemma_sentences = perform_lemmatization(rsw_lw_scr_tok_test_doc, ner)
    if debug: print(f'Step {(step := step + 1)}: {lemma_sentences}')
    return lemma_sentences

def verify_input(sentences=None): 
    if len(sentences) == 0: 
        return False
    else: 
        return True
    
def perform_ner(tokenized_sentences):
    all_named_entities = []
    ner_recognized = []
    for sentence in tokenized_sentences:
        pos_tags = pos_tag(sentence)
        named_entities = ne_chunk(pos_tags)
        # Store named entities for each sentence
        all_named_entities.append(named_entities)
    
    # Extract and print named entities
    for sentence_entities in all_named_entities:
        for subtree in sentence_entities:
            if isinstance(subtree, nltk.Tree):
                entity_type = subtree.label()
                entity_value = " ".join([leaf[0] for leaf in subtree.leaves()])
                ner_recognized.append(entity_value)
    return ner_recognized

def perform_ner_using_spacy(tokenized_sentences):
    ner_recogized = []
    nlp = spacy.load("en_core_web_sm")
    for sentence in tokenized_sentences:
        doc = nlp(" ".join(sentence))
        for ent in doc.ents:
            ner_recogized.append(ent)
    return ner_recogized

def perform_lemmatization(tokenized_sentences, ner_words):
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [[lemmatizer.lemmatize(word) for word in sentence if word not in ner_words] for sentence in tokenized_sentences]
    return lemmatized_words
    
#This function takes in tokenized sentence and removes special characters
def remove_special_characters(tokenized_sentences):
    special_characters = ['!','"','#','$','%','&','_','(',')','*','+','/',':','','<','=','>','@','[','\\',']','^','`','{','|','}','~','\t', '\s', ',', '.']
    cleaned_tokenized_sentences = []
    for sentence in tokenized_sentences:
        cleaned = []
        for word in sentence:
            if word not in special_characters:
                cleaned.append(word)
        cleaned_tokenized_sentences.append(cleaned)
    return cleaned_tokenized_sentences

def remove_stop_words(tokenized_sentences):
    stop_words_removal = [ [word for word in sentence if word not in stopwords.words('english')] for sentence in tokenized_sentences]
    return stop_words_removal

def perform_lowercasing(tokenized_sentences):
    lower = [ [ word.lower() for word in sentence] for sentence in tokenized_sentences]
    return lower

def print_sentence_summary(matrix):
    results = []
    for index, entry in enumerate(matrix):
        sentenceName = f'Sentence-{index+1}'
        total_words = len(entry)
        unique_words = len(set(entry))
        results.append({"Sentence": sentenceName,  "Total Words": total_words, "Unique Words": unique_words})
    # Create and return the DataFrame
    return pd.DataFrame(results)
    
cleaned_sentence = clean_sentences(sentences=sample_sentences)

3.9.1


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


#### PART B: Building IR Sentence-Word Representation (30 Marks)

- Question B-1: Create a method that takes as an input a 2-dimensional list where each of the inner dimensions is a sentence list of tokens, and the outer dimension is the list of the sentences. The method MUST return the <b>inverted index</b> that is sufficient to represent the document. Assume that each sentence is a document and the sentence ID starts from 1. (10)

In [63]:
def get_inverted_index(list_of_sentence_tokens):
    ## TODO: Implement the functionality that will return the inverted index
    #initializing an empty dictionary
    inverted_index = {}
    for index,sentence in enumerate(list_of_sentence_tokens, 1):
        for word in set(sentence):
            if word not in inverted_index:
                inverted_index[word] = []
            inverted_index[word].append(index)
    return  inverted_index #THIS IS A PLACEHOLDER FOR THE OUTPUT YOU NEED TO OVERWRITE

get_inverted_index(cleaned_sentence)

{'versatile': [1],
 'python': [1, 8],
 'language': [1],
 'proved': [1],
 'programming': [1, 4, 5, 10],
 'various': [1, 10],
 'domain': [1],
 'importance': [1],
 'used': [2],
 'development': [2, 9],
 'widely': [2],
 'web': [2, 9],
 'javascript': [2],
 'java': [3],
 'known': [3],
 'independence': [3],
 'platform': [3],
 'writing': [4],
 'involves': [4],
 'solve': [4],
 'code': [4, 7, 8],
 'problem': [4, 6],
 'efficient': [5],
 'crucial': [5],
 'data': [5],
 'structure': [5],
 'step-by-step': [6],
 'algorithm': [6],
 'solving': [6],
 'instruction': [6],
 'change': [7],
 'help': [7],
 'control': [7],
 'version': [7],
 'manage': [7],
 'collaboration': [7],
 'system': [7],
 'process': [8],
 'fixing': [8],
 'finding': [8],
 'debugging': [8],
 'error': [8],
 'simplify': [9],
 'application': [9],
 'framework': [9],
 'intelligence': [10],
 'applied': [10],
 'artificial': [10],
 'task': [10]}

- Question B-2: Create a method that takes as an input a 2-dimensional list where each of the inner dimensions is a sentence list of tokens, and the outer dimension is the list of the sentences. The method MUST return the <b>Positional index</b> that is sufficient to represent the document. Assume that each sentence is a document and the sentence ID starts from 1, and the first token in the list is at position 0. Make sure to consider multiple appearance of the same token. (10)

In [64]:
def get_positional_index(list_of_sentence_tokens):
    ## TODO: Implement the functionality that will return the positional index
    #dictionary to save positional index
    positional_index = {}
    for index, sentence in enumerate(list_of_sentence_tokens, 1):
        for position, word in enumerate(sentence):
            if word not in positional_index:
                positional_index[word] = {}
            if index not in positional_index[word]:
                positional_index[word][index] = []
            positional_index[word][index].append(position)
    return  positional_index #THIS IS A PLACEHOLDER FOR THE OUTPUT YOU NEED TO OVERWRITE

get_positional_index(cleaned_sentence)

{'python': {1: [0, 4], 8: [5]},
 'versatile': {1: [1]},
 'programming': {1: [2], 4: [0], 5: [4], 10: [4]},
 'language': {1: [3]},
 'proved': {1: [5]},
 'importance': {1: [6]},
 'various': {1: [7], 10: [3]},
 'domain': {1: [8]},
 'javascript': {2: [0]},
 'widely': {2: [1]},
 'used': {2: [2]},
 'web': {2: [3], 9: [0, 4]},
 'development': {2: [4], 9: [3]},
 'java': {3: [0]},
 'known': {3: [1]},
 'platform': {3: [2]},
 'independence': {3: [3]},
 'involves': {4: [1]},
 'writing': {4: [2]},
 'code': {4: [3], 7: [5], 8: [6]},
 'solve': {4: [4]},
 'problem': {4: [5], 6: [4]},
 'data': {5: [0]},
 'structure': {5: [1]},
 'crucial': {5: [2]},
 'efficient': {5: [3]},
 'algorithm': {6: [0]},
 'step-by-step': {6: [1]},
 'instruction': {6: [2]},
 'solving': {6: [3]},
 'version': {7: [0]},
 'control': {7: [1]},
 'system': {7: [2]},
 'help': {7: [3]},
 'manage': {7: [4]},
 'change': {7: [6]},
 'collaboration': {7: [7]},
 'debugging': {8: [0]},
 'process': {8: [1]},
 'finding': {8: [2]},
 'fixing': {8: 

-Question B-3: Create a method that takes as an input a 2-dimensional list where each of the inner dimensions is a sentence list of tokens, and the outer dimension is the list of the sentences. The method MUST return the <b>TF-IDF Matrix</b> that is sufficient to represent the documents, the tokens are expected to be sorted as well as documentIDs. Assume that each sentence is a document and the sentence ID starts from 1. (10) You are not allowed to use any libraries.

In [65]:
import math

def get_TFIDF_matrix(list_of_sentence_tokens):
    ## TODO: Implement the functionality that will return the tf-idf matrix
    tfidf_dictionary = {}
    positional_index = get_positional_index(list_of_sentence_tokens)
    for index, sentence in enumerate(list_of_sentence_tokens, 1):
        for word in set(sentence):
            tf = len(positional_index[word][index])/len(sentence)
            #if we want to optimize this code we can shift this outside the loop and claculate 
            #idf using the positional index separately for each word in a dictionary and then 
            #directly access it without calculating eacg time.
            idf = math.log10(len(list_of_sentence_tokens)/len(positional_index[word]))
            tf_idf_value = tf * idf
            if word not in tfidf_dictionary:
                tfidf_dictionary[word] = {}
            if index not in tfidf_dictionary[word]:
                tfidf_dictionary[word][index] = tf_idf_value
    tf_idf_matrix = []
    for index in range(1, len(list_of_sentence_tokens)+1):
        list_tf_idf = []
        for word in sorted(tfidf_dictionary):
            value = tfidf_dictionary[word][index] if index in tfidf_dictionary[word] else 0
            list_tf_idf.append(value)
        tf_idf_matrix.append(list_tf_idf)
    df = pd.DataFrame(tf_idf_matrix, index=[f'Doc{index}' for index in range(1, len(list_of_sentence_tokens)+1)], columns=sorted(tfidf_dictionary))
    pd.set_option('display.max_columns', None)  # Display all columns
    display(df)
    return  tf_idf_matrix #THIS IS A PLACEHOLDER FOR THE OUTPUT YOU NEED TO OVERWRITE

tf_idf_matrix = get_TFIDF_matrix(cleaned_sentence)
print("OK")

Unnamed: 0,algorithm,application,applied,artificial,change,code,collaboration,control,crucial,data,debugging,development,domain,efficient,error,finding,fixing,framework,help,importance,independence,instruction,intelligence,involves,java,javascript,known,language,manage,platform,problem,process,programming,proved,python,simplify,solve,solving,step-by-step,structure,system,task,used,various,versatile,version,web,widely,writing
Doc1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.044216,0.111111,0.155327,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.077663,0.111111,0.0,0.0,0.0,0.0
Doc2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.139794,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.139794,0.2,0.0
Doc3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.25,0.0,0.25,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Doc4,0.0,0.0,0.0,0.0,0.0,0.087146,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.116495,0.0,0.066323,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667
Doc5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.2,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.079588,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Doc6,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.139794,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Doc7,0.0,0.0,0.0,0.0,0.125,0.06536,0.125,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0
Doc8,0.0,0.0,0.0,0.0,0.0,0.074697,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.142857,0.142857,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.099853,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Doc9,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.116495,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.23299,0.0,0.0
Doc10,0.0,0.0,0.166667,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066323,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.116495,0.0,0.0,0.0,0.0,0.0


OK


#### PART C- Measuring Documents Similarity

##### Create a method that takes as an input: (15)
 - a 2-dimensional list where each of the inner dimensions is a sentence list of tokens, and the outer dimension is the list of the sentences.
 - A method name: "tfidf", "inverted"
 - A Search Query
 - Return the rank of the sentences based on the given method and a query <br>

***Hint: For inverted index we just want documents that have the query word/words, for tfidf you must show the ranking based on highest tfidf score***

In [73]:
def get_ranked_documents(list_of_sentence_tokens, method_name, search_query):
    # TODO: Implement the functionality that returns the rank of the documents based on the method given and the search query
    ## If the method is "inverted" then rank the documents based on the number of matching tokens 
    ## If the method is "tfidf" then use the tfidf score equation in slides and return ranking based on the score
    ## The document with highest relevance should be ranked first
    ## list method should return the index of the documents based on highest ranking first
    print("\n\n--------------------------------------------------------------------------")
    for sentence in list_of_sentence_tokens:
        print(sentence)
    print(f"\nMethodName: {method_name} \nSearch: {search_query}")
    if method_name == 'tfidf':
        return get_ranked_documents_based_ontfidf(list_of_sentence_tokens, search_query)
    elif method_name == 'inverted':
        return get_ranked_documents_based_on_inverted_index(list_of_sentence_tokens, search_query)
    else:
        raise ValueError("method_name should be either tfidf or inverted") 
    
def get_sorted_list_of_tokens(cleaned_tokens):
    all_tokens = [word for sentence in cleaned_tokens for word in sentence]
    sorted_tokens = sorted(set(all_tokens))
    return sorted_tokens

def get_ranked_documents_based_ontfidf(list_of_sentence_tokens, search_query):
    cs_tfidf = clean_sentences(list_of_sentence_tokens)
    list_of_sorted_tokens = get_sorted_list_of_tokens(cs_tfidf)
    cleaned_queries = clean_sentences([search_query])[0]
    #create tfidf for the list of sentences
    tfidf_matrix = get_TFIDF_matrix(cs_tfidf)

    #checking if query tokens atleast one is in the list of sorted token list
    intersection = list(set(list_of_sorted_tokens) & set(cleaned_queries))

    #there are no query token in the list of identififed tokens in the whole dataset.
    #We can handle it while looping also but this way we save a lot of computations inside the loop
    if(len(intersection) == 0): return []

    #the tokens which are not in list of sorted tokens should be removed
    cleaned_queries = intersection

    scores = {}
    for index, doc in enumerate(cs_tfidf):
        score = 0
        for token in cleaned_queries:
            score += tfidf_matrix[index][list_of_sorted_tokens.index(token)]
        scores[index+1] = score
    df = pd.DataFrame([scores])
    display(df)

    #We dont need to keep documents with zero score. Zero score means no query matched with the document.
    filtered_dict = {key: value for key, value in scores.items() if value != 0}
    print(filtered_dict)

    #we need to sort the keys based on score descending
    sorted_keys = sorted(filtered_dict, key=lambda key: filtered_dict[key], reverse=True)
    print(f"Returned Documents: {sorted_keys}")
    return sorted_keys

def get_ranked_documents_based_on_inverted_index(list_of_sentence_tokens, search_query):
    cs_inverted = clean_sentences(list_of_sentence_tokens)
    list_of_sorted_tokens = get_sorted_list_of_tokens(cs_inverted)
    cleaned_queries = clean_sentences([search_query])[0]
    inverted_index = get_inverted_index(cs_inverted)
    #checking if query tokens atleast one is in the list of sorted token list
    intersection = list(set(list_of_sorted_tokens) & set(cleaned_queries))
    #there are no query token in the list of identififed tokens in the whole dataset.
    #We can handle it while looping also but this way we save a lot of computations inside the loop
    if(len(intersection) == 0): return []
    #the tokens which are not in list of sorted tokens should be removed
    cleaned_queries = intersection
    scores = {}
    for token in cleaned_queries:
        doclist = inverted_index[token]
        for docNum in doclist:
            if docNum in scores:
                scores[docNum] += 1
            else:
                scores[docNum] = 1
    #We dont need to keep documents with zero score. Zero score means no query matched with the document.
    filtered_dict = {key: value for key, value in scores.items() if value != 0}
    print(filtered_dict)
    #we need to sort the keys based on score descending
    sorted_keys = sorted(filtered_dict, key=lambda key: filtered_dict[key], reverse=True)
    print(f"Returned Documents: {sorted_keys}")
    return sorted_keys



In [74]:
#testing
sentences = [
    "Mohanlal is one of the greatest actors",
    "Mammootty is also a great actor",
    "Mammootty and Mohanlal has acted together in lot of movies.",
    "Harikrishnans is one of the movies they acted together.",
    "Theatre experience is better than Netflix and chill."
]

expectedResult1 = [1, 3]
expectedResult2 = [5, 1, 3]
expectedResult3 = [4]
expectedResult4 = []
expectedResult5 = []

result1 = get_ranked_documents(sentences, "tfidf", "Mohanlal")
result2 = get_ranked_documents(sentences, "tfidf", "Mohanlal Netflix")
result3 = get_ranked_documents(sentences, "tfidf", "Harikrishnans")
result4 = get_ranked_documents(sentences, "tfidf", "Fahadh")
result5 = get_ranked_documents(sentences, "tfidf", "act")

assert result1 == expectedResult1
assert result2 == expectedResult2
assert result3 == expectedResult3
assert result4 == expectedResult4
assert result5 == expectedResult5

result1inverted = get_ranked_documents(sentences, "inverted", "Mohanlal")
result2inverted = get_ranked_documents(sentences, "inverted", "Mohanlal Netflix")
result3inverted = get_ranked_documents(sentences, "inverted", "Harikrishnans")
result4inverted = get_ranked_documents(sentences, "inverted", "Fahadh")
result5inverted = get_ranked_documents(sentences, "inverted", "act")





--------------------------------------------------------------------------
Mohanlal is one of the greatest actors
Mammootty is also a great actor
Mammootty and Mohanlal has acted together in lot of movies.
Harikrishnans is one of the movies they acted together.
Theatre experience is better than Netflix and chill.

MethodName: tfidf 
Search: Mohanlal


Unnamed: 0,acted,actor,also,better,chill,experience,great,greatest,harikrishnans,lot,mammootty,mohanlal,movie,netflix,one,theatre,together
Doc1,0.0,0.099485,0.0,0.0,0.0,0.0,0.0,0.174743,0.0,0.0,0.0,0.099485,0.0,0.0,0.099485,0.0,0.0
Doc2,0.0,0.099485,0.174743,0.0,0.0,0.0,0.174743,0.0,0.0,0.0,0.099485,0.0,0.0,0.0,0.0,0.0,0.0
Doc3,0.066323,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.116495,0.066323,0.066323,0.066323,0.0,0.0,0.0,0.066323
Doc4,0.079588,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.139794,0.0,0.0,0.0,0.079588,0.0,0.079588,0.0,0.079588
Doc5,0.0,0.0,0.0,0.139794,0.139794,0.139794,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.139794,0.0,0.139794,0.0


Unnamed: 0,1,2,3,4,5
0,0.099485,0,0.066323,0,0


{1: 0.0994850021680094, 3: 0.06632333477867293}
Returned Documents: [1, 3]


--------------------------------------------------------------------------
Mohanlal is one of the greatest actors
Mammootty is also a great actor
Mammootty and Mohanlal has acted together in lot of movies.
Harikrishnans is one of the movies they acted together.
Theatre experience is better than Netflix and chill.

MethodName: tfidf 
Search: Mohanlal Netflix


Unnamed: 0,acted,actor,also,better,chill,experience,great,greatest,harikrishnans,lot,mammootty,mohanlal,movie,netflix,one,theatre,together
Doc1,0.0,0.099485,0.0,0.0,0.0,0.0,0.0,0.174743,0.0,0.0,0.0,0.099485,0.0,0.0,0.099485,0.0,0.0
Doc2,0.0,0.099485,0.174743,0.0,0.0,0.0,0.174743,0.0,0.0,0.0,0.099485,0.0,0.0,0.0,0.0,0.0,0.0
Doc3,0.066323,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.116495,0.066323,0.066323,0.066323,0.0,0.0,0.0,0.066323
Doc4,0.079588,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.139794,0.0,0.0,0.0,0.079588,0.0,0.079588,0.0,0.079588
Doc5,0.0,0.0,0.0,0.139794,0.139794,0.139794,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.139794,0.0,0.139794,0.0


Unnamed: 0,1,2,3,4,5
0,0.099485,0,0.066323,0,0.139794


{1: 0.0994850021680094, 3: 0.06632333477867293, 5: 0.13979400086720378}
Returned Documents: [5, 1, 3]


--------------------------------------------------------------------------
Mohanlal is one of the greatest actors
Mammootty is also a great actor
Mammootty and Mohanlal has acted together in lot of movies.
Harikrishnans is one of the movies they acted together.
Theatre experience is better than Netflix and chill.

MethodName: tfidf 
Search: Harikrishnans


Unnamed: 0,acted,actor,also,better,chill,experience,great,greatest,harikrishnans,lot,mammootty,mohanlal,movie,netflix,one,theatre,together
Doc1,0.0,0.099485,0.0,0.0,0.0,0.0,0.0,0.174743,0.0,0.0,0.0,0.099485,0.0,0.0,0.099485,0.0,0.0
Doc2,0.0,0.099485,0.174743,0.0,0.0,0.0,0.174743,0.0,0.0,0.0,0.099485,0.0,0.0,0.0,0.0,0.0,0.0
Doc3,0.066323,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.116495,0.066323,0.066323,0.066323,0.0,0.0,0.0,0.066323
Doc4,0.079588,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.139794,0.0,0.0,0.0,0.079588,0.0,0.079588,0.0,0.079588
Doc5,0.0,0.0,0.0,0.139794,0.139794,0.139794,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.139794,0.0,0.139794,0.0


Unnamed: 0,1,2,3,4,5
0,0,0,0,0.139794,0


{4: 0.13979400086720378}
Returned Documents: [4]


--------------------------------------------------------------------------
Mohanlal is one of the greatest actors
Mammootty is also a great actor
Mammootty and Mohanlal has acted together in lot of movies.
Harikrishnans is one of the movies they acted together.
Theatre experience is better than Netflix and chill.

MethodName: tfidf 
Search: Fahadh


Unnamed: 0,acted,actor,also,better,chill,experience,great,greatest,harikrishnans,lot,mammootty,mohanlal,movie,netflix,one,theatre,together
Doc1,0.0,0.099485,0.0,0.0,0.0,0.0,0.0,0.174743,0.0,0.0,0.0,0.099485,0.0,0.0,0.099485,0.0,0.0
Doc2,0.0,0.099485,0.174743,0.0,0.0,0.0,0.174743,0.0,0.0,0.0,0.099485,0.0,0.0,0.0,0.0,0.0,0.0
Doc3,0.066323,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.116495,0.066323,0.066323,0.066323,0.0,0.0,0.0,0.066323
Doc4,0.079588,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.139794,0.0,0.0,0.0,0.079588,0.0,0.079588,0.0,0.079588
Doc5,0.0,0.0,0.0,0.139794,0.139794,0.139794,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.139794,0.0,0.139794,0.0




--------------------------------------------------------------------------
Mohanlal is one of the greatest actors
Mammootty is also a great actor
Mammootty and Mohanlal has acted together in lot of movies.
Harikrishnans is one of the movies they acted together.
Theatre experience is better than Netflix and chill.

MethodName: tfidf 
Search: act


Unnamed: 0,acted,actor,also,better,chill,experience,great,greatest,harikrishnans,lot,mammootty,mohanlal,movie,netflix,one,theatre,together
Doc1,0.0,0.099485,0.0,0.0,0.0,0.0,0.0,0.174743,0.0,0.0,0.0,0.099485,0.0,0.0,0.099485,0.0,0.0
Doc2,0.0,0.099485,0.174743,0.0,0.0,0.0,0.174743,0.0,0.0,0.0,0.099485,0.0,0.0,0.0,0.0,0.0,0.0
Doc3,0.066323,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.116495,0.066323,0.066323,0.066323,0.0,0.0,0.0,0.066323
Doc4,0.079588,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.139794,0.0,0.0,0.0,0.079588,0.0,0.079588,0.0,0.079588
Doc5,0.0,0.0,0.0,0.139794,0.139794,0.139794,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.139794,0.0,0.139794,0.0




--------------------------------------------------------------------------
Mohanlal is one of the greatest actors
Mammootty is also a great actor
Mammootty and Mohanlal has acted together in lot of movies.
Harikrishnans is one of the movies they acted together.
Theatre experience is better than Netflix and chill.

MethodName: inverted 
Search: Mohanlal
{1: 1, 3: 1}
Returned Documents: [1, 3]


--------------------------------------------------------------------------
Mohanlal is one of the greatest actors
Mammootty is also a great actor
Mammootty and Mohanlal has acted together in lot of movies.
Harikrishnans is one of the movies they acted together.
Theatre experience is better than Netflix and chill.

MethodName: inverted 
Search: Mohanlal Netflix
{1: 1, 3: 1, 5: 1}
Returned Documents: [1, 3, 5]


--------------------------------------------------------------------------
Mohanlal is one of the greatest actors
Mammootty is also a great actor
Mammootty and Mohanlal has acted together

#### PART D- TFIDF with a TWIST (30 Marks)

##### TFIDF with Custom Weighting Based on Document Length and Term Position
- You are expected to implement a twisted version of the TF-IDF vectorizer, that incorporates two additional features:
    - Document Length
    - Term Position
- This twist aims to assign weight based on Modified Term Frequency (MTF) and Modified inverse Document Frequency (MIDF)
1. Modified Term Frequency (MTF):
    - MTF is calculated by taking into consideration the position of the term into account
    - The assumption is the closer the term appears to the beginning of the document, the higher the weight should be.
    - $$\text{MTF}(t, d) = \frac{f(t, d)}{1 + \text{position}(t, d)}$$
        - Where f(t,d) is the raw count of term t in document d.
        - position(t,d) is the position of the first occurence of term t in document d.
2. Modified Inverse Document Frequency (MIDF):
    - MIDF is calculated taking into consideration the document length.
    - The assumption is that the IDF should be inversely proportion not only to the number of documents it appears at, but also to the average length of documents where the term appears. 
    - Hence, longer documents are less significant for a term's relevance.
    - $$\text{MIDF}(t) = \log \left( \frac{N}{\text{df}(t) \times \frac{1}{M} \sum_{d \in D_{t}} |d|} \right)$$

        - N is the total number of documents
        - df(t) is the document frequency
        - M is a constant for scaling
        - $${\sum_{d \in D_{t}} |d|}$$
                 is the sum of the lengths of all documents that contain t
        - |d| is the length of document d
3. Final Weight (MTF-MIDF):
    - The Combined is calculated as : MTF(t,d)*MIDF(t)

##### Part 4-A: Implement the function logic for getting modified tf-idf weightings. (20 Marks)
<b><u>NOTE: M is a scaling factor, setting it to 5 in our example would be sufficient. However, you need to explore what does increasing and decreasing it represent.</u></b>

In [121]:
def get_modified_tfidf_matrix_withM(list_of_sentence_tokens, M=5):
    ## TODO: Implement the functionality that will return the tf-idf matrix
    mtfmidf_dictionary = {}
    positional_index = get_positional_index(list_of_sentence_tokens)
    print(positional_index)
    for index, sentence in enumerate(list_of_sentence_tokens, 1):
        for word in set(sentence):
            min_value = min(positional_index[word][index])
            tf = len(positional_index[word][index])/(1 + min_value)
            #if we want to optimize this code we can shift this outside the loop and claculate 
            #idf using the positional index separately for each word in a dictionary and then 
            #directly access it without calculating eacg time.
            sum_of_words_in_docs = 0
            for j in positional_index[word]:
                sum_of_words_in_docs += len(list_of_sentence_tokens[j-1])
            denominator = (1/M)*sum_of_words_in_docs
            idf = math.log10(len(list_of_sentence_tokens)/(len(positional_index[word])*denominator))
            tf_idf_value = tf * idf
            if word not in mtfmidf_dictionary:
                mtfmidf_dictionary[word] = {}
            if index not in mtfmidf_dictionary[word]:
                mtfmidf_dictionary[word][index] = tf_idf_value
    mtf_midf_matrix = []
    for index in range(1, len(list_of_sentence_tokens)+1):
        list_tf_idf = []
        for word in sorted(mtfmidf_dictionary):
            value = mtfmidf_dictionary[word][index] if index in mtfmidf_dictionary[word] else 0
            list_tf_idf.append(value)
        mtf_midf_matrix.append(list_tf_idf)
    df = pd.DataFrame(mtf_midf_matrix, index=[f'Doc{index}' for index in range(1, len(list_of_sentence_tokens)+1)], columns=sorted(mtfmidf_dictionary))
    pd.set_option('display.max_columns', None)  # Display all columns
    display(df)
    return  mtf_midf_matrix, df #THIS IS A PLACEHOLDER FOR THE OUTPUT YOU NEED TO OVERWRITE

def get_modified_tfidf_matrix(list_of_sentence_tokens):
    matrix, df = get_modified_tfidf_matrix_withM(list_of_sentence_tokens, M=5)
    return matrix

sentences_mtfmidf = [
    "Messi is the greatest player of all time",
    "Mohanlal is one of the greatest actors",
    "Mammootty is also a great actor",
    "Mammootty and Mohanlal has acted together in lot of movies.",
    "Harikrishnans is one of the movies they acted together.",
    "Theatre experience is better than Netflix and chill."
]
cs_mtfmidf = clean_sentences(sentences_mtfmidf)
mtf_midf_matrix = get_modified_tfidf_matrix(cs_mtfmidf)
print("OK")


{'messi': {1: [0]}, 'greatest': {1: [1], 2: [2]}, 'player': {1: [2]}, 'time': {1: [3]}, 'mohanlal': {2: [0], 4: [1]}, 'one': {2: [1], 5: [1]}, 'actor': {2: [3], 3: [3]}, 'mammootty': {3: [0], 4: [0]}, 'also': {3: [1]}, 'great': {3: [2]}, 'acted': {4: [2], 5: [3]}, 'together': {4: [3], 5: [4]}, 'lot': {4: [4]}, 'movie': {4: [5], 5: [2]}, 'harikrishnans': {5: [0]}, 'theatre': {6: [0]}, 'experience': {6: [1]}, 'better': {6: [2]}, 'netflix': {6: [3]}, 'chill': {6: [4]}}


Unnamed: 0,acted,actor,also,better,chill,experience,great,greatest,harikrishnans,lot,mammootty,messi,mohanlal,movie,netflix,one,player,theatre,time,together
Doc1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.136501,0.0,0.0,0.0,0.875061,0.0,0.0,0.0,0.0,0.291687,0.0,0.218765,0.0
Doc2,0.0,0.06825,0.0,0.0,0.0,0.0,0.0,0.091,0.0,0.0,0.0,0.0,0.176091,0.0,0.0,0.110924,0.0,0.0,0.0,0.0
Doc3,0.0,0.06825,0.437531,0.0,0.0,0.0,0.291687,0.0,0.0,0.0,0.176091,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Doc4,0.0449,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.139794,0.176091,0.0,0.088046,0.02245,0.0,0.0,0.0,0.0,0.0,0.033675
Doc5,0.033675,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.778151,0.0,0.0,0.0,0.0,0.0449,0.0,0.110924,0.0,0.0,0.0,0.02694
Doc6,0.0,0.0,0.0,0.259384,0.15563,0.389076,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.194538,0.0,0.0,0.778151,0.0,0.0


OK


##### Part 4-B: Experiment the effect of changing M and comment on what do you think M is for and why is it added. (5) 

- <b> Your answer here

In [119]:
def printmax_min(df):
    max_value = df.max().max()
    max_position = df.stack().idxmax()  # (row, column)

    # Get the minimum value, its row, and column
    min_value = df.min().min()
    min_position = df.stack().idxmin()  # (row, column)

    # Create a new DataFrame with results
    result_df = pd.DataFrame({
        "Type": ["Max", "Min"],
        "Value": [max_value, min_value],
        "Row": [max_position[0], min_position[0]],
        "Column": [max_position[1], min_position[1]]
    })
    display(result_df)

_, mtf_midf_matrix_m1 = get_modified_tfidf_matrix_withM(cs_mtfmidf, 1)
_, mtf_midf_matrix_m10 = get_modified_tfidf_matrix_withM(cs_mtfmidf, 10)
_, mtf_midf_matrix_m100 = get_modified_tfidf_matrix_withM(cs_mtfmidf, 100)

print("----------------------------")
print(f"M = 1")
printmax_min(mtf_midf_matrix_m1)
print("----------------------------\n")

print("----------------------------")
print(f"M = 10")
printmax_min(mtf_midf_matrix_m10)
print("----------------------------\n")

print("----------------------------")
print(f"M = 100")
printmax_min(mtf_midf_matrix_m100)
print("----------------------------\n")


{'messi': {1: [0]}, 'greatest': {1: [1], 2: [2]}, 'player': {1: [2]}, 'time': {1: [3]}, 'mohanlal': {2: [0], 4: [1]}, 'one': {2: [1], 5: [1]}, 'actor': {2: [3], 3: [3]}, 'mammootty': {3: [0], 4: [0]}, 'also': {3: [1]}, 'great': {3: [2]}, 'acted': {4: [2], 5: [3]}, 'together': {4: [3], 5: [4]}, 'lot': {4: [4]}, 'movie': {4: [5], 5: [2]}, 'harikrishnans': {5: [0]}, 'theatre': {6: [0]}, 'experience': {6: [1]}, 'better': {6: [2]}, 'netflix': {6: [3]}, 'chill': {6: [4]}}


Unnamed: 0,acted,actor,also,better,chill,experience,great,greatest,harikrishnans,lot,mammootty,messi,mohanlal,movie,netflix,one,player,theatre,time,together
Doc1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.212984,0.0,0.0,0.0,0.176091,0.0,0.0,0.0,0.0,0.058697,0.0,0.044023,0.0
Doc2,0.0,-0.106492,0.0,0.0,0.0,0.0,0.0,-0.14199,0.0,0.0,0.0,0.0,-0.522879,0.0,0.0,-0.238561,0.0,0.0,0.0,0.0
Doc3,0.0,-0.106492,0.088046,0.0,0.0,0.0,0.058697,0.0,0.0,0.0,-0.522879,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Doc4,-0.18809,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.522879,0.0,-0.261439,-0.094045,0.0,0.0,0.0,0.0,0.0,-0.141068
Doc5,-0.141068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.079181,0.0,0.0,0.0,0.0,-0.18809,0.0,-0.238561,0.0,0.0,0.0,-0.112854
Doc6,0.0,0.0,0.0,0.026394,0.015836,0.039591,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019795,0.0,0.0,0.079181,0.0,0.0


{'messi': {1: [0]}, 'greatest': {1: [1], 2: [2]}, 'player': {1: [2]}, 'time': {1: [3]}, 'mohanlal': {2: [0], 4: [1]}, 'one': {2: [1], 5: [1]}, 'actor': {2: [3], 3: [3]}, 'mammootty': {3: [0], 4: [0]}, 'also': {3: [1]}, 'great': {3: [2]}, 'acted': {4: [2], 5: [3]}, 'together': {4: [3], 5: [4]}, 'lot': {4: [4]}, 'movie': {4: [5], 5: [2]}, 'harikrishnans': {5: [0]}, 'theatre': {6: [0]}, 'experience': {6: [1]}, 'better': {6: [2]}, 'netflix': {6: [3]}, 'chill': {6: [4]}}


Unnamed: 0,acted,actor,also,better,chill,experience,great,greatest,harikrishnans,lot,mammootty,messi,mohanlal,movie,netflix,one,player,theatre,time,together
Doc1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.287016,0.0,0.0,0.0,1.176091,0.0,0.0,0.0,0.0,0.39203,0.0,0.294023,0.0
Doc2,0.0,0.143508,0.0,0.0,0.0,0.0,0.0,0.191344,0.0,0.0,0.0,0.0,0.477121,0.0,0.0,0.261439,0.0,0.0,0.0,0.0
Doc3,0.0,0.143508,0.588046,0.0,0.0,0.0,0.39203,0.0,0.0,0.0,0.477121,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Doc4,0.145243,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.477121,0.0,0.238561,0.072621,0.0,0.0,0.0,0.0,0.0,0.108932
Doc5,0.108932,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.079181,0.0,0.0,0.0,0.0,0.145243,0.0,0.261439,0.0,0.0,0.0,0.087146
Doc6,0.0,0.0,0.0,0.359727,0.215836,0.539591,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.269795,0.0,0.0,1.079181,0.0,0.0


{'messi': {1: [0]}, 'greatest': {1: [1], 2: [2]}, 'player': {1: [2]}, 'time': {1: [3]}, 'mohanlal': {2: [0], 4: [1]}, 'one': {2: [1], 5: [1]}, 'actor': {2: [3], 3: [3]}, 'mammootty': {3: [0], 4: [0]}, 'also': {3: [1]}, 'great': {3: [2]}, 'acted': {4: [2], 5: [3]}, 'together': {4: [3], 5: [4]}, 'lot': {4: [4]}, 'movie': {4: [5], 5: [2]}, 'harikrishnans': {5: [0]}, 'theatre': {6: [0]}, 'experience': {6: [1]}, 'better': {6: [2]}, 'netflix': {6: [3]}, 'chill': {6: [4]}}


Unnamed: 0,acted,actor,also,better,chill,experience,great,greatest,harikrishnans,lot,mammootty,messi,mohanlal,movie,netflix,one,player,theatre,time,together
Doc1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.787016,0.0,0.0,0.0,2.176091,0.0,0.0,0.0,0.0,0.725364,0.0,0.544023,0.0
Doc2,0.0,0.393508,0.0,0.0,0.0,0.0,0.0,0.524677,0.0,0.0,0.0,0.0,1.477121,0.0,0.0,0.761439,0.0,0.0,0.0,0.0
Doc3,0.0,0.393508,1.088046,0.0,0.0,0.0,0.725364,0.0,0.0,0.0,1.477121,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Doc4,0.478576,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4,1.477121,0.0,0.738561,0.239288,0.0,0.0,0.0,0.0,0.0,0.358932
Doc5,0.358932,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.079181,0.0,0.0,0.0,0.0,0.478576,0.0,0.761439,0.0,0.0,0.0,0.287146
Doc6,0.0,0.0,0.0,0.69306,0.415836,1.039591,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.519795,0.0,0.0,2.079181,0.0,0.0


----------------------------
M = 1


Unnamed: 0,Type,Value,Row,Column
0,Max,0.176091,Doc1,messi
1,Min,-0.522879,Doc2,mohanlal


----------------------------

----------------------------
M = 10


Unnamed: 0,Type,Value,Row,Column
0,Max,1.176091,Doc1,messi
1,Min,0.0,Doc1,acted


----------------------------

----------------------------
M = 100


Unnamed: 0,Type,Value,Row,Column
0,Max,2.176091,Doc1,messi
1,Min,0.0,Doc1,acted


----------------------------



- When M = 1 we can see that the min value in all cells is for "Mohanlal" which is -0.5. Ideally min value in all cases should be 0 because we are filling empty entries with zero. 
- It was okay in tf-idf since we can always be sure that tf * idf value will never be less than 0. 
- But in the mtf-midf we introduced the length of the document as a factor in the denominator of the log. This will mean that for words that are in a lot of documents will have a very big denominator for log which can make the midf value less than zero as well. This in turn means that mtf-midf value can be negative and hence we can't rely on filling the empty cells in the matrix with zeros. 
- Filling it with zeros will mean that it has a higher mtf-midf value than some of the actual high relevant words and in turn the whole calculation can become incorrect.
- This is why we have to choose an M value such that we know for sure that no calculated mtf-midf value can go below zero. 
- The only way to generalize and ensure that the value doesnt go below zero is by choosing an M value equal to the sum of all document lengths. This will ensure that the value never goes below zero. 
- The drawback of this new approach is that when we have huge number of documents and the sum of document lengths can be very huge which in turn takes rare word score close to infinity. Which is the other side of the problem.

##### Part 4-C: Do you think Modified TF-Modified IDF is a good technique? Please comment and explain your thoughts.(5)

- <b> Your answer here</b>

- In my opinion though the intuition behind Modified TF- Modified IDF is smart, the equation needs to handle some edge cases which I have explained in the previous answer. ( Less than zero issue ).
- there is no generalized way to figure out an appropriate value for M.
- One way to make it better is like I explained in the previous answer using sum of all document length as the value for M.
- The issue with that approach is some rare words can take very huge values.
- Another approach we can think to solve this is to change the way we fill empty values in mtf-midf matrix. ie. Unlike tf-idf we need to fill empty entries with NaN ( Python perspective ) or with -infinity for the selected language and variable type in which it is implemented. This way even if there are entries with negative values the empty cells will still have a score less than those entries. 
- The drawback again is that in tf-idf since zero acts as a proper lower limit we can rely on documents getting a score of zero and can exclude from the result. With mtf-midf -infinity will have to serve this purpose.