## Assignment 3: Exploring IR & NLP
- In this assignment we are going to implement various IR techniques <b><i>From Scratch</i></b>, Please don't use available libraries except if specified that you can use it.
- You are required to submit 6 different functions for this assignment, you can additional helper functions but only 6 will be tested.
- You will be granted 10 marks for clean code and documenting the code.
- Student Name: Tai Siang Huang
- ID: 9006413


In [32]:
sample_sentences = [
    "Python is a versatile programming language, python proved its importance in various domains.",
    "JavaScript is widely used for web development.",
    "Java is known for its platform independence.",
    "Programming involves writing code to solve problems.",
    "Data structures are crucial for efficient programming.",
    "Algorithms are step-by-step instructions for solving problems.",
    "Version control systems help manage code changes in collaboration.",
    "Debugging is the process of finding and fixing errors in python code.",
    "Web frameworks simplify the development of web applications.",
    "Artificial intelligence can be applied in various programming tasks."
]

#### PART A: Preprocessing (15 Marks)
- You are required to preprocess the text and apply the tokenization process.<br/>
- Proprocessing should include tokenization, normalization, stemming <b>OR</b> lemmatization, and Named Entity Recognition (NER).<br/>
- You need to make sure that Named Entities are not broken into separate tokens, but should be normalized by case-folding only. <br/>
- The output of this step should be list of tokenized sentences. [[sentence1_token1, sentence1_token2, .. .], [sentence2_token1, .. .], .. .] <br/>
- Please write the functionality of clean_sentences as explained in the comment (Please do comment your code at each essential step) <br/>

In [33]:
import spacy
import re
import math
import numpy as np
import pandas as pd

nlp = spacy.load("en_core_web_sm")

In [34]:
## You are allowed for PART A to use any library that would help you in the task.
def clean_sentences(sentences=None):
    ## This function takes as an input list of sentences
    tokenized_sentences = []
    for i, sentence in enumerate(sentences, start=1):
        doc = nlp(sentence)

        entities = doc.ents 
        print(entities)  # only JavaScript and Java are recognized as entities, there are no inseparable NER 

        tokens = []
        for token in doc:
            # print(token)
            # remove punctuation stop words and spaces
            if not token.is_punct and not token.is_space and not token.is_stop:
                # normalize the token
                token = token.text.lower()
                # remove punctuation and digits
                token = re.sub(r'[^a-zA-Z0-9 ]', '', token)

                tokens.append(token)
        # add the tokenized sentence to the list
        tokenized_sentences.append(tokens)
        # print("----tokenized_sentences----",tokenized_sentences)

    ## This function returns a list of tokenized_sentences
    return tokenized_sentences

clean_sentences(sample_sentences)

()
(JavaScript,)
(Java,)
()
()
()
()
()
()
()


[['python',
  'versatile',
  'programming',
  'language',
  'python',
  'proved',
  'importance',
  'domains'],
 ['javascript', 'widely', 'web', 'development'],
 ['java', 'known', 'platform', 'independence'],
 ['programming', 'involves', 'writing', 'code', 'solve', 'problems'],
 ['data', 'structures', 'crucial', 'efficient', 'programming'],
 ['algorithms', 'step', 'step', 'instructions', 'solving', 'problems'],
 ['version',
  'control',
  'systems',
  'help',
  'manage',
  'code',
  'changes',
  'collaboration'],
 ['debugging', 'process', 'finding', 'fixing', 'errors', 'python', 'code'],
 ['web', 'frameworks', 'simplify', 'development', 'web', 'applications'],
 ['artificial', 'intelligence', 'applied', 'programming', 'tasks']]

#### PART B: Building IR Sentence-Word Representation (30 Marks)

- Question B-1: Create a method that takes as an input a 2-dimensional list where each of the inner dimensions is a sentence list of tokens, and the outer dimension is the list of the sentences. The method MUST return the <b>inverted index</b> that is sufficient to represent the document. Assume that each sentence is a document and the sentence ID starts from 1. (10)

In [35]:
def get_inverted_index(list_of_sentence_tokens):
    ## TODO: Implement the functionality that will return the inverted index
    output = {}
    # start from 1 to match the sentence number
    for i, sentence in enumerate(list_of_sentence_tokens, start=1):
        # run through each token in the sentence
        for token in sentence:
            if token not in output:
                #  create a new entry in the dictionary
                output[token] = []
            # append the sentence number to the list of the token
            output[token].append(i)

    return  output #THIS IS A PLACEHOLDER FOR THE OUTPUT YOU NEED TO OVERWRITE

get_inverted_index(clean_sentences(sample_sentences))

()
(JavaScript,)
(Java,)
()
()
()
()
()
()
()


{'python': [1, 1, 8],
 'versatile': [1],
 'programming': [1, 4, 5, 10],
 'language': [1],
 'proved': [1],
 'importance': [1],
 'domains': [1],
 'javascript': [2],
 'widely': [2],
 'web': [2, 9, 9],
 'development': [2, 9],
 'java': [3],
 'known': [3],
 'platform': [3],
 'independence': [3],
 'involves': [4],
 'writing': [4],
 'code': [4, 7, 8],
 'solve': [4],
 'problems': [4, 6],
 'data': [5],
 'structures': [5],
 'crucial': [5],
 'efficient': [5],
 'algorithms': [6],
 'step': [6, 6],
 'instructions': [6],
 'solving': [6],
 'version': [7],
 'control': [7],
 'systems': [7],
 'help': [7],
 'manage': [7],
 'changes': [7],
 'collaboration': [7],
 'debugging': [8],
 'process': [8],
 'finding': [8],
 'fixing': [8],
 'errors': [8],
 'frameworks': [9],
 'simplify': [9],
 'applications': [9],
 'artificial': [10],
 'intelligence': [10],
 'applied': [10],
 'tasks': [10]}

- Question B-2: Create a method that takes as an input a 2-dimensional list where each of the inner dimensions is a sentence list of tokens, and the outer dimension is the list of the sentences. The method MUST return the <b>Positional index</b> that is sufficient to represent the document. Assume that each sentence is a document and the sentence ID starts from 1, and the first token in the list is at position 0. Make sure to consider multiple appearance of the same token. (10)

In [36]:
def get_positional_index(list_of_sentence_tokens):
    ## TODO: Implement the functionality that will return the positional index
    output = {} 
    # start from 1 to match the sentence number
    for i, sentence in enumerate(list_of_sentence_tokens, start=1):
        # run through each token in the sentence
        for position, token in enumerate(sentence):
            # check if the token is already in the dictionary
            if token not in output:
                #  create a new entry in the dictionary
                output[token] = {}

            # check if the sentence number is already in the dictionary
            if i not in output[token]:
                #  initialize a new list for the sentence number
                output[token][i] = []

            # append the position of the token in the sentence
            output[token][i].append(position)

    return  output #THIS IS A PLACEHOLDER FOR THE OUTPUT YOU NEED TO OVERWRITE

get_positional_index(clean_sentences(sample_sentences))

()
(JavaScript,)
(Java,)
()
()
()
()
()
()
()


{'python': {1: [0, 4], 8: [5]},
 'versatile': {1: [1]},
 'programming': {1: [2], 4: [0], 5: [4], 10: [3]},
 'language': {1: [3]},
 'proved': {1: [5]},
 'importance': {1: [6]},
 'domains': {1: [7]},
 'javascript': {2: [0]},
 'widely': {2: [1]},
 'web': {2: [2], 9: [0, 4]},
 'development': {2: [3], 9: [3]},
 'java': {3: [0]},
 'known': {3: [1]},
 'platform': {3: [2]},
 'independence': {3: [3]},
 'involves': {4: [1]},
 'writing': {4: [2]},
 'code': {4: [3], 7: [5], 8: [6]},
 'solve': {4: [4]},
 'problems': {4: [5], 6: [5]},
 'data': {5: [0]},
 'structures': {5: [1]},
 'crucial': {5: [2]},
 'efficient': {5: [3]},
 'algorithms': {6: [0]},
 'step': {6: [1, 2]},
 'instructions': {6: [3]},
 'solving': {6: [4]},
 'version': {7: [0]},
 'control': {7: [1]},
 'systems': {7: [2]},
 'help': {7: [3]},
 'manage': {7: [4]},
 'changes': {7: [6]},
 'collaboration': {7: [7]},
 'debugging': {8: [0]},
 'process': {8: [1]},
 'finding': {8: [2]},
 'fixing': {8: [3]},
 'errors': {8: [4]},
 'frameworks': {9: [1

- Question B-3: Create a method that takes as an input a 2-dimensional list where each of the inner dimensions is a sentence list of tokens, and the outer dimension is the list of the sentences. The method MUST return the <b>TF-IDF Matrix</b> that is sufficient to represent the documents, the tokens are expected to be sorted as well as documentIDs. Assume that each sentence is a document and the sentence ID starts from 1. (10) You are not allowed to use any libraries.

In [37]:
def get_TFIDF_matrix(list_of_sentence_tokens):
    ## TODO: Implement the functionality that will return the tf-idf matrix

    print("list_of_sentence_tokens", list_of_sentence_tokens)
    # unique tokens from the list
    unique_tokens = set()
    for sentence in list_of_sentence_tokens:
        for token in sentence:
            unique_tokens.add(token)
    # unique_tokens = sorted(unique_tokens)

    # compute the DF (Document Frequency) 
    DF = {}
    for token in unique_tokens:
        # initialize the DF of the token to 0
        DF[token] = 0
        # run through each sentence
        for sentence in list_of_sentence_tokens:
            if token in sentence:
                # plus 1 to the DF of the token
                DF[token] += 1
    print("DF", DF)

    # compute the IDF (Inverse Document Frequency) log(N/DF)
    IDF = {}
    # N is the number of sentences
    N = len(list_of_sentence_tokens)
    for token in unique_tokens:
        IDF[token] = math.log(N / DF[token])
    print("IDF", IDF)

    
    # convert to this structure [[token1_TF, token2_TF], [document2]]
    # convert to this structure [[token1_TF_IDF, token2_TF_IDF], [document2]]
    # create the same structure as the list_of_sentence_tokens but fill it with 0
    # Initialize TF and TF_IDF matrices with zeros
    TF = []
    TF_IDF = []
    for i, sentence in enumerate(list_of_sentence_tokens):
        for j, token in enumerate(sentence):
            # create a new list for each sentence
            if i >= len(TF):
                TF.append([0] * len(sentence))
                TF_IDF.append([0] * len(sentence))
            # initialize the TF and TF-IDF of the token to 0
            TF[i][j] = 0
            TF_IDF[i][j] = 0

    print("TF", TF)
    print("TF_IDF", TF_IDF)

    # compute the TF (Term Frequency)
    # (TF = count(token) / len(sentence))
    for i, sentence in enumerate(list_of_sentence_tokens):
        # print("sentence",i, sentence)
        # run through each token in the sentence
        for j, token in enumerate(sentence):
            # print("token", j,token)
            # count the number of times the token appears in the sentence
            count = sentence.count(token)
            # compute the TF of the token
            TF[i][j] = count / len(sentence)
    #         print("TF[i][j]",i,j, TF[i][j])
    #     print("TF[i]",i, TF[i])
    print("TF", TF)

    # compute the TF-IDF matrix
    # TF_IDF = TF * IDF
    #  run through each sentence
    for i, sentence in enumerate(list_of_sentence_tokens):
        for j, token in enumerate(sentence):
            # compute the TF-IDF of the token
            TF_IDF[i][j] = TF[i][j] * IDF[token]

    print("TF_IDF", TF_IDF)



    return  TF_IDF #THIS IS A PLACEHOLDER FOR THE OUTPUT YOU NEED TO OVERWRITE

get_TFIDF_matrix(clean_sentences(sample_sentences))

()
(JavaScript,)


(Java,)
()
()
()
()
()
()
()
list_of_sentence_tokens [['python', 'versatile', 'programming', 'language', 'python', 'proved', 'importance', 'domains'], ['javascript', 'widely', 'web', 'development'], ['java', 'known', 'platform', 'independence'], ['programming', 'involves', 'writing', 'code', 'solve', 'problems'], ['data', 'structures', 'crucial', 'efficient', 'programming'], ['algorithms', 'step', 'step', 'instructions', 'solving', 'problems'], ['version', 'control', 'systems', 'help', 'manage', 'code', 'changes', 'collaboration'], ['debugging', 'process', 'finding', 'fixing', 'errors', 'python', 'code'], ['web', 'frameworks', 'simplify', 'development', 'web', 'applications'], ['artificial', 'intelligence', 'applied', 'programming', 'tasks']]
DF {'debugging': 1, 'programming': 4, 'known': 1, 'intelligence': 1, 'version': 1, 'widely': 1, 'javascript': 1, 'proved': 1, 'applied': 1, 'development': 2, 'instructions': 1, 'simplify': 1, 'problems': 2, 'involves': 1, 'importance': 1, 'process

[[0.40235947810852507,
  0.28782313662425574,
  0.11453634148426939,
  0.28782313662425574,
  0.40235947810852507,
  0.28782313662425574,
  0.28782313662425574,
  0.28782313662425574],
 [0.5756462732485115,
  0.5756462732485115,
  0.40235947810852507,
  0.40235947810852507],
 [0.5756462732485115,
  0.5756462732485115,
  0.5756462732485115,
  0.5756462732485115],
 [0.15271512197902584,
  0.3837641821656743,
  0.3837641821656743,
  0.20066213405432268,
  0.3837641821656743,
  0.26823965207235],
 [0.4605170185988092,
  0.4605170185988092,
  0.4605170185988092,
  0.4605170185988092,
  0.18325814637483104],
 [0.3837641821656743,
  0.7675283643313486,
  0.7675283643313486,
  0.3837641821656743,
  0.3837641821656743,
  0.26823965207235],
 [0.28782313662425574,
  0.28782313662425574,
  0.28782313662425574,
  0.28782313662425574,
  0.28782313662425574,
  0.15049660054074201,
  0.28782313662425574,
  0.28782313662425574],
 [0.32894072757057796,
  0.32894072757057796,
  0.32894072757057796,
  0.3

#### PART C- Measuring Documents Similarity

##### Create a method that takes as an input: (15)
 - a 2-dimensional list where each of the inner dimensions is a sentence list of tokens, and the outer dimension is the list of the sentences.
 - A method name: "tfidf", "inverted"
 - A Search Query
 - Return the rank of the sentences based on the given method and a query <br>

***Hint: For inverted index we just want documents that have the query word/words, for tfidf you must show the ranking based on highest tfidf score***

In [38]:
def get_ranked_documents(list_of_sentence_tokens, method_name, search_query):
    # TODO: Implement the functionality that returns the rank of the documents based on the method given and the search query
    # Preprocess the query consistently with documents
    query_tokens = clean_sentences([search_query])[0]
    # unique tokens in the query
    query_tokens_set = set(query_tokens)

    # fill the scores with 0 [document_0_score, document_1_score, document_2_score, ...]
    scores = [0] * len(list_of_sentence_tokens)
    # print("scores", scores)
    ## If the method is "inverted" then rank the documents based on the number of matching tokens
    if method_name == "inverted":
        # get the inverted index
        inverted_index = get_inverted_index(list_of_sentence_tokens)        
        # run through each token in the sentence
        for token in query_tokens_set:
            # check if the token is in the inverted index
            if token in inverted_index:
                # run through each sentence that contains the token
                # print("inverted_index[token]",token, inverted_index[token])
                for sentence_id in inverted_index[token]:
                    print("sentence_id", sentence_id)
                    # increment the score of the sentence
                    scores[sentence_id - 1] += 1
            else:
                scores    
        
        print(scores)


    ## If the method is "tfidf" then use the tfidf score equation in slides and return ranking based on the score
    elif method_name == "tfidf":
        # get the TF-IDF matrix
        tfidf_matrix = get_TFIDF_matrix(list_of_sentence_tokens)

        # get the index of the tokens in the TF-IDF matrix
        # same token will be replaced with the TF-IDF score
        # readable_conversion = {document1_id:{token1:TF_IDF, token2:TF_IDF}, document2_id:{token1:TF_IDF, token2:TF_IDF}}
        readable_conversion = {}
        for i, sentence_thidfs in enumerate(tfidf_matrix):
            readable_conversion[i + 1] = {}
            for j, tfidf in enumerate(sentence_thidfs):
                # check if the tfidf is in the inverted index
                if tfidf != 0:
                    readable_conversion[i + 1][list_of_sentence_tokens[i][j]] = tfidf
        print("readable_conversion", readable_conversion)

        # run through each readable_conversion
        for i, document in readable_conversion.items():
            # print("document", i, document)
            # run through each token in the query
            for token in query_tokens_set:
                # check if the token is in the document
                if token in document:
                    # i - 1 because the document id starts from 1 
                    scores[i - 1] += document[token]
        print("scores", scores)
    else:
        raise ValueError("Method name must be either 'inverted' or 'tfidf'")
    ## The document with highest relevance should be ranked first
    ## list method should return the index of the documents based on highest ranking first
    rank_list = []
    
    # rank the sentences based on the score
    rank_list = []
    for i, score in enumerate(scores):
        if score > 0: 
            rank_list.append((i, score))
        else:
            rank_list.append((i, 0))
    
    # sort the rank_list by score in descending order
    rank_list.sort(key=lambda x: x[1], reverse=True)
    # get the document id
    rank_list = [item[0] for item in rank_list]
    return rank_list

get_ranked_documents(clean_sentences(sample_sentences), "inverted", "Python programming")
# get_ranked_documents(clean_sentences(sample_sentences), "tfidf", "Python programming")

()
(JavaScript,)
(Java,)
()
()
()
()
()
()
()
()
sentence_id 1
sentence_id 1
sentence_id 8
sentence_id 1
sentence_id 4
sentence_id 5
sentence_id 10
[3, 0, 0, 1, 1, 0, 0, 1, 0, 1]


[0, 3, 4, 7, 9, 1, 2, 5, 6, 8]

#### PART D- TFIDF with a TWIST (30 Marks)

##### TFIDF with Custom Weighting Based on Document Length and Term Position
- You are expected to implement a twisted version of the TF-IDF vectorizer, that incorporates two additional features:
    - Document Length
    - Term Position
- This twist aims to assign weight based on Modified Term Frequency (MTF) and Modified inverse Document Frequency (MIDF)
1. Modified Term Frequency (MTF):
    - MTF is calculated by taking into consideration the position of the term into account
    - The assumption is the closer the term appears to the beginning of the document, the higher the weight should be.
    - $$\text{MTF}(t, d) = \frac{f(t, d)}{1 + \text{position}(t, d)}$$
        - Where f(t,d) is the raw count of term t in document d.
        - position(t,d) is the position of the first occurence of term t in document d.
2. Modified Inverse Document Frequency (MIDF):
    - MIDF is calculated taking into consideration the document length.
    - The assumption is that the IDF should be inversely proportion not only to the number of documents it appears at, but also to the average length of documents where the term appears. 
    - Hence, longer documents are less significant for a term's relevance.
    - $$\text{MIDF}(t) = \log \left( \frac{N}{\text{df}(t) \times \frac{1}{M} \sum_{d \in D_{t}} |d|} \right)$$

        - N is the total number of documents
        - df(t) is the document frequency
        - M is a constant for scaling
        - $${\sum_{d \in D_{t}} |d|}$$
                 is the sum of the lengths of all documents that contain t
        - |d| is the length of document d
3. Final Weight (MTF-MIDF):
    - The Combined is calculated as : MTF(t,d)*MIDF(t)

##### Part 4-A: Implement the function logic for getting modified tf-idf weightings. (20 Marks)
<b><u>NOTE: M is a scaling factor, setting it to 5 in our example would be sufficient. However, you need to explore what does increasing and decreasing it represent.</u></b>

In [39]:
def get_modified_tfidf_matrix(list_of_sentence_tokens):
    ## TODO: Implement the functionality that will return the modified tf-idf matrix
    
    # set a constant for scaling
    M = 5
    
    # unique tokens from the list
    unique_tokens = set()
    for sentence in list_of_sentence_tokens:
        for token in sentence:
            unique_tokens.add(token)
    # unique_tokens = sorted(unique_tokens)

    # compute the DF (Document Frequency) and the sum of lengths of the sentences that contain the token
    DF = {}
    sum_lengths = {}
    for token in unique_tokens:
        # initialize the DF of the token to 0
        DF[token] = 0
        sum_lengths[token] = 0
        # run through each sentence
        for sentence in list_of_sentence_tokens:
            if token in sentence:
                # plus 1 to the DF of the token
                DF[token] += 1
                # compute the sum of lengths of the sentences that contain the token
                sum_lengths[token] += len(sentence)

    print("DF", DF)
    print("sum_lengths", sum_lengths)

    # compute the MIDF (Modified Inverse Document Frequency) log(N*M/(DF*sum_length_of_sentence))
    MIDF = {}
    # N is the number of sentences
    N = len(list_of_sentence_tokens)
    for token in unique_tokens:
        MIDF[token] = math.log(N * M / (DF[token] * sum_lengths[token]))
    print("MIDF", MIDF)

    
    # convert to this structure [[token1_MTF, token2_MTF], [document2]]
    # convert to this structure [[token1_MTF_MIDF, token2_MTF_MIDF], [document2]]
    # create the same structure as the list_of_sentence_tokens but fill it with 0
    # Initialize MTF and MTF_MIDF matrices with zeros
    MTF = []
    MTF_MIDF = []
    for i, sentence in enumerate(list_of_sentence_tokens):
        for j, token in enumerate(sentence):
            # create a new list for each sentence
            if i >= len(MTF):
                MTF.append([0] * len(sentence))
                MTF_MIDF.append([0] * len(sentence))
            # initialize the MTF and MTF-MIDF of the token to 0
            MTF[i][j] = 0
            MTF_MIDF[i][j] = 0

    # print("MTF", MTF)
    # print("MTF_MIDF", MTF_MIDF)

    positional_index = get_positional_index(list_of_sentence_tokens)

    # compute the MTF (Term Frequency)
    # (MTF = count(token) / 1 + positions(token))
    for i, sentence in enumerate(list_of_sentence_tokens):
        # print("sentence",i, sentence)
        # run through each token in the sentence
        for j, token in enumerate(sentence):

            # print("token", j,token)
            if token in positional_index and i in positional_index[token]:
                first_pos = positional_index[token][i][0]  # First position
            else:
                first_pos = 0
            # print("first_pos",first_pos)
            # count the number of times the token appears in the sentence
            count = sentence.count(token)
            # compute the MTF of the token
            MTF[i][j] = count / (1 + first_pos)
        #     print("MTF[i][j]",i,j, MTF[i][j])
        # print("MTF[i]",i, MTF[i])
    print("MTF", MTF)

    # compute the MTF-MIDF matrix
    # MTF_MIDF = MTF * MIDF
    #  run through each sentence
    for i, sentence in enumerate(list_of_sentence_tokens):
        for j, token in enumerate(sentence):
            # compute the MTF-MIDF of the token
            MTF_MIDF[i][j] = MTF[i][j] * MIDF[token]

    print("MTF_MIDF", MTF_MIDF)


    return  MTF_MIDF #THIS IS A PLACEHOLDER FOR THE OUTPUT YOU NEED TO OVERWRITE

get_modified_tfidf_matrix(clean_sentences(sample_sentences))

()
(JavaScript,)
(Java,)
()
()
()
()
()
()
()
DF {'debugging': 1, 'programming': 4, 'known': 1, 'intelligence': 1, 'version': 1, 'widely': 1, 'javascript': 1, 'proved': 1, 'applied': 1, 'development': 2, 'instructions': 1, 'simplify': 1, 'problems': 2, 'involves': 1, 'importance': 1, 'process': 1, 'java': 1, 'step': 1, 'solving': 1, 'errors': 1, 'frameworks': 1, 'code': 3, 'control': 1, 'manage': 1, 'versatile': 1, 'language': 1, 'independence': 1, 'finding': 1, 'applications': 1, 'writing': 1, 'data': 1, 'help': 1, 'systems': 1, 'artificial': 1, 'python': 2, 'domains': 1, 'platform': 1, 'solve': 1, 'algorithms': 1, 'collaboration': 1, 'changes': 1, 'fixing': 1, 'web': 2, 'tasks': 1, 'structures': 1, 'efficient': 1, 'crucial': 1}
sum_lengths {'debugging': 7, 'programming': 24, 'known': 4, 'intelligence': 5, 'version': 8, 'widely': 4, 'javascript': 4, 'proved': 8, 'applied': 5, 'development': 10, 'instructions': 6, 'simplify': 6, 'problems': 12, 'involves': 6, 'importance': 8, 'process'

[[1.0216512475319814,
  1.8325814637483102,
  -0.6523251860396901,
  1.8325814637483102,
  1.0216512475319814,
  1.8325814637483102,
  1.8325814637483102,
  1.8325814637483102],
 [2.5257286443082556,
  2.5257286443082556,
  0.9162907318741551,
  0.9162907318741551],
 [2.5257286443082556,
  2.5257286443082556,
  2.5257286443082556,
  2.5257286443082556],
 [-0.6523251860396901,
  2.120263536200091,
  2.120263536200091,
  -0.2311117209633867,
  2.120263536200091,
  0.7339691750802005],
 [2.302585092994046,
  2.302585092994046,
  2.302585092994046,
  2.302585092994046,
  -0.6523251860396901],
 [2.120263536200091,
  4.240527072400182,
  4.240527072400182,
  2.120263536200091,
  2.120263536200091,
  0.7339691750802005],
 [1.8325814637483102,
  1.8325814637483102,
  1.8325814637483102,
  1.8325814637483102,
  1.8325814637483102,
  -0.2311117209633867,
  1.8325814637483102,
  1.8325814637483102],
 [1.9661128563728327,
  1.9661128563728327,
  1.9661128563728327,
  1.9661128563728327,
  1.966112

##### Part 4-B: Experiment the effect of changing M and comment on what do you think M is for and why is it added. (5) 

- <b> Your answer here

- **M** is a scaling factor in the MIDF formula that balances the denominator, which the both number of documents a term appears in and their total length.
- If term appears in `longer` documents, the average length of the documents will `increase`, causing the denominator to `increase`, thereby `reducing` the MIDF value.
- If term appears in `shorter` documents, the average length of the documents will `decrease`, causing the denominator to `decrease`, thereby `increasing` the MIDF value.
- Additionally,
    - Increasing **M** will reduce the penalty for terms appearing in longer documents.
    - Decreasing **M** will increase the penalty for terms in longer documents.
- The reason of add **M**, can represent how document length impacts term relevance then providing flexibility for different tasks.

##### Part 4-C: Do you think Modified TF-Modified IDF is a good technique? Please comment and explain your thoughts.(5)

- <b> Your answer here</b>

- <u>**MTF is sensitivity with the position**</u>, prioritizes terms appearing earlier, which for the documents where the beginning, like titles, is more significant.
- <u>**MIDF penalizes terms in widely documents**</u>, could potentially improve the connection between collections with varying document sizes.
- <u>**Effectiveness depends on the context**</u>, it is a good technique for applications where term position and document length are critical, for instance news articles.
- However, it should be used carefully in diverse research corpora where document length and structure vary widely.