# NOTE:

1. Please implement the TFIDF function such that for each word in a sentence, its corresponding tfidf value is assigned. Thus a 4 x 6 sized matrix should be returned where the rows represent sentences and the columns represent words. We wish to keep it simple in the beginning.

2. In reality the TFIDF function should return a matrix where the rows represent sentences and the columns represent words (ie: Features). Every sentence vector in this matrix will be 'd' dimensional, where d = number of unique words in the corpus (ie: Vocabulary).
Every position/cell in a sentence vector correponds to a particular word in the vocabulary. If the word is not present in the current sentence, we assign a value of 0 to that cell, else we assign the TFIDF value.

# **Implement TF-IDF from scratch**

In this assignment, you will implement TF-IDF vectorization of text from scratch using only Python and inbuilt data structures. You will then verify the correctness of the your implementation using a "grader" function/cell (provided by us) which will match your implmentation.

The grader fucntion would help you validate the correctness of your code. 

Please submit the final Colab notebook in the classroom ONLY after you have verified your code using the grader function/cell.

**(FAQ) Why bother about implementing a function to compute TF-IDF when it is already available in major libraries?**

Ans.
1. It helps you improve your coding proficiency.
2. It helps you obtain a deeper understanding of the concepts and how it works internally. Knowledge of the internals will also help you debug problems better.
3. A lot of product based startups and companies do focus on this in their interviews to gauge your depth and clarity of understanding along with your programming skills. Hence, most top universities have implementations of some ML algorithms/concepts as mandatory assignments.

**NOTE: DO NOT change the "grader" functions or code snippets written by us.Please add your code in the suggested locations.**

Ethics Code:
1. You are welcome to read up online resources to implement the code. 
2. You can also discuss with your classmates on the implmentation over Slack.
3. But, the code you wirte and submit should be yours ONLY. Your code will be compared against other stduents' code and online code snippets to check for plagiarism. If your code is found to be plagiarised, you will be awarded zero-marks for all assignments, which have a 10% wieghtage in the final marks for this course.

In [4]:
# Corpus to be used for this assignment

corpus = [
     'this is the first document mostly',
     'this document is the second document',
     'and this is the third one',
     'is this the first document here',
]

In [5]:
# Please implement this fucntion and write your code wherever asked. Do NOT change the code snippets provided by us.
import numpy as np
from math import log

def computeTFIDF (corpus):
  """Given a list of sentences as "corpus", return the TF-IDF vectors for all the 
  sentences in the corpus as a numpy 2D matrix. 
  
  Each row of the 2D matrix must correspond to one sentence 
  and each column corresponds to a word in the text corpus. 
  
  Please order the rows in the same order as the 
  sentences in the input "corpus". 
    
  Ignore puncutation symbols like comma, fullstop, 
  exclamation, question-mark etc from the input corpus.
  
  For e.g, If the corpus contains sentences with these 
  9 distinct words, ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'], 
  then the first column of the 2D matrix will correpsond to word "and", the second column will 
  correspond to column "document" and so on. 
  
  Write this function using only basic Python code, inbuilt Data Structures and  NumPy ONLY.

  Implement the code as optimally as possible using the inbuilt data structures of Python.
  """

  ##############################################################
  ####   YOUR CODE BELOW  as per the above instructions #######
  ##############################################################
  tf_idf_mat=[]

  total_documents_in_corpus=len(corpus)
  # Creating Corpus
  corpus_words=[]
  for doc in corpus:
    doc_words=doc.split(" ")
    doc_words=[word.lower() for word in doc_words ]
    for word in doc_words:
      corpus_words.append(word.lower())
  
  term, counts = np.unique(np.array(corpus_words),return_counts=True)
  corpus_dict=dict(zip(term, counts))
  
  # Total unique words in corpus 
  total_unique_words_in_corpus=len(corpus_dict)
  print(corpus_dict)
  
  # Counting word in document (Not the frequency) for IDF calculations
  # dictionary= { word : count of documents where it is present from corpus }
  word_present_in_n_docs={}

  for word in corpus_dict:
    for doc in corpus:
      if word.lower() in doc.lower():
        if word.lower() in word_present_in_n_docs:
          word_present_in_n_docs[word]+=1
        else:
          word_present_in_n_docs[word]=1

  # Calculating TF-IDF
  for doc in corpus:
    # Splitting and changing case to lower for mitigating duplication for corpus
    # I know that our corpus is not having Capitalized values but its good to have
    doc_word=doc.split(" ")
    doc_word=[word.lower() for word in doc_word ]
    current_doc_length=len(doc_word)

    # Dictionary for the cur doc words count 
    unique, counts = np.unique(np.array(doc_word),return_counts=True)
    curr_doc_word_counts=dict(zip(unique, counts))

    tf_idf_doc_vec=[]
    for word in doc_word:
      # Calculating TF
      word_count_in_current_doc=curr_doc_word_counts[word]
      tf_val_in_curr_word_doc=word_count_in_current_doc/current_doc_length

      # Calculating IDF
      idf_val_in_curr_word_doc=log(total_documents_in_corpus/word_present_in_n_docs[word])

      # Calculating TF-IDF
      tf_idf_val=round((tf_val_in_curr_word_doc*idf_val_in_curr_word_doc),2)
      tf_idf_doc_vec.append(tf_idf_val)
      
    tf_idf_mat.append(tf_idf_doc_vec)
    
    tf_idf=np.array(tf_idf_mat)
  return tf_idf


# Grader Cell
Please execute the following Grader cell to verify the correctness of your above implementation. This cell will print "Success" if your implmentation of the computeTFIDF() is correct, else, it will print "Failed". Make sure you get a "Success" before you submit the code in the classroom.

In [6]:
###########################################
## GRADER CELL: Do NOT Change this.
# This cell will print "Success" if your implmentation of the computeTFIDF() is correct.
# Else, it will print "Failed"
###########################################
import numpy as np

# compute TF-IDF using the computeTFIDF() function
X_custom = computeTFIDF(corpus)

# Reference grader array - DO NOT MODIFY IT
X_grader = np.array(
    [[0, 0, 0, 0.12, 0.05, 0.23],
     [0, 0.1, 0, 0, 0.23, 0.1],
     [0.23, 0, 0, 0, 0.23, 0.23],
     [0, 0, 0, 0.12, 0.05, 0.23]]
     )

# compare X_grader and X_custom
comparison = ( X_grader == X_custom )
isEqual = comparison.all()

if isEqual:
  print("******** Success ********")
else:
  print("####### Failed #######")
  print("\nX_grader = \n\n", X_grader)
  print("\n","*"*50)
  print("\nX_custom = \n\n", X_custom)


{'and': 1, 'document': 4, 'first': 2, 'here': 1, 'is': 4, 'mostly': 1, 'one': 1, 'second': 1, 'the': 4, 'third': 1, 'this': 4}
******** Success ********
