## Term Frequency and Inverse Document Frequency

In [1]:
# Import the library
import nltk

In [2]:
# Input text data
paragraph = """Real Madrid simply show no sign of letting up. The LaLiga table-toppers saw off Alavés at 
               the Di Stéfano to make it eight wins on the bounce and retain the four-point buffer at the 
               summit with three games to go. The Madrid goals came from Karim Benzema, who converted 
               from the spot, whilst Marco Asensio was also on the mark for the hosts, who recorded a fifth 
               successive shutout. Ferland Mendy started at left wing-back, with Lucas Vázquez occupying 
               the right wing-back berth and inside the first minute, the pair were involved in the madridistas' 
               first forward foray, which culminated in Luka Modric sending his effort wide of the target. 
               The Alavés response wasn't long in coming and Joselu's headed effort struck the crossbar, 
               whilst Raphaël Varane cleared a Lucas Pérez's follow-up off the line. It looked as if we were 
               in store for a high-tempo affair and just after the 10-minute mark, Mendy once again showed what a 
               threat he is down the left. Ximo Navarro upended the Frenchman in the area and Benzema stepped 
               up to make it 1-0. With 12 minutes gone, Toni Kroos’ did his best to find the top corner, before a 
               fierce Mendy cross nearly forced Camarasa to turn into his own net on 17’. The Blanquiazules refused
               to roll over though, with Oliver Burke proving a constant nuisance for the defence and testing
               Thibaut Courtois, despite the final chances before the break falling to Rodrygo and Benzema. 
               After the restart, referee Gil Manzano retired injured and by the time the 50th minute came around, 
               Madrid had added to their advantage. Benzema and Asensio raced through on goal, up against Roberto, 
               and the Balearic Island-born forward stroked home with ease, though his goal was originally ruled 
               out for offside before being correctly awarded by VAR."""

In [3]:
import re # Used for cleaning the text such as commas, full-stops, question mark, etc
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Creating the object for Stemming and Lemmatization
ps = PorterStemmer()
wordnet = WordNetLemmatizer()

### Lemmatization

In [4]:
# Converting the paragraph into sentences
sentences = nltk.sent_tokenize(paragraph)

# storing the cleaned text
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]',' ', sentences[i]) # Here, we are replacing everything with spaces in the input paragraph apart from characters 'a-z' and 'A-Z'
    review = review.lower()  # Lowering each and every sentence i.e. lowercase
    review = review.split()  # When we apply this, we'll be getting the list of words from the sentences
    review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

In [6]:
print(sentences)
print()
print(corpus)

['Real Madrid simply show no sign of letting up.', 'The LaLiga table-toppers saw off Alavés at \n               the Di Stéfano to make it eight wins on the bounce and retain the four-point buffer at the \n               summit with three games to go.', 'The Madrid goals came from Karim Benzema, who converted \n               from the spot, whilst Marco Asensio was also on the mark for the hosts, who recorded a fifth \n               successive shutout.', "Ferland Mendy started at left wing-back, with Lucas Vázquez occupying \n               the right wing-back berth and inside the first minute, the pair were involved in the madridistas' \n               first forward foray, which culminated in Luka Modric sending his effort wide of the target.", "The Alavés response wasn't long in coming and Joselu's headed effort struck the crossbar, \n               whilst Raphaël Varane cleared a Lucas Pérez's follow-up off the line.", 'It looked as if we were \n               in store for a high-te

### Applying TF-IDF

In [7]:
# Importing the library
from sklearn.feature_extraction.text import TfidfVectorizer

# Creating the object
tf = TfidfVectorizer()
# Fit the model to the corpus
X = tf.fit_transform(corpus).toarray()

In [8]:
# Shape of the matrix
X.shape

(11, 152)

In [9]:
import numpy as np
import sys
np.set_printoptions(threshold=sys.maxsize)
# View the matrix
print(X)

[[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.42390093
  0.         0.         0.         0.         0.         0.31865342
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         

##### Notice that unlike Bag of Words, the representation of the vectors in this are in the form of '0's and some decimal values. This is because the Term Frequency is multiplied to the Inverse Document Frequency, which makes up the vectors in the matrix. This is more robust than Bag of Words model where we find it difficult to distinguish whether the word is important enough or not to pass through the model.

### Now, it's your turn to try this out by yourself. Till then, PEACE...✌️ 