### TF-IDF

- **TF** stands for **Term Frequency**  
TF = No. of repeatitions of a word in a sentence/No. of sentences
    
- **IDF** stands for **Inverse Document Frequency**  
IDF = No. of sentences/No. of sentences containing a word

In [7]:
##

from sklearn.feature_extraction.text import TfidfVectorizer
# Sample corpus of movie reviews
corpus = [
    "I loved the movie, it was fantastic!",
    "The movie was okay, but not great.",
    "I hated the movie, it was terrible.",
]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the corpus to a document-term matrix
X = vectorizer.fit_transform(corpus)

# Convert the document-term matrix into a dense format (optional for visualization)
X_dense = X.toarray()

# Get the vocabulary (mapping of words to index positions)
vocab = vectorizer.get_feature_names_out()

# Print the vocabulary and document-term matrix
print("Vocabulary:", vocab)
print(" ")
print("Document-Term Matrix:\n", X_dense)
print(X_dense.shape)   # 3 obs and 12 vocab 

Vocabulary: ['but' 'fantastic' 'great' 'hated' 'it' 'loved' 'movie' 'not' 'okay'
 'terrible' 'the' 'was']
 
Document-Term Matrix:
 [[0.         0.52523431 0.         0.         0.39945423 0.52523431
  0.31021184 0.         0.         0.         0.31021184 0.31021184]
 [0.44514923 0.         0.44514923 0.         0.         0.
  0.26291231 0.44514923 0.44514923 0.         0.26291231 0.26291231]
 [0.         0.         0.         0.52523431 0.39945423 0.
  0.31021184 0.         0.         0.52523431 0.31021184 0.31021184]]
(3, 12)


In [8]:
## Bigrams

from sklearn.feature_extraction.text import TfidfVectorizer
# Sample corpus of movie reviews
corpus = [
    "I loved the movie, it was fantastic!",
    "The movie was okay, but not great.",
    "I hated the movie, it was terrible.",
]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(2,2))

# Fit and transform the corpus to a document-term matrix
X = vectorizer.fit_transform(corpus)

# Convert the document-term matrix into a dense format (optional for visualization)
X_dense = X.toarray()

# Get the vocabulary (mapping of words to index positions)
vocab = vectorizer.get_feature_names_out()

# Print the vocabulary and document-term matrix
print("Vocabulary:", vocab)
print(" ")
print("Document-Term Matrix:\n", X_dense)
print(X_dense.shape)   # 3 obs and 12 vocab 

Vocabulary: ['but not' 'hated the' 'it was' 'loved the' 'movie it' 'movie was'
 'not great' 'okay but' 'the movie' 'was fantastic' 'was okay'
 'was terrible']
 
Document-Term Matrix:
 [[0.         0.         0.40619178 0.53409337 0.40619178 0.
  0.         0.         0.31544415 0.53409337 0.         0.        ]
 [0.43238509 0.         0.         0.         0.         0.43238509
  0.43238509 0.43238509 0.2553736  0.         0.43238509 0.        ]
 [0.         0.53409337 0.40619178 0.         0.40619178 0.
  0.         0.         0.31544415 0.         0.         0.53409337]]
(3, 12)


# End!