<a href="https://colab.research.google.com/github/SakshamRimal/Deep-Learning/blob/main/06_NLP/TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np

In [5]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [7]:
documents = [
    "I love programming in Python",
    "Python programming is Fun",
    "I love machine learning"
]

In [8]:
stop_words = set(stopwords.words('english'))

In [17]:
filtered_docs = [
    " ".join([word for word in word_tokenize(doc.lower()) if word.isalpha() and word not in stop_words])
    for doc in documents
]

In [19]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [20]:
filtered_docs

['love programming python', 'python programming fun', 'love machine learning']

In [21]:
vectorizer = TfidfVectorizer()

In [22]:
tfidf_matrix = vectorizer.fit_transform(filtered_docs)

In [23]:
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print("\nTF-IDF Matrix:\n", df_tfidf)



TF-IDF Matrix:
         fun  learning     love   machine  programming    python
0  0.000000  0.000000  0.57735  0.000000     0.577350  0.577350
1  0.680919  0.000000  0.00000  0.000000     0.517856  0.517856
2  0.000000  0.622766  0.47363  0.622766     0.000000  0.000000


In [26]:
clean_docs = [" ".join(words) for words in filtered_docs]

print("Cleaned Documents:", clean_docs)

Cleaned Documents: ['l o v e   p r o g r a m m i n g   p y t h o n', 'p y t h o n   p r o g r a m m i n g   f u n', 'l o v e   m a c h i n e   l e a r n i n g']


In [27]:
#term frequency
count_vectorizer = vectorizer.build_analyzer()
tf_values = []
for doc in clean_docs:
    word_counts = {}
    words = count_vectorizer(doc)
    for word in words:
        word_counts[word] = word_counts.get(word, 0) + 1
    doc_len = len(words)
    tf_values.append({word: count / doc_len for word, count in word_counts.items()})
print("\nTerm Frequencies (TF):")
for i, tf in enumerate(tf_values):
    print(f"Doc {i+1}:", tf)


Term Frequencies (TF):
Doc 1: {}
Doc 2: {}
Doc 3: {}


In [28]:
#inverse document frequency
idf_scores = dict(zip(vectorizer.get_feature_names_out(), vectorizer.idf_))
print("\nInverse Document Frequency (IDF):")
for word, score in idf_scores.items():
    print(f"{word}: {score:.4f}")



Inverse Document Frequency (IDF):
fun: 1.6931
learning: 1.6931
love: 1.2877
machine: 1.6931
programming: 1.2877
python: 1.2877


In [29]:
print("\nTF-IDF scores for each document:")
for i, row in enumerate(tfidf_matrix.toarray()):
    print(f"Doc {i+1}:")
    for word, score in zip(vectorizer.get_feature_names_out(), row):
        if score > 0:
            print(f"  {word}: {score:.4f}")


TF-IDF scores for each document:
Doc 1:
  love: 0.5774
  programming: 0.5774
  python: 0.5774
Doc 2:
  fun: 0.6809
  programming: 0.5179
  python: 0.5179
Doc 3:
  learning: 0.6228
  love: 0.4736
  machine: 0.6228


How this program uses all TF-IDF concepts

    Tokenization → nltk.word_tokenize

    Lowercasing & Punctuation Removal → .isalpha()

    Stopword Removal → stopwords.words('english')

    TF Calculation → Term counts ÷ total words in document

    IDF Calculation → idf = log(N / (1 + df)) + 1

    TF-IDF Matrix → Generated using TfidfVectorizer

    Readable Output → pandas.DataFrame for matrix & dictionary for TF/IDF