<a href="https://colab.research.google.com/github/MedAzzam/TF-IDF-from-scratch/blob/main/Build_TF_IDF_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TF-IDF From Scratch

![TF-IDF](https://miro.medium.com/v2/1*swXqNsBUqcysa72mcb1_kw.png)

# Libraries & data

In [None]:
import pandas as pd
import numpy as np
import os
import nltk

from nltk import word_tokenize

In [None]:
dir_path = "C:/Users/pc/Documents/Python files/NLP/bbc"

data = []

for foldername in os.listdir(dir_path):
    if not os.path.isdir(os.path.join(dir_path, foldername)):
        continue
    for filename in os.listdir(os.path.join(dir_path, foldername)):
        if filename.endswith(".txt"):
            with open(os.path.join(dir_path, foldername, filename), "r") as f:
                content = f.read()
                data.append({"foldername": foldername, "filename": filename, "content": content})

In [None]:
df = pd.DataFrame(data)

df.to_csv("output.csv", index=False)

In [None]:
df.head()

Unnamed: 0,foldername,filename,content
0,business,001.txt,Ad sales boost Time Warner profit\n\nQuarterly...
1,business,002.txt,Dollar gains on Greenspan speech\n\nThe dollar...
2,business,003.txt,Yukos unit buyer faces loan claim\n\nThe owner...
3,business,004.txt,High fuel prices hit BA's profits\n\nBritish A...
4,business,005.txt,Pernod takeover talk lifts Domecq\n\nShares in...


In [None]:
nltk.download('puntk')

[nltk_data] Error loading puntk: Package 'puntk' not found in index


False

In [None]:
df.drop("filename", axis=1)

Unnamed: 0,foldername,content
0,business,Ad sales boost Time Warner profit\n\nQuarterly...
1,business,Dollar gains on Greenspan speech\n\nThe dollar...
2,business,Yukos unit buyer faces loan claim\n\nThe owner...
3,business,High fuel prices hit BA's profits\n\nBritish A...
4,business,Pernod takeover talk lifts Domecq\n\nShares in...
...,...,...
2220,tech,BT program to beat dialler scams\n\nBT is intr...
2221,tech,Spam e-mails tempt net shoppers\n\nComputer us...
2222,tech,Be careful how you code\n\nA new European dire...
2223,tech,US cyber security chief resigns\n\nThe man mak...


### Populate word2idx
   - **convert documents into sequences of ints / ids / indices**

In [None]:
idx = 0
word2idx = {}
tokenized_docs = []
for doc in df['content']:
    words = word_tokenize(doc.lower())
    doc_as_int = []
    for word in words:
        if word not in word2idx:
            word2idx[word] = idx
            idx += 1

        # Save for later
        doc_as_int.append(word2idx[word])
    tokenized_docs.append(doc_as_int)

 - **reverse mapping**
 - **if you do it smarter you can store it as a list**

In [None]:
idx2word = {v:k for k, v in word2idx.items()}

 - **number of documents**

In [None]:
N = len(df['content'])

 - **number of words**

In [None]:
V = len(word2idx)

 - **instantiate term-frequency matrix**
 - **note: could have also used count vectoriser**

In [None]:
tf = np.zeros((N,V))

 - **populate term-fequency matrix**

In [None]:
for i, doc_as_int in enumerate(tokenized_docs):
    for j in doc_as_int:
        tf[i, j] = 1

 - **compute term-frequency counts**

In [None]:
document_freq = np.sum(tf>0, axis=0)  # document feaquency (shape = (V,))
idf = np.log(N / document_freq)

 - **compute TF-IDF**

In [None]:
tf_idf = tf * idf

In [None]:
np.random.seed(123)

 - **pick a random document, show the top 5 terms (in terms of tf_idf score)**

In [None]:
i = np.random.choice(N)
row = df.iloc[i]
print("Label:", row['foldername'])
print("content:", row['content'].split("\n",1)[0])
print("Top 5 terms:")

scores = tf_idf[i]
indices = (-scores).argsort()

for j in indices[:5]:
    print(idx2word[j])

Label: tech
content: IBM puts cash behind Linux push
Top 5 terms:
beefing
premise
tinker
digit
suite
