<a href="https://colab.research.google.com/github/LUMII-AILab/NLP_Course/blob/main/notebooks/tf_idf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TF-IDF
### term frequency–inverse document frequency

##Pre-processing


In [1]:
import urllib
import re

Vispirms ir nepieciešams klāsts ar individuāliem failiem, kuriem veikt tf-idf analīzi. Pielietosim iepriekš izmantotos Šekspīra darbu datus. Sadalām tos pa lugām, nolasot, kur dokumentā sākas jauna luga jeb sākas pirmais cēliens.

We read and split Shakespeare works in seperate files, one work= one file

In [2]:
url = 'https://raw.githubusercontent.com/alexisperrier/intro2nlp/master/data/Shakespeare_alllines.txt'
lines = urllib.request.urlopen(url).read().decode('utf-8').split("\n")
iter = 0
for line in lines:
    if line == "\"ACT I\"":
#        outfile.close()
        iter += 1
        outfile = open("work_"+str(iter)+".txt","w")
    outfile.write(line+"\n")
outfile.close()
work_count = iter
print("Divided into {} files".format(work_count))

Divided into 36 files


##Veicam tf-idf rezultāta aprēķināšanu ar pašu veidotām funkcijām
##Writing functions for tf-idf calculation

In [3]:
# Dokumentu biežums
# Frequency in documents

def number_of_docs_with_term(word):
    found = 0
    for i in range(1,work_count+1):
        infile = open("work_"+str(i)+".txt", "r")
        for line in infile:
            line = re.sub(r'[\!"#$%&\*+,-./:;<=>?@^_`()|~=]','',line).strip()
            tokens = re.findall(r'\b\w+\b', line)
            if word in tokens:
                found += 1
                break
        infile.close()
    return found


def doc_appearance(word):
    found = number_of_docs_with_term(word)
    print("Word \"{}\" was found in {} out of {} files.".format(word, found, work_count))

doc_appearance("Romeo")
doc_appearance("battle")
doc_appearance("said")

Word "Romeo" was found in 1 out of 36 files.
Word "battle" was found in 21 out of 36 files.
Word "said" was found in 36 out of 36 files.


In [4]:
# Vārda biežums
# Word frequncy in document
def word_frequency(word, doc_id):
    freq = 0
    infile = open("work_"+str(doc_id)+".txt", "r")
    for line in infile:
        line = re.sub(r'[\!"#$%&\*+,-./:;<=>?@^_`()|~=]','',line).strip()
        tokens = re.findall(r'\b\w+\b', line)
        for token in tokens:
            if token == word:
                freq += 1
    return freq


def frequency(word, doc_id):
    w_freq = word_frequency(word, doc_id)
    filename = "work_"+str(doc_id)+".txt"
    print("Found word \"{}\" in file \"{}\" a total of {} times.".format(word, filename, w_freq))

frequency("king", 28)
frequency("Romeo", 28)
frequency("said", 28)

Found word "king" in file "work_28.txt" a total of 1 times.
Found word "Romeo" in file "work_28.txt" a total of 128 times.
Found word "said" in file "work_28.txt" a total of 14 times.


In [5]:
# Calculating tf-idf using formula
import numpy as np
def tf_idf(word, doc_id):
    inverse_doc_freq = np.log((1+work_count)/(number_of_docs_with_term(word)+1))+1
    score = word_frequency(word, doc_id) * inverse_doc_freq
    return score


def print_score(word, doc_id):
    filename = "work_"+str(doc_id)+".txt"
    score = tf_idf(word, doc_id)
    print("The tf-idf score for word \"{}\" in file \"{}\" is: {}".format(word, filename, score))

print_score("battle", 28)
print_score("Romeo", 28)
print_score("Juliet", 28)

The tf-idf score for word "battle" in file "work_28.txt" is: 0.0
The tf-idf score for word "Romeo" in file "work_28.txt" is: 501.4746537067877
The tf-idf score for word "Juliet" in file "work_28.txt" is: 165.0783643268774


##Alternative: scikit-learn library for tf-idf calculation
TfidfVectorizer papildus veic arī rezultātu smoothing (smooth_idf=True) un normalizēšanu(norm='l2'). <br>
Tādēļ ar bibliotēku iegūtie tf-idf reultāti atšķiras no iepriekš aprēķinātajiem.

This library includes smoothing and normalisation, thus results differs from the above.

In [6]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd


text_files = []
for i in range(1,work_count+1):
    text_files.append("work_"+str(i)+".txt")

text_titles = [text.split(".")[0] for text in text_files]


# Get tf-idf score data for word
def vectorizer_score(word, doc_id):

    # Initialize and run TfidfVectorizer
    tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')
    tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

    # Create a DataFrame out of the resulting tf–idf vector
    tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names_out())
    tfidf_df = tfidf_df.stack().reset_index()
    tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})


    tfidf_df = tfidf_df[tfidf_df['document'] == 'work_'+str(doc_id)]
    print(tfidf_df[tfidf_df['term'] == word], "\n")


vectorizer_score("juliet", 28)
vectorizer_score("romeo", 28)
vectorizer_score("said", 28)


       document    term     tfidf
612231  work_28  juliet  0.225776 

       document   term     tfidf
618023  work_28  romeo  0.583629 

       document  term     tfidf
618237  work_28  said  0.015305 

