In information retrieval, tf–idf, TF-IDF, or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

tf–idf is one of the most popular term-weighting schemes today. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf.

In [1]:
import pandas as pd
import pickle
import string
import re
import os

from statistics import mean
import numpy as np


In [2]:
df = pd.read_csv('movie_reviews.tsv', sep='\t')

In [7]:
class Ranking:
    """
    Class that handles ranking algorithms.

    Attributes
    -----------
    directory: where will the dictionaries be loaded/saved.
    document_frequencies_path: path to save document_frequencies dict.
    term_frequencies_path: path to save term_frequencies dict.
    document_length_path: path to save document_length dict.
    document_frequencies: dict <word> -> <number of documents appearing>
    term_frequencies: dict <document_id> -> <word> -> <number of occurrences>
    document_length: dict <document_id> -> <number of words>
    num_documents: number of documents stored.
    ids: ids of the documents stored.
    avg_length: average length of the documents stored.

    Methods
    --------
    _get_dictionaries: Loads the dictionaries needed to calculate the scoring functions.
    _get_dictionaries_from_file: Given a file_name returns the dictionary if the file exists or returns an empty dict.
    add_document: It adds a document to the dictionaries needed to compute the documents scores.
    delete_document: It deletes the given document from the dictionaries needed to compute the documents scores.
    save: Saves the dictionaries to a given path.
    get_tfidf_score: Given a query, computes the TFIDF score for every document in the corpus.
    get_bm25_scores: Given a query, computes the BM25+ score for every document in the corpus.
    most_similar_threshold: Given a query it returns the k most relevant documents or the ones
                        that are over a given threshold.
    """

    def __init__(self, directory='Data/ranking_dict/'):
        """
        Init Ranking class.
        :param directory: where will the dictionaries be loaded/saved.
        """

        if not os.path.exists(directory):
            os.makedirs(directory)

        self.directory = directory
        self.document_frequencies_path = os.path.join(directory, 'document_frequencies.p')
        self.term_frequencies_path = os.path.join(directory, 'term_frequencies.p')
        self.document_length_path = os.path.join(directory, 'document_length.p')

        self._get_dictionaries()
        self.num_documents = len(self.document_frequencies)
        self.ids = list(self.term_frequencies.keys())

        if len(self.document_length) > 0:
            self.avg_length = mean(self.document_length.values())
        else:
            self.avg_length = None

    def _get_dictionaries(self):
        """
        Loads the dictionaries needed to calculate the scoring functions.
        """

        self.document_frequencies = self._get_dictionaries_from_file(self.document_frequencies_path)
        self.term_frequencies = self._get_dictionaries_from_file(self.term_frequencies_path)
        self.document_length = self._get_dictionaries_from_file(self.document_length_path)
    
    @staticmethod
    def _get_dictionaries_from_file(file_name):
        """
        Given a file_name returns the dictionary if the file exists or returns an empty dict.
        :param file_name: path where the file is to be found. <str>
        :return dictionary stored in the pickle file if found, emtpy dict if not found. <dict>
        """

        if os.path.isfile(file_name):
            with open(file_name, 'rb') as fp:
                return pickle.load(fp)
        else:
            return {}

    def add_document(self, document_id, document):
        """
        It adds a document to the dictionaries needed to compute the documents scores.
        It gets rid off punctuation.
        :param document_id: unique id for the document. <str>
        :param document: document to be added. It can be either a string or a json file from amazon api.
        """
        if document_id in self.ids:
            raise Exception("You provided an ID it's already stored.")

        words = split_words(document)

        actual_frequencies = {}
        words_set = set()
        length = 0
        for word in words:
            word = word.lower()
            if word in string.punctuation:
                continue
            length += 1
            if word in actual_frequencies:
                actual_frequencies[word] += 1
            else:
                actual_frequencies[word] = 1
            words_set.add(word)
        for word in words_set:
            if word in self.document_frequencies:
                self.document_frequencies[word] += 1
            else:
                self.document_frequencies[word] = 1
        self.document_length[document_id] = length
        self.term_frequencies[document_id] = actual_frequencies

        # RECALCULATE VALUES
        self.avg_length = mean(self.document_length.values())
        self.num_documents = len(self.document_frequencies)
        self.ids = list(self.term_frequencies.keys())

        self.save()

    def delete_document(self, document_id):
        """
        It deletes the given document from the dictionaries needed to compute the documents scores.
        :param document_id: unique id for the document. <str>

        :raises Exception if the given document_id is not stored.
        """

        if document_id not in self.ids:
            raise Exception("You provided an ID it's not stored.")

        word_set = self.term_frequencies[document_id].keys()
        for word in word_set:
            self.document_frequencies[word] -= 1
            if self.document_frequencies[word] == 0:
                del self.document_frequencies[word]

        del self.document_length[document_id]
        del self.term_frequencies[document_id]

        # RECALCULATE VALUES
        self.ids = list(self.term_frequencies.keys())
        if len(self.ids) != 0:
            self.avg_length = mean(self.document_length.values())
            self.num_documents = len(self.document_frequencies)
        else:
            self.avg_length = 0
            self.num_documents = 0

        self.save()

    def save(self, path=None):
        """
        Saves the dictionaries to a given path. If path is None, it will save into the directory
        specified in the init method.
        :param path: directory where the dictionaries are going to be saved. <str> (default: None)
        """

        if path is None:
            with open(self.document_frequencies_path, 'wb') as fp:
                pickle.dump(self.document_frequencies, fp, protocol=pickle.HIGHEST_PROTOCOL)

            with open(self.term_frequencies_path, 'wb') as fp:
                pickle.dump(self.term_frequencies, fp, protocol=pickle.HIGHEST_PROTOCOL)

            with open(self.document_length_path, 'wb') as fp:
                pickle.dump(self.document_length, fp, protocol=pickle.HIGHEST_PROTOCOL)

        else:
            if not os.path.exists(path):
                os.makedirs(path)

            document_frequencies_path = os.path.join(path, 'document_frequencies.p')
            term_frequencies_path = os.path.join(path, 'term_frequencies.p')
            document_length_path = os.path.join(path, 'document_length.p')

            with open(document_frequencies_path, 'wb') as fp:
                pickle.dump(self.document_frequencies, fp, protocol=pickle.HIGHEST_PROTOCOL)

            with open(term_frequencies_path, 'wb') as fp:
                pickle.dump(self.term_frequencies, fp, protocol=pickle.HIGHEST_PROTOCOL)

            with open(document_length_path, 'wb') as fp:
                pickle.dump(self.document_length, fp, protocol=pickle.HIGHEST_PROTOCOL)

    def get_tfidf_scores(self, query, filtered_ids=None):
        """
        Given a query, computes the TFIDF score for every document in the corpus.
        :param query: query to find relevant documents from. <str>
        :param filtered_ids: if specified, it will only compute the score of the provided ids.
                        If None, it will compute the score for all the documents. <list of str (ids)>
                        (default: None)
        :return: dictionary with document_id as key and the score as value. <dict>. <document_id> -> <score>.
        """
        length = {}
        scores = {}

        if filtered_ids is None:
            ids = self.ids
        else:
            ids = filtered_ids

        for ind in ids:
            length[ind] = 0
            scores[ind] = 0

        for term in re.findall(r"[\w']+|[.,!?;]", query.strip()):
            term = term.lower()
            if (term not in self.document_frequencies) or (len(term) <= 2):
                continue
            df = self.document_frequencies[term]
            wq = np.log(self.num_documents / df)
            for ind in ids:
                if ind not in self.ids:
                    continue
                document_dict = self.term_frequencies[ind]
                if term not in document_dict:
                    scores[ind] += 0
                    continue
                tf = document_dict[term]
                length[ind] += tf ** 2
                wd = 1 + np.log(tf)
                scores[ind] += wq * wd
        for ind in ids:
            if length[ind] == 0:
                continue
            scores[ind] /= np.sqrt(length[ind])

        return scores
    def most_similar(self, query, threshold=None, k=100, func='tfidf', filtered_ids=None):
        """
        Given a query it returns the k most relevant documents or the ones that are over a given threshold.
        :param query: query to find relevant documents from. <str>
        :param threshold: min scoring value of the sorted document_ids. <int> (default: None)
        :param k: number of most relevant documents to return. <int> (default: 100)
        :param func: whether you want to use bm25+ or tfidf scoring. <str> (default: bm25)
        :param filtered_ids: if specified, it will only compute the score of the provided ids.
                        If None, it will compute the score for all the documents. <list of str (ids)>
                        (default: None)
        :return: list of strings
        """
        if func == 'tfidf':
            scores = self.get_tfidf_scores(query, filtered_ids)
        most_similar = sorted(scores.items(), key=lambda x: x[1], reverse=True)

        if threshold is None:
            ids = [doc_id for doc_id, _ in most_similar[:k]]

        else:
            ids = [doc_id for doc_id, score in most_similar if score >= threshold]
            if k is not None:
                ids = ids[:k]

        return ids

    def get_most_similar(self, query, k=1):

        tfidf_scores = self.get_tfidf_scores(query)

        candidates_scores = sorted(tfidf_scores.items(), key=lambda x: x[1], reverse=True)
        # best_index, long_score = candidate_scores[0]

        # return best_index, long_score
        return candidates_scores[:k]

    def get_candidates(self, df, column):

        self.reset()
        for index, text in zip(df.index, df[column]):
            if index not in self.ids:
                self.add_document(index, text)

    def reset(self):

        for document_id in self.ids:
            self.delete_document(document_id)

        self.save()


In [12]:
os.path.exists("Data/ranking_dict")

True

In [4]:
import pandas as pd
import json
import os
from tqdm import tqdm
import sys
from langdetect import detect
import re
import pickle
import string


def save_dictionaries(df):
    """
    Calculates the dictionaries needed to compute TFIDF score.
        term_frequencies: list of dicts of term frequencies within a document.
        document_frequencies: number of documents containing a given term.
        document_length
    The dictionaries will be stored in the directory Data/ranking_dict/
    """
    directory = 'Data/ranking_dict/'
    if not os.path.exists(directory):
        os.makedirs(directory)
    # CREATE TEXT DICTIONARIES

    term_frequencies = {}  # dict of dicts id -> word -> frequency within a document
    document_frequencies = {}  # dict word -> number of documents containing the term
    document_length = {}  # dict id -> document length

    for id in list(df.id):
        term_frequencies[id] = {}
        document_length[id] = 0

    print('Processing the corpus...\n')
    for id, document in tqdm(zip(list(df.id), list(df.review))):
        actual_frequencies = {}
        words_set = set()
        length = 0
        if type(document) != type('a'):
            continue
        for word in re.findall(r"[\w']+|[.,!?;]", document.strip()):
            word = word.lower()
            if word in string.punctuation:
                continue
            length += 1
            if word in actual_frequencies:
                actual_frequencies[word] += 1
            else:
                actual_frequencies[word] = 1
            words_set.add(word)
        for word in words_set:
            if word in document_frequencies:
                document_frequencies[word] += 1
            else:
                document_frequencies[word] = 1
        document_length[id] = length
        term_frequencies[id] = actual_frequencies

    # Save dictionaries into files
    with open('Data/ranking_dict/document_frequencies.p', 'wb') as fp:
        pickle.dump(document_frequencies, fp, protocol=pickle.HIGHEST_PROTOCOL)

    with open('Data/ranking_dict/term_frequencies.p', 'wb') as fp:
        pickle.dump(term_frequencies, fp, protocol=pickle.HIGHEST_PROTOCOL)

    with open('Data/ranking_dict/document_length.p', 'wb') as fp:
        pickle.dump(document_length, fp, protocol=pickle.HIGHEST_PROTOCOL)

    print(len(term_frequencies))


In [8]:
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [9]:
df.shape

(25000, 3)

In [15]:
save_dictionaries(df)

Processing the corpus...



25000it [00:05, 4219.05it/s]


25000


In [16]:
model = Ranking("Data/ranking_dict/")

In [17]:
model.

'Data/ranking_dict/document_frequencies.p'

In [50]:
model.ids[:10]

['5814_8',
 '2381_9',
 '7759_3',
 '3630_4',
 '9495_8',
 '8196_8',
 '7166_2',
 '10633_1',
 '319_1',
 '8713_10']

In [56]:
model.get_most_similar("film about car crash")

[('2190_2', 7.641138076110774)]

In [57]:
df[df.id=='2190_2'].review.values

array(['**Possible Spoilers Ahead**<br /><br />     Jason (a.k.a. Herb) Evers is a brilliant brain surgeon who, along with wife Virginia Leith, is involved in the most lackluster onscreen car crash ever. Leith is decapitated and the doctor takes her severed noggin back to his mansion and rejuvenates the head in his lab. The mansion\'s exterior was allegedly filmed at Tarrytown\'s Lyndhurst estate; the lab scenes were apparently shot in somebody\'s basement. The bandaged head is kept alive on \\lab equipment\\" that\'s almost cheap-looking enough for Ed Wood. Some of the library music\x96the movie\'s high point\x96later turned up in Andy Milligan\'s THE BODY BENEATH. Leith\'s head has some heavy metaphysical discourses with another of Ever\'s misfires, a mutant chained in the closet. Meanwhile, the good doc prowls strip joints looking for a body worthy of his wife\'s gabby noodle. The ending, in uncut prints, features some ahead-of-its-time splatter and dismemberment when the zucchini-h

In [59]:
ranking = model.most_similar("emotional, romantic movie that involves brain work", k=3)

In [60]:
for id in ranking:
    print(df[df.id=='2190_2'].review.values)

['**Possible Spoilers Ahead**<br /><br />     Jason (a.k.a. Herb) Evers is a brilliant brain surgeon who, along with wife Virginia Leith, is involved in the most lackluster onscreen car crash ever. Leith is decapitated and the doctor takes her severed noggin back to his mansion and rejuvenates the head in his lab. The mansion\'s exterior was allegedly filmed at Tarrytown\'s Lyndhurst estate; the lab scenes were apparently shot in somebody\'s basement. The bandaged head is kept alive on \\lab equipment\\" that\'s almost cheap-looking enough for Ed Wood. Some of the library music\x96the movie\'s high point\x96later turned up in Andy Milligan\'s THE BODY BENEATH. Leith\'s head has some heavy metaphysical discourses with another of Ever\'s misfires, a mutant chained in the closet. Meanwhile, the good doc prowls strip joints looking for a body worthy of his wife\'s gabby noodle. The ending, in uncut prints, features some ahead-of-its-time splatter and dismemberment when the zucchini-headed 