# Introduction to Machine Learning with Python - Exercise 02

In this exercise, we will use the knowledge of python classes and dictionaries to extract keywords from movie titles. 

## Keywords extraction

Consider the following document:

    I love pizza. I can't imagine living without pizza more than one day. That's how I love it!
    
Are there certain keywords that pop out?

### TF-IDF Method

The TF-IDF (Term Frequency - Inverse Document Frequency) method is a technique used for extracting keywords by comparing the frequency of a word in a document to its frequency in the entire corpus. This approach is based on the intuition that words which appear frequently in a specific document but not commonly throughout the corpus are significant for that document.

(*) _Corpus_ refers to the collection of all documents.

#### Mathematical Definition

The TF-IDF score for a word $w$ in a document $d$ is defined as:
$$ t_{w,d} = tf_{w,d} \times idf_w $$

where:
- $tf_{w,d}$ represents the term frequency of $w$ in $d$.
- $idf_w$ denotes the inverse document frequency of $w$.

#### Inverse Document Frequency (IDF)

IDF is calculated using a logarithmic transformation:
$$ idf_w = \log \left( \frac{N}{df_w} \right) $$

where:
- $N$ is the total number of documents in the corpus.
- $df_w$ is the number of documents containing the word $w$.

The logarithmic scale mitigates the influence of highly frequent terms across the corpus, ensuring a balanced importance score in the TF-IDF calculations. Without this transformation, the IDF component could disproportionately affect the outcome, particularly in large corpora.

#### Your turn split a document into words

In [42]:
def split_document(document):
    """
    Splits a document into words

    Removes punctuation and numbers, and converts to lowercase

    Parameters
    ----------
    document : str
        The document to split

    Returns
    -------
    list
        A list of words
    """
    punctuation = list(".,;:!?-$&@()[]'\"")
    numbers = [str(n) for n in range(10)]
    ...
    for p in punctuation:
        document = document.replace(p, " ")
    for n in numbers:
        document = document.replace(n, " ")
    document = document.lower()
    return document.split()


doc = "I love pizza. I can't imagine living without pizza more than one day. That's how I love it!"
result = set(split_document(doc))
expected = {
    "i",
    "love",
    "pizza",
    "can",
    "t",
    "imagine",
    "living",
    "without",
    "more",
    "than",
    "one",
    "day",
    "that",
    "s",
    "how",
    "it",
}
assert result == expected, f"Expected {expected}, got {result}"

One way to keep count of tokens is to use a dictionary. The keys of the dictionary are the tokens, and the values are the counts. We will use `.setdefault()` to initialize the count to 0 if the token is not yet in the dictionary.

In [43]:
def keep_count(items, result=None):
    """
    Keeps count of items in a dictionary
    Args:
        items: A list of items
        result: If not None, the dictionary to update

    Returns:
        dict: A dictionary with the counts of the items
    """
    if result is None:
        result = {}  # initialize the dictionary
    for item in items:
        result.setdefault(
            item, 0
        )  # initialize the count to 0 if the item is not yet in the dictionary
        result[item] += 1  # increment the count
    return result


result = keep_count(["a", "b", "a", "c", "b", "a", "d"])
#print(f"After counting: {result}")
expected = {"a": 3, "b": 2, "c": 1, "d": 1}
assert result == expected

# update an existing dictionary
result = keep_count(["a", "b", "a", "c", "b", "a", "d"], result=result)
#print(f"After updating: {result}")

After counting: {'a': 3, 'b': 2, 'c': 1, 'd': 1}
After updating: {'a': 6, 'b': 4, 'c': 2, 'd': 2}


#### Your turn: compute term frequency

Once we have the counts, we can compute the term frequency (TF) for each token. The term frequency is the number of times a token appears in a document divided by the total number of tokens in the document.

In [60]:
def compute_tf(document):
    """
    Computes the term frequency for each token in a document
    Args:
        document: A single document as a string
    Returns:
        dict: A dictionary with the term frequency of each token in the document
    """
    ...
    terms = split_document(document)
    term_count = keep_count(terms)
    tf = {term: count / float(len(terms)) for term, count in term_count.items()}
    return tf


def compute_doc_freq(corpus):
    """
    Computes the document frequency for each token in the corpus
    Args:
        corpus: A list of documents
    Returns:
        dict: A dictionary with the number of documents in which each token appears
    """
    ...
    doc_freq = {}
    for document in corpus:
        terms = set(split_document(document))
        for term in terms:
            doc_freq[term] = doc_freq.get(term, 0) + 1
    return doc_freq


corpus = [
    "I love pizza. I can't imagine living without pizza more than one day. That's how I love it!",
    "My bus is late. I hate waiting for the bus. I wish I had a car.",
    "I love my car. Driving is the best thing I know. I love it!",
]
result = compute_doc_freq(corpus)
#print(f"TF: {result}")

TF: {'day': 1, 'pizza': 1, 'how': 1, 'it': 2, 'one': 1, 'than': 1, 'without': 1, 'more': 1, 'that': 1, 's': 1, 'imagine': 1, 'can': 1, 'i': 3, 'love': 2, 't': 1, 'living': 1, 'my': 2, 'late': 1, 'a': 1, 'had': 1, 'waiting': 1, 'the': 2, 'bus': 1, 'is': 2, 'for': 1, 'hate': 1, 'car': 2, 'wish': 1, 'thing': 1, 'best': 1, 'know': 1, 'driving': 1}


Let's read the data from our tmdb_5000_movies.csv file and compute the term frequency for each movie title.

In [63]:
import os
import csv

# fn = os.path.join("data", "tmdb_5000_movies.csv")
reader = csv.DictReader(open("tmdb_5000_movies.csv", "r", encoding="utf-8"))
fieldnames = reader.fieldnames
titles = [row["original_title"] for row in reader]
#print(f"Read {len(titles)} titles")

Read 4803 titles


#### Let's compute IDF

In [64]:
import math


def compute_idf(doc_count, doc_freq):
    """
    Computes the inverse document frequency for each token in a document
    Args:
        doc_count: The number of documents
        doc_freq: A dictionary with the number of documents in which each token appears
    Returns:
        dict: A dictionary with the inverse document frequency of each token
    """
    idf = {}
    for term, count in doc_freq.items():
        idf[term] = math.log(doc_count / float(count))
    return idf

In [65]:
# now, let's extract the keywords from a title
title = titles[100]
#print(f"Title: {title}")
title_tf = compute_tf(title)
#print(f"TF: {title_tf}")

Title: The Curious Case of Benjamin Button
TF: {'the': 0.16666666666666666, 'curious': 0.16666666666666666, 'case': 0.16666666666666666, 'of': 0.16666666666666666, 'benjamin': 0.16666666666666666, 'button': 0.16666666666666666}


In [66]:
# First, compute the document frequencies for the entire corpus
doc_freq = compute_doc_freq(titles)
doc_count = len(titles)

# top 10 most frequent terms
sorted_terms = sorted(doc_freq.items(), key=lambda x: x[1], reverse=True)
#print(f"Most frequent terms: {sorted_terms[:10]}")

Most frequent terms: [('the', 1305), ('of', 422), ('s', 176), ('a', 174), ('and', 125), ('in', 111), ('to', 101), ('man', 75), ('i', 55), ('love', 54)]


In [69]:
# Compute IDF using the document frequencies
idf = compute_idf(doc_count, doc_freq)
# top 10 terms with the highest IDF
sorted_terms = sorted(idf.items(), key=lambda x: x[1], reverse=True)
#print(f"Terms with highest IDF: {sorted_terms[:10]}")

Terms with highest IDF: [('avatar', 8.476996001664824), ('spectre', 8.476996001664824), ('rises', 8.476996001664824), ('tangled', 8.476996001664824), ('ultron', 8.476996001664824), ('solace', 8.476996001664824), ('quantum', 8.476996001664824), ('chest', 8.476996001664824), ('caspian', 8.476996001664824), ('armies', 8.476996001664824)]


In [76]:
# Compute TF-IDF for the specific title
tf_idf = {term: title_tf.get(term, 0) * idf.get(term, 0) for term in title_tf}

# Sort and print the terms
sorted_terms = sorted(tf_idf.items(), key=lambda x: x[1], reverse=True)
sorted_terms = [t[0] for t in sorted_terms]
#print(f"{'Term':20} {'Frequency':>10} {'IDF':>10} {'TF-IDF':>10}")
#for term in sorted_terms[:10]:
 #   print(
  #      f"{term:20} {title_tf.get(term, 0):10.2f} {idf.get(term, 0):10.2f} {tf_idf.get(term, 0):10.2f}"
#    )

Term                  Frequency        IDF     TF-IDF
button                     0.17       8.48       1.41
curious                    0.17       7.78       1.30
case                       0.17       7.78       1.30
benjamin                   0.17       7.78       1.30
of                         0.17       2.43       0.41
the                        0.17       1.30       0.22


In [80]:
class TFIDF:
    """
    A class to compute TF-IDF scores for a given corpus.

    Methods:
        fit(corpus): Learns the IDF for each term in the corpus.
        transform(document): Computes the TF-IDF score for each term in a single document.
    """

    def __init__(self):
        self.idf = {}

    """
    split(self, document: str) -> list: Tokenizes the input document into a list of words.
    It performs some preprocessing by replacing punctuation and numbers with spaces and converting the text to lowercase.
    """
    def _split(self, document: str) -> list:
        document = str(document)
        punctuation = list(".,;:!?-$&@()[]'\"")
        numbers = [str(n) for n in range(10)]
        for p in punctuation:
            document = document.replace(p, " ")
        for n in numbers:
            document = document.replace(n, " ")
        document = document.lower()
        return document.split()
    """
    count_terms(self, words: list) -> dict: Counts the frequency of each term in the given list of words.
    """
    def _count_terms(self, words: list) -> dict:
        term_count = {}
        for word in words:
            term_count[word] = term_count.get(word, 0) + 1
        return term_count
    """
    compute_tf(self, term_count: dict, doc_len: int) -> dict: Computes the Term Frequency (TF) for each term based
    on its count and the total number of terms in the document.
    """

    def _compute_tf(self, term_count: dict, doc_len: int) -> dict:
        tf = {}
        for word, count in term_count.items():
            tf[word] = count / float(doc_len)
        return tf

    """
    compute_idf(self, doc_count: int, doc_freq: dict) -> None: Computes the Inverse Document Frequency (IDF) for each term based on the total number of documents
    in the corpus and the frequency of each term across all documents.
    in short word math.Log(The number of documents in the Corpus / the number of documents in which each term The word appears.)
    """
    def _compute_idf(self, doc_count: int, doc_freq: dict) -> None:
        for word, count in doc_freq.items():
            self.idf[word] = math.log(doc_count / float(count))
    
    """
    fit(self, corpus) -> None: Fits the TF-IDF model to the given corpus by learning the IDF for each term.
    """

    def fit(self, corpus) -> None:
        doc_count = len(corpus)
        doc_freq = {}

        for document in corpus:
            words = set(self._split(document))
            for word in words:
                doc_freq[word] = doc_freq.get(word, 0) + 1

        self._compute_idf(doc_count, doc_freq)
    
    """
    transform(self, document: str) -> dict: Transforms a single document into TF-IDF scores for each term.
    """

    def transform(self, document: str) -> dict:
        tf_idf = {}
        words = self._split(document)
        term_count = self._count_terms(words)
        tf = self._compute_tf(term_count, len(words))

        for word, tf_value in tf.items():
            idf = self.idf.get(word, 0.0)
            tf_idf[word] = tf_value * idf

        return tf_idf

In [81]:
tfidf = TFIDF()
tfidf.fit(titles)
tfidf.transform(titles[100])

{'the': 0.21717294698467174,
 'curious': 1.297308136850813,
 'case': 1.297308136850813,
 'of': 0.40533178127146874,
 'benjamin': 1.297308136850813,
 'button': 1.4128326669441373}

In [86]:
import pandas as pd

df = pd.read_csv("tmdb_5000_movies.csv").sort_values("popularity", ascending=False)
df

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
546,74000000,"[{""id"": 10751, ""name"": ""Family""}, {""id"": 16, ""...",http://www.minionsmovie.com/,211672,"[{""id"": 3487, ""name"": ""assistant""}, {""id"": 179...",en,Minions,"Minions Stuart, Kevin and Bob are recruited by...",875.581305,"[{""name"": ""Universal Pictures"", ""id"": 33}, {""n...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2015-06-17,1156730962,91.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Before Gru, they had a history of bad bosses",minions,6.4,4571
95,165000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 18, ""...",http://www.interstellarmovie.net/,157336,"[{""id"": 83, ""name"": ""saving the world""}, {""id""...",en,Interstellar,Interstellar chronicles the adventures of a gr...,724.247784,"[{""name"": ""Paramount Pictures"", ""id"": 4}, {""na...","[{""iso_3166_1"": ""CA"", ""name"": ""Canada""}, {""iso...",2014-11-05,675120017,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Mankind was born on Earth. It was never meant ...,interstellar,8.1,10867
788,58000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.foxmovies.com/movies/deadpool,293660,"[{""id"": 2095, ""name"": ""anti hero""}, {""id"": 307...",en,Deadpool,Deadpool tells the origin story of former Spec...,514.569956,"[{""name"": ""Twentieth Century Fox Film Corporat...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2016-02-09,783112979,108.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Witness the beginning of a happy ending,deadpool,7.4,10995
94,170000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 878, ""na...",http://marvel.com/guardians,118340,"[{""id"": 8828, ""name"": ""marvel comic""}, {""id"": ...",en,Guardians of the Galaxy,"Light years from Earth, 26 years after being a...",481.098624,"[{""name"": ""Marvel Studios"", ""id"": 420}, {""name...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2014-07-30,773328629,121.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,All heroes start somewhere.,guardians of the galaxy,7.9,9742
127,150000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.madmaxmovie.com/,76341,"[{""id"": 2964, ""name"": ""future""}, {""id"": 3713, ...",en,Mad Max: Fury Road,An apocalyptic story set in the furthest reach...,434.278564,"[{""name"": ""Village Roadshow Pictures"", ""id"": 7...","[{""iso_3166_1"": ""AU"", ""name"": ""Australia""}, {""...",2015-05-13,378858340,120.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,What a Lovely Day.,mad max: fury road,7.2,9427
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4625,0,"[{""id"": 27, ""name"": ""Horror""}]",,426067,[],en,Midnight Cabaret,A Broadway producer puts on a play with a Devi...,0.001389,[],[],1990-01-01,0,94.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The hot spot where Satan's waitin'.,midnight cabaret,0.0,0
4118,0,[],,325140,[],en,Hum To Mohabbat Karega,"Raju, a waiter, is in love with the famous TV ...",0.001186,[],[],2000-05-26,0,0.0,[],Released,,hum to mohabbat karega,0.0,0
4727,0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 18, ""nam...",,65448,"[{""id"": 378, ""name"": ""prison""}, {""id"": 209476,...",en,Penitentiary,A hitchhiker named Martel Gordone gets in a fi...,0.001117,[],[],1979-12-01,0,99.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"There's only one way out, and 100 fools stand ...",penitentiary,4.9,8
3361,0,"[{""id"": 27, ""name"": ""Horror""}, {""id"": 28, ""nam...",,77156,[],en,Alien Zone,A man who is having an affair with a married w...,0.000372,[],"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",1978-11-22,0,90.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Don't you dare go in there!,alien zone,4.0,3


In [89]:
df.original_title.head().tolist()

['Minions',
 'Interstellar',
 'Deadpool',
 'Guardians of the Galaxy',
 'Mad Max: Fury Road']

In [1]:
import os
os.getcwd()

'/Users/barcohen/Desktop/שנה ג׳ /מבוא ללמידת מכונה/Ex1'