# Film Exploration

This notebook cleans & explores a film dataset.

## Imports

In [124]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## Setup

In [125]:
pd.set_option("max.colwidth", 0)

## Data sourcing

In [126]:
films = pd.read_csv("data/16k_Movies.csv")

## Data cleaning

In [127]:
# Drop bad column and unwanted columns

films = films.drop(columns=["Unnamed: 0", "Written by", "Directed by"])

In [128]:
# Get rid of bad rows

films = films.dropna()

In [129]:
# Convert times

films["Release Date"] = pd.to_datetime(films["Release Date"])

In [130]:
# Standardise durations

duration_df = films["Duration"].str.extract(r"(\d+) h (\d+) m")

duration_df = duration_df.fillna(0)

duration_df["total_duration"] = duration_df[0].astype(int) * 60 + duration_df[1].astype(int)

films["Duration"] = duration_df["total_duration"]

films = films[films["Duration"] > 0]

In [131]:
# Simplift column names

films.columns = films.columns.str.lower().str.replace(" ", "_")
films = films.rename(columns={
    "no_of_persons_voted": "votes"
})

In [132]:
films.sample()

Unnamed: 0,title,release_date,description,rating,votes,duration,genres
8917,Gifted,2017-04-07,Frank Adler (Chris Evans) is a single man raising a child prodigy – his spirited young niece Mary (Mckenna Grace) – in a coastal town in Florida. Frank’s plans for a normal school life for Mary are foiled when the seven-year-old’s mathematical abilities come to the attention of Frank’s formidable mother Evelyn (Lindsay Duncan) whose plans for her granddaughter threaten to separate Frank and Mary.,7.5,166,101,Drama


## Text cleaning

In [133]:
keywords = films["description"]

In [134]:
# Case-fold

keywords = keywords.str.lower()


In [135]:
# Get a stopwords list and customise it

stops = stopwords.words("english")
stops.extend(["'ve", "'nt", "re-release", "starring", "directed", "award", "stars", "starring"])

In [136]:
# Tokenise the text

keyword_tokens = keywords.apply(word_tokenize)

In [137]:
# Purge tokens that are not of interest

def remove_unwanted_words(tokens: list[str]) -> list[str]:
    """Returns a list of tokens filtered for undesirables."""

    return [t for t in tokens
            if t not in stops
            and len(t) >= 3
            and not t.isdigit()]

keyword_tokens = keyword_tokens.apply(remove_unwanted_words)

In [138]:
keyword_tokens

# More cleaning
# recommendations (similarity)
# sentiment (how positive/negative)
# word clouds / most frequent keywords

0        [masterwork, krzysztof, kieślowski, one, twentieth, century, greatest, achievements, visual, storytelling, originally, made, polish, television, dekalog, focuses, residents, housing, complex, late-communist, poland, whose, lives, become, subtly, intertwined, face, emotional, dilemmas, deeply, personal, universally, human, ten, hour-long, films, drawing, ten, commandments, thematic, inspiration, overarching, structure, grapple, deftly, complex, moral, existential, questions, concerning, life, death, love, hate, truth, passage, time, shot, nine, different, cinematographers, stirring, music, zbigniew, preisner, compelling, performances, established, unknown, actors, alike, dekalog, arrestingly, explores, unknowable, forces, shape, lives, also, available, longer, theatrical, versions, series, fifth, sixth, films, short, film, killing, short, film, love, janus, films]
1        [krzysztof, kieslowski, closes, three, colors, trilogy, grand, fashion, incandescent, meditation, fate, ch

## Vectorisation

Turn something complex into a big list of numbers --> a point in multi-dimensional space

One ring to rule them all  
One ring to find them  
One ring to bring them all  
And in the darkness bind them  


| One | ring | to | rule | them | all | find | bring | and | in | the | darkness | bind |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| T | T | T | T | T | T | T | F | F | F | F | F | F | F | F |
| T | T | T | F | T | F | T | F | F | F | F | F | F | F | F |


In [180]:
# Combine tokens back into strings

keyword_strings = keyword_tokens.apply(" ".join)

In [201]:
# c_vec = CountVectorizer(max_features=10000)
t_vec = TfidfVectorizer(max_features=10000)

keyword_vectors = t_vec.fit_transform(keyword_strings)

In [202]:
keyword_vectors.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], shape=(12505, 10000))

# Similarity

In [203]:
joana_film_vector = keyword_vectors.toarray()[5734]

In [204]:
joana_film_similarities = cosine_similarity([joana_film_vector], keyword_vectors)

In [205]:
joana_film_similarities

array([[0.02920762, 0.        , 0.07624641, ..., 0.        , 0.        ,
        0.        ]], shape=(1, 12505))

In [206]:
films["j_score"] = joana_film_similarities[0]

In [207]:
films.sort_values("j_score", ascending=False)["title"].head(10)

7564     Café Society            
3356     Roger Dodger            
1334     Night Moves             
1333     Night Moves             
14864    St. Elmo's Fire         
3978     Anaïs in Love           
6245     1984                    
658      Los Angeles Plays Itself
8428     The Bubble              
8429     The Bubble              
Name: title, dtype: object

In [208]:
ben_vector = t_vec.transform(["magic dragons wizards fire space"])

In [209]:
ben_vector

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 5 stored elements and shape (1, 10000)>

In [210]:
films["b_score"] = cosine_similarity(ben_vector, keyword_vectors)[0]

In [211]:
films.sort_values("b_score", ascending=False)["title"].head(10)

2756     How to Train Your Dragon 2               
4510     Harry Potter and the Order of the Phoenix
14180    SpaceCamp                                
6883     Harry Potter and the Sorcerer's Stone    
15351    Your Highness                            
3493     Raya and the Last Dragon                 
13065    Fire and Ice                             
10180    Return to Space                          
4067     Fire Will Come                           
8993     Make Believe                             
Name: title, dtype: object

In [192]:
"dragons" in c_vec.get_feature_names_out()

True

In [None]:
boolean == is the word there
count == how many times is the word there
tfidf == how much is the word there relative to how common the word is

In [200]:
keyword_tokens[12588]

['larry',
 'daley',
 'ben',
 'stiller',
 'heads',
 'london',
 'revitalize',
 'magic',
 'life-giving',
 'tablet']