# Film cleaning

This notebook cleans & explores a film dataset.

## Imports

In [75]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

## Setup

In [76]:
pd.set_option("max.colwidth", 0)

## Data sourcing

In [77]:
films = pd.read_csv("data/16k_Movies.csv")

In [78]:
films.head()

Unnamed: 0.1,Unnamed: 0,Title,Release Date,Description,Rating,No of Persons Voted,Directed by,Written by,Duration,Genres
0,0,Dekalog (1988),"Mar 22, 1996","This masterwork by Krzysztof Kieślowski is one of the twentieth century’s greatest achievements in visual storytelling. Originally made for Polish television, Dekalog focuses on the residents of a housing complex in late-Communist Poland, whose lives become subtly intertwined as they face emotional dilemmas that are at once deeply personal and universally human. Its ten hour-long films, drawing from the Ten Commandments for thematic inspiration and an overarching structure, grapple deftly with complex moral and existential questions concerning life, death, love, hate, truth, and the passage of time. Shot by nine different cinematographers, with stirring music by Zbigniew Preisner and compelling performances from established and unknown actors alike, Dekalog arrestingly explores the unknowable forces that shape our lives. Also available are the longer theatrical versions of the series’ fifth and sixth films: A Short Film About Killing and A Short Film About Love. [Janus Films]",7.4,118,Krzysztof Kieslowski,"Krzysztof Kieslowski, Krzysztof Piesiewicz",9 h 32 m,Drama
1,1,Three Colors: Red,"Nov 23, 1994","Krzysztof Kieslowski closes his Three Colors trilogy in grand fashion, with an incandescent meditation on fate and chance, starring Irène Jacob as a sweet-souled yet somber runway model in Geneva whose life dramatically intersects with that of a bitter retired judge, played by Jean-Louis Trintignant. Meanwhile, just down the street, a seemingly unrelated story of jealousy and betrayal unfolds. Red is an intimate look at forged connections and a splendid final statement from a remarkable filmmaker at the height of his powers. [Criterion]",8.3,241,Krzysztof Kieslowski,"Krzysztof Kieslowski, Krzysztof Piesiewicz, Agnieszka Holland, Edward Zebrowski, Edward Klosinski, Marcin Latallo",1 h 39 m,"Drama,Mystery,Romance"
2,2,The Conformist,"Oct 22, 1970","Set in Rome in the 1930s, this re-release of Bernardo Bertolucci's 1970 breakthrough feature stars Jean-Louis Trintignant as a Mussolini operative sent to Paris to locate and eliminate an old professor who fled Italy when the fascists came to power.",7.3,106,Bernardo Bertolucci,"Alberto Moravia, Bernardo Bertolucci",1 h 47 m,Drama
3,3,Tokyo Story,"Mar 13, 1972","Yasujiro Ozu’s Tokyo Story follows an aging couple, Tomi and Sukichi, on their journey from their rural village to visit their two married children in bustling, postwar Tokyo. Their reception is disappointing: too busy to entertain them, their children send them off to a health spa. After Tomi falls ill she and Sukichi return home, while the children, grief-stricken, hasten to be with her. From a simple tale unfolds one of the greatest of all Japanese films. Starring Ozu regulars Chishu Ryu and Setsuko Hara, the film reprises one of the director’s favorite themes—that of generational conflict—in a way that is quintessentially Japanese and yet so universal in its appeal that it continues to resonate as one of cinema’s greatest masterpieces. [Janus Films]",8.1,147,Yasujirô Ozu,"Kôgo Noda, Yasujirô Ozu",2 h 16 m,Drama
4,4,The Leopard (re-release),"Aug 13, 2004","Set in Sicily in 1860, Luchino Visconti's spectacular 1963 adaptation of Giuseppe di Lampedusa's international bestseller is one of the cinema's greatest evocations of the past, achingly depicting the passing of an ancient order. (Film Forum)",7.8,85,Luchino Visconti,"Giuseppe Tomasi di Lampedusa, Suso Cecchi D'Amico, Pasquale Festa Campanile, Enrico Medioli, Massimo Franciosa, Luchino Visconti",3 h 7 m,"Drama,History"


In [79]:
# Drop bad column and unwanted columns

films = films.drop(columns=["Unnamed: 0", "Written by", "Directed by"])

In [80]:
# Get rid of bad rows

films = films.dropna()

In [81]:
films["Release Date"] = pd.to_datetime(films["Release Date"])

In [82]:
duration_df = films["Duration"].str.extract(r"(\d+) h (\d+) m")

duration_df = duration_df.fillna(0)

duration_df["total_duration"] = duration_df[0].astype(int) * 60 + duration_df[1].astype(int)

In [83]:
films["Duration"] = duration_df["total_duration"]

films = films[films["Duration"] > 0]

In [84]:
films.dtypes

Title                  object        
Release Date           datetime64[ns]
Description            object        
Rating                 float64       
No of Persons Voted    object        
Duration               int64         
Genres                 object        
dtype: object

In [85]:
films.columns = films.columns.str.lower().str.replace(" ", "_")
films = films.rename(columns={
    "no_of_persons_voted": "votes"
})

In [86]:
films.sample()

Unnamed: 0,title,release_date,description,rating,votes,duration,genres
12529,The Mummy Returns,2001-05-04,"Deep within a shadowy chamber in the British Museum of London, an ancient force of terror is about to be reborn. It is 1933, the Year of the Scorpion. Eight years have passed since dashing legionnaire Rick O'Connell (Fraser) and fearless Egyptologist Evelyn (Weisz) fought for their lives against a 3000-year-old enemy named Imhotep (Vosloo). (Universal Pictures)",3.4,533,130,"Action,Adventure,Fantasy,Thriller"


## Text cleaning

In [87]:
keywords = films["description"]

In [88]:
keywords = keywords.str.lower()

# Remove stopwords (grammatically but not semantically valuable)
# bigrams
# misspelling
# synonyms/homonyms/multiple definitions
# Cases
# Plurals punctuation


In [93]:
stops = stopwords.words("english")
stops.extend(["'ve", "'nt", "re-release", "starring", "directed", "award"])

In [94]:
keyword_tokens = keywords.apply(word_tokenize)

In [97]:
def remove_unwanted_words(tokens: list[str]) -> list[str]:
    """Returns a list of tokens filtered for undesirables."""

    return [t for t in tokens
            if t not in stops
            and len(t) >= 3
            and not t.isdigit()]

keyword_tokens = keyword_tokens.apply(remove_unwanted_words)

In [98]:
keyword_tokens

0        [masterwork, krzysztof, kieślowski, one, twentieth, century, greatest, achievements, visual, storytelling, originally, made, polish, television, dekalog, focuses, residents, housing, complex, late-communist, poland, whose, lives, become, subtly, intertwined, face, emotional, dilemmas, deeply, personal, universally, human, ten, hour-long, films, drawing, ten, commandments, thematic, inspiration, overarching, structure, grapple, deftly, complex, moral, existential, questions, concerning, life, death, love, hate, truth, passage, time, shot, nine, different, cinematographers, stirring, music, zbigniew, preisner, compelling, performances, established, unknown, actors, alike, dekalog, arrestingly, explores, unknowable, forces, shape, lives, also, available, longer, theatrical, versions, series, fifth, sixth, films, short, film, killing, short, film, love, janus, films]
1        [krzysztof, kieslowski, closes, three, colors, trilogy, grand, fashion, incandescent, meditation, fate, ch