# Objective/s

- The purpose of this notebook is to build a Machine Learning model that predicts genre/s of a movie, given a text corpus of it's plot.
- The model will be trained on around 40000 examples.
- Model related files will be pickled (saved) and used to make predictions on new data.

### NOTE: I have created a few helper scripts (present in same directory as this notebook) to keep things clean and modular.

In [1]:
from collections import namedtuple
from itertools import product
from nltk.corpus import stopwords
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

import warnings
warnings.filterwarnings(action='ignore')

In [2]:
import config
import data_reader
import predict
import text_cleaner
import utils

In [3]:
pd.set_option('display.max_columns', None)

# Either load the already pre-processed data or read raw data and clean/pre-process from scratch

In [4]:
%%time

if config.LOAD_PREPROCESSED_DATA:
    try:
        df_movie = utils.pickle_load(filepath=config.PATH_MOVIE_DATA_PREPROCESSED)
        print("Success!")
    except FileNotFoundError:
        print(
            "Error!",
            "You cannot load preprocessed data since said file doesn't exist on your device.",
            "Go to config.py (in current directory) and set `LOAD_PREPROCESSED_DATA` to False.",
            "This will create a Pickle file that can then be loaded (during your next usage) by setting `LOAD_PREPROCESSED_DATA` to True.",
            "",
            sep="\n"
        )
else:
    df_movie = data_reader.get_movie_data()
    stopwords_list = list(set(stopwords.words('english')))
    df_movie['Plot'] = df_movie['Plot'].apply(text_cleaner.clean_text, stopwords_list=stopwords_list)
    utils.pickle_save(data_obj=df_movie, filepath=config.PATH_MOVIE_DATA_PREPROCESSED)
    print("Success!")

Success!
Wall time: 812 ms


# Viewing the data

In [5]:
df_movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42204 entries, 0 to 42203
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   MovieId    42204 non-null  object
 1   MovieName  42204 non-null  object
 2   Genre      42204 non-null  object
 3   Plot       42204 non-null  object
dtypes: object(4)
memory usage: 1.3+ MB


In [6]:
df_movie.head()

Unnamed: 0,MovieId,MovieName,Genre,Plot
0,975900,Ghosts of Mars,"[Thriller, Science Fiction, Horror, Adventure,...",set second half nd centuri film depict mar pla...
1,9363483,White Of The Eye,"[Thriller, Erotic thriller, Psychological thri...",seri murder rich young women throughout arizon...
2,261236,A Woman in Flames,[Drama],eva upper class housewif becom frustrat leav a...
3,18998739,The Sorcerer's Apprentice,"[Family Film, Fantasy, Adventure, World cinema]",everi hundr year evil morgana return claim fin...
4,6631279,Little city,"[Romantic comedy, Ensemble Film, Comedy-drama,...",adam san francisco base artist work cab driver...


# Figure out hyperparams

In [7]:
def get_f1_score_by_thresh(thresh, X_test, y_test, ovr_clf):
    y_pred_prob = ovr_clf.predict_proba(X=X_test)
    y_pred_new = (y_pred_prob >= thresh).astype(int)
    f1_score_by_thresh = f1_score(y_true=y_test, y_pred=y_pred_new, average="micro")
    return round(f1_score_by_thresh, 5)

def get_runs(dictionary_hyperparams):
    """
    Takes in dictionary of hyperparams and returns list of runs wherein
    each run contains hyperparams to be used for experimentation.
    """
    run_tuple = namedtuple('Run', dictionary_hyperparams.keys())
    runs_list = []
    for value in product(*dictionary_hyperparams.values()):
        runs_list.append(run_tuple(*value))
    return runs_list

In [10]:
dictionary_hyperparams = {
    'max_df': [0.65, 0.7, 0.75, 0.8, 0.85],
    'max_features': [7000, 8000, 9000, 10000, 11000]
}

runs = get_runs(dictionary_hyperparams=dictionary_hyperparams)
runs = pd.Series(data=runs).sample(len(runs)).tolist() # Shuffle order of runs
runs

[Run(max_df=0.65, max_features=9000),
 Run(max_df=0.7, max_features=7000),
 Run(max_df=0.7, max_features=8000),
 Run(max_df=0.85, max_features=10000),
 Run(max_df=0.8, max_features=9000),
 Run(max_df=0.8, max_features=8000),
 Run(max_df=0.7, max_features=9000),
 Run(max_df=0.85, max_features=7000),
 Run(max_df=0.75, max_features=8000),
 Run(max_df=0.75, max_features=7000),
 Run(max_df=0.65, max_features=8000),
 Run(max_df=0.65, max_features=7000),
 Run(max_df=0.65, max_features=10000),
 Run(max_df=0.65, max_features=11000),
 Run(max_df=0.85, max_features=11000),
 Run(max_df=0.75, max_features=10000),
 Run(max_df=0.85, max_features=9000),
 Run(max_df=0.8, max_features=10000),
 Run(max_df=0.75, max_features=11000),
 Run(max_df=0.8, max_features=7000),
 Run(max_df=0.75, max_features=9000),
 Run(max_df=0.8, max_features=11000),
 Run(max_df=0.7, max_features=11000),
 Run(max_df=0.85, max_features=8000),
 Run(max_df=0.7, max_features=10000)]

In [46]:
%%time

df_hyperparams_used = pd.DataFrame()
for run in runs:
    print(run)
    mlb = MultiLabelBinarizer()
    X = df_movie['Plot']
    y = mlb.fit_transform(df_movie['Genre'])

    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    # Create features
    tfidf = TfidfVectorizer(max_df=run.max_df, max_features=run.max_features)
    X_train = tfidf.fit_transform(raw_documents=X_train)
    X_test = tfidf.transform(raw_documents=X_test)

    # Build classifier
    log_reg = LogisticRegression()
    ovr_clf = OneVsRestClassifier(estimator=log_reg)
    ovr_clf.fit(X=X_train, y=y_train)
    y_pred = ovr_clf.predict(X=X_test)
    
    df_temp = pd.DataFrame(data={
        'max_df': run.max_df,
        'max_features': run.max_features,
        'f1_score_thresh50': get_f1_score_by_thresh(thresh=0.5, X_test=X_test, y_test=y_test, ovr_clf=ovr_clf),
        'f1_score_thresh30': get_f1_score_by_thresh(thresh=0.3, X_test=X_test, y_test=y_test, ovr_clf=ovr_clf),
        'f1_score_thresh25': get_f1_score_by_thresh(thresh=0.25, X_test=X_test, y_test=y_test, ovr_clf=ovr_clf),
        'f1_score_thresh20': get_f1_score_by_thresh(thresh=0.2, X_test=X_test, y_test=y_test, ovr_clf=ovr_clf),
    }, index=[0])
    df_hyperparams_used = pd.concat(objs=[df_hyperparams_used, df_temp], axis=0, ignore_index=True, sort=False)

f1_score_columns = ['f1_score_thresh50', 'f1_score_thresh30', 'f1_score_thresh25', 'f1_score_thresh20']
df_hyperparams_used['highest_f1_score'] = df_hyperparams_used[f1_score_columns].max(axis=1)
df_hyperparams_used.sort_values(by='highest_f1_score', ascending=False, ignore_index=True, inplace=True)

Run(max_df=0.65, max_features=9000)
Run(max_df=0.7, max_features=7000)
Run(max_df=0.7, max_features=8000)
Run(max_df=0.85, max_features=10000)
Run(max_df=0.8, max_features=9000)
Run(max_df=0.8, max_features=8000)
Run(max_df=0.7, max_features=9000)
Run(max_df=0.85, max_features=7000)
Run(max_df=0.75, max_features=8000)
Run(max_df=0.75, max_features=7000)
Run(max_df=0.65, max_features=8000)
Run(max_df=0.65, max_features=7000)
Run(max_df=0.65, max_features=10000)
Run(max_df=0.65, max_features=11000)
Run(max_df=0.85, max_features=11000)
Run(max_df=0.75, max_features=10000)


KeyboardInterrupt: 

In [49]:
df_hyperparams_used

Unnamed: 0,max_df,max_features,f1_score_thresh50,f1_score_thresh30,f1_score_thresh25,f1_score_thresh20,highest_f1_score
0,0.7,7000,0.32421,0.44383,0.46282,0.47459,0.47459
1,0.85,7000,0.32421,0.44383,0.46282,0.47459,0.47459
2,0.75,7000,0.32421,0.44383,0.46282,0.47459,0.47459
3,0.65,7000,0.32421,0.44383,0.46282,0.47459,0.47459
4,0.7,8000,0.32247,0.44337,0.4635,0.47414,0.47414
5,0.8,8000,0.32247,0.44337,0.4635,0.47414,0.47414
6,0.75,8000,0.32247,0.44337,0.4635,0.47414,0.47414
7,0.65,8000,0.32247,0.44337,0.4635,0.47414,0.47414
8,0.65,9000,0.32069,0.44351,0.46311,0.47409,0.47409
9,0.8,9000,0.32069,0.44351,0.46311,0.47409,0.47409


# Build the model (on 100% of the dataset, based on best set of hyperparams)

In [50]:
df_movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42204 entries, 0 to 42203
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   MovieId    42204 non-null  object
 1   MovieName  42204 non-null  object
 2   Genre      42204 non-null  object
 3   Plot       42204 non-null  object
dtypes: object(4)
memory usage: 1.3+ MB


In [58]:
%%time

mlb = MultiLabelBinarizer()
X = df_movie['Plot'].values
y = mlb.fit_transform(df_movie['Genre'])

# Create features
tfidf = TfidfVectorizer(max_df=0.85, max_features=7000)
X = tfidf.fit_transform(raw_documents=X)

# Build classifier
log_reg = LogisticRegression()
ovr_clf = OneVsRestClassifier(estimator=log_reg)
ovr_clf.fit(X=X, y=y)

Wall time: 2min 26s


In [None]:
# %%time

# mlb = MultiLabelBinarizer()
# X = df_movie['Plot']
# y = mlb.fit_transform(df_movie['Genre'])

# # Train-test split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# # Create features
# tfidf = TfidfVectorizer(max_df=0.85, max_features=7000)
# X_train = tfidf.fit_transform(raw_documents=X_train)
# X_test = tfidf.transform(raw_documents=X_test)

# # Build classifier
# log_reg = LogisticRegression()
# ovr_clf = OneVsRestClassifier(estimator=log_reg)
# ovr_clf.fit(X=X_train, y=y_train)
# y_pred = ovr_clf.predict(X=X_test)

# Pickling (saving) model-related objects

In [59]:
utils.pickle_save(data_obj=mlb, filepath=config.PATH_MODEL_MLB)
utils.pickle_save(data_obj=tfidf, filepath=config.PATH_MODEL_TFIDF)
utils.pickle_save(data_obj=ovr_clf, filepath=config.PATH_MODEL_OVR_CLF)

# Predict for new data

In [None]:
# Delete currently existing model-related objects (if they exist)
del(mlb)
del(tfidf)
del(ovr_clf)

In [77]:
text = """
Long-time friends and small-time criminals Eddie, Tom, Soap, and Bacon put together £100,000 so that Eddie, a genius card sharp, can buy into one of "Hatchet" Harry Lonsdale's high-stakes three-card brag games. The game is rigged, however, and the friends end up massively indebted to Harry for £500,000. Harry then sends his debt collector Big Chris, who is often accompanied by his son, Little Chris, to ensure that the debt is paid within a week.

Harry is also interested in a pair of expensive antique shotguns that are up for auction and gets his enforcer Barry "the Baptist" to hire a pair of thieves, Gary and Dean, to steal them from a bankrupt lord. The two turn out to be highly incompetent and unwittingly sell the shotguns to Nick "the Greek", a local fence. After learning this, an enraged Barry threatens the two into getting the guns back.

Eddie returns home one day and overhears his neighbours—a gang of robbers led by a brutal man called "Dog"—planning a heist on some cannabis growers loaded with cash and drugs. Eddie relays this information to the group, intending for them to rob the neighbours as they come back from their heist. In preparation for the robbery, Tom visits Nick the Greek to buy weapons, and ends up buying the two antique shotguns.

The neighbours' heist gets underway, and despite a gang member being killed by his own Bren gun, and an incriminating encounter with a traffic warden, the job is a success; they return home with a duffel bag filled with money and a van loaded with bags of marijuana. Eddie and his friends ambush them as planned, and later return to stash their loot next door. They then have Nick fence the drugs to Rory Breaker, a gangster with a reputation for violence. Rory agrees to the deal but later learns that the drugs were stolen from his own growers. Rory threatens Nick into giving him Eddie's address and brings along one of the growers, Winston, to identify the robbers.

Eddie and his friends spend the night at Eddie's father's bar to celebrate. Meanwhile, Dog's crew accidentally learns that their neighbours are the ones who robbed them, and set up an ambush in Eddie's flat. Rory and his gang arrive instead and a shootout ensues, resulting in the deaths of all but Dog and Winston. Winston leaves with the drugs; Dog leaves with the two shotguns and the money but is waylaid by Big Chris, who knocks him out and takes everything. Gary and Dean, having learned who bought the shotguns and not knowing that Chris works for Harry, follows Chris to Harry's place.

Chris delivers the money and guns to Harry, but when he returns to his car he finds Dog holding Little Chris at knifepoint, demanding the money be returned to him. Chris complies and starts the car. Meanwhile, Gary and Dean burst into Harry's office, starting a confrontation that ends up killing them both, and Harry and Barry as well.

Returning to see the carnage at their flat and their loot missing, Eddie and his friends head to Harry's, but when they discover Harry's corpse they decide to take the money for themselves. Before they are able to leave, Chris crashes into their car to disable Dog, and then brutally bludgeons him to death with his car door. He then takes the debt money back from the unconscious friends, but allows Tom to leave with the antique shotguns after a brief standoff in Harry's office.

The friends are arrested but declared innocent of recent events after the traffic warden identifies Dog and his crew as the culprits. Back at the bar, they send Tom out to dispose of the antique shotguns—the only remaining evidence linking them to the case. Chris then arrives to give back the duffel bag, from which he has taken all the money for himself and his son, and which is empty except for a catalogue of antique weapons. Leafing through the catalogue, the friends learn that the shotguns are actually quite valuable (worth £250,000 to £300,000), and quickly call Tom. The film ends with Tom leaning over the side of a bridge, with his mobile phone stuffed in his mouth and ringing, as he prepares to drop the shotguns into the River Thames.
"""

tuple_genres = predict.predict_genres_from_plot(text=text)
utils.prettify_genres(tuple_genres=tuple_genres)

'Genres are Crime Fiction, Drama, Thriller'

In [78]:
text = """
Chak De! India opens in Delhi during the final minutes of a Hockey World Cup match between Pakistan and India, with Pakistan leading 1–0. When Indian team captain Kabir Khan (Shah Rukh Khan) is fouled, he takes a penalty stroke. His shot just misses, costing India the match. Soon afterwards, media outlets circulate a photograph of Khan shaking hands with the Pakistani captain. The sporting gesture is misunderstood, and the Muslim Khan[11][12] is suspected of "throwing" the game out of sympathy towards Pakistan. Religious prejudice[11][12][13] forces him and his mother (Joyshree Arora) from their family home.

Seven years later Mr. Tripathi (Anjan Srivastav), the head of India's hockey association, meets with Khan's friend and hockey advocate Uttam Singh (Mohit Chauhan) to discuss the Indian women's hockey team. According to Tripathi, the team has no future since the only long-term role for women is to "cook and clean". Uttam, however, tells him that Kabir Khan (whom no one has seen for seven years) wants to coach the team. Initially skeptical, Tripathi agrees to the arrangement.

Khan finds himself in charge of a group of 16 young women (each representing a different state), divided by their competitive nature and regional prejudices. Komal Chautala (Chitrashi Rawat), a village girl from Haryana, clashes with Preeti Sabarwal (Sagarika Ghatge) from Chandigarh; short-tempered Balbir Kaur (Tanya Abrol) from Punjab bullies Rani Dispotta (Seema Azmi) and Soimoi Kerketa (Nisha Nair), who are from remote villages in Jharkhand. Mary Ralte (Kimi Laldawla) from Mizoram and Molly Zimik (Masochon "Chon Chon" Zimik22), from Manipur in North-East India, face widespread racial discrimination, and sexually suggestive comments from some strangers. Team captain Vidya Sharma (Vidya Malvade) must choose between hockey and the wishes of her husband Rakesh's (Nakul Vaid) family, and Preeti's fiancé—Abhimanyu Singh (Vivan Bhatena), vice-captain of the India national cricket team—feels threatened by her involvement with the team.

Khan realizes that he can make the girls winners only if he can help them overcome their differences. During his first few days as coach he benches several players who refuse to follow his rules—including Bindiya Naik (Shilpa Shukla), his most experienced player. In response, Bindiya repeatedly encourages the other players to defy Khan. When she finally succeeds, Khan angrily resigns; however, he invites the staff and team to a farewell lunch at McDonald's. During the lunch, local boys make a pass at Mary; Balbir attacks them, triggering a brawl between the boys and the team. Khan, recognizing that they are acting as one for the first time, prevents the staff from intervening; he only stops a man from hitting one of the women from behind with a cricket bat, telling him that there are no cowards in hockey. In an about-face, after the fight the women ask Khan to remain as their coach.

The team faces new challenges. When Tripathi refuses to send the women's team to Australia for the World Cup, Khan proposes a match against the men's team. Although his team loses, their performance inspires Tripathi to send them to Australia after all. Bindiya is upset with Khan for choosing Vidya over her as the captain of the team. The result sees a loss in the tournament with a 7–0 to Australia. When Khan confronts Bindiya about her behavior on the field, Bindiya responds by seducing Khan, to which he rejects her advances and asks her to stay away from the game. Khan goes on to train the girls and again, which is followed by victories over England, Spain, South Africa, New Zealand, Argentina. Just before their game with Korea, Khan approaches Bindiya to go back in the field and break the strategy of 'Man to Man' marking by Korean team so they can win the match. Bindiya goes on to the field and with the help of Gunjan Lakhani manages to beat South Korea. They are again matched with Australia for the final; this time, they defeat the Hockeyroos for the World Cup. When the team returns home, their families treat them with greater respect and Khan, his good name restored, returns with his mother to their ancestral home.
"""

tuple_genres = predict.predict_genres_from_plot(text=text)
utils.prettify_genres(tuple_genres=tuple_genres)

'Genres are Drama, Sports'

In [79]:
text = """
Anti-Terrorism Squad (ATS) officer Daanish Ali (Farhan Akhtar) lives with his wife Ruhana (Aditi Rao Hydari) and little daughter Noorie. One day, while Daanish is driving with his wife and daughter, he spots the terrorist Farooq Rameez and chases him. Noorie is killed in the ensuing shootout while Rameez escapes. Ruhana is shattered and blames Daanish for Noorie's death. Daanish later kills Rameez during a police operation, angering his seniors, who wanted Rameez alive to find out which politician he was going to meet.

A grief-stricken Daanish is about to kill himself at Noorie's grave when a mysterious van appears, driving off when Daanish yells at it. Daanish finds a wallet lying where the van was, goes to return it, and meets its owner, a handicapped chess master named Pandit Omkar Nath Dhar (Amitabh Bachchan), who turns out to have been Noorie's chess teacher. Pandit starts teaching Daanish chess and tells him about Nina, Pandit's daughter who also died. Nina (Vaidehi Parashurami) had been teaching chess to Ruhi (Mazel Vyas), the daughter of a government minister, Yazaad Qureshi (Manav Kaul), and had fallen down the stairs at Qureshi's house and died. Pandit is convinced that her death was not an accident. Intrigued, Daanish tries to meet Qureshi, but police officers at the minister's office threaten him with arrest. Daanish later meets Ruhi at her school to ask her about Nina, but Ruhi is taken away by Qureshi who threatens her for talking to Daanish. That night Pandit is brutally attacked by Wazir (Neil Nitin Mukesh), a hitman sent by Qureshi, who warns Pandit and Daanish to stop chasing Qureshi.

Pandit leaves for Kashmir, where Qureshi is headed. Wazir calls Daanish and threatens to kill Pandit. Daanish frantically chases Pandit's van but Wazir blows it up killing Pandit. Determined to exact revenge and discover who Wazir is, Daanish makes a plan with the Superintendent of Police (SP) Vijay Mallik (John Abraham). During Qureshi's speech, SP detonates explosives, causing a panic, and holds back the police to give Daanish time to act. Qureshi escapes in the confusion and goes to where he and his daughter are staying, but Daanish breaks in and asks him about Wazir. Qureshi claims he does not know anybody called Wazir. Crying, Ruhi reveals to Daanish that Qureshi is not her father – he is actually one of the militants who massacred her entire village and posed as her father when Indian Army troops arrived. Ruhi had told Nina this, so Qureshi killed Nina. Daanish realises that Rameez and the other terrorists had come to meet Qureshi, and kills him.

A few days later, while watching Ruhana's play – based on chess and dedicated to Pandit – Daanish realises that Pandit was in fact the play's "weak pawn" who befriended a "strong rook" (Daanish) who would kill a "wicked king" (Qureshi). Shocked, Daanish finds Pandit's housekeeper, who says that she did not actually see Wazir on the night he attacked Pandit. She gives Daanish a USB pen-drive which Pandit had told her to give to Daanish if he came looking for Wazir. The pen-drive contains a video of Pandit, who explains that Wazir never existed – he was just a persona created by Pandit, who knew that, due to his handicap, he was powerless against Qureshi. Pandit intentionally dropped his wallet near Noorie's grave so Daanish could find it and befriend him. His knife wounds from Wazir's attack were self-inflicted, and he had used voice recordings to pose as Wazir on the phone. Pandit had sacrificed himself to ensure that Daanish would kill Qureshi and get revenge for both of them. Shaken by the revelation, Daanish and Ruhana reunite.
"""

tuple_genres = predict.predict_genres_from_plot(text=text)
utils.prettify_genres(tuple_genres=tuple_genres)

'Genres are Action, Thriller'