Multi-label natural language processing (NLP) classification model to predict movies' genres from dialogue.
There should be no necessary libraries to run the code here beyond the Anaconda distribution of Python. The code should run with no issues using Python version 3.
The motivation behind undertaking this project was to explore whether text patterns could be uncovered in movies' dialogue to act as indicators of their genres. The steps of the process were as follows:
- Extracting and transforming the raw data contained within the source files and loading to a SQLite database
- Exploring the training data to uncover evidence of class and label imbalances
- Implementing NLP techniques to convert raw text data into a matrix of feature variables
- Building and tuning a classification model to produce a multi-output genre prediction when passed a text quote
A more in-depth discussion of the process of building the model can be found in a Medium post linked here.
The data for this project was obtained via a publication from Cornell University (see acknowledgements below).
Of the files provided, there were three data sets of interest to this project: movie_conversations
, movie_lines
, and movie_titles_metadata
. The steps taken to extract the necessary data and transform it into a format suitable for the project were as follows:
- Read the data from each of the three
.txt
files into pandas dataframes - Assign a conversation ID to each exchange contained in the conversations data set
- Melt the conversations dataframe such that each line of dialogue appears on a separate row with the corresponding conversation ID
- Join the melted dataframe with the lines data set to retrieve the actual text for each line ID
- Join the separate rows via the conversation ID such that the entirety of each exchange appears in text format on an individual row
- Finally, join the dataframe of text conversations with the movie metadata to retrieve the genres for each text document, and load the final dataframe to a SQLite database
A more complete description of the raw data files can be found in a README within the corpus folder in the repository, as provided by the publication.
Once the raw data had been transformed into a workable format, the steps for processing the data into a model-friendly representation were as follows:
- Remove all punctuation and special characters from the text using a regular expression
- Tokenise each document in the text
- Lemmatise tokens, set to lower case, strip whitespace and filter out stop words
- Convert documents into a matrix of vectorised token counts
- Transform matrix into tf-idf representation
This process rendered the exchanges into a matrix of fearures suitable to train a Decision Tree Classification model, with the 24 genres acting as a multi-label target variable.
The NLP preprocessing and model fitting were bundled into a single pipeline to prevent data leakage and render the model selection and hyper-parameter tuning processes more efficient.
cornell_movie_dialogs_corpus
: folder containing raw source data filesmovie_dialogue_etl.py
: Python program to extract and transform raw data and load to a local SQLite databasedialogue.db
: SQLite database containing training data setmovie_dialogue_clf.ipynb
: notebook used to construct classification model
The data for this project was obtained courtesy of the publication Cornell Movie--Dialogs Corpus from Cristian Danescu-Niculescu-Mizil at Cornell University.