Movie Dialogue Classifier

Multi-label natural language processing (NLP) classification model to predict movies' genres from dialogue.

Installation

There should be no necessary libraries to run the code here beyond the Anaconda distribution of Python. The code should run with no issues using Python version 3.

Project Motivation

The motivation behind undertaking this project was to explore whether text patterns could be uncovered in movies' dialogue to act as indicators of their genres. The steps of the process were as follows:

Extracting and transforming the raw data contained within the source files and loading to a SQLite database
Exploring the training data to uncover evidence of class and label imbalances
Implementing NLP techniques to convert raw text data into a matrix of feature variables
Building and tuning a classification model to produce a multi-output genre prediction when passed a text quote

A more in-depth discussion of the process of building the model can be found in a Medium post linked here.

Data Preprocessing and Modelling

The data for this project was obtained via a publication from Cornell University (see acknowledgements below).

Of the files provided, there were three data sets of interest to this project: movie_conversations, movie_lines, and movie_titles_metadata. The steps taken to extract the necessary data and transform it into a format suitable for the project were as follows:

Read the data from each of the three .txt files into pandas dataframes
Assign a conversation ID to each exchange contained in the conversations data set
Melt the conversations dataframe such that each line of dialogue appears on a separate row with the corresponding conversation ID
Join the melted dataframe with the lines data set to retrieve the actual text for each line ID
Join the separate rows via the conversation ID such that the entirety of each exchange appears in text format on an individual row
Finally, join the dataframe of text conversations with the movie metadata to retrieve the genres for each text document, and load the final dataframe to a SQLite database

A more complete description of the raw data files can be found in a README within the corpus folder in the repository, as provided by the publication.

Once the raw data had been transformed into a workable format, the steps for processing the data into a model-friendly representation were as follows:

Remove all punctuation and special characters from the text using a regular expression
Tokenise each document in the text
Lemmatise tokens, set to lower case, strip whitespace and filter out stop words
Convert documents into a matrix of vectorised token counts
Transform matrix into tf-idf representation

This process rendered the exchanges into a matrix of fearures suitable to train a Decision Tree Classification model, with the 24 genres acting as a multi-label target variable.

The NLP preprocessing and model fitting were bundled into a single pipeline to prevent data leakage and render the model selection and hyper-parameter tuning processes more efficient.

File Descriptions

cornell_movie_dialogs_corpus: folder containing raw source data files
movie_dialogue_etl.py: Python program to extract and transform raw data and load to a local SQLite database
dialogue.db: SQLite database containing training data set
movie_dialogue_clf.ipynb: notebook used to construct classification model

Acknowledgements

The data for this project was obtained courtesy of the publication Cornell Movie--Dialogs Corpus from Cristian Danescu-Niculescu-Mizil at Cornell University.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Movie Dialogue Classifier

Table of Contents

Installation

Project Motivation

Data Preprocessing and Modelling

File Descriptions

Acknowledgements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
cornell_movie_dialogs_corpus		cornell_movie_dialogs_corpus
README.md		README.md
dialogue.db		dialogue.db
movie_dialogue_clf.ipynb		movie_dialogue_clf.ipynb
movie_dialogue_etl.py		movie_dialogue_etl.py

harryroper96/movie_dialogue_clf

Folders and files

Latest commit

History

Repository files navigation

Movie Dialogue Classifier

Table of Contents

Installation

Project Motivation

Data Preprocessing and Modelling

File Descriptions

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages