# Cornell Movie Dialog Dataset
Collection of conversations extracted from movie scripts, created by researchers at *Cornell University*

- Website: https://www.cs.cornell.edu/~cristian/Chameleons_in_imagined_conversations.html

- Dataset: http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip

- Paper: https://www.cs.cornell.edu/~cristian/papers/chameleons.pdf

**Citation:**
>Danescu-Niculescu-Mizil, C., & Lee, L. (2011).
>
> *Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs.*
>
> In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011.



**NLP Group Assignment**
- Elif Gamze GULITER
- Romane KULESZA
- Volkan MAZLUM
- Juan Pablo RAMIREZ

**Politecnico di Milano**

The purpose of this notebook is to ...

### Connect to Drive
**(optional for Google Colab users)**

In [1]:
from google.colab import drive
drive.mount('/content/drive/')

%cd /content/drive/My Drive/NLP Cornell Movie Dataset/

Mounted at /content/drive/
/content/drive/My Drive/NLP Cornell Movie Dataset


### Download and unzip the Cornell Movie Dialog dataset
**(optional)**

In [1]:
# Download and unzip the Cornell Movie Dialog dataset
download = False
if download:
  !curl -L -o cornell_movie_dialogs.zip http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip
  !unzip cornell_movie_dialogs.zip

### Import Libraries

In [2]:
# Fix randomness and hide warnings
seed = 42

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
os.environ['PYTHONHASHSEED'] = str(seed)
os.environ['MPLCONFIGDIR'] = os.getcwd()+'/configs/'

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=Warning)

import numpy as np
np.random.seed(seed)

import logging

import random
random.seed(seed)

In [3]:
# Import NLP libraries
import re

In [4]:
# Import other libraries
import ast
import pandas as pd
import matplotlib.pyplot as plt
plt.rc('font', size=16)

### Load Cornell Movie Dialog Dataset

In [5]:
class CornellMovieDialogDataset:
  """
  Class to represent the Cornell Movie Dialog Dataset.

  Attributes:
    - base_path (str): The base path where the dataset files are located.
    - movies (dict): A dictionary to store movie metadata (indexed by movie ID).
        - title
        - year
        - IMBD_rating
        - IMBD_votes
        - genres
    - characters (dict): A dictionary to store character metadata (indexed by character ID).
        - name
        - movie_id
        - gender
        - pos_credits
    - utterances (dict): A dictionary to store movie lines (indexed by utterance ID).
        - character_id
        - movie_id
        - text
    - conversations (list): A list to store conversation data (represented as a list of utterance IDs).
  """
  def __init__(self, base_path):
    self.base_path = base_path
    self.movies = {}
    self.characters = {}
    self.utterances = {}
    self.conversations = []
    self.load_data()

  def load_data(self):
    """
    Loads the data from the dataset files into the class attributes.
    Uses original iso-8859-1 encoding of the dataset to read files
    """
    # Load movie titles
    with open(self.base_path + "/movie_titles_metadata.txt", "r", encoding="iso-8859-1") as file:
      for line in file:
        parts = [part.strip() for part in line.split("+++$+++")]
        self.movies[parts[0]] = {
            'title': parts[1],
            'year': int(re.match('\d+', parts[2])[0]),
            'IMDB_rating': float(re.match('[-+]?\d*\.\d+', parts[3])[0]),
            'IMDB_votes': int(re.match('\d+', parts[4])[0]),
            'genres': ast.literal_eval(parts[5])
        }

    # Load characters
    with open(self.base_path + "/movie_characters_metadata.txt", "r", encoding="iso-8859-1") as file:
      for line in file:
        parts = [part.strip() for part in line.split("+++$+++")]
        self.characters[parts[0]] = {
            'name': parts[1],
            'movie_id': parts[2],
            # ignore movie title (redundant)
            'gender': parts[4],
            'pos_credits': parts[5]
        }

    # Load movie lines
    with open(self.base_path + "/movie_lines.txt", "r", encoding="iso-8859-1") as file:
      for line in file:
        parts = [part.strip() for part in line.split("+++$+++")]
        self.utterances[parts[0]] = {
            'character_id': parts[1],
            'movie_id': parts[2],
            # ignore character name (redundant)
            'text': parts[4]
        }

    # Load conversations
    with open(self.base_path + "/movie_conversations.txt", "r", encoding="iso-8859-1") as file:
      for line in file:
        parts = [part.strip() for part in line.split("+++$+++")]
        self.conversations.append(ast.literal_eval(parts[3]))
        # ignore character ids / movie id (redundant)

  def print_summary(self):
    """
    Prints a summary of the dataset.
    """
    print(f'Number of movies: {len(self.movies)}')
    print(f'Number of characters: {len(self.characters)}')
    print(f'Number of utterances: {len(self.utterances)}')
    print(f'Number of conversations: {len(self.conversations)}')

  def print_random_conversation(self):
    """
    Prints a random conversation from the dataset.
    """
    conversation = random.choice(self.conversations)
    movie = self.movies[self.utterances[conversation[0]]['movie_id']]
    print(f"{movie['title']} ({movie['year']})")
    for line in conversation:
      print(f"- {self.characters[self.utterances[line]['character_id']]['name']}: {self.utterances[line]['text']}")


In [8]:
# Instantiate dataset object and load data
DATSET_FOLDER = 'data'
dataset = CornellMovieDialogDataset(os.path.join(os.getcwd(), DATSET_FOLDER))

FileNotFoundError: [Errno 2] No such file or directory: '/Users/nishithranjanbiswas/Desktop/NLP/Cornell-Movie-Dialog-Analysis-NLP-Course-Project-/data/movie_titles_metadata.txt'

In [9]:
dataset.print_summary()

Number of movies: 617
Number of characters: 9035
Number of utterances: 304713
Number of conversations: 83097


In [12]:
dataset.print_random_conversation()

a walk to remember (2002)
- REV. SULLIVAN: What's Landon Carter up to?
- JAMIE: Up to?
- REV. SULLIVAN: I thought we had rid ourselves of his disagreeable companionship.


In [6]:
#!pip3 install convokit
#from convokit import Corpus, download
#corpus = Corpus(filename=download("movie-corpus"))
#dir(corpus)