#                         Cornell Movie Dialog Dataset Analysis by 'Webwizards'


## Contributors:
  * [Rishikesh Miriyala](https://github.com/Rishi24109)
  * [Nishith Ranjan Biswas](https://github.com/Nishith170217)


## Overview

This repository hosts the analysis of the Cornell Movie Dialogs dataset for a Natural Language Processing course project at Politecnico di Milano (Polimi). This project aims to leverage various NLP techniques to explore and model movie script dialogues, enabling a deeper understanding of conversational dynamics in films. This README outlines the dataset details, analysis methods, and insights gained from modeling efforts.

## Dataset Description

The Cornell Movie Dialogs dataset is a comprehensive compilation of movie character dialogues and associated metadata. Here are some key details:

* Source: [Cornell Movie Dialogs Dataset on Hugging Face](https://huggingface.co/datasets/cornell_movie_dialog)
* Reference Paper: [Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs](https://arxiv.org/abs/1106.3077)
* Content Description: The dataset contains movie dialogue scripts accompanied by detailed metadata, such as film title, characters involved, and more.
* Documents Type: Includes dialogues exchanged between movie characters.
* Size: Contains 220,579 conversational exchanges between 10,292 pairs of movie characters involves 9,035 characters from 617 movies.
* movie metadata included:
  1. genres
  2. release year
  3. IMDB rating
  4. number of IMDB votes
  5. IMDB rating
* character metadata included:
  1. gender (for 3,774 characters)
  2. position on movie credits (3,321 characters)

* Primary Tasks:
  * Film Dialog Generation: Generate contextually appropriate responses based on previous dialogue exchanges.
  * Prediction of Metadata: Predict metadata attributes like film title or character traits based on specific dialogues.

# Importing Libraries

In [1]:
import numpy as np
import pandas as pd

# Importing Dataset

In [2]:
conversations = pd.read_csv(
    "/Users/nishithranjanbiswas/Desktop/NLP/Cornell-Movie-Dialog-Analysis-NLP-Course-Project-/data/movie_conversations.tsv", 
    sep='\t', 
    encoding='ISO-8859-2',
    names = ['charID_1', 'charID_2', 'movieID', 'conversation']
)

In [3]:
conversations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83097 entries, 0 to 83096
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   charID_1      83097 non-null  object
 1   charID_2      83097 non-null  object
 2   movieID       83097 non-null  object
 3   conversation  83097 non-null  object
dtypes: object(4)
memory usage: 2.5+ MB


In [4]:
conversations.head()

Unnamed: 0,charID_1,charID_2,movieID,conversation
0,u0,u2,m0,['L194' 'L195' 'L196' 'L197']
1,u0,u2,m0,['L198' 'L199']
2,u0,u2,m0,['L200' 'L201' 'L202' 'L203']
3,u0,u2,m0,['L204' 'L205' 'L206']
4,u0,u2,m0,['L207' 'L208']


In [5]:
lines = pd.read_csv(
    "/Users/nishithranjanbiswas/Desktop/NLP/Cornell-Movie-Dialog-Analysis-NLP-Course-Project-/data/movie_lines.tsv", 
    encoding='utf-8-sig', 
    sep='\t', 
    on_bad_lines="skip", 
    header = None,
    names = ['lineID', 'charID', 'movieID', 'charName', 'text'],
    index_col=['lineID']
)

In [6]:
lines.info()

<class 'pandas.core.frame.DataFrame'>
Index: 293202 entries, L1045 to L666256
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   charID    288917 non-null  object
 1   movieID   288917 non-null  object
 2   charName  288874 non-null  object
 3   text      288663 non-null  object
dtypes: object(4)
memory usage: 11.2+ MB


In [7]:
lines.head()

Unnamed: 0_level_0,charID,movieID,charName,text
lineID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
L1045,u0,m0,BIANCA,They do not!
L1044,u2,m0,CAMERON,They do to!
L985,u0,m0,BIANCA,I hope so.
L984,u2,m0,CAMERON,She okay?
L925,u0,m0,BIANCA,Let's go.


In [8]:
characters = pd.read_csv(
    "/Users/nishithranjanbiswas/Desktop/NLP/Cornell-Movie-Dialog-Analysis-NLP-Course-Project-/data/movie_characters_metadata.tsv", 
    sep='\t', 
    header = None,
    on_bad_lines= "skip",
    names = ['charID','charName','movieID','movieName','gender','score'],
    index_col=['charID']
)

In [9]:
characters.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9034 entries, u0 to u9034
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   charName   9015 non-null   object
 1   movieID    9017 non-null   object
 2   movieName  9017 non-null   object
 3   gender     9017 non-null   object
 4   score      9017 non-null   object
dtypes: object(5)
memory usage: 423.5+ KB


In [10]:
characters.head()

Unnamed: 0_level_0,charName,movieID,movieName,gender,score
charID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
u0,BIANCA,m0,10 things i hate about you,f,4
u1,BRUCE,m0,10 things i hate about you,?,?
u2,CAMERON,m0,10 things i hate about you,m,3
u3,CHASTITY,m0,10 things i hate about you,?,?
u4,JOEY,m0,10 things i hate about you,m,6


In [11]:
titles = pd.read_csv(
    "/Users/nishithranjanbiswas/Desktop/NLP/Cornell-Movie-Dialog-Analysis-NLP-Course-Project-/data/movie_titles_metadata.tsv",
    sep='\t',
    header=None,
    on_bad_lines="skip",
    names=['movieID', 'title', 'year', 'ratingIMDB', 'votes', 'genresIMDB'],
    index_col=['movieID']
)

In [12]:
titles.info()

<class 'pandas.core.frame.DataFrame'>
Index: 617 entries, m0 to m616
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   title       616 non-null    object 
 1   year        616 non-null    object 
 2   ratingIMDB  616 non-null    float64
 3   votes       616 non-null    float64
 4   genresIMDB  616 non-null    object 
dtypes: float64(2), object(3)
memory usage: 28.9+ KB


In [13]:
titles.head()

Unnamed: 0_level_0,title,year,ratingIMDB,votes,genresIMDB
movieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
m0,10 things i hate about you,1999,6.9,62847.0,['comedy' 'romance']
m1,1492: conquest of paradise,1992,6.2,10421.0,['adventure' 'biography' 'drama' 'history']
m2,15 minutes,2001,6.1,25854.0,['action' 'crime' 'drama' 'thriller']
m3,2001: a space odyssey,1968,8.4,163227.0,['adventure' 'mystery' 'sci-fi']
m4,48 hrs.,1982,6.9,22289.0,['action' 'comedy' 'crime' 'drama' 'thriller']


In [14]:
# Merge the datasets
merged_df = pd.merge(conversations, lines, on='movieID')
merged_df = pd.merge(merged_df, titles, on='movieID')
#merged_df = pd.merge(merged_df, characters, on='movieID')

In [15]:
merged_df.describe()

Unnamed: 0,ratingIMDB,votes
count,45023770.0,45023770.0
mean,7.0018,49362.16
std,1.152501,58872.15
min,2.5,9.0
25%,6.4,11969.0
50%,7.2,27791.0
75%,7.9,68749.0
max,9.3,419312.0


In [16]:
merged_df.head()

Unnamed: 0,charID_1,charID_2,movieID,conversation,charID,charName,text,title,year,ratingIMDB,votes,genresIMDB
0,u0,u2,m0,['L194' 'L195' 'L196' 'L197'],u0,BIANCA,They do not!,10 things i hate about you,1999,6.9,62847.0,['comedy' 'romance']
1,u0,u2,m0,['L194' 'L195' 'L196' 'L197'],u2,CAMERON,They do to!,10 things i hate about you,1999,6.9,62847.0,['comedy' 'romance']
2,u0,u2,m0,['L194' 'L195' 'L196' 'L197'],u0,BIANCA,I hope so.,10 things i hate about you,1999,6.9,62847.0,['comedy' 'romance']
3,u0,u2,m0,['L194' 'L195' 'L196' 'L197'],u2,CAMERON,She okay?,10 things i hate about you,1999,6.9,62847.0,['comedy' 'romance']
4,u0,u2,m0,['L194' 'L195' 'L196' 'L197'],u0,BIANCA,Let's go.,10 things i hate about you,1999,6.9,62847.0,['comedy' 'romance']


In [17]:
from collections import Counter

In [19]:
word_counts = Counter(word for sentence in merged_df for word in sentence.split())

In [20]:
vocabulary = list(word_counts.keys())

In [21]:
print("Vocabulary:", vocabulary)
print("Vocabulary Size:", len(vocabulary))

Vocabulary: ['charID_1', 'charID_2', 'movieID', 'conversation', 'charID', 'charName', 'text', 'title', 'year', 'ratingIMDB', 'votes', 'genresIMDB']
Vocabulary Size: 12
