<a href="https://colab.research.google.com/github/AngelaRemolina/NLP-Film-Dialog-Generation/blob/main/Metadata-Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Load data

In [293]:
from google.colab import drive
drive.mount('/content/drive')

import os

path = 'Colab Notebooks/NLP/Project'

os.chdir(f'/content/drive/MyDrive/{path}')
os.getcwd()


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


'/content/drive/MyDrive/Colab Notebooks/NLP/Project'

In [294]:
dataset = 'cornell-movie-dialogs-small'

These are the different files found in the dataset, let's see what's inside of them

In [295]:
with open(f'{dataset}/movie_lines.txt', encoding='utf-8') as f:
    lines = f.readlines()

with open(f'{dataset}/movie_conversations.txt', encoding='utf-8') as f:
    conversations = f.readlines()

with open(f'{dataset}/movie_titles_metadata.txt', encoding='latin-1') as f:
    titles = f.readlines()

with open(f'{dataset}/movie_characters_metadata.txt', encoding='latin-1') as f:
    characters = f.readlines()

In [296]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


The lines and conversation files that are conected with an ID of the line in the conversation.

> According to the dataset documentation these file contains
* 220,579 conversational exchanges between 10,292 pairs of movie characters
* involves 9,035 characters from 617 movies
* in total 304,713 utterances

Also on the titles file we can see

> Movie metadata included:
* genres
* release year
* IMDB rating
* number of IMDB votes
* IMDB rating

And this information for the characters file

> Character metadata included:
* gender (for 3,774 characters)
* position on movie credits (3,321 characters)


For easier visualization, we'll load the Loading them into a pandas dataframe, starting with the lines.


As we saw before, each column of the dataset is splitted by the string `+++$+++` then we'll set that as a parameter for the split function. Also we'll delete the last `\n` symbol from the line

In [297]:
import re

In [298]:
def clean_text(text): #adapted from https://github.com/REDFOX1899/Chatbot/blob/master/Chatbot.py
    text = text.lower()
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "what is", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"how's", "how is", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"n't", " not", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", text)

    # from some common specific mistakes found in this dataset
    text = re.sub(r"youíre", "you are", text)
    text = re.sub(r"óó", "", text)

    return text

In [299]:
import pandas as pd

In [300]:
# create dataframe with lines
df_lines = pd.DataFrame({'line_text': lines})

# split into columns
df_lines = df_lines['line_text'].str.split(r' \+\+\+\$\+\+\+ ', expand=True)
df_lines.columns = ['line_ID', 'speaker_ID', 'movie','speaker','text']

# delete new line character and other blank character
df_lines = df_lines.applymap(lambda x: x.rstrip() if isinstance(x, str) else x)
df_lines['text'] = df_lines['text'].str.rstrip('\n')

# Add column clean text
df_lines['clean_text'] = df_lines['text'].apply(lambda x: clean_text(x))

#might be needed?
# # add column speaker + text
# df_lines['line'] = df_lines['speaker'] + ": " + df_lines['clean_text']

df_lines.head(5)

Unnamed: 0,line_ID,speaker_ID,movie,speaker,text,clean_text
0,L1045,u0,m0,BIANCA,They do not!,they do not
1,L1044,u2,m0,CAMERON,They do to!,they do to
2,L985,u0,m0,BIANCA,I hope so.,i hope so
3,L984,u2,m0,CAMERON,She okay?,she okay
4,L925,u0,m0,BIANCA,Let's go.,let's go


Now let's do the same thing with all the other txt files

In [301]:
import ast

In [302]:
# create dataframe with conversations
df_conv = pd.DataFrame({'conv': conversations})

# split into columns
df_conv = df_conv['conv'].str.split(r' \+\+\+\$\+\+\+ ', expand=True)
df_conv.columns = ['speaker1_ID', 'speaker2_ID', 'movie_ID','lines_list']

# delete new line character
df_conv['lines_list'] = df_conv['lines_list'].str.rstrip('\n')

# set lines_list to list type
df_conv['lines_list'] = df_conv['lines_list'].apply(ast.literal_eval)

df_conv.head(5)

Unnamed: 0,speaker1_ID,speaker2_ID,movie_ID,lines_list
0,u0,u2,m0,"[L194, L195, L196, L197]"
1,u0,u2,m0,"[L198, L199]"
2,u0,u2,m0,"[L200, L201, L202, L203]"
3,u0,u2,m0,"[L204, L205, L206]"
4,u0,u2,m0,"[L207, L208]"


In [303]:
# create dataframe with conversations
df_title = pd.DataFrame({'title': titles})

# split into columns
df_title = df_title['title'].str.split(r' \+\+\+\$\+\+\+ ', expand=True)
df_title.columns = ['movie_ID','title','year','IMBD_rating','IMBD_votes','genres']

# cast types to what they are
df_title['IMBD_rating'] = df_title['IMBD_rating'].astype(float)
df_title['IMBD_votes'] = df_title['IMBD_votes'].astype(int)
# Clean 'year' column using regex (for cases like ' 1989/I ')
df_title['year'] = df_title['year'].apply(lambda x: re.sub(r'\D', '', x))  # \D means "no digit"
df_title['year'] = df_title['year'].astype(int)

# delete new line character
df_title['genres'] = df_title['genres'].str.rstrip('\n')

# set genres_list to list type
df_title['genres'] = df_title['genres'].apply(ast.literal_eval)
df_title.head(5)

Unnamed: 0,movie_ID,title,year,IMBD_rating,IMBD_votes,genres
0,m0,10 things i hate about you,1999,6.9,62847,"[comedy, romance]"
1,m1,1492: conquest of paradise,1992,6.2,10421,"[adventure, biography, drama, history]"
2,m2,15 minutes,2001,6.1,25854,"[action, crime, drama, thriller]"
3,m3,2001: a space odyssey,1968,8.4,163227,"[adventure, mystery, sci-fi]"
4,m4,48 hrs.,1982,6.9,22289,"[action, comedy, crime, drama, thriller]"


In [304]:
# create dataframe with conversations
df_chars = pd.DataFrame({'characters': characters})

# split into columns
df_chars = df_chars['characters'].str.split(r' \+\+\+\$\+\+\+ ', expand=True)
df_chars.columns = ['Character_ID','name','movie_ID','movie_title','gender','credits_pos']

# delete new line character
df_chars['credits_pos'] = df_chars['credits_pos'].str.rstrip('\n')

# cast credits_pos to int (-1 if unknown)
df_chars.loc[df_chars['credits_pos'] == '?', 'credits_pos'] = -1
df_chars['credits_pos'] = df_chars['credits_pos'].astype(int)  # max(df_chars['credits_pos']) = 1000

df_chars.head(5)

Unnamed: 0,Character_ID,name,movie_ID,movie_title,gender,credits_pos
0,u0,BIANCA,m0,10 things i hate about you,f,4
1,u1,BRUCE,m0,10 things i hate about you,?,-1
2,u2,CAMERON,m0,10 things i hate about you,m,3
3,u3,CHASTITY,m0,10 things i hate about you,?,-1
4,u4,JOEY,m0,10 things i hate about you,m,6


For easier handle of the conversations and lines we'll join them together in one dataframe

In [305]:
# Create a guide index
df_conv['index'] = df_conv.index
# Expand list into all sub item lines
expanded_lines = df_conv.explode('lines_list')
# Reset index of expanded df
expanded_lines.reset_index(drop=True, inplace=True)
# merge the line with line ids
merged_df = pd.merge(
    expanded_lines,
    df_lines,
    left_on='lines_list',
    right_on='line_ID',
    how='inner'
)

# Select relevant columns and group by the guide index
merged_df = merged_df[['speaker1_ID', 'speaker2_ID', 'movie_ID', 'lines_list', 'index', 'clean_text']]
merged_df = merged_df.groupby(['index','speaker1_ID', 'speaker2_ID', 'movie_ID'])['clean_text'].apply(list).reset_index()
merged_df = merged_df.rename(columns={'clean_text': 'dialog'})

# convert dialog to string, not list
merged_df['dialog'] = merged_df['dialog'].apply(lambda x: ';'.join(x))
dialog_df = merged_df.drop(['index'], axis=1)

dialog_df

Unnamed: 0,speaker1_ID,speaker2_ID,movie_ID,dialog
0,u0,u2,m0,can we make this quick roxanne korrine and an...
1,u0,u2,m0,you are asking me out that is so cute what is...
2,u0,u2,m0,no no it's my fault we did not have a proper ...
3,u0,u2,m0,why;unsolved mystery she used to be really po...
4,u0,u2,m0,gosh if only we could find kat a boyfriend;let...
...,...,...,...,...
83092,u9028,u9031,m616,do you think she might be interested in someo...
83093,u9028,u9031,m616,choose your targets men that is right watch th...
83094,u9030,u9034,m616,colonel durnford william vereker i hear you h...
83095,u9030,u9034,m616,your orders mr vereker;i am to take the sikali...


In [306]:
dialog_df['dialog'][501]

'i can tell from the tone of your voice dave that you are upset why do not you take a stress pill and get some rest;hal i am in command of this ship i order you to release the manual hibernation control;i am sorry dave but in accordance with subroutine c15324 quote when the crew are dead or incapacitated the computer must assume control unquote i must therefore override your authority now since you are not in any condition to intel ligently exercise it;hal unless you follow my instructions i shall be forced to disconnect you'

In [307]:
dialog_df['movie_ID'][501]

'm3'

## Meta data prediction task

prediction of metadata for specific dialog.

**Input:** Dialog

**Output:**

Movie metadata:

* genres
* release year
* IMDB rating
* number of IMDB votes
* IMDB rating

Character metadata:

* gender
* position on movie credits

let's put the metadata into the dialogs datset to make things easier

In [308]:
movie_metadata = pd.merge(
    dialog_df,
    df_title,
    left_on='movie_ID',
    right_on='movie_ID',
    how='inner'
)

Shuffle the dataset

In [309]:
movie_metadata = movie_metadata.sample(frac=1)
movie_metadata

Unnamed: 0,speaker1_ID,speaker2_ID,movie_ID,dialog,title,year,IMBD_rating,IMBD_votes,genres
57017,u6168,u6173,m411,we miss you;thanks for the presents;we love th...,jurassic park,1993,7.9,153737,"[action, adventure, family, sci-fi, action, ad..."
4261,u511,u514,m32,pass them potatoes lincoln;y'all let me know i...,black snake moan,2006,7.1,28509,[drama]
14848,u1594,u1598,m105,can i ask you about melanie;sure;what is your ...,jackie brown,1997,7.6,85496,"[crime, drama, thriller]"
35987,u3819,u3820,m252,so you are talking to me;when i have something...,a walk to remember,2002,7.1,38751,"[drama, romance]"
69169,u7554,u7573,m511,i swear it he wants romeo for ned and the admi...,shakespeare in love,1998,7.4,78654,"[comedy, drama, romance]"
...,...,...,...,...,...,...,...,...,...
32808,u3520,u3523,m232,no i mean it those seals are not telling us d...,the abyss,1989,7.6,51699,"[action, adventure, drama, sci-fi, thriller]"
42560,u4525,u4535,m299,maybe the asian design major slipped her some ...,clerks.,1994,8.0,90972,[comedy]
61245,u6664,u6689,m445,by what name are you known;there are some who ...,monty python and the holy grail,1975,8.4,157683,"[adventure, comedy]"
10600,u1155,u1165,m76,marcus aurelius has died;he left us at dawn,gladiator,2000,8.4,286067,"[action, adventure, drama]"


Using vectorizer (computed in 'otherTasks' notebook)

In [318]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.8, min_df=5, stop_words='english', max_features=1000) # without max_features there are more than 17000 columns
X = vectorizer.fit_transform(movie_metadata['dialog'])

In [319]:
tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
tfidf_df

Unnamed: 0,able,absolutely,accept,accident,account,act,acting,actually,address,admit,...,wrong,wrote,ya,yeah,year,years,yes,yesterday,york,young
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.106984,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.435941,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
83092,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
83093,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
83094,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
83095,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0


ignore the following, there are just notes for understanding

# Cosas para entender mejor lo que se hace

Ejemplo de como funciona el vectorizer

In [317]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Ejemplo de diálogos
dialogues = [
    "Hello, how are you?",
    "I'm good, thank you! How about you?",
    "I'm fine too, thanks for asking."
]

# Crear un vectorizador TF-IDF
vectorizer = TfidfVectorizer(max_features=1000)

# Transformar los diálogos en una matriz de características TF-IDF
X = vectorizer.fit_transform(dialogues)

# Convertir a un DataFrame para visualizar
tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print(tfidf_df)

      about       are    asking      fine       for      good     hello  \
0  0.000000  0.562829  0.000000  0.000000  0.000000  0.000000  0.562829   
1  0.411973  0.000000  0.000000  0.000000  0.000000  0.411973  0.000000   
2  0.000000  0.000000  0.447214  0.447214  0.447214  0.000000  0.000000   

        how     thank    thanks       too       you  
0  0.428046  0.000000  0.000000  0.000000  0.428046  
1  0.313316  0.411973  0.000000  0.000000  0.626632  
2  0.000000  0.000000  0.447214  0.447214  0.000000  


La regresión logistica solo se debe usar para clasificacion si las clases son binarias, pero para las variables numericas puede servir (year, rating, etc)

In [313]:
from sklearn.linear_model import LogisticRegression
import numpy as np

In [314]:
def generatePredictionsY(x_train, y_train, x_test):
    # train example 1
    # x = df['dialog'] #input
    # y = df['genre'] #output

    # train example 2
    # x = df['line'] #input
    # y = df['gender'] #output

    model = LogisticRegression(max_iter=1000)
    model.fit(X=np.array(x_train), y=y_train)

    predictions = model.predict(X=np.array(x_test))
    return predictions

The idea is to implement this with cross-validation in order to obtain a prediction for each subtest.

In [315]:
# Example, predict genre:

# train with 10
x_train = movie_metadata.iloc[0:10][['IMBD_rating','IMBD_votes']]
y_train = movie_metadata.iloc[0:10]['year']

# print(x_train)
# print()
# print(y_train)
# print()

# test with 5
x_test = movie_metadata.iloc[10:15][['IMBD_rating','IMBD_votes']]
print(x_test)
print()

p = generatePredictionsY(x_train, y_train, x_test)
print(p)

       IMBD_rating  IMBD_votes
38789          4.5         269
27912          6.4       28682
55300          7.4       80077
67744          7.6        1085
68879          6.1         259

[1993 1993 1993 1993 1993]


In [316]:
movie_metadata.loc[52595]['year']

1995