# Conversations

EXISTING

This data frame lists the `character_ID`s for the two characters in the conversation, the `movie_ID`, and the dialogue through references to `line_ID`s.

## Table of Contents
- [1st Section](#Loading-in-the-data-and-basic-summary) is where I import the files and get an idea of the data shape and contents
- [2nd section](#Fixing-the-dialogue-column) is where I fix the dialogue list issue and explode the data frame
- [3rd section](#Exploration) is where I look into some different aspects of the conversations
- [4th section](#Pickling-the-data-and-creating-a-csv) is where the data are saved and exported
- [Conclusion](#Conclusion) summarizes the notebook

In [1]:
# import packages
import numpy as np
import pandas as pd
from ast import literal_eval 

## Loading in the data and basic summary

In [2]:
conversations_df = pd.read_csv('../data/movie_conversations.txt', sep='\s+\+\+\+\$\+\+\+\s?',
                               names=['character1_ID', 'character2_ID' , 'movie_ID', 'dialogue'], 
                               dtype={'character1_ID':'string', 'character2_ID':'string', 'movie_ID':'string'}, 
                               engine='python')

The data were all separated with ' +++$+++ ' and did not have column names. The README from the original data described what each column was in the data so I used that to create column names.

In [3]:
conversations_df.shape

(83097, 4)

In [4]:
conversations_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83097 entries, 0 to 83096
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   character1_ID  83097 non-null  string
 1   character2_ID  83097 non-null  string
 2   movie_ID       83097 non-null  string
 3   dialogue       83097 non-null  object
dtypes: object(1), string(3)
memory usage: 2.5+ MB


In [5]:
conversations_df.head()

Unnamed: 0,character1_ID,character2_ID,movie_ID,dialogue
0,u0,u2,m0,"['L194', 'L195', 'L196', 'L197']"
1,u0,u2,m0,"['L198', 'L199']"
2,u0,u2,m0,"['L200', 'L201', 'L202', 'L203']"
3,u0,u2,m0,"['L204', 'L205', 'L206']"
4,u0,u2,m0,"['L207', 'L208']"


This dataframe will only make sense when compiled with the others to read in character names, movie names, and the utterances (stored as line numbers in the dialogue column).

## Fixing the dialogue column

The `dialogue` column was not reading in properly as a list, so that made manipulation very challenging. The following code corrects this issue and was found thanks to [this](https://stackoverflow.com/questions/32742976/how-to-read-a-column-of-csv-as-dtype-list-using-pandas) Stack Overflow thread.

In [6]:
conversations_df["dialogue"] = conversations_df["dialogue"].fillna("[]").apply(lambda x: eval(x))

To help with downstream analysis, I am creating a 'conversation_ID' column that duplicates the index of the data frame.

In [7]:
conversations_df['conversation_ID'] = conversations_df.index

In [8]:
# rearrange the columns
conversations_df = conversations_df[['conversation_ID', 'character1_ID', 'character2_ID', 'movie_ID', 'dialogue']]

In [9]:
conversations_df.head()

Unnamed: 0,conversation_ID,character1_ID,character2_ID,movie_ID,dialogue
0,0,u0,u2,m0,"[L194, L195, L196, L197]"
1,1,u0,u2,m0,"[L198, L199]"
2,2,u0,u2,m0,"[L200, L201, L202, L203]"
3,3,u0,u2,m0,"[L204, L205, L206]"
4,4,u0,u2,m0,"[L207, L208]"


To help with one downstream analysis, I will save a version of the data frame as is, with the dialogue column as a series of lists

In [10]:
conversations_intact_df = conversations_df

I will unnest the dialogue column below so I can merge this with the utterances_df and have corresponding conversations IDs (the index) to link the lines together in the Analysis notebook.

In [11]:
conversations_df = conversations_df.explode('dialogue')

In [12]:
conversations_df.head()

Unnamed: 0,conversation_ID,character1_ID,character2_ID,movie_ID,dialogue
0,0,u0,u2,m0,L194
0,0,u0,u2,m0,L195
0,0,u0,u2,m0,L196
0,0,u0,u2,m0,L197
1,1,u0,u2,m0,L198


## Exploration

In [13]:
# on average how many lines/turns per conversation?
conversations_df.groupby('conversation_ID')['dialogue'].count().mean()

3.6669554857576085

In [14]:
# what's the shortest conversation?
conversations_df.groupby('conversation_ID')['dialogue'].count().min()

2

In [15]:
# and the longest?
conversations_df.groupby('conversation_ID')['dialogue'].count().max()

89

That is a very long conversation! Let's see what some of the other long conversations are.

In [16]:
conversations_df.groupby('conversation_ID')['dialogue'].count().sort_values(ascending=False).head(20)

conversation_ID
42477    89
73134    59
70355    56
45571    55
11348    54
19670    53
42520    52
35650    52
32342    49
45572    45
19584    45
33010    44
33022    44
21215    43
40734    43
42599    42
19675    41
10803    41
29302    40
29397    40
Name: dialogue, dtype: int64

I will come back to some of these in the [discourse analysis notebook](../analysis_notebooks/Discourse_Analysis.ipynb) to see what I can uncover.

## Pickling the data and creating a csv

In [17]:
import pickle

In [18]:
# pickle the data to use in other notebooks for further analysis
f1 = open('conversations_df.pkl', 'wb')
pickle.dump(conversations_df, f1, -1)
f1.close()

f2 = open('conversations_intact_df.pkl', 'wb')
pickle.dump(conversations_intact_df, f2, -1)
f2.close

<function BufferedWriter.close>

In [19]:
conversations_df.to_csv('../new_data/conversations_df.csv', header=True)
conversations_intact_df.to_csv('../new_data/conversations_intact_df.csv', header=True)

## Conclusion

These data were the cleanest. The only issue was the `dialogue` column not properly reading in as a list. I unnested/exploded the lists and duplicated the index to create a `conversation_ID` which I will use in some downstream analysis and synthesis. 