# Converting the Friends dataset into ConvoKit format

This notebook describes how we converted the Friends dataset (https://github.com/emorynlp/character-mining) into a Corpus with ConvoKit.

In [1]:
!pip3 install convokit
# !python3 -m spacy download en

Collecting convokit
[?25l  Downloading https://files.pythonhosted.org/packages/9d/4c/66b8c4dcdefc6c688f1fb4e25f765bc7359671fca759bc32e3af63be7e15/convokit-2.1.11.tar.gz (81kB)
[K     |████████████████████████████████| 81kB 5.3MB/s 
Collecting msgpack-numpy==0.4.3.2 (from convokit)
  Downloading https://files.pythonhosted.org/packages/ad/45/464be6da85b5ca893cfcbd5de3b31a6710f636ccb8521b17bd4110a08d94/msgpack_numpy-0.4.3.2-py2.py3-none-any.whl
Collecting spacy==2.0.12 (from convokit)
[?25l  Downloading https://files.pythonhosted.org/packages/24/de/ac14cd453c98656d6738a5669f96a4ac7f668493d5e6b78227ac933c5fd4/spacy-2.0.12.tar.gz (22.0MB)
[K     |████████████████████████████████| 22.0MB 1.8MB/s 
Collecting nltk>=3.4 (from convokit)
[?25l  Downloading https://files.pythonhosted.org/packages/f6/1d/d925cfb4f324ede997f6d47bea4d9babba51b49e87a767c170b77005889d/nltk-3.4.5.zip (1.5MB)
[K     |████████████████████████████████| 1.5MB 38.7MB/s 
[?25hCollecting dill==0.2.9 (from convokit)
[?25

In [0]:
import requests
import json
from tqdm import tqdm
from convokit import Corpus, User, Utterance

## The Friends Dataset

The original dataset (https://github.com/emorynlp/character-mining) contains a set of 10 JSON files, each of which represents a complete transcript of 1 season of <i>Friends</i>. Since the data are available in JSON format from this GitHub repo, we download the raw data directly using the `requests` module. You will not need to download raw data files to use this script.

## Gather information about the corpus
For the **corpus.json** file, it will include information of number of episodes, number of scenes, number of utterances and number of speakers.
When counting the number of utterances, we ignore utterances that have no conversations.

In [0]:
num_episodes = 0
num_scenes = 0
num_utterances = 0
speakers = set()
for i in range(1,11):
  season_number = '0'+str(i) if i < 10 else '10'
  json_file = 'https://raw.githubusercontent.com/emorynlp/character-mining/master/json/friends_season_'+str(season_number)+'.json'
  r = requests.get(json_file)
  
  season = json.loads(r.text)
  episodes = season['episodes']
  num_episodes += len(episodes)
  for j in range(len(episodes)):
    episode = episodes[j]
    scenes = episode['scenes']
    num_scenes += len(scenes)
    for k in range(len(scenes)):
      scene = scenes[k]
      utterances = scene['utterances']
      for l in range(len(utterances)):
        utterance = utterances[l]
        speaker = utterance['speakers']
        speakers.update(speaker)
        num_utterances += 1 if len(speaker) != 0 else 0
corpus = {'friends': 'friends corpus', 'num_episodes': num_episodes, 'num_scenes': num_scenes, 'num_utterances': num_utterances, 'num_speakers': len(speakers)}

In [4]:
print(corpus)

{'friends': 'friends corpus', 'num_episodes': 236, 'num_scenes': 3107, 'num_utterances': 61338, 'num_speakers': 700}


## Generating user information
Since our dataset doesn't have any existing user information, we extract speaker information from the conversation. For each user, we collect the episode in which he/she first appears and guess his/her gender based on the name using the gender_guesser module.

Users are indexed by their name, which is a `<str>`. For each user, we create an object with:

- <b>first_appearance:</b> the episode in which he or she first appeared
- <b>gender:</b> the character's gender, as defined by the `gender_guesser` module's guess of his/her name

In [5]:
! pip3 install gender_guesser
import gender_guesser.detector as gender
d = gender.Detector()

Collecting gender_guesser
[?25l  Downloading https://files.pythonhosted.org/packages/13/fb/3f2aac40cd2421e164cab1668e0ca10685fcf896bd6b3671088f8aab356e/gender_guesser-0.4.0-py2.py3-none-any.whl (379kB)
[K     |████████████████████████████████| 389kB 4.8MB/s 
[?25hInstalling collected packages: gender-guesser
Successfully installed gender-guesser-0.4.0


In [6]:
users = {}
for i in tqdm(range(1,11)):
  season_number = '0'+str(i) if i < 10 else '10'
  json_file = 'https://raw.githubusercontent.com/emorynlp/character-mining/master/json/friends_season_'+str(season_number)+'.json'
  r = requests.get(json_file)
  
  season = json.loads(r.text)
  episodes = season['episodes']
  for j in range(len(episodes)):
    episode = episodes[j]
    scenes = episode['scenes']
    for k in range(len(scenes)):
      scene = scenes[k]
      utterances = scene['utterances']
      for l in range(len(utterances)):
        utterance = utterances[l]
        speaker_list = utterance['speakers']
        for speaker in speaker_list:
          if speaker not in users:
            users[speaker] = {'first_appearance': episode['episode_id'], 'gender': d.get_gender(speaker.split()[0])}

100%|██████████| 10/10 [00:03<00:00,  3.72it/s]


Sanity-checking the user data, we should see the correct genders assigned to the 6 friends:

In [7]:
print("number of users in the data = {}/700".format(len(users)))
print("Monica Geller object: ", users["Monica Geller"])
print("Joey Tribbiani object: ", users["Joey Tribbiani"])
print("Chandler Bing object: ", users["Chandler Bing"])
print("Phoebe Buffay object: ", users["Phoebe Buffay"])
print("Ross Geller object: ", users["Ross Geller"])
print("Rachel Green object: ", users["Rachel Green"])

number of users in the data = 700/700
Monica Geller object:  {'first_appearance': 's01_e01', 'gender': 'female'}
Joey Tribbiani object:  {'first_appearance': 's01_e01', 'gender': 'male'}
Chandler Bing object:  {'first_appearance': 's01_e01', 'gender': 'mostly_male'}
Phoebe Buffay object:  {'first_appearance': 's01_e01', 'gender': 'female'}
Ross Geller object:  {'first_appearance': 's01_e01', 'gender': 'male'}
Rachel Green object:  {'first_appearance': 's01_e01', 'gender': 'female'}


We then create a User object for each unique character in the dataset.

In [0]:
corpus_users = {k: User(name=k, meta=v) for k,v in users.items()}

In [9]:
print(corpus_users['Monica Geller'].name)
print(corpus_users['Monica Geller'].meta)

Monica Geller
{'first_appearance': 's01_e01', 'gender': 'female'}


## Generating Utterances

We then loop through the data to generate a list of all utterances in the series. To align with the Utterance schema ConvoKit expects, we construct for each utterance:

- **id:** index of the utterance

- **user:** the user who authored the utterance; the speaker in our case

- **root:** id of the conversation root of the utterance; the first utterance in the scene, in our case

- **reply_to:** id of the utterance to which this utterance replies to; None if the utterance is not a reply.

- **timestamp:** time of the utterance (None for us -- the dataset does not contain this information)

- **text:** textual content of the utterance

We also pull in the following metadata including:
- **tokens** a tokenized representation of the text (handy for sentence separation)
-**character_entities** available for some but not all utterances; `None` if unavailable. These are intended to identify who the user is speaking to and/or about.
-**emotion** emotion labels for each token. Available for some but not all utterances; `None` if unavailable. 
-**caption**  available for some but not all utterances; `None` if unavailable. This contains the begin time, end time, and text sans punctuation. Only available for seasons 6-9.
-**transcript_with_note**  a version of the text with an action note (e.g. "(to Ross) Hand me the coffee" vs. "Hand me the coffee"). Available for some but not all utterances; `None` if unavailable.
-**token_with_note** a tokenized representation of the above.

In [10]:
all_utterances = {}



for i in tqdm(range(1,11)):
  season_number = '0'+str(i) if i < 10 else '10'
  json_file = 'https://raw.githubusercontent.com/emorynlp/character-mining/master/json/friends_season_'+str(season_number)+'.json'
  r = requests.get(json_file)
  
  season = json.loads(r.text)
  episodes = season['episodes']
  for j in range(len(episodes)):
    episode = episodes[j]
    scenes = episode['scenes']
    for k in range(len(scenes)):
      scene = scenes[k]
      utterances = scene['utterances']
      
      root = utterances[0] #set the root as the first utterance in the scene for now
      
      prev_utt = None

      for l in range(len(utterances)):
        utterance = utterances[l]
        
        speaker = utterance['speakers']
        
        if len(speaker) == 0:
          prev_utt = None
          continue
        
        # Add meta       
        meta = {
            'tokens': utterance.get('tokens'),
            'character_entities': utterance.get('character_entities'),
            'emotion': utterance.get('emotion'),
            'caption': utterance.get('caption'),
            'transcript_with_note': utterance.get('transcript_with_note'),
            'tokens_with_note': utterance.get('tokens_with_note')
        }
        
        # Create the Utterance, including meta
        all_utterances[utterance['utterance_id']] = Utterance(
            id=utterance['utterance_id'],
            user=corpus_users[speaker[0]],
            root=root['utterance_id'],
            reply_to=prev_utt,
            timestamp=None,
            text=utterance['transcript'],
            meta=meta
        )
        
        # Get the prev_utt for the next iteration
        prev_utt = utterance['utterance_id']


100%|██████████| 10/10 [00:04<00:00,  2.04it/s]


In [11]:
print("This corpus has {}/61309 utterances".format(len(all_utterances)))

This corpus has 61338/61309 utterances


In [0]:
#all_utterances['s01_e18_c05_u021']

## Creating the corpus from a list of utterances

We now create the corpus from our dict of utterances. Note, we are are allowing convokit to create conversations IDs automatically after loading the utterances list.

In [0]:
utterance_list = [utt for k, utt in all_utterances.items()]

In [0]:
friends_corpus = Corpus(utterances=utterance_list, version=1)

Sanity checks for the number of conversations in the dataset and the first 5 conversations:

In [15]:
print("number of conversations in the dataset={}".format(len(friends_corpus.get_conversation_ids())))

number of conversations in the dataset=3099


In [18]:
convo_ids = friends_corpus.get_conversation_ids()
for i, convo_idx in enumerate(convo_ids[0:5]):
    print("sample conversation {}:".format(i))
    print(friends_corpus.get_conversation(convo_idx).get_utterance_ids())

sample conversation 0:
['s01_e01_c01_u001', 's01_e01_c01_u002', 's01_e01_c01_u003', 's01_e01_c01_u004', 's01_e01_c01_u006', 's01_e01_c01_u007', 's01_e01_c01_u008', 's01_e01_c01_u010', 's01_e01_c01_u011', 's01_e01_c01_u012', 's01_e01_c01_u013', 's01_e01_c01_u014', 's01_e01_c01_u015', 's01_e01_c01_u016', 's01_e01_c01_u017', 's01_e01_c01_u018', 's01_e01_c01_u019', 's01_e01_c01_u021', 's01_e01_c01_u022', 's01_e01_c01_u023', 's01_e01_c01_u024', 's01_e01_c01_u025', 's01_e01_c01_u026', 's01_e01_c01_u027', 's01_e01_c01_u028', 's01_e01_c01_u029', 's01_e01_c01_u030', 's01_e01_c01_u031', 's01_e01_c01_u032', 's01_e01_c01_u033', 's01_e01_c01_u034', 's01_e01_c01_u035', 's01_e01_c01_u036', 's01_e01_c01_u037', 's01_e01_c01_u038', 's01_e01_c01_u039', 's01_e01_c01_u040', 's01_e01_c01_u041', 's01_e01_c01_u042', 's01_e01_c01_u044', 's01_e01_c01_u045', 's01_e01_c01_u047', 's01_e01_c01_u048', 's01_e01_c01_u049', 's01_e01_c01_u050', 's01_e01_c01_u051', 's01_e01_c01_u052', 's01_e01_c01_u053', 's01_e01_c01_u05

Summary stats for the corpus:

In [19]:
friends_corpus.print_summary_stats()

Number of Users: 699
Number of Utterances: 61338
Number of Conversations: 3099


## Updating Gender for our Main Characters

In [20]:
users['Chandler Bing']['gender'] = 'male'
users['Carol Willick']['gender'] = 'female'
print(users['Chandler Bing'])

{'first_appearance': 's01_e01', 'gender': 'male'}


# D1. Exploring Dataset
## Taking character entities and placing at conversation level
For each scene, we extract unique people that are mentioned during converstion from character entities and unique speakers.

In [21]:
import re
import string
string.ascii_uppercase

all_ce = {}


for i in tqdm(range(1,11)):
  season_number = '0'+str(i) if i < 10 else '10'
  json_file = 'https://raw.githubusercontent.com/emorynlp/character-mining/master/json/friends_season_'+str(season_number)+'.json'
  r = requests.get(json_file)
  
  season = json.loads(r.text)
  episodes = season['episodes']
  for j in range(len(episodes)):
    episode = episodes[j]
    scenes = episode['scenes']
    for k in range(len(scenes)):
      scene = scenes[k]
      utterances = scene['utterances']
      ces = set()
      spkrs = set()
      for l in range(len(utterances)):
        utterance = utterances[l]
        if 'character_entities' in utterance.keys():
          character_entities = utterance['character_entities']
          speakers = utterance['speakers']
          for char in character_entities:
            if len(char) != 0:
               for li in char:
                  name=li[2]
                  if name != speakers[0]:
                    ces.add(name)
                    ceg=[]
                    for n in ces:
                      ce_g = d.get_gender(n.split()[0])
                      ceg.append(ce_g)
                      c_f=[]
                      c_m=[]
                      for cg in ceg:
                        if cg=='female':
                          cf=1
                          c_f.append(cf)
                        if cg=='male':
                          cm=1
                          c_m.append(cm)
        if 'speakers' in utterance.keys():
          speakers = utterance['speakers']
          for sp in speakers:
            for list in sp:
              sname=sp
              spkrs.add(sname) 
              spg=[]
              for z in spkrs:
                sp_g = d.get_gender(z.split()[0])
                spg.append(sp_g)
                s_f=[]
                s_m=[]
                for sg in spg:
                  if sg=='female':
                    sf=1
                    s_f.append(sf)
                  if sg=='male' or sg=='mostly_male':
                    sm=1
                    s_m.append(sm)
      all_ce[scene['scene_id']] = {'character_entities': ces, 'ce_m': c_m, 'ce_f': c_f, 'speakers': spkrs, 'spkr_m': s_m, 'spkr_f': s_f}

100%|██████████| 10/10 [00:18<00:00,  1.67s/it]


In [0]:
#print(all_ce['s05_e01_c01'])

In [0]:
#all_ce.keys()

## Extracting Romantic Words from Conversation
For each scene, we'll check what romantic (and nonromantic) words are used during the conversation.

In [24]:
from google.colab import files
uploaded = files.upload()

Saving RomanticWords.txt to RomanticWords.txt


In [0]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

Read from RomanticWords.txt and store in root word. We use only words instead of phrases to reduce the checking time of each utterance.

In [0]:
romantic_words = [ps.stem(line.rstrip('\n')) for line in open('RomanticWords.txt', 'r')]

In [0]:
#print(romantic_words)

In [28]:
import regex
import re

extract_romantic = {}

for i in tqdm(range(1,11)):
  season_number = '0'+str(i) if i < 10 else '10'
  json_file = 'https://raw.githubusercontent.com/emorynlp/character-mining/master/json/friends_season_'+str(season_number)+'.json'
  r = requests.get(json_file)
  
  season = json.loads(r.text)
  episodes = season['episodes']
  for j in range(len(episodes)):
    episode = episodes[j]
    scenes = episode['scenes']
    for k in range(len(scenes)):
      scene = scenes[k]
      utterances = scene['utterances']
      romantic = []
      non_romantic = []
      for l in range(len(utterances)):
        utterance = utterances[l]
        if len(utterance['speakers']) == 0 :
          continue
        tokens = utterance['tokens']
        for token in tokens:
          for word in token:
            word=re.sub(r'[^\w\s]', '', word) #remove all punctuation so it isn't included as a word
            if ps.stem(word) in romantic_words:
              romantic.append(word)
            else:
                if len(word)!=0:
                  non_romantic.append(word)
      extract_romantic[scene['scene_id']] = {'romantic words': romantic, 'nonromantic words': non_romantic}

100%|██████████| 10/10 [00:14<00:00,  1.48s/it]


In [0]:
#extract_romantic.keys()

In [0]:
#print(extract_romantic['s01_e01_c01'])

# Assessing the Romantic Words and Genders
We create a dataframe of the total number of romantic and nonromantic words used per scene, as well as counts of the total number of character entities (and speakers) in all and by sex by scene. We then use these variables to calculate statistics on the types of conversations occuring. 

In [0]:
r= [(scene_id, len(extract_romantic[scene_id]['romantic words']), len(extract_romantic[scene_id]['nonromantic words'])) for scene_id in extract_romantic.keys()]

In [0]:
c=[(scene_id, len(all_ce[scene_id]['character_entities']), len(all_ce[scene_id]['ce_m']), len(all_ce[scene_id]['ce_f']) , len(all_ce[scene_id]['speakers']), len(all_ce[scene_id]['spkr_f']), len(all_ce[scene_id]['spkr_m'])) for scene_id in all_ce.keys()]

In [0]:
import pandas as pd
c1= pd.DataFrame.from_dict(c)
c1.columns = ['scene_id', 'character_entities', 'ce_m', 'ce_f', 'speakers', 'spkr_f', 'spkr_m']  
r1= pd.DataFrame.from_dict(r)
r1.columns=['scene_id', 'romantic words', 'nonromantic words']
df = pd.merge(c1, r1, left_on='scene_id', right_on='scene_id')

dfn = df["scene_id"].str.split("_", n = 1, expand = True)
dfn= dfn[0].str.replace("s", "", regex=True)
df["season"]= dfn
df['season'] = df['season'].astype(int)
#print (df.dtypes)

Calculating the percent of romantic words used by scene.

In [34]:
def calculate_average(row):
    return ((row['romantic words']+1)/(row['romantic words']+row['nonromantic words']+1))*100

df.apply(calculate_average, axis=1)
df['avg_rom'] = df.apply(calculate_average, axis=1)

avg_rom=[]
x= df['avg_rom'].mean()
avg_rom.append(x)

y= df['avg_rom'].min()
avg_rom.append(y)

z= df['avg_rom'].max()
avg_rom.append(z)


print(avg_rom)

[2.186444081665329, 0.2127659574468085, 100.0]


Calculating the total number of scenes with only male and only female characters.

In [35]:
#Female Total
f_all = len(df[(df['spkr_m']==0) & (df['spkr_f']>=2)]) # (255)
print(f_all)

f_14=len(df[(df['spkr_m']==0) & (df['spkr_f']>=2) & (df['season']<=4)]) # (127)
print(f_14)


#Male total 
m_all = len(df[(df['spkr_f']==0) & (df['spkr_m']>=2)]) # (343)
print(m_all)

m_14 = len(df[(df['spkr_f']==0) & (df['spkr_m']>=2) & (df['season']<=4)]) # (155)
print(m_14)



255
127
343
155


Calculating the total number of scenes with only male and only female characters with above average rates of romantic words used and 50% or more of the character entities of the opposite sex. NOTE: This is only one way in which we can define romantic relationships. 

In [36]:
#Proportion of male character entities (calculating only for seasons 1-4)
def calculate_propm(row):
    return ((row['ce_m']+1)/(row['character_entities']+1))*100

df.apply(calculate_propm, axis=1)
df['ce_mp'] = df.apply(calculate_propm, axis=1)


ce_mp=[]
x= df['ce_mp'].iloc[0:1304].mean() 
ce_mp.append(x)

y= df['ce_mp'].iloc[0:1304].min()
ce_mp.append(y)

z= df['ce_mp'].iloc[0:1304].max()
ce_mp.append(z)


print(ce_mp)

#Proportion of female character entities
def calculate_propf(row):
    return ((row['ce_f']+1)/(row['character_entities']+1))*100

df.apply(calculate_propf, axis=1)
df['ce_fp'] = df.apply(calculate_propf, axis=1)


ce_fp=[]
x= df['ce_fp'].iloc[0:1304].mean()
ce_fp.append(x)

y= df['ce_fp'].iloc[0:1304].min()
ce_fp.append(y)

z= df['ce_fp'].iloc[0:1304].max()
ce_fp.append(z)


print(ce_fp)

[43.76059440586523, 12.5, 500.0]
[44.058208472973504, 11.11111111111111, 600.0]


In [0]:
# only female with higher than average romantic language (& >=50% male character entities for seasons 1-4) (6)
f_mr_14 = len(df[(df['spkr_m']==0) & (df['spkr_f']>=2) & (df['avg_rom']>=2.18) & (df['ce_mp']>=50) & (df['season']<=4)])
f_mr_all = len(df[(df['spkr_m']==0) & (df['spkr_f']>=2) & (df['avg_rom']>=2.18)])

# only male with higher than average romantic language (& >=50% female character entities for seasons 1-4) (10)
m_fr_14 = len(df[(df['spkr_f']==0) & (df['spkr_m']>=2) & (df['avg_rom']>=2.18) & (df['ce_fp']>=50)  & (df['season']<=4)])
m_fr_all = len(df[(df['spkr_f']==0) & (df['spkr_m']>=2) & (df['avg_rom']>=2.18)])

In [38]:
# of the conversations with only females, what percent were about romantic conversations
print(f_mr_14/f_14)
print(f_mr_all/f_all)

# of the conversations with only females, what percent were about romantic conversations
print(m_fr_14/m_14)
print(m_fr_all/m_all)

0.047244094488188976
0.23529411764705882
0.06451612903225806
0.2594752186588921
