Katherine Kairis, kak275@pitt.edu, 10/12/2017

In [1]:
from bs4 import BeautifulSoup
import glob
import re

In [2]:
transcripts = glob.glob('VOICE/VOICE1.0XML/XML/*.xml')
del transcripts[0]

In [3]:
#Create two dictionaries: one containing information about the participants, and one containing the conversations
participants = {}
conversations = {}

# Getting info about the participants
The participant_info function extracts information about the participants and stores it in the "participants" dictionary. The keys of the dictionary are the participants' ID numbers. The values are sub-dictionaries that include the participant's role, age, sex, and occupation (if listed). The sub-dictionaries also include the participants' L1s, which are stored in lists (since some participants have multiple L1s).

In [4]:
def participant_info(contents):
    
    #Get all of the participants in the given conversation
    people = contents.find('listPerson', {'type': 'identified'}).findAll('person')
    
    for p in people:
        #info is a subdirector that contains a single participant's information. It will be 
        info = {}
        info['role'] = p['role']
        info['age'] = p.age.get_text()
        info['sex'] = p.sex.get_text()
        
        #In some cases, the occupation isn't listed. If it is included, get the text of the occupation field.
        #If it isn't included, "None" will be stored as the occupation, since p.occupation would return "None."
        try:
            info['occupation'] = p.occupation.get_text()
        except AttributeError:
            info['occupation'] = p.occupation
        
        #Get a list of the languages that the participant speaks. Iterate through the list, and add them to the
        #dictionary according to the speaker's level (ie. L1).
        languages = p.findAll('langKnown')
        for l in languages:
            level = l['level']
            language = l['tag']
        
            if level in info:
                info[level].append(language)
            else:
                info[level] = [language]
    
        #Get the participant's ID number, and make it a key in the participants dictionary. The value will be
        #the info dictionary
        name = p['xml:id']
        participants[name] = info

# Getting lines of the conversation from the file
The conversation_lines function gets each line from the current conversation. The lines are stored as lists in the "conversations" dictionary, whose keys are the names of the XML files. For now, I decided to keep the lines in their XML format; there are a lot of annotations in the XML format that could be useful later on, such as the speaker, pauses, and intonation markings. Converting the XML lines into text/getting rid of the tags is simple, so I could change this later on.

In [5]:
def conversation_lines(file, contents):
    file_name = file.split("/")[-1]
    text_body = contents.body
    xml_lines = text_body.findAll('u')
    conversations[file_name] = xml_lines

# Processing the XML files
This section iterates through all of the files (except for corpus-header.xml) in the VOICE1.0XML/XML directory. It calls conversation_lines and participant_info to extract some important parts of the data from the corpus.

In [6]:
for t in transcripts:
    file = open(t, 'r')
    text = file.read()
    xml_contents = BeautifulSoup(text, 'xml')
    conversation_lines(t, xml_contents)
    participant_info(xml_contents)

In [7]:
conversations['EDcon496.xml'][0]

<u who="#EDcon496_S1" xml:id="EDcon496_u_1"> e<c type="lengthening"/>r leads so <pause/> ma<c type="lengthening"/>n i'm still stuck on lead du<c type="lengthening"/>de <pause dur="PT3S"/></u>

In [8]:
participants['EDcon250_S2']

{'L1': ['ger-AT', 'eng-US'],
 'age': '25-34',
 'occupation': None,
 'role': 'participant',
 'sex': 'female'}

# Data statistics

## Size of corpus

In [9]:
num_conversations = len(conversations)
num_participants = len(participants)
num_lines = 0

for c in conversations:
    num_lines += len(conversations[c])
    
print("Number of conversations:", num_conversations)
print("Average length of conversation:", num_lines/num_conversations)
print("Number of participants:", num_participants)

Number of conversations: 151
Average length of conversation: 698.8675496688742
Number of participants: 1260


## Number of native English speakers
I noticed that some of the participants in the conversations are native English speakers. Since I would like to separate native and non-native speakers in order to compare the two groups, I will have to remove these participants from this corpus. As of now, I'm not sure if I will include them in a native English corpus, or just completely discard them. Luckily, I won't have to delete too many participants from this corpus: only 87 of the 1260 participants have English listed as an L1. 

In [10]:
#Get native English speakers
native_speakers = []

#There are multiple ways that English is listed as an L1 ("eng", "eng-US", "eng-CA", "eng-GB", "eng-GY", "eng-AU", etc)
#I used a regular expression to find all of these instances
r = re.compile("eng.*")

for person in participants:
    
    #returns a list of all languages that contain "eng.*" The length of this list should be 1 or 0. If it's 1, the
    #participant has English listed as an L1.
    english = list(filter(r.match, participants[person]['L1']))
    
    if len(english) != 0:
        #print(person, ':', participants[person])
        native_speakers.append(person)

In [11]:
print("Number of native speakers:", len(native_speakers))
print("Total number of participants:", len(participants))
print("\nNative English speakers:")
print(native_speakers)

Number of native speakers: 87
Total number of participants: 1260

Native English speakers:
['EDcon250_S2', 'EDcon496_S2', 'EDcon521_S1', 'EDint328_S3', 'EDint330_S2', 'EDint330_S4', 'EDsed251_S3', 'EDsed301_S6', 'EDsed362_S1', 'EDsed362_S11', 'EDsed362_S14', 'EDsed362_S17', 'EDsed363_S3', 'EDsed364_S7', 'EDwgd497_S4', 'EDwgd5_S3', 'EDwgd6_S7', 'EDwgd6_S11', 'EDwsd15_S13', 'EDwsd242_S7', 'EDwsd302_S13', 'EDwsd303_S13', 'EDwsd304_S9', 'EDwsd306_S11', 'EDwsd590_S13', 'EDwsd9_S2', 'LEcon329_S4', 'LEcon545_S7', 'LEcon545_S1', 'LEcon547_S4', 'LEcon548_S4', 'LEcon548_S5', 'LEcon562_S2', 'LEcon562_S3', 'LEcon562_S6', 'PBmtg280_S3', 'PBmtg280_S4', 'PBpan10_S9', 'PBpan28_S7', 'PBpan28_S9', 'PBqas411_S1', 'PBqas412_S6', 'POcon549_S4', 'POcon591_S10', 'POmtg404_S5', 'POmtg439_S2', 'POmtg439_S3', 'POmtg444_S5', 'POmtg444_S9', 'POmtg447_S2', 'POmtg447_S3', 'POmtg546_S9', 'POprc522_S7', 'POprc522_S8', 'POprc558_S6', 'POprc559_S10', 'POprc559_S11', 'POwgd12_S1', 'POwgd12_S6', 'POwgd37_S10', 'POwgd375_

## "Unclear" speech
I also noticed that a lot of lines in the conversations had an "unclear" tag. I was thinking about removing these lines, but I was worried about possibly removing a large amount of the data. 9877 utterances out of the 105529 utterances had at least one "unclear" annotation, so I wouldn't lose a lot of data if I removed these lines.

In [12]:
num_utterances = 0
num_unclear = 0

for c in conversations:
    num_utterances += len(conversations[c])
    for line in conversations[c]:
        if line.findChildren('unclear'):
            num_unclear += 1

In [13]:
print("Total number of utterances:", num_utterances)
print("Number of utterances with at least one \"unclear\" tag:", num_unclear)

Total number of utterances: 105529
Number of utterances with at least one "unclear" tag: 9877


# Sharing data
The license for the corpus seems very lenient, so I think that I could share as much data as I would like. The following comes from the corpus's license:
* Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)

* You are free:
    * to Share — to copy, distribute and transmit the work
    * to Remix — to adapt the work

* Under the following conditions:
    * Attribution — You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).
    * Noncommercial — You may not use this work for commercial purposes.
    * Share Alike — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.