Alicia Sigmon

als333@pitt.edu

11/19/2017

## Discourse Analysis of the Australian Radio Talkback Corpus

#### About the Corpus
- The Australian Radio Talkback corpus contains raw text files of telephone conversations.
- These conversations include the speaker's role (presenter, expert, or caller), name, and gender.
- The conversations include other verbal cues such as laughter <laugh> and where a speaker says something during another speaker's turn <E1 yeah>. It also notes corrections to the transcription in squirrley brackets {}.

#### Discourse Analysis Goals
- commparing speakers by role and gender
- aspects to consider:
    - back channels (yeah, uhhu, that's great!)
    - vocabulary size (avg word length)
    - number of turns, sentences, and words

In [1]:
%pprint

Pretty printing has been turned OFF


In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
from nltk.corpus import PlaintextCorpusReader
import pandas as pd
import nltk
import numpy
import glob

In [4]:
ART_fids=glob.glob('C:\\Users\\sigmo\\Documents\\Data_Science\\Discourse-Analysis-ART-Corpus\\data_files\\AustralianRadioTalkback\\files\\Raw\\*.txt')

**Each filename contains 3-4 letters followed by a number. They all end in -raw.txt.**

In [5]:
# having trouble with glob..

print("There are " + str(len(ART_fids)) + " files in this corpus:") # 26
for x in ART_fids: 
#     x.replace('C:\Users\sigmo\Documents\Data_Science\Discourse-Analysis-ART-Corpus\data_files\AustralianRadioTalkback\files\Raw\',"")
    print(x) # I want to use .replace() to get rid of the excess, but I'm not sure how.

There are 26 files in this corpus:
C:\Users\sigmo\Documents\Data_Science\Discourse-Analysis-ART-Corpus\data_files\AustralianRadioTalkback\files\Raw\ABCE1-raw.txt
C:\Users\sigmo\Documents\Data_Science\Discourse-Analysis-ART-Corpus\data_files\AustralianRadioTalkback\files\Raw\ABCE2-raw.txt
C:\Users\sigmo\Documents\Data_Science\Discourse-Analysis-ART-Corpus\data_files\AustralianRadioTalkback\files\Raw\ABCE3-raw.txt
C:\Users\sigmo\Documents\Data_Science\Discourse-Analysis-ART-Corpus\data_files\AustralianRadioTalkback\files\Raw\ABCE4-raw.txt
C:\Users\sigmo\Documents\Data_Science\Discourse-Analysis-ART-Corpus\data_files\AustralianRadioTalkback\files\Raw\ABCNE1-raw.txt
C:\Users\sigmo\Documents\Data_Science\Discourse-Analysis-ART-Corpus\data_files\AustralianRadioTalkback\files\Raw\ABCNE2-raw.txt
C:\Users\sigmo\Documents\Data_Science\Discourse-Analysis-ART-Corpus\data_files\AustralianRadioTalkback\files\Raw\COME1-raw.txt
C:\Users\sigmo\Documents\Data_Science\Discourse-Analysis-ART-Corpus\data_f

In [6]:
dir(ART_fids)

['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']

In [7]:
rawtext_dict = {}
for fid in ART_fids:     
    f=open(fid, 'r')  
    last_slash=fid.rindex('\\') # the key will be only the file name, not the entire location
    rawtext_dict[fid[last_slash+1:]] = f.read()
    f.close() 

In [8]:
rawtext_dict['ABCE1-raw.txt'][:1000]
# Within this glimpse of the data, the speaker information data format is shown through Presenter 1.

"[Presenter 1: Simon Marnie, M] Thanks for that John Hall now John Hall will be listening for the next hour 'cos Angus Stewart is here to take your calls eight-triple-three-one-thousand one-eight-hundred-eight-hundred-seven-oh-two something in the garden that's causing you problems give us a call right now and Angus can I mean y'know he is known in the trade as Mr popergation {propagation} Mr propagation. He's also known for his passion for natives and his love of o orchids am I right so far.\n\n[Expert 1: Angus Stewart, M] I guess yeah yeah <laughs>.\n\n[P1] He's also known <E1 sounds reasonable> for his ability to open cosposting {composting} toilets so he can tell you anything worm farm problems certainly helped us and although I'm still confused about dry ingredients we might talk about that as well but eight-triple-three-one-thousand one-eight-hundred-eight-hundred-seven-oh-two fine sunny day today top temperatures on the coast of twenty-seven inland thirty degrees Bowral enjoying

**Original Speaker Data Format in the Raw Text Files:**

   *There are multiple exceptions to this format that I will address later in the script.*

All speaker information is contained within [ ].

Each first instance of a speaker is in the following order:
    - Speaker Type
        - Presenter, Caller, or Expert
    - Speaker Number
        - the first presenter in a file is Presenter 1, the second presenter is Presenter 2, etc..
    - The Speaker's Name
        - ex: Simon Marnie
    - The Speaker's gender:
        - M / F
    
For every other instance of a speaker's line, the speaker is indicated by P, C, or E followed by their number.
    - Ex: [Presenter 1: Simon Marnie, M] == [P1]

In [9]:
print(rawtext_dict['NAT4-raw.txt']) # both NAT4-raw.txt and NAT5-raw.txt are contained within this file

[Presenter 1: Tony Delroy, M] Eighteen minutes past ten eighteen past nine in Queensland eighteen past seven in Western Australia. Navigating the world of treatment for menopause can be a nightmare. In the last month we've seen the cancellation of a seven year trial in the states <,> sending more shockwaves through the community. Already a topic that causes a lotta confusion hormone replacement therapy or H R T has been controversial since its widespread use in the early nineties but does that mean it's not beneficial for some people. Menopause is different for each woman <,> uh some women suffer few side effects during the change as it's affectionately known and uh others suffer severe weight loss loss of sex drive uh mood swings and of course the dreaded hot flushes. So uh how do we make sense of everything that's on offer. To help us out tonight Dr Barry Wren is a guest. S one of Australia's leading researchers in gynaecological research and the author of a book called Understanding

In [10]:
# Modifying the text keys to separate the two radio talk shows from within NAT4-raw.txt to NAT4-raw.txt and NAT5-raw.txt

seg2_start = rawtext_dict['NAT4-raw.txt'].rindex('[Presenter 1')   # Location of last mention of Presenter 1
seg1 = rawtext_dict['NAT4-raw.txt'][:seg2_start]
seg2 = rawtext_dict['NAT4-raw.txt'][seg2_start:]
seg2[:500]

rawtext_dict['NAT4-raw.txt'] = seg1
rawtext_dict['NAT5-raw.txt'] = seg2

files = sorted(rawtext_dict.keys()) # creates an accurate file list (ART_fids is missing NAT5-raw.txt!!)

"[Presenter 1: John Cleary, M] So now it's welcome first to our expert panel. Dr Brian Edgar he is director of theology and public policy for the Evangelical Alliance a mainstream Protestant agency which have a website that canvases electoral issues. Brian welcome <Expert 1: Brian Edgar, M thank you John> to you.\n\n[E1] Thank you very much glad to <P1 Victoria> be here.\n\n[P1] Victoria Kearney is one of the coordinators of a website called PolMin which looks at lobbying for policies in harmony with"

In [11]:
# Line splitting using regular expressions 
    # some lines appeared to contain \r while most contained \n
import re
foo = "Hello world\n\nhow are you\na new line\r\nanother newline\n"
re.split(r'[\n\r]+', foo)
re.split(r'[\n\r]+', foo.strip())

['Hello world', 'how are you', 'a new line', 'another newline', '']

['Hello world', 'how are you', 'a new line', 'another newline']

In [12]:
# Splitting the texts by line:
    # Trial run on the first 3 files and their first 20 lines:

for fid in files[:3]: # files includes  NAT5-raw.txt
    rawtext = rawtext_dict[fid] # already includes NAT5-raw.txt
    rawlines = re.split(r'[\n\r]+', rawtext.strip())[:20]
    print(fid)
    for l in rawlines:
        if ']' in l:
            where = l.index(']')
            speaker = l[:where+1]
            utterance = l[where+2:]
            print(speaker+' '+utterance)
        else: 
            print('***'+l+'******')    # {program advert}. What to do with these? 
    print()

ABCE1-raw.txt
[Presenter 1: Simon Marnie, M] Thanks for that John Hall now John Hall will be listening for the next hour 'cos Angus Stewart is here to take your calls eight-triple-three-one-thousand one-eight-hundred-eight-hundred-seven-oh-two something in the garden that's causing you problems give us a call right now and Angus can I mean y'know he is known in the trade as Mr popergation {propagation} Mr propagation. He's also known for his passion for natives and his love of o orchids am I right so far.
[Expert 1: Angus Stewart, M] I guess yeah yeah <laughs>.
[P1] He's also known <E1 sounds reasonable> for his ability to open cosposting {composting} toilets so he can tell you anything worm farm problems certainly helped us and although I'm still confused about dry ingredients we might talk about that as well but eight-triple-three-one-thousand one-eight-hundred-eight-hundred-seven-oh-two fine sunny day today top temperatures on the coast of twenty-seven inland thirty degrees Bowral e

In [24]:
# Data Cleaning:

rawtext_dict['ABCE4-raw.txt'] = rawtext_dict['ABCE4-raw.txt'].replace('[E2]', '[E1]')

# There is no E2 in this segment, but E2 is erroneously mentioned. Replacing. 

# $ grep Expert  ABCE4-raw.txt 
# [Expert 1: Ric Nattrass, M] Uh blue-tongues'd be {break} unlikely ...

# $ grep E2  ABCE4-raw.txt 
# [E2] Yeah.
# [E2] Yeah okay so your yours up there is the spotted catbird if you're on <C3 mm> ...


rawtext_dict['COME2-raw.txt'] = rawtext_dict['COME2-raw.txt'].replace('[C5: Jenny, F]', '[Caller 5: Jenny, F]')


# $ grep "C5" COME2-raw.txt 
# [C5: Jenny, F] Hello how are you.
# [C5] That's good. Um I was just wondering for some information on my house at ...
# [C5] Oh have I.

rawtext_dict['ABCE3-raw.txt'] = rawtext_dict['ABCE3-raw.txt'].replace('[Caller 11, Robyn, F]', '[Caller 11: Robyn, F]')
rawtext_dict['COME1-raw.txt'] = rawtext_dict['COME1-raw.txt'].replace('[Caller 23, Maureen, F]', '[Caller 23: Maureen, F]')
rawtext_dict['NAT7-raw.txt'] = rawtext_dict['NAT7-raw.txt'].replace('[Caller 12, Brian, M]', '[Caller 12: Brian, M]')
rawtext_dict['NAT8-raw.txt'] = rawtext_dict['NAT8-raw.txt'].replace('[Caller 10, Brett, M]', '[Caller 10: Brett, M]')


# $ grep -P 'Caller \d+,' *
# ABCE3-raw.txt:[Caller 11, Robyn, F] Hi um I read the book quite a while ago and ...
# COME1-raw.txt:[Caller 23, Maureen, F] Yes good morning.
# NAT7-raw.txt:[Caller 12, Brian, M] Yeah.
# NAT8-raw.txt:[Caller 10, Brett, M] How're you going.


rawtext_dict['COME3-raw.txt'] = rawtext_dict['COME3-raw.txt'].replace('[Caller 9 Maureen, F]', '[Caller 9: Maureen, F]')

# $ grep -P '\[\S+ \d+ ' *
# COME3-raw.txt:[Caller 9 Maureen, F] Good morning Dr Graham.


rawtext_dict['COME3-raw.txt'] = rawtext_dict['COME3-raw.txt'].replace('[CE1]', '[E1]')

# $ grep CE1 *
# COME3-raw.txt:[CE1] If it did become serious <,> 


rawtext_dict['COME6-raw.txt'] = rawtext_dict['COME6-raw.txt'].replace('P1a', 'P1')
rawtext_dict['COME6-raw.txt'] = rawtext_dict['COME6-raw.txt'].replace('[P1b Paul Murray, M]', '[P1]').replace('P1b', 'P1')

# COME6 presenter encoding scheme was generally messed up.


rawtext_dict['COMNE4-raw.txt'] = rawtext_dict['COMNE4-raw.txt'].replace('[C14: Noelene, F]', '[Caller 14: Noelene, F]')

rawtext_dict['NAT1-raw.txt'] = rawtext_dict['NAT1-raw.txt'].replace('[C11]', '[C10]')

rawtext_dict['NAT5-raw.txt'] = rawtext_dict['NAT5-raw.txt'].replace('[E1] Thank you very much glad to', '[Expert 1: Brian Edgar, M] Thank you very much glad to')
rawtext_dict['NAT5-raw.txt'] = rawtext_dict['NAT5-raw.txt'].replace('<Expert 1: Brian Edgar, M thank you John>', '<E1 thank you John>')

rawtext_dict['COME3-raw.txt'] = rawtext_dict['COME3-raw.txt'].replace('[C4 Nah <E1 it ih thi> and because','[C4] Nah <E1 it ih thi> and because')
rawtext_dict['COME3-raw.txt'] = rawtext_dict['COME3-raw.txt'].replace('[E1 No no no <P1 we haven\'t had one> we','[E1] No no no <P1 we haven\'t had one> we')

# This is Marianna's only line - her speach is untranscribed: therefore I will skip her as a speaker
# print(rawtext_dict['NAT7-raw.txt'])
# rawtext_dict['NAT7-raw.txt'] = rawtext_dict['NAT7-raw.txt'].replace('{Caller 2: Marianna, F untranscribed overseas caller 04:32-07:18}','[Caller 2: Marianna, F untranscribed overseas caller 04:32-07:18]')

# print(rawtext_dict['COME3-raw.txt'])
rawtext_dict['COME3-raw.txt'] = rawtext_dict['COME3-raw.txt'].replace('of\ncholesterol excess cigarette smoking and of course a lack of exercise and uh central truncal obesity','of cholesterol excess cigarette smoking and of course a lack of exercise and uh central truncal obesity')

rawtext_dict['COME1-raw.txt'] = rawtext_dict['COME1-raw.txt'].replace('remember the name\n<inaudible> chrysanthemums','remember the name <inaudible> chrysanthemums')

# grep -P "Juicy" "COME6-raw.txt"
rawtext_dict['COME6-raw.txt'] = rawtext_dict['COME6-raw.txt'].replace('from a beautiful name called Juicy.\n\nJuicy.','from a beautiful name called Juicy. Juicy.')

In [28]:
# Looking at all {...} lines to find errors.
# The above cell accounts for errors

for fid in files:
    rawtext = rawtext_dict[fid]
    rawlines = re.split(r'[\n\r]+', rawtext.strip())
    print(fid)
    for l in rawlines:
        if ']' in l:
            where = l.index(']')
            speaker = l[:where+1]
            utterance = l[where+2:]
            #print(speaker+' '+utterance)
        else: 
            print('***'+l+'*****')    # {program advert}. What to do with these? 
    print()

ABCE1-raw.txt
***{program advert}*****
***{program advert}*****
***{program advert 11:20-11:47}*****
***{program advert 19:53-20:23}*****
***{cut}*****
***{program advert}*****
***{program advert}*****
***{program advert 30:34-31:03}*****
***{program advert}*****
***{music}*****
***{music}*****
***{music}*****
***{music}*****
***{music}*****
***{program advert 37:19-37:32}*****
***{program advert}*****

ABCE2-raw.txt
***{program advert 14:53-15:24}*****
***{cut, program advert}*****
***{program advert}*****
***{program advert 24:28-24:58}*****
***{Untranscribed news bulletins 25:27-26:51}*****
***{program advert 34:22-34:33}*****
***{program advert 39:52-40:22}*****
***{program advert}*****

ABCE3-raw.txt
***{untranscribed book reading 21:44-22:12}*****

ABCE4-raw.txt

ABCNE1-raw.txt
***{program advert: 0:15:15- 0:15:46}*****
***{untranscribed poem, 25:51-26:24}*****
***{Ends 26:34}*****

ABCNE2-raw.txt
***{program advert 11:08-11:39}*****
***{program advert 18:25-18:56}*****
***{Ends 