Katherine Kairis, kak275@pitt.edu, 11/2/2017

NEW CONTINUING -- This file, in addition to BNC_data and VOICE_data, continues upon the first progress report

In [1]:
import pickle
import nltk

# Getting Data
Open the pickle files created in BNC_data and VOICE_data.

In [2]:
f = open('VOICE_tokenized.p', 'rb')
VOICE_toks = pickle.load(f)
f.close()

In [3]:
f = open('VOICE_tagged.p', 'rb')
VOICE_tags = pickle.load(f)
f.close()

In [4]:
f = open('BNC_tokenized.p', 'rb')
BNC_toks = pickle.load(f)
f.close()

In [5]:
f = open('BNC_tagged.p', 'rb')
BNC_tags = pickle.load(f)
f.close()

# Comparing the Two Data Sets

## Comparing Utterance Lengths
On average, the utterances in the BNC corpus are longer than those in VOICE. The average utterance length in the BNC is 10.028 and the average utterance length in VOICE is 8.583.

In [6]:
#Return a list of integers, which correspond to the number of words in the utterance
def utterance_lengths(dictionary):
    lengths = []
    for file in dictionary:
        for key in dictionary[file]:
            l = len(dictionary[file][key])
            lengths.append(l)
    return lengths

In [7]:
#Average utterance length in VOICE (non-native speakers)
VOICE_utterance_lengths = utterance_lengths(VOICE_toks)
sum(VOICE_utterance_lengths)/len(VOICE_utterance_lengths)

8.582855347973343

In [8]:
#Average utterance length in BNC (native speakers)
BNC_utterance_lengths = utterance_lengths(BNC_toks)
sum(BNC_utterance_lengths)/len(BNC_utterance_lengths)

10.027982402143955

## Comparing Words and Bigrams
By looking at the most frequent words and bigrams, there are a few subtle difference between the two corpora, so they may not be extremely useful for comparing native and non-native speakers. However, I only looked into the 50 most frequent bigrams, so I will probably look into this more. Other bigrams that aren't in this list could be helpful; a moderately-common bigram in VOICE that is nonexistant in BNC could indicate that a person is not a native speaker. Bigrams could also be useful in comparing the L1 groups in the Vienna-Oxford International Corpus of English.

### Comparing Words
Just by looking at the 50 most frequent words in the two corpora, there doesn't seem to be a huge difference between the common words in VOICE and the common words in BNC. One small difference between the two is that VOICE has 'er', 'erm', and 'hh' among the most frequent tokens. In fact, 'er' is the second most frequent token in VOICE, while it has a lower frequency ranking in BNC.

In [9]:
#Returns a list of all of the words from the corpus/dictionary
def get_words(dictionary):
    words = []
    for file in dictionary:
        for key in dictionary[file]:
            words.extend(dictionary[file][key])
    return words

In [10]:
VOICE_words = get_words(VOICE_toks)

In [11]:
BNC_words = get_words(BNC_toks)

In [12]:
VOICE_word_freqs = nltk.FreqDist(VOICE_words)

In [13]:
BNC_word_freqs = nltk.FreqDist(BNC_words)

In [14]:
#Use a freqdist to get the 50 most frequent words in each corpus
VOICE_most_common = VOICE_word_freqs.most_common(50)
BNC_most_common = BNC_word_freqs.most_common(50)

In [15]:
#Print the most frequent words in the two corpora side-by-side (VOICE is in the left column, BNC is in the right column)
index = 0
while index < len(VOICE_most_common):
    print(VOICE_most_common[index], '\t\t', BNC_most_common[index])
    index += 1

('the', 25148) 		 ('the ', 393381)
('er', 19846) 		 ('and ', 250917)
('i', 16392) 		 ('i ', 235255)
('and', 15075) 		 ('to ', 224282)
('it', 14197) 		 ('you ', 204805)
('to', 13780) 		 ('a ', 197649)
('you', 13745) 		 ("'s ", 188842)
('yeah', 11614) 		 ('of ', 169443)
('that', 11307) 		 ('that ', 149614)
('we', 10596) 		 ('it ', 131515)
("'s", 10578) 		 ('in ', 130367)
('a', 10160) 		 ('it', 119305)
('of', 9822) 		 ("n't ", 117817)
('in', 9712) 		 ('is ', 86448)
('is', 9352) 		 ('we ', 74748)
('mhm', 8131) 		 ('that', 72881)
('but', 6743) 		 ('was ', 72775)
('have', 6636) 		 ('i', 71967)
('so', 6529) 		 ('on ', 70371)
('this', 6069) 		 ('have ', 66723)
('[', 5304) 		 ('they ', 64022)
(']', 5304) 		 ('for ', 63343)
('do', 5212) 		 ('yeah', 62068)
('for', 4929) 		 ('you', 61105)
('okay', 4618) 		 ('but ', 61070)
('yes', 4497) 		 ('er ', 60972)
('erm', 4449) 		 ('what ', 58560)
('they', 4339) 		 ('be ', 55666)
('no', 4226) 		 ('so ', 53129)
("n't", 4213) 		 ('he ', 51580)
('not', 4093) 		

### Comparing Bigrams
Many of the common bigrams are shared among the two corpora. However, there are a few interesting common bigrams in VOICE. In VOICE, there are several instances of bigrams that contain duplicates of the same word, like ('i', 'i') and ('the', 'the'). This seems like they come from utterances that contain stuttering or hesitations, which could possibly be used 

In [16]:
#Uses NLTK's bigram function to get the bigrams from the corpus.
#Returns a list of all the bigrams
def get_bigrams(dictionary):
    bigrams = []
    for file in dictionary:
        for key in dictionary[file]:
            pairs = list(nltk.bigrams(dictionary[file][key]))
            bigrams.extend(pairs)
    return bigrams

In [17]:
VOICE_bigrams = get_bigrams(VOICE_toks)

In [18]:
BNC_bigrams = get_bigrams(BNC_toks)

In [19]:
VOICE_bigram_freqs = nltk.FreqDist(VOICE_bigrams)

In [20]:
BNC_bigram_freqs = nltk.FreqDist(BNC_bigrams)

In [21]:
#Use freqdists to get the 50 most frequent bigrams
VOICE_most_common = VOICE_bigram_freqs.most_common(50)
BNC_most_common = BNC_bigram_freqs.most_common(50)

In [22]:
#Print the most frequent bigrams in the two corpora side-by-side (VOICE is in the left column, BNC is in the right column)
index = 0
while index < len(VOICE_most_common):
    print(VOICE_most_common[index], '\t\t', BNC_most_common[index])
    index += 1

(('it', "'s"), 6104) 		 (('it', "'s "), 64813)
(('that', "'s"), 2679) 		 (('that', "'s "), 43246)
(('do', "n't"), 2384) 		 (('do', "n't "), 39746)
(('i', 'think'), 2251) 		 (('in ', 'the '), 34104)
(('in', 'the'), 2087) 		 (('of ', 'the '), 33454)
(('of', 'the'), 1917) 		 (('i', "'m "), 24995)
(('we', 'have'), 1693) 		 (('i ', 'think '), 22604)
(('this', 'is'), 1549) 		 (('on ', 'the '), 19677)
(('you', 'know'), 1526) 		 (('i ', 'do'), 19270)
(('i', "'m"), 1500) 		 (('i', "'ve "), 17626)
(('have', 'to'), 1417) 		 (('you', "'re "), 17135)
(('i', 'mean'), 1384) 		 (('to ', 'the '), 16575)
(('you', 'have'), 1326) 		 (("'ve ", 'got '), 16499)
(('yeah', 'yeah'), 1264) 		 (('there', "'s "), 16343)
(('i', 'do'), 1254) 		 (('it ', 'was '), 15613)
(('the', 'the'), 1206) 		 (("'s ", 'a '), 15461)
(('and', 'then'), 1143) 		 (('and ', 'i '), 15408)
(('er', 'er'), 1123) 		 (('you ', 'know '), 15236)
(('if', 'you'), 1106) 		 (('to ', 'be '), 14952)
(('and', 'er'), 1096) 		 (('i ', 'mean '), 14701)
(

### Comparing Stop Word Use

In [23]:
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))

In [24]:
#Return a list of the stop words found in a given corpus/dictionary
def get_stopwords(word_list):
    stop_list = []
    for w in word_list:
        if w in stopWords:
            stop_list.append(w)
    return stop_list

#### Proportion of stop words

In [25]:
#Proportion of stop words in VOICE (non-native speakers)
VOICE_stop = get_stopwords(VOICE_words)
len(VOICE_stop)/len(VOICE_words)

0.48718556623933323

In [26]:
#Proportion of stop words in BNC (native speakers)
BNC_stop = get_stopwords(BNC_words)
len(BNC_stop)/len(BNC_words)

0.0798851434861071

#### Common stop words

In [27]:
VOICE_stop_freqs = nltk.FreqDist(VOICE_stop)

In [28]:
BNC_stop_freqs = nltk.FreqDist(BNC_stop)

In [29]:
#Use the list of stop words from the corpora and freqdists to get the 15 most frequent stop words in each corpus
VOICE_most_common = VOICE_stop_freqs.most_common(15)
BNC_most_common = BNC_stop_freqs.most_common(15)

In [30]:
#Print the most common stop words for each corpus side-by-side (VOICE is in the left column, BNC is in the right column)
index = 0
index = 0
while index < len(VOICE_most_common):
    print(VOICE_most_common[index], '\t\t', BNC_most_common[index])
    index += 1

('the', 25148) 		 ('it', 119305)
('i', 16392) 		 ('that', 72881)
('and', 15075) 		 ('i', 71967)
('it', 14197) 		 ('you', 61105)
('to', 13780) 		 ('do', 48598)
('you', 13745) 		 ('there', 33245)
('that', 11307) 		 ('we', 32951)
('we', 10596) 		 ('they', 31664)
('a', 10160) 		 ('no', 28468)
('of', 9822) 		 ('he', 23090)
('in', 9712) 		 ('is', 16740)
('is', 9352) 		 ('did', 16221)
('but', 6743) 		 ('what', 16204)
('have', 6636) 		 ('she', 13622)
('so', 6529) 		 ('now', 10075)
