# Preprocessing of the Switchboard Dialog Act Corpus

This notebook aims to perform preprocessing of data contained in the Switchboard Dialog Act Corpus (SwDA). To generate Natural Language Questions we wanted to use SWDA question structures as a reference point. The code concentrates on the POS-Tags provided by the spaCy library. Since we want to generate Polar Questions for the 20 Questions game, here we explore the POS-templates of Yes/No questions (indicated by the act-tag "qy" in the SwDA Corpus).

The results of our 5 most-used patterns shows that Yes/No questions follow the POS-Tag-pattern: **VERB**-**PRON**-**?**. For our purpose, this gives us question structures such as: 


|POS-Tags|Example|  
|---|---|
|VERB, PRON, VERB, DET, NOUN|"Does it have (a) tail?"|
|VERB, PRON, VERB, ADP, DET|"Does it live in a ...?"|
|VERB, PRON, ADJ|"Is it big?"| 

In [2]:
import swda
from swda import CorpusReader
from collections import Counter
import pandas as pd
import spacy
import spacy

nlp = spacy.load('en_core_web_sm')
corpus = CorpusReader('swda')

In [3]:
"""This piece of code creates a dataframe with useful information for the upcoming question-generation part."""
    
len_qy = 0

# indicate column names, this is the information we need
df = pd.DataFrame(columns=['Index', 'Tag','Text','POS'])

# iterate over swda transcripts and append information to dataframe
for trans in corpus.iter_transcripts():
    for utt in trans.utterances:
        if utt.act_tag == "qy":
            df.loc[len_qy] = [utt.utterance_index, utt.act_tag, utt.text, utt.pos]
            len_qy += 1

# get information about dataframe structure
df.info()

transcript 1155


<class 'pandas.core.frame.DataFrame'>
Int64Index: 3788 entries, 0 to 3787
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Index   3788 non-null   object
 1   Tag     3788 non-null   object
 2   Text    3788 non-null   object
 3   POS     3788 non-null   object
dtypes: object(4)
memory usage: 148.0+ KB


In [4]:
"""We now want to add a new column to our dataframe with SpaCy POS-Tags to compare those 
    with the predefined ones. Those are much easier to process later on."""

pos_tags=[]

# access the text column of each row and create POS-Tags via SpaCy
for row in df["Text"]: 
    doc = nlp(row)
    tags = []
    for token in doc: 
        tags.append(token.pos_) 
    pos_tags.append(tags)

    
"""Cleaning step to remove POS-Tags from our list that are not informative,
    i.e. "SPACE" = blank spaces, "SYM" = other symbols, "PUNCT" = punctuation symbols, 
    "X" = other and "INTJ" = interjections."""

clean = ["SPACE", "SYM", "PUNCT", "X", "INTJ"]

for item in pos_tags:
    for element in clean: 
        while element in item: 
            item.remove(element)     
      
# create new column containing spaCy POS-Tags
df["Spacy_POS"]= pos_tags

# convert datatype of new column to string
df['Spacy_POS']= df['Spacy_POS'].astype(str)

# this is how our dataframe looks like after adding the SpaCy POS-Tags and performing the cleaning step
df.head()

Unnamed: 0,Index,Tag,Text,POS,Spacy_POS
0,83,qy,Were you --,Were/VBD you/PRP --/:,"['VERB', 'PRON']"
1,46,qy,Are you in Texas? /,Are/VBP you/PRP in/IN Texas/NNP ?/.,"['VERB', 'PRON', 'ADP', 'PROPN']"
2,29,qy,"I probably would have done, {D you know, } jus...","I/PRP probably/RB would/MD have/VB done/VBN ,/...","['PRON', 'ADV', 'VERB', 'VERB', 'VERB', 'NOUN'..."
3,1,qy,"Are you a Vietnam veteran, Dudley? <Music>. /","Are/VBP you/PRP a/DT Vietnam/NNP veteran/NN ,/...","['VERB', 'PRON', 'DET', 'PROPN', 'NOUN', 'PROP..."
4,5,qy,Do you have family who were in the Vietnam War? /,Do/VBP you/PRP have/VB family/NN who/WP were/V...,"['VERB', 'PRON', 'VERB', 'NOUN', 'PRON', 'VERB..."


In [5]:
"""We want to get information about how the sentences start/the general structure of the sentence.
    This piece of code creates a dictionary with the unique patterns of POS-Tags that can be found in the corpus"""

pos_5 = [] #using first 5 pos tags of every sentence 

for tag in pos_tags:
    pos_5.append(tag[:5])
    
# get unique sentence-POS combinations
unique_pos = Counter([tuple(i) for i in pos_5])

"""Lets have a look at the most frequent POS-Tag templates"""
   
most_used_5 = sorted(unique_pos, key=unique_pos.get, reverse=True)[:5]

print("We have", len(unique_pos), "unique POS-Tag combinations.")

# sort dictionary unique_pos to get most frequent 
freq = sorted(((v,k) for k,v in unique_pos.items()), reverse=True) 

# create frequency dataframe 
freq_df = pd.DataFrame(freq, columns=["Occurrences", "POS_Tags"])
print("\nThe five most used combinations are:\n", freq_df.head())

We have 1332 unique POS-Tag combinations.

The five most used combinations are:
    Occurrences                         POS_Tags
0          153    (VERB, PRON, VERB, DET, NOUN)
1           90                     (VERB, PRON)
2           74  (VERB, PRON, CCONJ, VERB, PRON)
3           69   (PROPN, VERB, PRON, VERB, DET)
4           61               (VERB, PRON, VERB)
