# Homework 1

Problem Description:

In this homework you will be analyzing presidential debate texts. Specifically, you will be trying to identify rhetorical questions in the debate text. We will operate that a rhetorical question is a sentence that is a question, but that is not answered by another speaker. In the context of this debate text, we will assume that a rhetorical question is any that occurs within a single speaker's response.

So for example, the following would be considered rhetorical for our purposes: I have one last part here. I decided I was dumb and didn't understand it, so I called the "Who's Who" of the folks that have been around it. ==An example rhetorical question ==> And I said, "Why won't everybody go south?" They say, "It would be disruptive." ==An example rhetorical question ==> I said, "For how long?" I finally got them up for 12 to 15 years. ==An example rhetorical question ==> And I said, "Well, how does it stop being disruptive?" And that is, when their jobs come up from $1 an hour to $6 an hour, and ours go down to $6 an hour, then it's leveled again. But in the meantime, you've wrecked the country with these kinds of deals. We've got to cut it out.

A key task here will be to iterate through all the debates and then segment the speech into sentences. Identify the sentences which are questions or contain questions. The only thing to check for is whether it is at the end of a speaker turn.

For stretch goals that could count as extra credit, you can try to identify who said the rhetorical question (which would require tracking which part of the debate you are in). Also, you could try to identify what the repsonse to the question was. You could further examine the ways in which the rhetorical questions might differ from other statements.

In [1]:
import nltk.data
from os import listdir
from os.path import isfile, join
from nltk.util import bigrams 
from nltk.tokenize import TreebankWordTokenizer
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
treebank_tokenizer = TreebankWordTokenizer()

dir_base = "C:/Users/23140/Desktop/f19_ds_nlp-master/homeworks/homework_1/data/"

def read_file(filename):
    input_file_text = open(filename , encoding='utf-8').read()
    return input_file_text
    
def read_directory_files(directory):
    file_texts = []
    files = [f for f in listdir(directory) if isfile(join(directory, f))]
    for f in files:
        file_text = read_file(join(directory, f))
        ##print(file_text)
        file_texts.append({"file":f, "content": file_text })
    return file_texts
    
text_corpus = read_directory_files(dir_base)

In [2]:
def process_debate(debate_object):
    debate_name = debate["file"]
    debate_text = debate["content"]
    
    possible_rhetorical_questions = []
    
    ## Deal with debates with '.' after each speakers' name
    if debate_name in ['80_debate_1','84_debate_1','92_debate_1','96_debate_1']:
        debaters = ['President Bush.','Governor Clinton.','GOVERNOR REAGAN.','THE PRESIDENT.',
                    'The President.','Mr. Mondale.','Senator Dole.']
        punkt_sentences = sentence_tokenizer.tokenize(debate_text)
        for i in range(len(punkt_sentences)):  ## Iterate throuhgh each sentences
            if punkt_sentences[i] in debaters:  ## The beginning of a speaker's content
                for j in range(i+1,len(punkt_sentences)):  
                    tokenized_j = treebank_tokenizer.tokenize(punkt_sentences[j])
                    if '?' in tokenized_j:  ## Find a question sentence
                        tokenized_j_next = treebank_tokenizer.tokenize(punkt_sentences[j+1])
                        ## Next, check if the next sentence is a statement and has the same speaker
                        if (len(tokenized_j_next) != 3 and '?' not in tokenized_j_next and 
                            punkt_sentences[j+1] != 'Q.'and 'Q.' not in tokenized_j_next):
                            possible_rhetorical_questions.append((debate_name,punkt_sentences[i],punkt_sentences[j],punkt_sentences[j+1]))
                    elif len(tokenized_j) == 3:  ## The end of a speaker's content
                        i = j
                        break

    ## Deal with debates with ':' after each speakers' name                   
    if debate_name in ['80_debate_2','88_debate_1']:
        speakers_list = []
        punkt_sentences = sentence_tokenizer.tokenize(debate_text)
        for i in range(len(punkt_sentences)):  ## Iterate through each sentences to find all speakers' names
            tokenized_i = treebank_tokenizer.tokenize(punkt_sentences[i])
            if ':' in tokenized_i: 
                if tokenized_i.index(':') == 1:
                    speakers_list.append(tokenized_i[0])
        for i in range(len(punkt_sentences)):  ## Iterate again to find all rhetorical questions
            tokenized_i = treebank_tokenizer.tokenize(punkt_sentences[i])
            if tokenized_i[0] in speakers_list:
                for j in range(i+1,len(punkt_sentences)):
                    tokenized_j = treebank_tokenizer.tokenize(punkt_sentences[j])
                    if '?' in tokenized_j:
                        tokenized_j_next = treebank_tokenizer.tokenize(punkt_sentences[j+1])
                        if ('?' not in tokenized_j_next and ':' not in tokenized_j_next):
                            possible_rhetorical_questions.append((debate_name,tokenized_i[0],punkt_sentences[j],punkt_sentences[j+1]))
                    elif ':' in tokenized_j:
                        i = j
                        break
    
    
    return possible_rhetorical_questions

In [3]:
all_rhetorical_questions = []
for debate in text_corpus:
    all_rhetorical_questions.append(process_debate(debate))

for i in range(len(all_rhetorical_questions)):
    print('There are '+str(len(all_rhetorical_questions[i]))+' potential rhetorical questions in this debate.')
    print('\n')
    for j in range(len(all_rhetorical_questions[i])):
        print(all_rhetorical_questions[i][j])
    print('\n\n')

There are 6 potential rhetorical questions in this debate.


('80_debate_1', 'GOVERNOR REAGAN.', 'I talked to a man just briefly there who asked me one simple question: "Do I have reason to hope that I can someday take care of my family again?', 'Nothing has been done."')
('80_debate_1', 'GOVERNOR REAGAN.', 'The question would be, "Have you any ideas of what you would do if you were there?"', 'And I said, well, yes.')
('80_debate_1', 'GOVERNOR REAGAN.', 'And I think that anyone that\'s seeking this position, as well as other people, probably, have thought to themselves, "What about this, what about that?"', 'These are just ideas of what I would think of if I were in that position and had access to the information, in which I would know all the options that were open to me.')
('80_debate_1', 'GOVERNOR REAGAN.', 'Second — the one that says, "Well, tell me, what are some of those ideas?"', 'First of all, I would be fearful that I might say something that was presently under way or in nego

Remarks:
1. Method Description: 
Basically, a rhetorical question is a question followed by a statement from the same speaker. So, my rough idea is to iterate through each sentences (while keeping track of the speaker) and find those questions NOT at the end of a speaker's content. 
2. Implementation Characteristics:
To judge whether we are looking at a new speaker, I first extract the list of all possible speakers' names. Then, if a sentence right after a question matches the name list, I will regard that as a change of speakers. One thing to mention here is that, I personally made an assumption that the majority of rhetorical questions should appear in the two debaters' words, because either hosts or question raisers in the debate were most interested in finding what the two debaters think about certain problems as potential president.As a result, the number of rhetorical questions in my output might be smaller than other students'.
3. Output Analysis:
My output follows the format of (debate name, speaker name, rhetorical question, potential answer),including the parts for extra credits.
Among all the potential rhetorical questions in my output, after checking by myself I found more than 95% of them are exactly what I want, i.e. they both follow the form (a question followed by a statement from the same speaker) and the logic (the statement is the answer to the question).
4. Further Improvment:
First, I will look at those "bad" candidates in my output and make my algorithm more adaptable to different forms of sentences and texts.
Second, I will try to consider all speakers including hosts and see whether they have rhetorical questions. After that, I will do some tests on my assumption and do some modification if necessary.

Best wish.
Guangji