## Toucan AI mini-project: Question Classifier

In [2]:
import spacy 
nlp = spacy.load('en_core_web_sm')

In [5]:
import nltk
nltk.download('punkt')

I am building the question classifier based on heuristics. I am using 4 conditions:

1) Whether first word of the sentence is any of the interogative words

2) Whether the sentence contains '?' mark

3) Whether the second word or the last word of the sentence is any of the wh_words 

4) If the sentence contains 'what':
   - Check for ',' and '.' before 'what' using spacy & pos_tags; 
   - Check for prepositions and past tense of verbs before 'what' using spacy & pos_tags

The list of POS tags is as follows, with examples of what each POS stands for:

- **CC** coordinating conjunction
- **CD** cardinal digit
- **DT** determiner
- **EX** existential there (like: “there is” … think of it like “there exists”)
- **FW** foreign word
- **IN** preposition/subordinating conjunction
- **JJ** adjective ‘big’
- **JJR** adjective, comparative ‘bigger’
- **JJS** adjective, superlative ‘biggest’
- **LS** list marker 1
- **MD** modal could, will
- **NN** noun, singular ‘desk’
- **NNS** noun plural ‘desks’
- **NNP** proper noun, singular ‘Harrison’
- **NNPS** proper noun, plural ‘Americans’
- **PDT** predeterminer ‘all the kids’
- **POS** possessive ending parent’s
- **PRP** personal pronoun I, he, she
- **PRP** possessive pronoun my, his, hers
- **RB** adverb very, silently,
- **RBR** adverb, comparative better
- **RBS** adverb, superlative best
- **RP** particle give up
- **TO**, to go ‘to’ the store.
- **UH** interjection, errrrrrrrm
- **VB** verb, base form take
- **VBD** verb, past tense took
- **VBG** verb, gerund/present participle taking
- **VBN** verb, past participle taken
- **VBP** verb, sing. present, non-3d take
- **VBZ** verb, 3rd person sing. present takes
- **WDT** wh-determiner which
- **WP** wh-pronoun who, what
- **WP** possessive wh-pronoun whose
-  **WRB** wh-abverb where, when

In [6]:
# List of interogative words
interogative_words = ["who", "where", "why", "when", "what", "which", "how", "whose", "whom", 
                      "do", "does", "did", "am", "is", "are", "was", "were", "have", "has", 
                      "had", "can", "could", "shall", "should", "may", "will", "didn't",
                      "doesn't", "haven't", "isn't", "aren't", "can't", "couldn't"," wouldn't",
                      "won't", "shouldn't", "weren't", "wasn't", "haven't", "hasn't", "hadn't"]

interogative_words = set(interogative_words)

# List of wh-words
wh_words = ["who", "where", "why", "when", "what", "which", "whose", "whom"]
wh_words = set(wh_words)

In [7]:
def get_double_tags(sentence):
    """
    This function extracts the set of POS-tag doubles from the sentence
    """
    list_of_double_tags = []
    t1 = nlp(sentence)

    pos = [word.tag_ for word in t1]

    n = len(pos)
    for i in range(n-1):
        t = "-".join(pos[i:i+2])
        list_of_double_tags.append(t)

    return list_of_double_tags

In [11]:
def QuestionClassifier(sentence):
    """
    This function first tokenizes the sentence. And then applies rules to 
    check whether a sentence is a question or not
    """
    tokens = nltk.word_tokenize(sentence.lower())
  
  # check if first word of the sentence is any of the interogative words
    if tokens[0] in interogative_words.intersection(tokens):
        ans = 1

    # check if the sentence contains '?' mark
    elif "?" in tokens:
        ans = 1

    # for sentences like 'In which case did a German man claim....' &
    # 'One of FIS' agenda items was to force women to start doing what'
    elif tokens[1] in wh_words.intersection(tokens) or tokens[-1] in wh_words.intersection(tokens):
        ans = 1

    # check if the sentence contains 'what', except the first, second & last word
    elif 'what' in tokens[2:-2]:
        double_tags = get_double_tags(sentence)

    # for sentences that contain ', what' and '. what'
        if ((',-WP' in double_tags) or (',-WDT' in double_tags) or 
             ('.-WP' in double_tags) or ('.-WDT' in double_tags)):
            ans = 1

    # for sentences that contain preposition with 'what' such as 'in what', 'to what'
    # and verb in past tense with 'what'
        elif (('IN-WP' in double_tags) or ('IN-WDT' in double_tags) or 
              ('VBN-WP' in double_tags) or ('VBD-WP' in double_tags) or
              ('VBD-WDT' in double_tags)):
            ans = 1
        else:
            ans = 0

    else:
        ans = 0

    return ans

In [22]:
# Read the input file
f = open('test-inputs.txt', 'r', encoding="utf8")
texts = []

for line in f:
    texts.append(line.replace('\n',''))

Open a new file and write the results denoting 1 as 'Question' and 0 as 'Not a question'

In [25]:
# Create a new text file and write results
f = open("results.txt","w+")

for i in texts:
    ans = QuestionClassifier(i)
    f.write(str(ans))
    f.write("\n")

f.close()

In [58]:
# Check the results text file
f = open('results.txt', 'r', encoding="utf8")
k = []

for line in f:
    k.append(line.replace('\n',''))

In [59]:
len(k)

24295

The results text file has the same number of lines as the input file (24295 lines).