# Question Answering System

Our aim is to find the correct answer for asked question.The user input can be slightly different from the based question.

In [1]:
import string
import nltk
from nltk.corpus import wordnet
import sys
import re
try:
    from Levenshtein.StringMatcher import StringMatcher as SequenceMatcher
except ImportError:
    from difflib import SequenceMatcher

In [2]:
training_questions1 = []
training_answers1 = []
testing_questions1 = []
testing_answers1 = []
count = 0
with open('training_dataset.txt') as training_dataset1: 
    for sentence in training_dataset1:
        if count%2 == 0:
            training_questions1.append(sentence[:-1])
        else:
            training_answers1.append(sentence[:-1])
        count +=1

count = 0
with open('test_dataset.txt') as testing_dataset:
#for question in training_dataset1:
    for sentence in testing_dataset:
        if count%2 == 0:
            testing_questions1.append(sentence[:-1])
        else:
            if sentence[-1] == '\n':
                    testing_answers1.append(sentence[:-1].lower())
            else:
                testing_answers1.append(sentence)
        count+=1

Lets first extract the questions from the training dataset and take a look at them.

In [3]:
testing_answers1

['hi',
 'nothing much',
 'greetings',
 "i'm doing good",
 "i'm doing good",
 "i'm nameless",
 'anything that you want',
 'the date specified on your reservation',
 'whenever you want to',
 'i am your little assistant',
 'hi',
 'hi',
 "i don't have an address",
 "i don't have phone number either",
 "you can't call me, you can only talk to me here",
 "you can't call me, you can only talk to me here",
 'sure, what is their phone number',
 "sure, i'm getting you someone to talk with you",
 'take the metro and stop at the terminus, we are 5mins away from there',
 'you need to take the tgv then at paris train station you take the subway',
 'you need to take the tgv then at paris train station you take the subway',
 'none',
 'you need to take the tgv then at paris train station you take the subway',
 'you need to take the periph and drive for 30mins',
 'you need to take the periph and drive for 80mins',
 'you need to take the periph and drive for 40mins',
 'sure, about what',
 'sure, about wh

First let us remove the stopwords from the dataset. 
### Stopwords
The words that occur most frequently in all the documents, they carry no meaning to the document like 'is','are','the', 'of' etc.  
Lets make a list of stopwords

In [4]:
#stopwords = nltk.corpus.stopwords.words('english')
stopwords = ['of','to','is','the','are','at','i','if']
stopwords.extend(string.punctuation)
stopwords.append('')

Now, we can do few things here,  
1. Normal String matching
2. Token matching
3. Lemma matching
4. Partial String Matching
2. We can apply **LSA** here
3. We can use **Jackard Coefficient**

### Normal String matching
This won't work because there can be extra space in new question, or one word can be missing  
Example:-  
1. How are you?
2. How are you today?

### Token matching
Token:- Break a sentence into words  
Example:-  
sentence = 'what are you doing'
token = ['what','are','you','doing']
Token matching would be similar to String matching hence would give results if the questions asked would be exactly similar to the question in training dataset.

### Lemma matching
Lemmatization:- There are 2 things $stemming$ and $lemmatization$.  
###### Stemming vs Lemmatization
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .  
Example:- The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.

### Partial String Matching
We can either find the percentage of 'query' matching  'training dataset'.

### LSA
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. 

### Jackard Cofficient
Jackard cofficient is a ratio to find the similarity b/w strings.  
a = similar token in 2 strings  
b = total tokens in 2 strings  
$$Jackard Coefficient = \dfrac{a}{b} $$

# What are we going to do?

Well, we are going to follow these steps:-

1. First break the sentence in tokens.
2. Then we will lemmatize the tokens to get its base word.
3. Remove the tokens that are stopwords. 
4. Now we will find the jackard coefficient of the sentences.
5. If answer not found then we will use partial_noun_lemma_match
6. Even if answer is not found we use levenshtein_distance
7. If answer not found upto step 6, then we return 'None'

In [5]:
def get_pos(pos_tag):
    if pos_tag[1][0] == 'J':
        return (pos_tag[0], wordnet.ADJ)
    elif pos_tag[1][0] == 'V':
        return (pos_tag[0], wordnet.VERB)
    elif pos_tag[1][0] == 'R':
        return (pos_tag[0], wordnet.ADV)
    elif pos_tag[1][0] == 'N':
        return (pos_tag[0], wordnet.NOUN)
    else:
        return (pos_tag[0], wordnet.NOUN)

Some words will be creating problems. So, remove those words. If there was more time then I might have used autocorrection code to correct any misspelled word (But i preffered not to).

In [6]:
def correct_words(sentence):
    output = []
    for word in sentence.lower().split(' '):
        if word == "i'm":
            output.append('i am')
        elif word == "what's":
            output.append('what is')
        elif word == "where's":
            output.append('what is')
        elif word == 'check-out':
            output.append('check out')
        elif word == 'check-in':
            output.append('check in')
        elif word == 'checkin':
            output.append('check in')
        elif word == 'checkout':
            output.append('check out')
        elif word == 'wi-fi' or word == 'wifi':
            output.append('wi fi')
        elif word == 'yours' or word == 'yourself' or word == 'yours' or word == 'yourselves':
            output.append('your')
        elif word == 'ours' or word == 'ourself' or word == 'ourselves' or word == 'me' or word == 'my' or word == 'myself':
            output.append('i')
        elif word in stopwords:
            pass
        else:
            word = re.sub('[^a-zA-Z]',' ',word).split(' ')[0]
            output.append(word)
    return ' '.join(output)
 

In [7]:
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
def data_cleaning(query_sentence,data_sentence):
    pos_a = map(get_pos, nltk.pos_tag(nltk.word_tokenize(query_sentence)))
    pos_b = map(get_pos, nltk.pos_tag(nltk.word_tokenize(data_sentence)))
    lemmae_a = [lemmatizer.lemmatize(token.lower().strip(string.punctuation), pos) for token, pos in pos_a \
                    if token.lower().strip(string.punctuation) not in stopwords]
    lemmae_b = [lemmatizer.lemmatize(token.lower().strip(string.punctuation), pos) for token, pos in pos_b \
                    if token.lower().strip(string.punctuation) not in stopwords]
    #print(lemmae_a)
    #print(lemmae_b)
    ratio = len(set(lemmae_a).intersection(lemmae_b)) / float(len(set(lemmae_a).union(lemmae_b)))
    return (ratio > 0.60,ratio)

I tried finding accuracy upto this point and found that sentence like  
'how to get there from switzerland'  
are not matched. So, if the sentence is not matched in previous stage, then we match the sentence via partial_noun_lemma_match.

In [8]:
def partial_noun_lemma_match(a, b):
    """Check if a and b are matches."""
    pos_a = map(get_pos, nltk.pos_tag(nltk.word_tokenize(a)))
    pos_b = map(get_pos, nltk.pos_tag(nltk.word_tokenize(b)))
    lemmae_a = [lemmatizer.lemmatize(token.lower().strip(string.punctuation), pos) for token, pos in pos_a \
                    if pos == wordnet.NOUN and token.lower().strip(string.punctuation) not in stopwords]
    lemmae_b = [lemmatizer.lemmatize(token.lower().strip(string.punctuation), pos) for token, pos in pos_b \
                    if pos == wordnet.NOUN and token.lower().strip(string.punctuation) not in stopwords]
    #if a == 'how to get there from switzerland':
    #    print(lemmae_a)
    #    print(lemmae_b)
    # Calculate Jaccard similarity
    try:
        ratio = len(set(lemmae_a).intersection(lemmae_b)) / float(len(set(lemmae_a).union(lemmae_b)))
        return (ratio >= 0.60,ratio)
    except:
        return None

I want to increase the results of second dataset. So, I explored the dataset and found that Levenshtein_distance would do the task.  
### Levenshtein_distance
Levenshtein distance between two words is the minimum number of single-character edits. 

For example(taken from Wikipedia), the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:  
kitten → sitten (substitution of "s" for "k")  
sitten → sittin (substitution of "i" for "e")  
sittin → sitting (insertion of "g" at the end).  

In [9]:
def levenshtein_distance(statement, other_statement):
    '''
    statement1 = statement.split(' ')
    statement2 = [w for w in statement1 if w not in stopwords]
    statement = ' '.join(statement2)
    
    other_statement1 = other_statement.split(' ')
    other_statement2 = [w for w in other_statement1 if w not in stopwords]
    other_statement = ' '.join(other_statement2)
    '''
    #if statement == 'do u have wifi in the room':
    #    print('shout')
    statement = correct_words(statement)
    other_statement = correct_words(other_statement)    
    PYTHON = sys.version_info[0]
    if not statement or not other_statement:
        return 0

    if PYTHON < 3:
        statement_text = unicode(statement.lower())
        other_statement_text = unicode(other_statement.lower())
    else:
        statement_text = str(statement.lower())
        other_statement_text = str(other_statement.lower())

    similarity = SequenceMatcher(None,statement_text,other_statement_text)
    percent = int(round(100 * similarity.ratio())) / 100.0
    return percent

Combining Everything

In [10]:
# -*- coding: utf-8 -*-
"""
Created on Sun Feb 26 01:03:14 2017

@author: Prashant Singh 
@E-mail: prashant.rahul5@gmail.com
"""

import string
import nltk
from nltk.corpus import wordnet
import sys

try:
    from Levenshtein.StringMatcher import StringMatcher as SequenceMatcher
except ImportError:
    from difflib import SequenceMatcher
    
def file_read(train_file,test_file):
    '''
    Read the file
    train_file :- Training_File
    test_file :- Testing_File
    return :- training_question,training_answers,testing_questions,testing_answers
    '''
    count = 0
    training_questions1 = []
    training_answers1 = []
    testing_questions1 = []
    testing_answers1 = []
    with open(train_file) as training_dataset1: 
        for sentence in training_dataset1:
            if count%2 == 0:
                training_questions1.append(sentence[:-1].lower())
            else:
                if sentence[-1] == '\n':
                    training_answers1.append(sentence[:-1].lower())
            count +=1

    count = 0
    with open(test_file) as testing_dataset:
    #for question in training_dataset1:
        for sentence in testing_dataset:
            #print(sentence)
            if count%2 == 0:
                testing_questions1.append(sentence[:-1].lower())
            else:
                if sentence[-1] == '\n':
                    testing_answers1.append(sentence[:-1].lower())
            count+=1
            
    return training_questions1,training_answers1,testing_questions1,testing_answers1

def argmax(array):
    index = 0
    maximum = -999
    for i in range(len(array)):
        if array[i] > maximum:
            maximum = array[i]
            index = i
    return index

def get_pos(pos_tag):
    '''
    return:- tag of word
    '''
    if pos_tag[1][0] == 'J':
        return (pos_tag[0], wordnet.ADJ)
    elif pos_tag[1][0] == 'V':
        return (pos_tag[0], wordnet.VERB)
    elif pos_tag[1][0] == 'R':
        return (pos_tag[0], wordnet.ADV)
    elif pos_tag[1][0] == 'N':
        return (pos_tag[0], wordnet.NOUN)
    else:
        return (pos_tag[0], wordnet.NOUN)
        
def correct_words(sentence):
    '''
    Change words to desired words
    sentence :- input sentence
    return :- correct_sentence
    '''
    output = []
    for word in sentence.lower().split(' '):
        if word == "i'm":
            output.append('i am')
        elif word == "what's":
            output.append('what is')
        elif word == "where's":
            output.append('what is')
        elif word == 'check-out':
            output.append('check out')
        elif word == 'check-in':
            output.append('check in')
        elif word == 'checkin':
            output.append('check in')
        elif word == 'checkout':
            output.append('check out')
        elif word == 'wi-fi' or word == 'wifi':
            output.append('wi fi')
        elif word == 'yours' or word == 'yourself' or word == 'yours' or word == 'yourselves':
            output.append('your')
        elif word == 'ours' or word == 'ourself' or word == 'ourselves' or word == 'me' or word == 'my' or word == 'myself':
            output.append('i')
        elif word in stopwords:
            pass
        else:
            word = re.sub('[^a-zA-Z]',' ',word).split(' ')[0]
            output.append(word)
    return ' '.join(output)

lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()

def data_cleaning(query_sentence,data_sentence):
    '''
    This function finds the Jaccard Coefficient of query_sentence and data_sentence
    query_sentence:- question user asked
    data_sentence:- question in training_data 
    return:- (ratio > 0.6,ratio)
    '''
    pos_a = map(get_pos, nltk.pos_tag(nltk.word_tokenize(query_sentence)))
    pos_b = map(get_pos, nltk.pos_tag(nltk.word_tokenize(data_sentence)))
    lemmae_a = [lemmatizer.lemmatize(token.lower().strip(string.punctuation), pos) for token, pos in pos_a \
                    if token.lower().strip(string.punctuation) not in stopwords]
    lemmae_b = [lemmatizer.lemmatize(token.lower().strip(string.punctuation), pos) for token, pos in pos_b \
                    if token.lower().strip(string.punctuation) not in stopwords]
    #print(lemmae_a)
    #print(lemmae_b)
    ratio = len(set(lemmae_a).intersection(lemmae_b)) / float(len(set(lemmae_a).union(lemmae_b)))
    return (ratio > 0.60,ratio)
    
    
def levenshtein_distance(statement, other_statement):
    '''
    statement1 = statement.split(' ')
    statement2 = [w for w in statement1 if w not in stopwords]
    statement = ' '.join(statement2)
    
    other_statement1 = other_statement.split(' ')
    other_statement2 = [w for w in other_statement1 if w not in stopwords]
    other_statement = ' '.join(other_statement2)
    '''
    #if statement == 'do u have wifi in the room':
    #    print('shout')
    statement = correct_words(statement)
    other_statement = correct_words(other_statement)    
    PYTHON = sys.version_info[0]
    if not statement or not other_statement:
        return 0

    if PYTHON < 3:
        statement_text = unicode(statement.lower())
        other_statement_text = unicode(other_statement.lower())
    else:
        statement_text = str(statement.lower())
        other_statement_text = str(other_statement.lower())

    similarity = SequenceMatcher(None,statement_text,other_statement_text)
    percent = int(round(100 * similarity.ratio())) / 100.0
    return percent


def partial_noun_lemma_match(a, b):
    '''
    This function finds the Jaccard Coefficient of query_sentence and data_sentence (Noun words)
    query_sentence:- question user asked
    data_sentence:- question in training_data 
    return:- (ratio > 0.6,ratio)
    '''
    pos_a = map(get_pos, nltk.pos_tag(nltk.word_tokenize(a)))
    pos_b = map(get_pos, nltk.pos_tag(nltk.word_tokenize(b)))
    lemmae_a = [lemmatizer.lemmatize(token.lower().strip(string.punctuation), pos) for token, pos in pos_a \
                    if pos == wordnet.NOUN and token.lower().strip(string.punctuation) not in stopwords]
    lemmae_b = [lemmatizer.lemmatize(token.lower().strip(string.punctuation), pos) for token, pos in pos_b \
                    if pos == wordnet.NOUN and token.lower().strip(string.punctuation) not in stopwords]
    #if a == 'how to get there from switzerland':
    #    print(lemmae_a)
    #    print(lemmae_b)
    # Calculate Jaccard similarity
    try:
        ratio = len(set(lemmae_a).intersection(lemmae_b)) / float(len(set(lemmae_a).union(lemmae_b)))
        return (ratio >= 0.60,ratio)
    except:
        return None





In [11]:

def head_function(training_questions,training_answers,testing_questions,testing_answers):
    '''
    This function calls all the functions and finds the count of queries matched.
    First calls data_cleaning. If answer not found then it calls partial_noun_lemma_match even if answer not found
    it calls levenshtein_distance. If answer is not found upto this stage , it returns None.
    return count of answers correctly identified.
    '''
    count = 0
    output = []
    ratio = []
    for i,target_sentence in enumerate(testing_questions):
        target_sentence = target_sentence.lower()
        print('----------------')
        #print(target_sentence)
        ratios = []
        answers = []
        answer = ''
        for j,sentence in enumerate(training_questions):
            sentence = sentence.lower()
            output.append((data_cleaning(target_sentence, sentence),j))
            ratios.append(data_cleaning(target_sentence, sentence)[1])
            answers.append(testing_answers[j])

        if(ratios[argmax(ratios)] > 0.6):
            answer = training_answers[argmax(ratios)]
            print(target_sentence,answer,testing_answers[i])
            #count +=1
        else:
            output = output[:-1]
            ratios = ratios[:-1]
            answers = answers[:-1]
            output.append((partial_noun_lemma_match(target_sentence, sentence),j))
            #print(output[j][0][1])
            ratios.append(partial_noun_lemma_match(target_sentence, sentence)[1])
            answers.append(testing_answers[j])
            if(ratios[argmax(ratios)] >= 0.6):
                answer = training_answers[argmax(ratios)]
                print(target_sentence,answer,testing_answers[i])
            else:
                output = output[:-1]
                ratios = ratios[:-1]
                answers = answers[:-1]
                distance = []
                for j,sentence in enumerate(training_questions):
                    distance.append(levenshtein_distance(target_sentence,sentence))
                if(max(distance) >= 0.4):
                    maximum = -999
                    index = 0
                    for j,dist in enumerate(distance):
                        if dist> maximum:
                            maximum = dist
                            index = j
                    answer = training_answers[index] 
                    print(target_sentence,answer,testing_answers[i])
                else:
                    #print('None')
                    answer = 'None'
        if(answer == testing_answers[i]):
                count +=1
                #print('count')
        #print(target_sentence)
        #print('----------------')
    return count

In [12]:
stopwords = ['of','to','is','the','are','at','if','am']
stopwords.extend(string.punctuation)
stopwords.append('')

In [13]:
print('Output is printed in following form:-')
print('(asked_question,our_answer,real_answer)')
training_questions,training_answers,testing_questions,testing_answers = \
                                file_read('training_dataset.txt','test_dataset.txt')
count_of_correct_answers = head_function(training_questions,training_answers,testing_questions,testing_answers)
print()
print('**************************************************************************')
print('Number of Correct answers:-',count_of_correct_answers)
print('**************************************************************************')

Output is printed in following form:-
(asked_question,our_answer,real_answer)
----------------
('hey', 'hi', 'hi')
----------------
("what's up", 'nothing much', 'nothing much')
----------------
('greetings', 'greetings', 'greetings')
----------------
('how are you', "i'm doing good", "i'm doing good")
----------------
('how are you doing today', "i'm doing good", "i'm doing good")
----------------
("what's your name", "i'm nameless", "i'm nameless")
----------------
('what can you do', 'anything that you want', 'anything that you want')
----------------
('when may i check in', 'the date specified on your reservation', 'the date specified on your reservation')
----------------
('when will i be able to i check out', 'whenever you want to', 'whenever you want to')
----------------
('who are you', 'i am your little assistant', 'i am your little assistant')
----------------
('hi', 'hi', 'hi')
----------------
('hello', 'hi', 'hi')
----------------
("what's your address", "i don't have an a

In [14]:
print('Output is printed in following form:-')
print('(asked_question,our_answer,real_answer)')
training_questions,training_answers,testing_questions,testing_answers = \
                                file_read('training_dataset_2.txt','test-data.txt')
count_of_correct_answers = head_function(training_questions,training_answers,testing_questions,testing_answers)
print('**************************************************************************')
print('Number of Correct Answers:-',count_of_correct_answers)
print('**************************************************************************')

Output is printed in following form:-
(asked_question,our_answer,real_answer)
----------------
('how can we book a room? ', 'hello, you can book online on hyphen.ai, on our mobile app, by giving us a call on 555 800 4567, or by email on reservations@hyphen.ai. do not hesitate to let us know if we can be of any other help. best wishes', 'hello, you can book online on hyphen.ai, on our mobile app, by giving us a call on 555 800 4567, or by email on reservations@hyphen.ai. do not hesitate to let us know if we can be of any other help. best wishes')
----------------
('i want to book a room', 'hello, you can book online on hyphen.ai, on our mobile app, by giving us a call on 555 800 4567, or by email on reservations@hyphen.ai. do not hesitate to let us know if we can be of any other help. best wishes', 'hello, you can book online on hyphen.ai, on our mobile app, by giving us a call on 555 800 4567, or by email on reservations@hyphen.ai. do not hesitate to let us know if we can be of any oth


********************************************************************  
Correctly Identified Answers in set 1:- $\frac{32}{33}$

Correctly Identified Answers in set 2:- $\frac{86}{93}$
********************************************************************  

# Wrong Answers in testing set
Test set has wrong answer for the following question.   
'can i book tee times online'  
If we correct the answer in testing dataset then Correctly Identified Answers in set 2 becomes 87 out of 93.