                                             
                                             
                                             Company: I2Decisions
                                               Author : Melvin Roy V
                                                 Date   : 1/4/2017
                                          Topic  : Sentence Matching Algorithm
                                          
                                          

'''
NLP Basics

What is Natural Language Processing?

Natural language processing is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve: natural language understanding, enabling computers to derive meaning from human
or natural language input; and others involve natural language generation.

Applications:

Spam Detection, Part of-speech tagging ,Named entity recognition(NER), Sentiment Analysis, Parsing, Machine Translation, 
Information Extraction, Question Answering, Paraphrase, Summarization, Dialog

Resources: 
1. https://www.youtube.com/watch?v=nfoudtpBV68 -Awesome videos by Professor Dan Jurafsky & Chris Manning 
2. https://pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/
3. Natural Language Processing With Python-Steven Bird, Ewan Klein, Edward Loper
4. Python 3 Text Processing with NLTK 3 Cookbook Jacob perkins

'''

In [6]:
'''
NLP Basic Terminologies:

Tokenization : Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or 
other meaningful elements called tokens.

Source: https://en.wikipedia.org/wiki/Tokenization

'''

#Tokenization
from nltk.tokenize import sent_tokenize, word_tokenize
sentence = "How to Contact Chamberlain Technical Support?"

print(sent_tokenize(sentence))
print(word_tokenize(sentence))


['How to Contact Chamberlain Technical Support?']
['How', 'to', 'Contact', 'Chamberlain', 'Technical', 'Support', '?']


In [72]:
'''

NLP Basics Terminologies:

Stopwords : Stop words are words which are filtered out before or after processing of natural language data (text)

Source : https://en.wikipedia.org/wiki/Stop_words

'''

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "How to Contact Chamberlain Technical Support?"

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

['How', 'to', 'Contact', 'Chamberlain', 'Technical', 'Support', '?']
['How', 'Contact', 'Chamberlain', 'Technical', 'Support', '?']


In [77]:
'''

NLP Basics Terminologies:

Stemming : Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root 
form—generally a written word form.

Source : https://en.wikipedia.org/wiki/Stemming

'''

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

example_words = ["contact","contacted","contacting"]

for w in example_words:
    print(ps.stem(w))

contact
contact
contact


So let's start.

Today I am going to teach you how sentence matching alogorithm is working in the backend. I am going to compare Api.ai algorithm with our python code.

Consider you have a sentence of interest and would like to match it with all the sentences within the corpus. 

How will you go about it?

What are the issued faced?
    a.Capitalization
    b.Punctuation
    c.White-spacing

Lets look at the example below


In [9]:
#Exact Matching

target_sentence = "How to Contact Chamberlain Technical Support?"

sentences = ["How to Contact Chamberlain Technical Support?",
             "how to  contact Chamberlain technical support?",
             "Contact Chamberlain Technical Support?",
             "Contacting Chamberlain Technical Support?",
             "Chamberlain Technical Support",
             "Technical Support",
             "Tech Support",
             "Call technical support",
             "Chamberlain Technical Support",]


In [11]:
# Exact Match
def exact_match(a, b):
    """Check if a and b are matches."""
    return (a == b)

for sentence in sentences:
    print(exact_match(target_sentence, sentence), sentence)

(True, 'How to Contact Chamberlain Technical Support?')
(False, 'how to  contact Chamberlain technical support?')
(False, 'Contact Chamberlain Technical Support?')
(False, 'Contacting Chamberlain Technical Support?')
(False, 'Chamberlain Technical Support')
(False, 'Technical Support')
(False, 'Tech Support')
(False, 'Call technical support')
(False, 'Chamberlain Technical Support')


What is the solution?

Fuzzy Sentence Matching 

Fuzzy logic is an approach to computing based on "degrees of truth" rather than the usual "true or false" (1 or 0) Boolean logic on which the modern computer is based. 

Source:whatis.techtarget.com/definition/fuzzy-logic


Types:
    
1. Exact case-insensitive token match
2. Exact case-insensitive stem matching after stopword removal
3. Exact case-insensitive stem matching after stopword removal and stemming
4. Exact Case-Insensitive Token set similarity after stopword removal(Jaccard similarity index)
5. Exact Case-Insensitive Token set similarity after stopword removal and stemming (Jaccard similarity index)
   

In [15]:
# Exact Case-Insensitive Token Match 
import nltk
import string

stopwords = nltk.corpus.stopwords.words('english')
stopwords.extend(string.punctuation)
stopwords.append('')
tokenizer = nltk.tokenize.TreebankWordTokenizer()

def is_ci_token_match(a, b):
    """Check if a and b are matches."""
    tokens_a = [token.lower().strip(string.punctuation) for token in tokenizer.tokenize(a)]
    tokens_b = [token.lower().strip(string.punctuation) for token in tokenizer.tokenize(b)]

    return (tokens_a == tokens_b)

for sentence in sentences:
    print(is_ci_token_match(target_sentence, sentence), sentence)

(True, 'How to Contact Chamberlain Technical Support?')
(True, 'how to  contact Chamberlain technical support?')
(False, 'Contact Chamberlain Technical Support?')
(False, 'Contacting Chamberlain Technical Support?')
(False, 'Chamberlain Technical Support')
(False, 'Technical Support')
(False, 'Tech Support')
(False, 'Call technical support')
(False, 'Chamberlain Technical Support')


In [16]:
# Exact Case-Insensitive Token Match After Removing Stopwords

stopwords = nltk.corpus.stopwords.words('english')
stopwords.extend(string.punctuation)
stopwords.append('')
tokenizer = nltk.tokenize.TreebankWordTokenizer()

def is_ci_token_stopword_match(a, b):
    """Check if a and b are matches."""
    tokens_a = [token.lower().strip(string.punctuation) for token in tokenizer.tokenize(a) \
                    if token.lower().strip(string.punctuation) not in stopwords]
    tokens_b = [token.lower().strip(string.punctuation) for token in tokenizer.tokenize(b) \
                    if token.lower().strip(string.punctuation) not in stopwords]
    
    return (tokens_a == tokens_b)

for sentence in sentences:
    print(is_ci_token_stopword_match(target_sentence, sentence), sentence)

(True, 'How to Contact Chamberlain Technical Support?')
(True, 'how to  contact Chamberlain technical support?')
(True, 'Contact Chamberlain Technical Support?')
(False, 'Contacting Chamberlain Technical Support?')
(False, 'Chamberlain Technical Support')
(False, 'Technical Support')
(False, 'Tech Support')
(False, 'Call technical support')
(False, 'Chamberlain Technical Support')


In [17]:
# Exact Case-Insensitive Token Match After Removing Stopwords & Stemming

stopwords = nltk.corpus.stopwords.words('english')
stopwords.extend(string.punctuation)
stopwords.append('')
tokenizer = nltk.tokenize.TreebankWordTokenizer()
stemmer = nltk.stem.snowball.SnowballStemmer('english')

def is_ci_token_stopword_stem_match(a, b):
    """Check if a and b are matches."""
    tokens_a = [token.lower().strip(string.punctuation) for token in tokenizer.tokenize(a) \
                    if token.lower().strip(string.punctuation) not in stopwords]
    tokens_b = [token.lower().strip(string.punctuation) for token in tokenizer.tokenize(b) \
                    if token.lower().strip(string.punctuation) not in stopwords]
    stems_a = [stemmer.stem(token) for token in tokens_a]
    stems_b = [stemmer.stem(token) for token in tokens_b]

    return (stems_a == stems_b)

for sentence in sentences:
    print(is_ci_token_stopword_stem_match(target_sentence, sentence), sentence)

(True, 'How to Contact Chamberlain Technical Support?')
(True, 'how to  contact Chamberlain technical support?')
(True, 'Contact Chamberlain Technical Support?')
(True, 'Contacting Chamberlain Technical Support?')
(False, 'Chamberlain Technical Support')
(False, 'Technical Support')
(False, 'Tech Support')
(False, 'Call technical support')
(False, 'Chamberlain Technical Support')


                                                
                                                #Approximate Sentence Matching


In the above three methods we transformed or removed the elements from input sequences, then compare it with the output sequence for exact match. 

Now we are going to relax our rules little bit. We are going to do a set similarity match instead of a exact sequence matching. For set similarity measure we can use Jaccard similarity index, which is based on the simple set operations union and intersection.

Formula:
    
    
    J(A,B) = |A n B| / |A u B| , 0<=J(A,B)<=1
    

Example:

    Sentence 1: How to Contact Chamberlain Technical Support?
    Sentence 2: Contact Chamberlain Technical Support
    
    J(A n B) = 4
    J(A u B) = 6
    
    J(S1,S2) = 4/6 = 0.66


In [18]:
# Exact Case-Insensitive Token set similarity after stopword removal using Jaccard similarity index

stopwords = nltk.corpus.stopwords.words('english')
stopwords.extend(string.punctuation)
stopwords.append('')

# Create tokenizer and stemmer
tokenizer = nltk.tokenize.TreebankWordTokenizer()

def is_ci_token_stopword_set_match(a, b, threshold=0.5):
    """Check if a and b are matches."""
    tokens_a = [token.lower().strip(string.punctuation) for token in tokenizer.tokenize(a) \
                    if token.lower().strip(string.punctuation) not in stopwords]
    tokens_b = [token.lower().strip(string.punctuation) for token in tokenizer.tokenize(b) \
                    if token.lower().strip(string.punctuation) not in stopwords]

    # Calculate Jaccard similarity
    ratio = len(set(tokens_a).intersection(tokens_b)) / float(len(set(tokens_a).union(tokens_b)))
    return (ratio >= threshold)

for sentence in sentences:
    print(is_ci_token_stopword_set_match(target_sentence, sentence), sentence)


(True, 'How to Contact Chamberlain Technical Support?')
(True, 'how to  contact Chamberlain technical support?')
(True, 'Contact Chamberlain Technical Support?')
(True, 'Contacting Chamberlain Technical Support?')
(True, 'Chamberlain Technical Support')
(True, 'Technical Support')
(False, 'Tech Support')
(False, 'Call technical support')
(True, 'Chamberlain Technical Support')


In [19]:
# Exact Case-Insensitive Token set similarity after stopword removal and stemming using Jaccard similarity index

stopwords = nltk.corpus.stopwords.words('english')
stopwords.extend(string.punctuation)
stopwords.append('')

# Create tokenizer and stemmer
tokenizer =nltk.tokenize.TreebankWordTokenizer()
stemmer = nltk.stem.snowball.SnowballStemmer('english')

def is_ci_stem_stopword_set_match(a, b, threshold=0.5):
    """Check if a and b are matches."""
    tokens_a = [token.lower().strip(string.punctuation) for token in tokenizer.tokenize(a) \
                    if token.lower().strip(string.punctuation) not in stopwords]
    tokens_b = [token.lower().strip(string.punctuation) for token in tokenizer.tokenize(b) \
                    if token.lower().strip(string.punctuation) not in stopwords]
    stems_a = [stemmer.stem(token) for token in tokens_a]
    stems_b = [stemmer.stem(token) for token in tokens_b]

    # Calculate Jaccard similarity
    ratio = len(set(stems_a).intersection(stems_b)) / float(len(set(stems_a).union(stems_b)))
    return (ratio >= threshold)

for sentence in sentences:
    print(is_ci_token_stopword_set_match(target_sentence, sentence), sentence)

(True, 'How to Contact Chamberlain Technical Support?')
(True, 'how to  contact Chamberlain technical support?')
(True, 'Contact Chamberlain Technical Support?')
(True, 'Contacting Chamberlain Technical Support?')
(True, 'Chamberlain Technical Support')
(True, 'Technical Support')
(False, 'Tech Support')
(False, 'Call technical support')
(True, 'Chamberlain Technical Support')


In [None]:
                                                        
    
                                                        Thank you
                                                
    