# To run this notebook properly you first need to install all the libraries used
## To install the libraries use "pip install <library name> command"

The libraries used are :
- Pandas
- nltk
    
## Also make sure the path to the files training_data.tsv and eval_data.txt are changed accordingly.

# The approach I have used here is Chunking which is a part of NLP and is done through NLTK.

In [1]:
import pandas as pd
df = pd.read_csv("/Users/harshpanwar/Desktop/phrase_extractor/training_data.tsv", sep="\t")

### To view the first 5 rows in the dataframe created

In [2]:
df.head()

Unnamed: 0,sent,label
0,Make remainder,Not Found
1,Set a reminder on date 23rd November'2016,Not Found
2,I need a daily wake up call,wake up call
3,remind me 6 pm today eveng,Not Found
4,Hi Pls to make one reminder for me,Not Found


In [3]:
len(df)

9819

## To remove those rows which don't have a reminder

In [4]:
df = df[df.label != 'Not Found']
df = df.reset_index(drop=True)

In [5]:
len(df)

5907

In [6]:
df.head()

Unnamed: 0,sent,label
0,I need a daily wake up call,wake up call
1,Remind me at 28 December for recharge,recharge
2,Can u pls remind me at 7pm on 8 Jan,on 8 Jan
3,What is my next reminder?,What
4,Set a reminder on 4 th Dec of going to meet so...,meet sonal miss


In [7]:
import nltk
from nltk.util import ngrams
#nltk.download('averaged_perceptron_tagger')
#nltk.download('tagsets')

In [8]:
sentence = df.sent[11]

In [9]:
sentence

'Remind me to buy eggs on next Monday and Tuesday at 9pm'

In [10]:
token = nltk.word_tokenize(sentence)

In [11]:
token

['Remind',
 'me',
 'to',
 'buy',
 'eggs',
 'on',
 'next',
 'Monday',
 'and',
 'Tuesday',
 'at',
 '9pm']

In [12]:
nltk.pos_tag(token)

[('Remind', 'VB'),
 ('me', 'PRP'),
 ('to', 'TO'),
 ('buy', 'VB'),
 ('eggs', 'NNS'),
 ('on', 'IN'),
 ('next', 'JJ'),
 ('Monday', 'NNP'),
 ('and', 'CC'),
 ('Tuesday', 'NNP'),
 ('at', 'IN'),
 ('9pm', 'CD')]

In [13]:
nltk.help.upenn_tagset("JJ")

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...


In [14]:
grammar = ('''
    NP: {<DT>?<JJ>*<NN>} # NP
    ''')

In [15]:
chunkParser = nltk.RegexpParser(grammar)
tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
tagged

[('Remind', 'VB'),
 ('me', 'PRP'),
 ('to', 'TO'),
 ('buy', 'VB'),
 ('eggs', 'NNS'),
 ('on', 'IN'),
 ('next', 'JJ'),
 ('Monday', 'NNP'),
 ('and', 'CC'),
 ('Tuesday', 'NNP'),
 ('at', 'IN'),
 ('9pm', 'CD')]

In [16]:
tree = chunkParser.parse(tagged)

In [17]:
for subtree in tree.subtrees():
    print(subtree)

(S
  Remind/VB
  me/PRP
  to/TO
  buy/VB
  eggs/NNS
  on/IN
  next/JJ
  Monday/NNP
  and/CC
  Tuesday/NNP
  at/IN
  9pm/CD)


### To better visualize the process of chunking run the code "tree.draw()" below which will draw a tree for having different POS.

In [18]:
#tree.draw()

In [19]:
#Significant words
sig_words = [s[0] for s in tagged if s[1].startswith('N') or s[1].startswith('V')]
sig_words

['Remind', 'buy', 'eggs', 'Monday', 'Tuesday']

In [20]:
n = 5
for i in range(1,n+1):
    out = list(ngrams(token, i))
    print(out,"\n")

[('Remind',), ('me',), ('to',), ('buy',), ('eggs',), ('on',), ('next',), ('Monday',), ('and',), ('Tuesday',), ('at',), ('9pm',)] 

[('Remind', 'me'), ('me', 'to'), ('to', 'buy'), ('buy', 'eggs'), ('eggs', 'on'), ('on', 'next'), ('next', 'Monday'), ('Monday', 'and'), ('and', 'Tuesday'), ('Tuesday', 'at'), ('at', '9pm')] 

[('Remind', 'me', 'to'), ('me', 'to', 'buy'), ('to', 'buy', 'eggs'), ('buy', 'eggs', 'on'), ('eggs', 'on', 'next'), ('on', 'next', 'Monday'), ('next', 'Monday', 'and'), ('Monday', 'and', 'Tuesday'), ('and', 'Tuesday', 'at'), ('Tuesday', 'at', '9pm')] 

[('Remind', 'me', 'to', 'buy'), ('me', 'to', 'buy', 'eggs'), ('to', 'buy', 'eggs', 'on'), ('buy', 'eggs', 'on', 'next'), ('eggs', 'on', 'next', 'Monday'), ('on', 'next', 'Monday', 'and'), ('next', 'Monday', 'and', 'Tuesday'), ('Monday', 'and', 'Tuesday', 'at'), ('and', 'Tuesday', 'at', '9pm')] 

[('Remind', 'me', 'to', 'buy', 'eggs'), ('me', 'to', 'buy', 'eggs', 'on'), ('to', 'buy', 'eggs', 'on', 'next'), ('buy', 'eggs',

In [26]:
def phrase_extractor(sentence):
    words = nltk.word_tokenize(sentence)
    nltk.pos_tag(words)
    grammar = "NP: {<DT>?<JJ>*<NN>}"
    parser = nltk.RegexpParser(grammar)
    t = parser.parse(nltk.pos_tag(words))
    a = [s for s in t.subtrees() if s.label() == "NP"]
    c = []
    num = []
    key  = ["monday","tuesday", "wednesday", "thursday","friday","saturday","sunday","today","tomorrow","yesterday", "reminder", "remind", "th", "pm","am"]
    for i in range(len(a)):
        count=0
        phrase = ""
        for j in range(len(a[i])):
            if a[i][j][0].lower() in key:
                phrase = phrase
            else : 
                phrase = phrase + str(a[i][j][0]) + " "
                count = count+1
        c.append(phrase)
        num.append(count)
    if(c==[] or max(num)<=1):
        return "Not Found"
    else :
        maxi = max(num)
        for i in range(len(num)):
            if(num[i]==maxi):
                return c[i].rstrip()

### Now we use phase_extractor method created above to extract a phrase/reminder from a sentence

In [27]:
print(sentence,"\n")
print("Reminder extracted  :   ", phrase_extractor(sentence))

Remind me to buy eggs on next Monday and Tuesday at 9pm 

Reminder extracted  :    Not Found


### We were able to extract a reminder as seen above

## We apply the same method on a single sentence chosen randomly from the eval_data.txt dataset

In [28]:
with open("eval_data.txt", 'r+') as f:
    lines = [line.rstrip('\n') for line in f]
print(lines[5])
print("Reminder extracted  :   ",phrase_extractor(lines[5]))

Reminder 11am tomorrorw buy bell for papu
Reminder extracted  :    tomorrorw buy


### We apply the function phrase_extractor on all the lines in the eval_data.txt dataset and store the output in the form of csv

In [29]:
import csv
with open('eval_data.csv', mode='w', newline='') as csv_file:
    fields = ['sent', 'label']
    writer = csv.DictWriter(csv_file, fieldnames=fields)
    writer.writeheader()
    for i in range(len(lines)):
        writer.writerow({'sent':lines[i],'label':phrase_extractor(lines[i])})

### To evaluate our trained model we create a new csv file "eval_data_accuracy.csv" containing three fields 'sent' , 'Actual_label' , 'Predicted_label' . The sent label is same as the sent label of the initial dataframe df which contains the sentences. The Actual_label have the predicted values of reminders that were already stored in the training dataset. And the Predicted_label contains the Reminders that were generated by applying phrase_extractor function on the sent label.

In [30]:
with open('eval_data_accuracy.csv', mode='w', newline='', encoding = 'utf-8') as csv_file:
    fields = ['sent', 'Actual_label', 'Predicted_label']
    writer = csv.DictWriter(csv_file, fieldnames=fields)
    writer.writeheader()
    count = 0
    for i in range(len(df)):
        writer.writerow({'sent':df['sent'][i], 'Actual_label':df['label'][i], 'Predicted_label':phrase_extractor(str(df['sent'][i]))})
        
        if str(df['label'][i]) == phrase_extractor(str(df['sent'][i])):
            count = count+1
            
print ("Accuracy = ", (count/len(df))*100, "%")

Accuracy =  1.946842728965634 %


- The Accuracy can be increased by changing the grammar which we considered above. 
- The accuracy depends on the training dataset. As we have just considered only those predictions as true which are exactly similar to the values in the training_data.tsv under "label" then it may be possible that some predictions made by our model are correct but the training_data.tsv have the wrong reminder. 
- By removing the rows having the label Not Found we have decreased the size of the dataset by around 40%.
