# NLP for Tio part 2

Recap:
 - Classifying words into parts of speech gives us the foundation. When we have the sentence " I am feeling very sad and lonely", sad is tagged as an adjective (JJ) and lonely as an adverb (RB).
 
What next:
 - I have played around with putting the parts of speech tags back into larger pieces to work out how to get some meaning from the text.
  - Option 1: looking at chunking.
  - Option 2: looking at linguistic structures
  

In [1]:
# Import the modules and packages 

import os   # os is a module for navigating your machine (e.g., file directories).
import nltk # nltk stands for natural language tool kit and is useful for text-mining. 
import re  #  re is for regular expressions, which we use later 
from nltk import word_tokenize
print("1. Succesfully imported necessary modules")    # 
print("")

1. Succesfully imported necessary modules



In [2]:
document = ("Jane moved out last week. I am feeling very sad and very lonely. I miss the little things, \
            like how she leaves her coffee cup in the sink. I don't like these feelings.")
#I have left it as "Jane" rather than "she". Jane will be recognised as a Named Entity

In [3]:
document

"Jane moved out last week. I am feeling very sad and very lonely. I miss the little things,             like how she leaves her coffee cup in the sink. I don't like these feelings."

In [4]:
#Turn text into tokens.
tokens = nltk.word_tokenize(document)
#Tag the tokens
tagged = nltk.pos_tag(tokens)
tagged

[('Jane', 'NNP'),
 ('moved', 'VBD'),
 ('out', 'RP'),
 ('last', 'JJ'),
 ('week', 'NN'),
 ('.', '.'),
 ('I', 'PRP'),
 ('am', 'VBP'),
 ('feeling', 'VBG'),
 ('very', 'RB'),
 ('sad', 'JJ'),
 ('and', 'CC'),
 ('very', 'RB'),
 ('lonely', 'RB'),
 ('.', '.'),
 ('I', 'PRP'),
 ('miss', 'VBP'),
 ('the', 'DT'),
 ('little', 'JJ'),
 ('things', 'NNS'),
 (',', ','),
 ('like', 'IN'),
 ('how', 'WRB'),
 ('she', 'PRP'),
 ('leaves', 'VBZ'),
 ('her', 'PRP$'),
 ('coffee', 'NN'),
 ('cup', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('sink', 'NN'),
 ('.', '.'),
 ('I', 'PRP'),
 ('do', 'VBP'),
 ("n't", 'RB'),
 ('like', 'VB'),
 ('these', 'DT'),
 ('feelings', 'NNS'),
 ('.', '.')]

## Option 1: Chunking

Chunking takes the POS segments and labels multi-token sequences. There
are different types of chunking. I simply tried noun phrase chunking, or NP-chunking, where we search for chunks corresponding to individual noun phrases.

In [5]:
print(tagged)

[('Jane', 'NNP'), ('moved', 'VBD'), ('out', 'RP'), ('last', 'JJ'), ('week', 'NN'), ('.', '.'), ('I', 'PRP'), ('am', 'VBP'), ('feeling', 'VBG'), ('very', 'RB'), ('sad', 'JJ'), ('and', 'CC'), ('very', 'RB'), ('lonely', 'RB'), ('.', '.'), ('I', 'PRP'), ('miss', 'VBP'), ('the', 'DT'), ('little', 'JJ'), ('things', 'NNS'), (',', ','), ('like', 'IN'), ('how', 'WRB'), ('she', 'PRP'), ('leaves', 'VBZ'), ('her', 'PRP$'), ('coffee', 'NN'), ('cup', 'NN'), ('in', 'IN'), ('the', 'DT'), ('sink', 'NN'), ('.', '.'), ('I', 'PRP'), ('do', 'VBP'), ("n't", 'RB'), ('like', 'VB'), ('these', 'DT'), ('feelings', 'NNS'), ('.', '.')]


In [6]:
#determine a grammar
grammar = "NP: {<DT>?<JJ>*<NN>}"

In [7]:
cp = nltk.RegexpParser(grammar)
result = cp.parse(tagged)
print (result) 

(S
  Jane/NNP
  moved/VBD
  out/RP
  (NP last/JJ week/NN)
  ./.
  I/PRP
  am/VBP
  feeling/VBG
  very/RB
  sad/JJ
  and/CC
  very/RB
  lonely/RB
  ./.
  I/PRP
  miss/VBP
  the/DT
  little/JJ
  things/NNS
  ,/,
  like/IN
  how/WRB
  she/PRP
  leaves/VBZ
  her/PRP$
  (NP coffee/NN)
  (NP cup/NN)
  in/IN
  (NP the/DT sink/NN)
  ./.
  I/PRP
  do/VBP
  n't/RB
  like/VB
  these/DT
  feelings/NNS
  ./.)


In [8]:
result.draw()

In [9]:
#chunking1_para is very flat and too long and messy.
#let's try just a shorter sentence.

sentence = ("I am feeling very sad and very lonely.")
tokens1 = nltk.word_tokenize(sentence)
#Tag the tokens
tagged1 = nltk.pos_tag(tokens1)
tagged1

[('I', 'PRP'),
 ('am', 'VBP'),
 ('feeling', 'VBG'),
 ('very', 'RB'),
 ('sad', 'JJ'),
 ('and', 'CC'),
 ('very', 'RB'),
 ('lonely', 'RB'),
 ('.', '.')]

In [10]:
cp = nltk.RegexpParser(grammar)
result1 = cp.parse(tagged1)
print (result1) 

(S
  I/PRP
  am/VBP
  feeling/VBG
  very/RB
  sad/JJ
  and/CC
  very/RB
  lonely/RB
  ./.)


In [11]:
result1.draw()

These are two examples of chunking sentences and also drawing a tree diagram.
 - The chunking could be made more sophisticated using chunking regex rules (see ch7) - before we dive into that we would need to determine what we wanted out of it.
  - We could also define what we want to exclude from a chunk - this is called chinking.
  -there is a section in ch8 about evaluating chunking - not sure whether relevant or how we apply.
  

 ## Option 2: Looking at linguistic structures

In [23]:
#define a grammar
from nltk import CFG

In [31]:
nltk.CFG

nltk.grammar.CFG

In [34]:
grammar?

[1;31mType:[0m        str
[1;31mString form:[0m NP: {<DT>?<JJ>*<NN>}
[1;31mLength:[0m      20
[1;31mDocstring:[0m  
str(object='') -> str
str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to 'strict'.


In [37]:
grammar = nltk.grammar.CFG("""
S -> NP VP
VP -> V NP | V NP PP
PP -> P NP
V -> "saw" | "ate" | "walked"
NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
Det -> "a" | "an" | "the" | "my"
N -> "man" | "dog" | "cat" | "telescope" | "park"
P -> "in" | "on" | "by" | "with"
""")

TypeError: __init__() missing 1 required positional argument: 'productions'

In [41]:
#ok back to documentation, we're going to have to just find a grammar. Any grammar.

from nltk.parse.generate import generate, demo_grammar
grammar = CFG.fromstring(demo_grammar)
print(grammar)

Grammar with 13 productions (start state = S)
    S -> NP VP
    NP -> Det N
    PP -> P NP
    VP -> 'slept'
    VP -> 'saw' NP
    VP -> 'walked' PP
    Det -> 'the'
    Det -> 'a'
    N -> 'man'
    N -> 'park'
    N -> 'dog'
    P -> 'in'
    P -> 'with'


OK - we have a grammar. ish.
NTLK defines grammar as: 
 - A grammar is a compact characterization of a potentially infinite set of sentences; we say that a tree is well-formed according to a grammar, or that a grammar licenses a tree.
 - A grammar is a formal model for describing whether a given phrase can be assigned a particular constituent or dependency structure.
 
 .....will need to come back to this.....
 
 ....also I am not sure how chunking relates to grammars. How do the two concepts relate?

In [43]:
sr_parse = nltk.ShiftReduceParser(grammar)
sent = 'I am feeling very sad and very lonely'.split()
result2 = sr_parse.parse(sent)
result2
  

<generator object ShiftReduceParser.parse at 0x000002A5039AC780>

In [None]:
#http://www.nltk.org/howto/generate.html
#http://www.nltk.org/book_1ed/ch08.html

In [15]:
tree1 = nltk.Tree('NP', ['Alice'])
print (tree1)



(NP Alice)


In [16]:
tree2 = nltk.Tree('NP', ['the', 'rabbit'])
print (tree2)


(NP the rabbit)


In [None]:
tree2.draw()

In [None]:
#other
# i think the subclauses Sanjay was talking about feature in 9.3 - Extending a Feature based Grammar........