# NLP for Tio

In [1]:
# Import the modules and packages 

import os   # os is a module for navigating your machine (e.g., file directories).
import nltk # nltk stands for natural language tool kit and is useful for text-mining. 
import re  #  re is for regular expressions, which we use later 
from nltk import word_tokenize
print("1. Succesfully imported necessary modules")    # 
print("")


1. Succesfully imported necessary modules



The stages to pre-process a text are:
 - 1) Raw text (the data is a string)
 - 2) sentence segmentation into tokens(string is turned list of strings)
      - 2a) some standardisation - turning to lower case and spell checking
       - Here is where I think we'd do any other pre-processing via RegEx I think - if/as needed.
  - 3) Part of speech tagging (we now have a list of list of tuples)
 - 4) Named Entity Recognition
  
It is possible wrap put several stages together, however for these purposes I have done each stage one by one.
 

### 1) - start with some raw text as a string

In [3]:
document = ("She moved out last weeek. I am feeling vry sad and very lonely. I miss the little things, \
            like how she leaves her coffee cup in the sink and doesn't wash it up. I don't like these feelings.")
            #a few typos deliberate
            #am putting lonely and sad into one sentence as that has featured in discussions
            #deliberate use of the word "feeling" both as a verb and a noun
  

In [4]:
document

"She moved out last weeek. I am feeling vry sad and very lonely. I miss the little things,             like how she leaves her coffee cup in the sink and doesn't wash it up. I don't like these feelings."

### 2) Segmentation into tokens

In [5]:
#We are breaking down our document into sentences. This gives us a list of strings.
tokens = nltk.word_tokenize(document) #[2]

In [6]:
tokens # Tokens are a list of strings.

['She',
 'moved',
 'out',
 'last',
 'weeek',
 '.',
 'I',
 'am',
 'feeling',
 'vry',
 'sad',
 'and',
 'very',
 'lonely',
 '.',
 'I',
 'miss',
 'the',
 'little',
 'things',
 ',',
 'like',
 'how',
 'she',
 'leaves',
 'her',
 'coffee',
 'cup',
 'in',
 'the',
 'sink',
 'and',
 'does',
 "n't",
 'wash',
 'it',
 'up',
 '.',
 'I',
 'do',
 "n't",
 'like',
 'these',
 'feelings',
 '.']

#### 2a) Some standardisation  - specifically turning to lower case and then spell checking

In [7]:
tokens_lower = [word.lower() for word in tokens]
print(tokens_lower)

['she', 'moved', 'out', 'last', 'weeek', '.', 'i', 'am', 'feeling', 'vry', 'sad', 'and', 'very', 'lonely', '.', 'i', 'miss', 'the', 'little', 'things', ',', 'like', 'how', 'she', 'leaves', 'her', 'coffee', 'cup', 'in', 'the', 'sink', 'and', 'does', "n't", 'wash', 'it', 'up', '.', 'i', 'do', "n't", 'like', 'these', 'feelings', '.']


In [8]:
# Now for a spell check - NTLK default - 
#!pip install autocorrect
from autocorrect import Speller
check = Speller(lang='en')

In [10]:
correct_spell = []

for word in tokens_lower:
    correct_spell.append(check(word))    

print(correct_spell[:25])

#Spell checking needs attention - the default ntlk spell checker turns "loney" to "money" however works ok 
#with the errors I put in here.

['she', 'moved', 'out', 'last', 'week', '.', 'i', 'am', 'feeling', 'very', 'sad', 'and', 'very', 'lonely', '.', 'i', 'miss', 'the', 'little', 'things', ',', 'like', 'how', 'she', 'leaves']


### 3) Parts of speech tagging

The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. This step assigns grammatical information of each word of the sentence. 


For a list of the different parts of speech see this link:
https://www.guru99.com/pos-tagging-chunking-nltk.html

In [24]:
#A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag 
#to each word
tagged = nltk.pos_tag(correct_spell)
tagged
#Note that the first use of feeling is a verb gerund (VBG) and the second use is a plural noun (NNS). This is
# a good thing.
# sad is tagged as an adjective (JJ) and lonely as an adverb (RB) - what you'd expect grammatically.
#interesting how "i" is not consistent

[('she', 'PRP'),
 ('moved', 'VBD'),
 ('out', 'RB'),
 ('last', 'JJ'),
 ('week', 'NN'),
 ('.', '.'),
 ('i', 'NN'),
 ('am', 'VBP'),
 ('feeling', 'VBG'),
 ('very', 'RB'),
 ('sad', 'JJ'),
 ('and', 'CC'),
 ('very', 'RB'),
 ('lonely', 'RB'),
 ('.', '.'),
 ('i', 'VB'),
 ('miss', 'VBP'),
 ('the', 'DT'),
 ('little', 'JJ'),
 ('things', 'NNS'),
 (',', ','),
 ('like', 'IN'),
 ('how', 'WRB'),
 ('she', 'PRP'),
 ('leaves', 'VBZ'),
 ('her', 'PRP$'),
 ('coffee', 'NN'),
 ('cup', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('sink', 'NN'),
 ('and', 'CC'),
 ('does', 'VBZ'),
 ("n't", 'RB'),
 ('wash', 'VB'),
 ('it', 'PRP'),
 ('up', 'RP'),
 ('.', '.'),
 ('i', 'NNS'),
 ('do', 'VBP'),
 ("n't", 'RB'),
 ('like', 'VB'),
 ('these', 'DT'),
 ('feelings', 'NNS'),
 ('.', '.')]

### 4) Named Entity Recognition

Named entities are definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, and so on. I think in the context of Tio we need to improve our understanding the people in the conversations will be seperate entities.

The ntlk chunk package may be able to help us. It basically helps us identify non-overlapping groups in unrestricted text. It's a classifier that has already been trained to recognize named entities, accessed with the function nltk.ne_chunk(). If we set the parameter binary=True, then named entities are just tagged as NE; otherwise, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE.

In [40]:
chunk = nltk.ne_chunk(tagged, binary = True)
print(chunk)

(S
  she/PRP
  moved/VBD
  out/RB
  last/JJ
  week/NN
  ./.
  i/NN
  am/VBP
  feeling/VBG
  very/RB
  sad/JJ
  and/CC
  very/RB
  lonely/RB
  ./.
  i/VB
  miss/VBP
  the/DT
  little/JJ
  things/NNS
  ,/,
  like/IN
  how/WRB
  she/PRP
  leaves/VBZ
  her/PRP$
  coffee/NN
  cup/NN
  in/IN
  the/DT
  sink/NN
  and/CC
  does/VBZ
  n't/RB
  wash/VB
  it/PRP
  up/RP
  ./.
  i/NNS
  do/VBP
  n't/RB
  like/VB
  these/DT
  feelings/NNS
  ./.)


### Where next? Some thoughts.
 - NER needs a bit more attention - plan to go over UKDS presentation from 29 June again and look at accompanying code files. I haven't yet got to grips with chunking properly.
 - Also need to talk to S, O and M re where we are trying to get to.
 - I think we need to think about building Tio's corpus as Tio learns - need to reread chapter 11 of NTLK book?
 - Also need to think about how this relates to the architecture - in terms of how we evaluate each statement vs the cumulative conversation.

### 5) Sources used
See Figure 1.1: Simple Pipeline Architecture for an Information Extraction System for more detail https://www.nltk.org/book/ch07.html 
POS tagging Source for material - chapter 5 of NTLK python book on categorising text https://www.nltk.org/book/ch05.html

https://ukdataservice.ac.uk/media/622724/textminingbasic16june2020.pdf
https://ukdataservice.ac.uk/media/622739/textminingadvanced29june2020.pdf
https://github.com/PeppermintT/Training_sessions