## NLP
Enabling computers to understand, interpret, and generate human language.

simple process flow:

- input -> text process -> natural langauge understanding -> dialog management -> natural language generation -> output

**Level of understanding**

## Import NLP library

In [1]:
import nltk

## Stemming, Lemmatization
- Stemming: Return to based word by cutting
- Lemmatization: Return to based word by looking at conditions(Ex: grammar)

Download wordnet (dataset, used for stemming and lemmatization)

In [2]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

### Import Stemming Function and Lemmatization Function

In [3]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

In [4]:
stemmer = PorterStemmer()
lemma = WordNetLemmatizer()

### Display function for comparing stemming and Lemmatization

In [5]:
def display(text):
  print(f"{'Word':<12}{'Lemma':<12}{'Stem':<12}")
  print('-' * 36)
  for word in text:
      print(f"{word:<12}{lemma.lemmatize(word):<12}{stemmer.stem(word):<12}")

### Example:

In [6]:
Word_list = ["fly", "flies", "flying", "flew", "flown"]
display(Word_list)

Word        Lemma       Stem        
------------------------------------
fly         fly         fli         
flies       fly         fli         
flying      flying      fli         
flew        flew        flew        
flown       flown       flown       


### Example 2:

In [14]:
Word_list = ["Universe", "University", "Universal"]
display(Word_list)

Word        Lemma       Stem        
------------------------------------
Universe    Universe    univers     
University  University  univers     
Universal   Universal   univers     


### Example 3:

In [8]:
Word_list = "The painting looks beautiful".split()
display(Word_list)

Word        Lemma       Stem        
------------------------------------
The         The         the         
painting    painting    paint       
looks       look        look        
beautiful   beautiful   beauti      


## Stopwords
- ignore extra words ex: a, an, am
  - not recommend for Thai language
- Download stopwords (dataset, used for stopwords function)

In [9]:
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')
stopwords[0:10]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an']

In [10]:
def remove_stopwords(text):
  output = [i for i in text if i not in stopwords]
  return output

In [11]:
text = "The is a painting and it looks beautiful".split()
print("Original text: ", text)
print("Remove stopwords: ", remove_stopwords(text))

Original text:  ['The', 'is', 'a', 'painting', 'and', 'it', 'looks', 'beautiful']
Remove stopwords:  ['The', 'painting', 'looks', 'beautiful']


## Normalisation
- straightforward just dictionary
- useful for anlyze text

In [38]:
norm_dict = {
    '2moro': 'tomorrow',
    '2mrrw': 'tomorrow',
    '2morrow': 'tomorrow',
    '2mrw': 'tomorrow',
    'tomrw': 'tomorrow',
    'tmr': 'tomorrow',
    'omw': 'on my way',
    'rn': 'right now',
    'b4': 'before',
    'otw': 'on the way',
    ':D': 'smile',
    ':)': 'smile',
    'j-)': 'smile'
}

In [45]:
def normalize_text(text):
  res = [norm_dict[w] if w in norm_dict else w for w in text]
  return res

In [47]:
word_list = ["tmr", "omw", ':D', "b4"]
normalize_text(word_list)

['tomorrow', 'on my way', 'smile', 'before']

## Noise Removal
Delete unncessary stuff from words

In [49]:
import pandas as pd
import re

In [50]:
def scrub_words(text):
    # remove html markup
    text = re.sub(r'<.*?>', '', text)
    # remove non-ascii and digits
    text = re.sub(r'[\W\d]', ' ', text)
    # remove whitespace
    text = text.strip()
    return text

In [54]:
raw_words = ["..trouble..", "trouble<", "trouble!", "<a>trouble</a>", "1.trouble"]
cleaned_words = [scrub_words(w) for w in raw_words]
stemdf = pd.DataFrame({'raw_word': raw_words,'cleaned_word':cleaned_words})
stemdf = stemdf[['raw_word','cleaned_word']]
stemdf

Unnamed: 0,raw_word,cleaned_word
0,..trouble..,trouble
1,trouble<,trouble
2,trouble!,trouble
3,<a>trouble</a>,trouble
4,1.trouble,trouble


## Text Enrichment / Augmentation
adding more sematics to words/input

In [67]:
from nltk.corpus import wordnet
syns = wordnet.synsets("program") # The output is library available of the word "program" or relate to "program"
syns

[Synset('plan.n.01'),
 Synset('program.n.02'),
 Synset('broadcast.n.02'),
 Synset('platform.n.02'),
 Synset('program.n.05'),
 Synset('course_of_study.n.01'),
 Synset('program.n.07'),
 Synset('program.n.08'),
 Synset('program.v.01'),
 Synset('program.v.02')]

In [68]:
[s.lemmas()[0].name() for s in syns]

['plan',
 'program',
 'broadcast',
 'platform',
 'program',
 'course_of_study',
 'program',
 'program',
 'program',
 'program']