# NLP intro to Natural Language Procession

Based on Laurence Moroney's tutorials

## Contents
0. Install packages
1. A basic example of encoding and decoding / aka creating embeddings
2. Tokenization
3. Sequencing

## 0. Install packages

In [4]:
!pip install tensorflow

Collecting tensorflow
  Using cached tensorflow-2.12.0-cp310-cp310-macosx_10_15_x86_64.whl (230.1 MB)
Collecting google-pasta>=0.1.1
  Using cached google_pasta-0.2.0-py3-none-any.whl (57 kB)
Collecting astunparse>=1.6.0
  Using cached astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting absl-py>=1.0.0
  Using cached absl_py-1.4.0-py3-none-any.whl (126 kB)
Collecting jax>=0.3.15
  Using cached jax-0.4.8-py3-none-any.whl
Collecting gast<=0.4.0,>=0.2.1
  Using cached gast-0.4.0-py3-none-any.whl (9.8 kB)
Collecting grpcio<2.0,>=1.24.3
  Using cached grpcio-1.54.0.tar.gz (23.5 MB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting tensorboard<2.13,>=2.12
  Using cached tensorboard-2.12.3-py3-none-any.whl (5.6 MB)
Collecting google-auth<3,>=1.6.3
  Using cached google_auth-2.17.3-py2.py3-none-any.whl (178 kB)
Collecting google-auth-oauthlib<1.1,>=0.5
  Using cached google_auth_oauthlib-1.0.0-py2.py3-none-any.whl (18 kB)
Collecting cachetools<6.0,>=2.0.0
  Using cached cache

In [None]:
!pip install tiktoken

## 1. A basic example of encoding and decoding / aka creating embeddings

In [6]:
#create a dummy text
text = 'the quick brown fox jumps over the lazy dog!'

In [7]:
#create a set (each character one time) convert it to a list and sort it
print(sorted(list(set(text))))

[' ', '!', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [8]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(vocab_size)
print(''.join(chars)) #join the items in item

28
 !abcdefghijklmnopqrstuvwxyz


In [9]:
# create a mapping from characters to integers
stoi = str_to_int = { ch:i for i,ch in enumerate(chars) } #stoi = string to integer
itos = int_to_str = { i:ch for i,ch in enumerate(chars) } #itos = integer to string

# encoder: take a string, output a list of integers
encode = lambda s: [stoi[c] for c in s] 
# decoder: take a list of integers, output a string
decode = lambda l: ''.join([itos[i] for i in l]) 

print(encode("hello world!"))
print(decode(encode("hello world!")))

[9, 6, 13, 13, 16, 0, 24, 16, 19, 13, 5, 1]
hello world!


In [10]:
print(chars[9],chars[6],chars[13], chars[13], chars[16])

h e l l o


In [11]:
# let's now encode the entire text dataset and store it into a torch.Tensor
# we use PyTorch: https://pytorch.org
import torch 
data = torch.tensor(encode(text), dtype=torch.long)
print(f'the shape is: {data.shape}')
print(f'the dtype is: {data.dtype}')
print(f'the tensor is: \n{data}') # the 1000 characters we looked at earier will to the GPT look like this

the shape is: torch.Size([44])
the dtype is: torch.int64
the tensor is: 
tensor([21,  9,  6,  0, 18, 22, 10,  4, 12,  0,  3, 19, 16, 24, 15,  0,  7, 16,
        25,  0, 11, 22, 14, 17, 20,  0, 16, 23,  6, 19,  0, 21,  9,  6,  0, 13,
         2, 27, 26,  0,  5, 16,  8,  1])


## 2.  Tokenization

In [3]:
#Creating a word-index with Tensorflow's Tokenizer
#source: source: https://www.youtube.com/watch?v=fNxaJsNG3-s
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words =100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

ModuleNotFoundError: No module named 'tensorflow'

In [None]:
import tiktoken
enc = tiktoken.get_encoding("gpt2")
assert enc.decode(enc.encode("hello world")) == "hello world"

In [None]:
print(enc)
print(type(enc))

## 3. Sequencing - Turning sentences into data (NLP Zero to Hero - Part 2)

https://www.youtube.com/watch?v=r9QjkdSJZ2g

In [2]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words =100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

padded = pad_sequences(sequences, padding='post')
print(word_index)
print(sequences)
print(padded)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
[[ 5  3  2  4  0  0  0]
 [ 5  3  2  7  0  0  0]
 [ 6  3  2  4  0  0  0]
 [ 8  6  9  2  4 10 11]]


## 4. NLP @ Picnic (just notes)

definition: NLP enables machines to understand human language

#### Use cases
- Chatbots
- Recommending answers to call center agents eg based on sentiment
- Identifying trends in customer feedback
- Prioritization of tickets
- Routing tickets

Messages could be: positive feedback from happy customers, request to adjust delivery time, missing products, issues with rproducts eg freshness, payment questions, suggestion to products

Automation with NLP.
Picnic uses two models:
- positive classification model =>assign positivity score => then business rules (eg no question mark, no address etc_ => then close message automatically. Circa 20% of messages are positive
- issue classification model => freshness, completeness, payment, adjustment, assortment suggestion

Text classification:
Step 1: Text cleaning
    a. Text substitution: example thursday => $weekday, ? => $QUESTION
    b. Replace emojis with words
    c. Clean punctuation
    d. Tokenization
    e. Stemming
    f. Remove stop words (eg. are, and, the, etc)
Step 2: Feature cration
    a. Text as a feature: bag of words, N-grams, TF_IDF term ferquency-inverse document frequency, BERT (embeddings)
    
    TF-IDF  numbers respresenting how relevant each word is in that document. Overweighr: rare words, underweight: often used words. sklearn has Class TFidfVectorizer
    BERT: can do Sentiment analysis, question answering text prediction etc. + get the essence of a text. BERT helps google to understand queries better. 
    Two submodels:
        1. Masked Language Model (MLM) => bidirectional training, 15% of words is masked during training, use words on either siee of the masked word to predict them. Learns context & doesn't require labeled data.    
        2. Next sentence prediction. (NSP) Binary classification task. Learn about relationships between sentences. 

There is a combined loss function of NSP & MLM. 

BERTje = Dutch BERT model developed at University of Groningen. 

Huggingface provides really nice API's:
from transformers import BertTokeniser, BertforSequenceClassification (oid)

TF-IDF works better when other features are present computationally less expensive. BERT is better for unstructured text.

Step 3: Classification

- logistic regression model
    positive score, probabilty of each issue type.
    

Architecture overview: 
- Salesforce as system of record
- Python job to request all messages (first batch job, now an API)
- then 3 AI steps
- send back the classifications to sales