# NLP encoding and decoding and tokenizers

#### Contents
0. Install packages
1. A basic example of encoding and decoding
2. OpenAI's tiktoken
3. ...

## 0. Install packages

In [30]:
!pip install torch



In [31]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.1.2-cp39-cp39-macosx_10_9_x86_64.whl (728 kB)
[K     |████████████████████████████████| 728 kB 4.5 MB/s eta 0:00:01
[?25hCollecting blobfile>=2
  Downloading blobfile-2.0.1-py3-none-any.whl (73 kB)
[K     |████████████████████████████████| 73 kB 5.8 MB/s  eta 0:00:01
Collecting pycryptodomex~=3.8
  Downloading pycryptodomex-3.16.0-cp35-abi3-macosx_10_9_x86_64.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 24.4 MB/s eta 0:00:01
Installing collected packages: pycryptodomex, blobfile, tiktoken
Successfully installed blobfile-2.0.1 pycryptodomex-3.16.0 tiktoken-0.1.2


## 1. A basic example of encoding and decoding

In [14]:
#create a dummy text
text = 'the quick brown fox jumps over the lazy dog!'

In [15]:
#create a set (each character one time) convert it to a list and sort it
print(sorted(list(set(text))))

[' ', '!', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [16]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(vocab_size)
print(''.join(chars)) #join the items in item

28
 !abcdefghijklmnopqrstuvwxyz


In [18]:
# create a mapping from characters to integers
stoi = str_to_int = { ch:i for i,ch in enumerate(chars) }
itos = int_to_str = { i:ch for i,ch in enumerate(chars) }

# encoder: take a string, output a list of integers
encode = lambda s: [stoi[c] for c in s] 
# decoder: take a list of integers, output a string
decode = lambda l: ''.join([itos[i] for i in l]) 

print(encode("hello world!"))
print(decode(encode("hello world!")))

[9, 6, 13, 13, 16, 0, 24, 16, 19, 13, 5, 1]
hello world!


In [20]:
print(chars[9],chars[6],chars[13], chars[13], chars[16])

h e l l o


In [28]:
# let's now encode the entire text dataset and store it into a torch.Tensor
# we use PyTorch: https://pytorch.org
import torch 
data = torch.tensor(encode(text), dtype=torch.long)
print(f'the shape is: {data.shape}')
print(f'the dtype is: {data.dtype}')
print(f'the tensor is: \n{data}') # the 1000 characters we looked at earier will to the GPT look like this

the shape is: torch.Size([44])
the dtype is: torch.int64
the tensor is: 
tensor([21,  9,  6,  0, 18, 22, 10,  4, 12,  0,  3, 19, 16, 24, 15,  0,  7, 16,
        25,  0, 11, 22, 14, 17, 20,  0, 16, 23,  6, 19,  0, 21,  9,  6,  0, 13,
         2, 27, 26,  0,  5, 16,  8,  1])


## 2. Tokenization with tiktoken

source: https://github.com/openai/tiktoken

In [32]:
import tiktoken
enc = tiktoken.get_encoding("gpt2")
assert enc.decode(enc.encode("hello world")) == "hello world"

In [34]:
print(enc)
print(type(enc))

<Encoding 'gpt2'>
<class 'tiktoken.core.Encoding'>


## 3. NLP @ PicNic

definition: NLP enables machines to understand human language

#### Use cases
- Chatbots
- Recommending answers to call center agents eg based on sentiment
- Identifying trends in customer feedback
- Prioritization of tickets
- Routing tickets

Messages could be: positive feedback from happy customers, request to adjust delivery time, missing products, issues with rproducts eg freshness, payment questions, suggestion to products

Automation with NLP.
Picnic uses two models:
- positive classification model =>assign positivity score => then business rules (eg no question mark, no address etc_ => then close message automatically. Circa 20% of messages are positive
- issue classification model => freshness, completeness, payment, adjustment, assortment suggestion

Text classification:
Step 1: Text cleaning
    a. Text substitution: example thursday => $weekday, ? => $QUESTION
    b. Replace emojis with words
    c. Clean punctuation
    d. Tokenization
    e. Stemming
    f. Remove stop words (eg. are, and, the, etc)
Step 2: Feature cration
    a. Text as a feature: bag of words, N-grams, TF_IDF term ferquency-inverse document frequency, BERT (embeddings)
    
    TF-IDF  numbers respresenting how relevant each word is in that document. Overweighr: rare words, underweight: often used words. sklearn has Class TFidfVectorizer
    BERT: can do Sentiment analysis, question answering text prediction etc. + get the essence of a text. BERT helps google to understand queries better. 
    Two submodels:
        1. Masked Language Model (MLM) => bidirectional training, 15% of words is masked during training, use words on either siee of the masked word to predict them. Learns context & doesn't require labeled data.    
        2. Next sentence prediction. (NSP) Binary classification task. Learn about relationships between sentences. 

There is a combined loss function of NSP & MLM. 

BERTje = Dutch BERT model developed at University of Groningen. 

Huggingface provides really nice API's:
from transformers import BertTokeniser, BertforSequenceClassification (oid)

TF-IDF works better when other features are present computationally less expensive. BERT is better for unstructured text.

Step 3: Classification

- logistic regression model
    positive score, probabilty of each issue type.
    

Architecture overview: 
- Salesforce as system of record
- Python job to request all messages (first batch job, now an API)
- then 3 AI steps
- send back the classifications to sales
  

In [None]:
#Öykü 
#