#### Types of Tokenization Methods
***
1. Word Tokenization
2. Character Tokenization
3. Sub-Word Tokenization

More INFO: [DataCamp Tokenization Explanation](https://www.datacamp.com/blog/what-is-tokenization)

In [8]:
import numpy as np
import torch
import torch.nn as nn

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import CountVectorizer

import pandas as pd

In [147]:
stopwords_list = set(stopwords.words('english'))
TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    '''Removes HTML tags: replaces anything between opening and closing <> with empty space'''

    return TAG_RE.sub('', text)

def text_preprocess(sen):
    
    sen = sen.lower()
        
    # Remove html tags
    sentence = remove_tags(sen)

    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)
        
    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)  # When we remove apostrophe from the word "Mark's", the apostrophe is replaced by an empty space. Hence, we are left with single character "s" that we are removing here.

    # Remove multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)  # Next, we remove all the single characters and replace it by a space which creates multiple spaces in our text. Finally, we remove the multiple spaces from our text as well.
        
    # Remove Stopwords
    pattern = re.compile(r'\b(' + r'|'.join(stopwords_list) + r')\b\s*')
    sentence = pattern.sub('', sentence)

    return sentence.rstrip()

# The code above ^ https://github.com/skillcate/sentiment-analysis-with-deep-neural-networks/blob/main/b2_preprocessing_function.py

def n_grams(data, n:int=3):
    n_grams_list = []

    for i in range(len(data)):
        for j in range(len(data[i]) - n + 1):
            n_grams_list.append(data[i][j: j + n])
    
    return n_grams_list

def clean_data(data):
    clean_data = []
    for i, sentence in enumerate(data):
        clean_data.append((text_preprocess(sentence).split(" ")))

    return clean_data

In [154]:
# Example Data

data = np.array(['Branden is good person\n',
 'Shlok is a great man\n',
 'Jason is a nice person\n',
 'David is a bad human\n',
 'Chris has a great personality\n'])


#### 1. Word Tokenization
***
- Splitting the sentence into word serperate words then convert them using word_to_index method
- Most common approach and particulary effective for languages with clear boundaries

1. Benefits
    - Simpliest way to seperate speech or text into parts
2. Cons
    - Difficult for word tokenization to seperate unknown words or Out of Vocavulary Words
    - Temporary solution is to replace unknown words with common tokens, but there is no way to determine whether the unknown words are different or the same

In [153]:
# Using n_grams to create word tokens
n_grams(clean_data(data), 1)

[['branden'],
 ['good'],
 ['person'],
 ['shlok'],
 ['great'],
 ['man'],
 ['jason'],
 ['nice'],
 ['person'],
 ['david'],
 ['bad'],
 ['human'],
 ['chris'],
 ['great'],
 ['personality'],
 ['sara'],
 ['interesting'],
 ['woman']]

In [31]:
# using ntlk's word tokenize function to create word tokens
word_tokenize(example_sentence)

['david', 'bad', 'human']

#### 2. Character Tokenization
***
- Splitting the sentence into characters then convert them using word_to_index method
- Allows the tokenization process to retain information about (OOV) words that word tokenization cannot

1. Benefits
    - Allows to obtain more information/context from the corpus
2. Cons
   - Increases dimnensionality of word vector
   - Example: (My name) -> ("M", "y", "n" "a", "m", "e")


#### 3. Subword Tokenization
***