<a href="https://colab.research.google.com/github/RiverTwilight/Awesome-Machine-Learning-Playground/blob/master/Natural_Language_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

> By Rene Wang

This notebook provides a comprehensive explanation of how a basic language model operates, covering all the essential steps. Upon completion, you will possess a solid foundational understanding of Natural Language Processing (NLP), enabling you to delve deeper into the subject.

Note: It's recommend to read the [Handwritten_Digits_Detection]() first. Some basic concepts and codes are explained there.

In [71]:
import os
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import jieba
import pickle
!pip install kora colorama -q
from colorama import Fore, Back, Style
from kora import drive
drive.link_nbs()

# from Handwritten_Digits_Detection import Relu, Affine, AdaGard

To render Chinese character in matplotlib we need to download the font first.

In [9]:
!wget -O SourceHanSansSC-Normal.otf https://github.com/adobe-fonts/source-han-sans/blob/release/OTF/SimplifiedChinese/SourceHanSansSC-Normal.otf?raw=true

os.makedirs('/root/.config/matplotlib', exist_ok=True)

!cp SourceHanSansSC-Normal.otf /root/.config/matplotlib/

matplotlib.font_manager.fontManager.addfont('SourceHanSansSC-Normal.otf')

matplotlib.rc("font",family='Source Han Sans SC')

--2023-06-09 10:06:38--  https://github.com/adobe-fonts/source-han-sans/blob/release/OTF/SimplifiedChinese/SourceHanSansSC-Normal.otf?raw=true
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/adobe-fonts/source-han-sans/raw/release/OTF/SimplifiedChinese/SourceHanSansSC-Normal.otf [following]
--2023-06-09 10:06:38--  https://github.com/adobe-fonts/source-han-sans/raw/release/OTF/SimplifiedChinese/SourceHanSansSC-Normal.otf
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/adobe-fonts/source-han-sans/release/OTF/SimplifiedChinese/SourceHanSansSC-Normal.otf [following]
--2023-06-09 10:06:39--  https://raw.githubusercontent.com/adobe-fonts/source-han-sans/release/OTF/SimplifiedChinese/SourceHanSansSC-Normal.otf
Resolving raw.githubusercontent.com (r

# Word Embedding: Count-based Method

In order to put the complex the data into neural network, the most-used solution is to convert the data to vector space, which is called **word2vec**. In this part we'll discuss how to convert the natural text data into vectors.

## Word co-occurrence matrix

Usually the meaning of a word depends on the context. A word co-occurrence matrix reprents how many times has a word appear around another word. 

Consider this sentence:
```markdown
 You are my friend and I am his friend. 
```
First we remove the duplicate words and we can create a table like this:

\begin{array}{ccc}
\text{ }&\text{You}&\text{are}&\text{my}&\text{friend}&\text{and}&\text{I}&\text{am}&\text{his}&\text{.}\\
\text{You}&\text{0}&\text{1}&\text{0}&\text{0}&\text{0}&\text{0}&\text{0}&\text{0}&\text{0}\\
\text{are}&\text{1}&\text{0}&\text{1}&\text{0}&\text{0}&\text{0}&\text{0}&\text{0}&\text{0}\\
\text{my}\\
\text{friend}\\
\text{and}\\
\text{I}\\
\text{am}\\
\text{his}\\
\end{array}

The row label reprents those words exsit in the context of the col lable. The context size was called **window_size**.

In [11]:
import sys
import os
sys.path.append('..')
try:
    import urllib.request
except ImportError:
    raise ImportError('Use Python3!')
import pickle

url_base = 'https://raw.githubusercontent.com/tomsercu/lstm/master/data/'
key_file = {
    'train':'ptb.train.txt',
    'test':'ptb.test.txt',
    'valid':'ptb.valid.txt'
}
save_file = {
    'train':'ptb.train.npy',
    'test':'ptb.test.npy',
    'valid':'ptb.valid.npy'
}
vocab_file = 'ptb.vocab.pkl'

dataset_dir = '/content/drive/MyDrive/Project/NLP'

def _download(file_name):
    file_path = dataset_dir + '/' + file_name
    if os.path.exists(file_path):
        return

    print('Downloading ' + file_name + ' ... ')

    try:
        urllib.request.urlretrieve(url_base + file_name, file_path)
    except urllib.error.URLError:
        import ssl
        ssl._create_default_https_context = ssl._create_unverified_context
        urllib.request.urlretrieve(url_base + file_name, file_path)

    print('Done')


def load_vocab():
    vocab_path = dataset_dir + '/' + vocab_file

    if os.path.exists(vocab_path):
        with open(vocab_path, 'rb') as f:
            word_to_id, id_to_word = pickle.load(f)
        return word_to_id, id_to_word

    word_to_id = {}
    id_to_word = {}
    data_type = 'train'
    file_name = key_file[data_type]
    file_path = dataset_dir + '/' + file_name

    _download(file_name)

    words = open(file_path).read().replace('\n', '<eos>').strip().split()

    for i, word in enumerate(words):
        if word not in word_to_id:
            tmp_id = len(word_to_id)
            word_to_id[word] = tmp_id
            id_to_word[tmp_id] = word

    with open(vocab_path, 'wb') as f:
        pickle.dump((word_to_id, id_to_word), f)

    return word_to_id, id_to_word


def load_ptb_data(data_type='train'):
    '''
        :param data_type: 数据的种类：'train' or 'test' or 'valid (val)'
        :return:
    '''
    if data_type == 'val': data_type = 'valid'
    save_path = dataset_dir + '/' + save_file[data_type]

    word_to_id, id_to_word = load_vocab()

    if os.path.exists(save_path):
        corpus = np.load(save_path)
        return corpus, word_to_id, id_to_word

    file_name = key_file[data_type]
    file_path = dataset_dir + '/' + file_name
    _download(file_name)

    words = open(file_path).read().replace('\n', '<eos>').strip().split()
    corpus = np.array([word_to_id[w] for w in words])

    np.save(save_path, corpus)
    return corpus, word_to_id, id_to_word

def postprocess(text, split_policy="Default"):
    text = text.lower()
    text = text.replace('.', " .")

    if split_policy == "Chinese":
        words = list(jieba.cut(text))
    else:
        words = text.split(' ')

    print(words)

    word_to_id = {}
    id_to_word = {}

    for word in words:
        new_id = len(word_to_id)
        word_to_id.setdefault(word, new_id)
        id_to_word.setdefault(new_id, word)
    
    corpus = [word_to_id[w] for w in words]
    corpus = np.array(corpus)

    return corpus, word_to_id, id_to_word

In [12]:
def create_co_matrix(corpus, vocab_size, windows_size=1):
    corpus_size = len(corpus)
    co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)

    for idx, word_id in enumerate(corpus):
        for i in range(windows_size):
            left_idx = idx - 1
            right_idx = idx + 1

            if left_idx >= 0:
                left_word_id = corpus[left_idx]
                co_matrix[word_id, left_word_id] += 1

            if right_idx < corpus_size:
                right_word_id = corpus[right_idx]
                co_matrix[word_id, right_word_id] += 1

    return co_matrix

## PPMI & SVD

But this doesn't represnet the link between word very well. Thus we introduced the PMI. The $P(x)$ denote the probablity of the event x.

$$ PMI(x, y) = log_{2}\frac{P(x, y)}{P(x)P(y)} $$

But as the volcabulary grows, use a 2-dimension array like this is not efficient. So we introduced the Singular Value Decomposition, or SVD, to reduce the dimension.

$$ X=USV^{t} $$

In the graph, the nearer the two word are, it means the similerer the two words are. Following cell will implement PPMI & SVD to the matrix.

In [13]:
def ppmi(C, verbose=False, eps=1e-8):
    M = np.zeros_like(C, dtype=np.float32)
    N = np.sum(C)
    S = np.sum(C, axis=0)
    total = C.shape[0] * C.shape[1]
    cnt = 0

    for i in range(C.shape[0]):
        for j in range(C.shape[1]):
            pmi = np.log2(C[i, j] * N / (S[j] * S[i]) + eps)
            M[i, j] = max(0, pmi)

            if verbose:
                cnt += 1
                if cnt % (total//100+1) == 0:
                    print('%.1f%% done' % (100*cnt/total))

    return M

## Similarity Comparsion

Our goal is to make the AI know the word based on their meaning. So we have to calculate the similarity between the words.

For example, we want the machine know that "I", "You" and "He" are simliary words in the English. This part we will **calculate the similarity of vectors**.

We have servals alogorithum to finish the tasks:

* Euclidean Distance
* Cosine Similarity
* Manhattan Distance

In [14]:
def cos_similarity(x, y, eps=1e-8):
    nx = x / np.sqrt(np.sum(x**2) + eps)
    ny = y / np.sqrt(np.sum(y**2) + eps)
    return np.dot(nx, ny)

def most_similar(query, word_to_id, id_to_word, word_matrix, top=5):
    if query not in word_to_id:
        print('%s is not found' % query)
        return

    print('\n[query] ' + query)
    query_id = word_to_id[query]
    query_vec = word_matrix[query_id]

    vocab_size = len(word_to_id)

    similarity = np.zeros(vocab_size)
    for i in range(vocab_size):
        similarity[i] = cos_similarity(word_matrix[i], query_vec)

    count = 0
    for i in (-1 * similarity).argsort():
        if id_to_word[i] == query:
            continue
        print(' %s: %s' % (id_to_word[i], similarity[i]))

        count += 1
        if count >= top:
            return

# Word-embedding: Predict-based Method

By using this method, we can get a more-connected word-emnbedding. This method is also called word2vec, which foucus on predict the word with given context or vise versa.

# Test Word-embedding Data

You can skip this chapter if you don't want to learn the dataset preprocess.

In [19]:
#@title Test Config
use_pretrained = True #@param {type:"boolean"}
split_policy = "Default" #@param ["Japanese", "Chinese", "Default"]
custom_data = "\u6211\u7231\u5317\u4EAC\u5929\u5B89\u95E8\uFF0C\u5929\u5B89\u95E8\u4E0A\u592A\u9633\u5347\u3002\u6211\u7231\u5317\u4EAC\u6545\u5BAB\uFF0C\u6545\u5BAB\u7684\u592A\u9633\u65E9\u5DF2\u5347\u8D77\u3002" #@param ["\u6211\u7231\u5317\u4EAC\u5929\u5B89\u95E8\uFF0C\u5929\u5B89\u95E8\u4E0A\u592A\u9633\u5347\u3002\u6211\u7231\u5317\u4EAC\u6545\u5BAB\uFF0C\u6545\u5BAB\u7684\u592A\u9633\u65E9\u5DF2\u5347\u8D77\u3002", " You are my friend and I am his friend. "] {allow-input: true}
data_source = "PTB" #@param ["PTB", "Custom"]


## Test Count-based data

We will use the function we created before to generate word-embedding using count-based method. If you don't need to see the data process you can skip the following code block and read the pre-trained embedding from the drive.

In [23]:
if not use_pretrained:
    corpus, word_to_id, id_to_word = load_ptb_data('train') if data_source == "PTB" else postprocess(custom_data, split_policy)

    vocab_size = len(word_to_id)

    C = create_co_matrix(corpus, vocab_size)

    print("Co-occurance matrix")
    print(C)

    wordvec_size=100
    np.set_printoptions(precision=3)
    W = ppmi(C)

    print("PPMIed Co-occurace Matrix")
    print(W)

    try:
        from sklearn.utils.extmath import randomized_svd
        U, S, V = randomized_svd(W, n_components=wordvec_size, n_iter=5, random_state=None)
    except ImportError:
        U, S, V = np.linalg.svd(W)

    print("SVDed PPMI matrix")
    print(U)

    if len(word_to_id) < 50:
        for word, word_id in word_to_id.items():
            plt.annotate(word, (U[word_id, 0], U[word_id, 1]))

        plt.scatter(U[:,0], U[:, 1], alpha=0.5)
        plt.show()

    wordvec_size = 100
    word_vecs = U[:, :wordvec_size]
    count_based_embedding = (word_vecs, word_to_id, id_to_word)

    with open(dataset_dir + "/count_based_embedding.pkl", 'wb') as f:
        pickle.dump(count_based_embedding, f)

else:

    with open(dataset_dir + "/count_based_embedding.pkl", 'rb') as f:
        if f:
            print("Embeddding Loaded")
            (word_vecs, word_to_id, id_to_word) = pickle.load(f)

Embeddding Loaded


In [25]:
querys = ['he', 'car', 'bread', 'watch', 'way']
for query in querys:
    most_similar(query, word_to_id, id_to_word, word_vecs, top=5)


[query] he
 she: 0.8356764912605286
 it: 0.7061509490013123
 yeargin: 0.5221908092498779
 that: 0.48959219455718994
 nobody: 0.46606236696243286

[query] car
 auto: 0.6837215423583984
 truck: 0.6087413430213928
 jewelry: 0.5681161284446716
 vehicle: 0.5538378953933716
 disk-drive: 0.5514360666275024

[query] bread
 peasants: 0.6837708950042725
 cubs: 0.6571336388587952
 insistence: 0.656091034412384
 toys: 0.6539305448532104
 viewpoint: 0.6513710021972656

[query] watch
 ivy: 0.5914743542671204
 dignity: 0.5657815337181091
 knock: 0.5625321269035339
 reconsider: 0.5484458208084106
 send: 0.5478222370147705

[query] way
 sign: 0.539264976978302
 knowledge: 0.5219477415084839
 chance: 0.4759937822818756
 dignity: 0.47081226110458374
 getting: 0.447970449924469


Let's streamline and clarify your summary:

In conclusion, we've executed the following steps:

1. Initially, we generated a co-occurrence matrix to represent contextual relationships between words. In this matrix, every word was unique, ensuring no duplicates.

2. Following this, we transformed the co-occurrence matrix using the Positive Pointwise Mutual Information (PPMI) algorithm, thereby enhancing the representational quality.

3. To mitigate the memory footprint of the matrix, we employed Singular Value Decomposition (SVD) for dimensionality reduction. This effectively eliminated null values, resulting in a denser, more compact matrix.

4. Ultimately, we obtained a vector space representation for each word, facilitating efficient semantic analysis.

This process is generally referred to as the **Count-based Method** for creating word embeddings.

The count-based method allows us to train data only once, but each word's embedding is less meaningful than Predict-based method. Besides, as the vocabulary grows bigger, the time complexity will increase exponentially (because the matrix is n*n).

## Test Predict-based data

asdfasdf

In [None]:
# TODO

# Data Process

sdaf

In [29]:
#@title Text Process Config
embedding_method = "Count-based" #@param ["word2vec", "Count-based"]
use_cached_embedding = False #@param {type:"boolean"}


## Normalization

We should clear that the real-world data is not as clean as we assumed. Thet usually:

* Differs by length
* Has some non-character

In [63]:
if use_cached_embedding:
    if embedding_method == "Count-based":
        with open(dataset_dir + "count_based_embedding.pkl", 'rb') as f:
            if f: (word_vecs, word_to_id, id_to_word) = pickle.load(f)
    elif embedding_method == "Count-based":
        with open(dataset_dir + "predict_based_embedding.pkl", 'rb') as f:
            if f: (word_vecs, word_to_id, id_to_word) = pickle.load(f)

def sentence_to_vector(sentence, model=None):
    # Note the shape[0] represnt the dataset's total word count. Because the matrix is actually a n*n shape
    sentence_vector = np.zeros(word_vecs.shape[1])

    num_words = 0
    
    for word in sentence.lower().split(" "):
        if embedding_method == "Count-based":
            if word in word_to_id:
                sentence_vector += word_vecs[word_to_id[word]]
        else:
            sentence_vector += model.wv[word]
                
        num_words += 1
    
    # If the sentence is not empty, divide the sum by the number of words to get the average
    if num_words > 0:
        sentence_vector /= num_words
    
    return sentence_vector

print(sentence_to_vector("The New Jersey Devils and the Detroit Red Wings play Ice Hockey."))

[ 2.578e-02  2.828e-02  9.418e-03 -1.313e-02  4.411e-03 -8.618e-03
  2.677e-03  1.972e-02  7.540e-03 -8.971e-03  3.138e-02  6.585e-03
 -2.964e-03 -2.060e-02  7.670e-03 -7.219e-03 -1.078e-02  3.520e-03
 -5.052e-03  1.339e-02  4.126e-03 -1.232e-03 -2.193e-02 -3.325e-03
  1.934e-02  3.677e-03 -7.852e-03  2.615e-03 -8.546e-03 -7.703e-03
 -7.261e-03 -8.996e-03  1.437e-02 -9.756e-03 -1.098e-03 -8.365e-03
  5.561e-03 -1.681e-02 -8.022e-03 -1.235e-02  6.034e-03 -1.096e-02
  5.051e-03 -4.538e-03  4.897e-03  1.592e-02  5.728e-03 -5.001e-03
 -1.843e-02 -6.627e-03 -3.112e-03 -3.591e-03 -1.885e-03 -9.115e-03
  3.566e-03 -3.167e-03 -2.176e-03 -8.351e-03  1.270e-03 -6.808e-03
 -1.799e-02 -8.503e-03 -2.359e-03 -8.527e-03  2.015e-03 -8.298e-03
 -7.903e-03  1.642e-02  3.235e-03 -2.093e-03  3.239e-04  4.637e-03
 -5.446e-04  7.157e-03 -5.292e-03  1.335e-03 -8.376e-03  1.175e-03
  4.807e-03 -6.613e-03 -1.924e-05  1.102e-02 -1.196e-02 -4.392e-03
  9.361e-03 -3.458e-03 -6.001e-03  1.307e-02  2.744e-03 -1.024

# Trainer: Spam Message Filter

If you want to know how ne

In [None]:
#@title Trainer Config


## Dataset

In [102]:
def read_dataset(file_path):
    with open(file_path, 'r') as f:
        lines = f.readlines()

    # Initialize empty lists to hold the labels and message vectors
    labels = []
    vectors = []

    for line in lines:
        # Split the line into label and message at the first whitespace character
        parts = line.strip().split(None, 1)

        # If the line doesn't have at least two parts, skip it
        if len(parts) < 2:
            continue

        label, message = parts

        # Convert the label to binary (1 for spam, 0 for ham)
        label = 1 if label == 'spam' else 0

        # Convert the message to a vector using the sentence_to_vector function
        vector = sentence_to_vector(message)

        # Add the label and vector to the lists
        labels.append(label)
        vectors.append(vector)

    return labels, vectors

labels, vectors = read_dataset('/content/SMSSpamCollection.txt')

# Optionally, you may want to convert the lists to numpy arrays for use with many machine learning libraries
labels = np.array(labels)
embedded_messages = np.array(vectors)

## Basic Classfier

In [105]:
import tensorflow as tf
from sklearn.model_selection import train_test_split

# Assuming 'embedded_messages' is your list of message vectors and 'labels' is your list of labels

# Split the dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(embedded_messages, labels, test_size=0.2)

# Convert the lists to numpy arrays
X_train = np.array(X_train)
y_train = np.array(y_train)
X_test = np.array(X_test)
y_test = np.array(y_test)

# Define the model architecture
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(256, activation='relu', input_shape=(X_train.shape[1],), kernel_regularizer=tf.keras.regularizers.l2(0.001)), # assuming input vectors are 1D
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(128, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.001)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.001)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')  # output layer for binary classification
])


# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Loss: {loss}, Accuracy: {accuracy}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Loss: 0.2773182690143585, Accuracy: 0.9199288487434387


In [107]:
def predict_spam(model, sentence):
    # Convert the sentence to vector
    vector = sentence_to_vector(sentence)

    # Remember to match the shape of the input your model expects. 
    # If your model was trained on single samples of shape (N,), you may need to expand the dimensions of your input
    vector = np.expand_dims(vector, axis=0)

    # Use the model to predict the probability of the sentence being spam
    prediction = model.predict(vector, verbose=0)

    # Since we use a sigmoid activation function in our final layer, 
    # the output will be a number between 0 and 1 representing the probability that the sentence is spam.
    # We can convert this to a binary label by choosing a threshold (like 0.5) and classifying all sentences 
    # with a probability greater than the threshold as spam (1) and all others as not spam (0).
    label = 1 if prediction > 0.5 else 0

    return label

sentences = [
    "Please call our customer service to get a free coupon",
    "What's your plan of this Saturaday?",
    "I love you Cathy",
    "For only 6 Rewards points, scroll through your social feed worry-free with 500MB data, valid for 1 day!",
    "[Leetcode] 42094 is You login code. Do not share it with anyone.",
    "Long time no see my old friend.",
    "Make everyone green with envy with our new collection. Shop now at www.fakeshoppingsite.com",
    "Your OKX verification code is: 443287. This code will expire in 10 minutes. Don't share this code with anyone; our employees will never ask for the code.",
    "Hurry up! Limited time offer at www.fakeofferwebsite.com"
]

for s in sentences:
    prediction = predict_spam(model, s)
    print(f"{Fore.RED + 'SPAM  ' + Fore.RESET if prediction == 1 else  Fore.GREEN +  'HAM   ' + Fore.RESET}{Fore.YELLOW + s + Fore.RESET}")

[31mSPAM  [39m[33mPlease call our customer service to get a free coupon[39m
[32mHAM   [39m[33mWhat's your plan of this Saturaday?[39m
[32mHAM   [39m[33mI love you Cathy[39m
[31mSPAM  [39m[33mFor only 6 Rewards points, scroll through your social feed worry-free with 500MB data, valid for 1 day![39m
[32mHAM   [39m[33m[Leetcode] 42094 is You login code. Do not share it with anyone.[39m
[32mHAM   [39m[33mLong time no see my old friend.[39m
[31mSPAM  [39m[33mMake everyone green with envy with our new collection. Shop now at www.fakeshoppingsite.com[39m
[32mHAM   [39m[33mYour OKX verification code is: 443287. This code will expire in 10 minutes. Don't share this code with anyone; our employees will never ask for the code.[39m
[32mHAM   [39m[33mHurry up! Limited time offer at www.fakeofferwebsite.com[39m


## Optmized Classfier

# Trainer: Translator

# Reference

1. Saito Kokih. Deep Learning from Scratch 4: Natural Language Processing[M]. Japan: O'Reilly Japan, 2018.

2. Khelifi Ahmed Aziz. Medium. Learn How to Write Markdown & LaTeX in The Jupyter Notebook (https://towardsdatascience.com/write-markdown-latex-in-the-jupyter-notebook-10985edb91fd)