<a href="https://colab.research.google.com/github/RiverTwilight/Awesome-Machine-Learning-Playground/blob/master/Natural_Language_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

> By Rene Wang

This notebook provides a comprehensive explanation of how a basic language model operates, covering all the essential steps. Upon completion, you will possess a solid foundational understanding of Natural Language Processing (NLP), enabling you to delve deeper into the subject.

Note: It's recommend to read the [Handwritten_Digits_Detection]() first. Some basic concepts and codes are explained there.

*If you encounter problems while studying this notebook, feel free to submit an issue in this [repo](https://github.com/RiverTwilight/Awesome-Machine-Learning-Playground)*

In [None]:
import os
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import jieba
import time
import pickle
!pip install -q kora colorama
from colorama import Fore, Back, Style
from kora import drive
drive.link_nbs()
from Handwritten_Digits_Recognition import SoftmaxWithLoss, Adam

To render Chinese character in matplotlib we need to download the font first.

In [None]:
!wget -O SourceHanSansSC-Normal.otf https://github.com/adobe-fonts/source-han-sans/blob/release/OTF/SimplifiedChinese/SourceHanSansSC-Normal.otf?raw=true

os.makedirs('/root/.config/matplotlib', exist_ok=True)

!cp SourceHanSansSC-Normal.otf /root/.config/matplotlib/

matplotlib.font_manager.fontManager.addfont('SourceHanSansSC-Normal.otf')

matplotlib.rc("font",family='Source Han Sans SC')

--2023-06-12 07:48:39--  https://github.com/adobe-fonts/source-han-sans/blob/release/OTF/SimplifiedChinese/SourceHanSansSC-Normal.otf?raw=true
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/adobe-fonts/source-han-sans/raw/release/OTF/SimplifiedChinese/SourceHanSansSC-Normal.otf [following]
--2023-06-12 07:48:39--  https://github.com/adobe-fonts/source-han-sans/raw/release/OTF/SimplifiedChinese/SourceHanSansSC-Normal.otf
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/adobe-fonts/source-han-sans/release/OTF/SimplifiedChinese/SourceHanSansSC-Normal.otf [following]
--2023-06-12 07:48:39--  https://raw.githubusercontent.com/adobe-fonts/source-han-sans/release/OTF/SimplifiedChinese/SourceHanSansSC-Normal.otf
Resolving raw.githubusercontent.com (raw.g

# Word Embedding: Count-based Method

In order to put the complex the data into neural network, the most-used solution is to convert the data to vector space, which is called **word2vec**. In this part we'll discuss how to convert the natural text data into vectors.

## Word co-occurrence matrix

Usually the meaning of a word depends on the context. A word co-occurrence matrix reprents how many times has a word appear around another word.

Consider this sentence:
```markdown
 You are my friend and I am his friend.
```
First we remove the duplicate words and we can create a table like this:

\begin{array}{ccc}
\text{ }&\text{You}&\text{are}&\text{my}&\text{friend}&\text{and}&\text{I}&\text{am}&\text{his}&\text{.}\\
\text{You}&\text{0}&\text{1}&\text{1}&\text{0}&\text{0}&\text{0}&\text{0}&\text{0}&\text{0}\\
\text{are}&\text{1}&\text{0}&\text{1}&\text{1}&\text{0}&\text{0}&\text{0}&\text{0}&\text{0}\\
\text{my}&\text{1}&\text{1}&\text{0}&\text{1}&\text{1}&\text{0}&\text{0}&\text{0}&\text{0}\\
\text{friend}&\text{0}&\text{1}&\text{1}&\text{0}&\text{1}&\text{1}&\text{0}&\text{0}&\text{0}\\
\text{and}&\text{0}&\text{0}&\text{1}&\text{1}&\text{0}&\text{0}&\text{0}&\text{0}&\text{0}\\
\text{I}&\text{0}&\text{0}&\text{0}&\text{1}&\text{1}&\text{0}&\text{1}&\text{1}&\text{0}\\
\text{am}\\
\text{his}\\
\end{array}

The row label reprents those words exsit in the context of the col lable. The context size was called **window_size**.

In [None]:
import sys
import os
sys.path.append('..')
try:
    import urllib.request
except ImportError:
    raise ImportError('Use Python3!')
import pickle

url_base = 'https://raw.githubusercontent.com/tomsercu/lstm/master/data/'
key_file = {
    'train':'ptb.train.txt',
    'test':'ptb.test.txt',
    'valid':'ptb.valid.txt'
}
save_file = {
    'train':'ptb.train.npy',
    'test':'ptb.test.npy',
    'valid':'ptb.valid.npy'
}
vocab_file = 'ptb.vocab.pkl'

dataset_dir = '/content/drive/MyDrive/Project/NLP'

def _download(file_name):
    file_path = dataset_dir + '/' + file_name
    if os.path.exists(file_path):
        return

    print('Downloading ' + file_name + ' ... ')

    try:
        urllib.request.urlretrieve(url_base + file_name, file_path)
    except urllib.error.URLError:
        import ssl
        ssl._create_default_https_context = ssl._create_unverified_context
        urllib.request.urlretrieve(url_base + file_name, file_path)

    print('Done')


def load_vocab():
    vocab_path = dataset_dir + '/' + vocab_file

    if os.path.exists(vocab_path):
        with open(vocab_path, 'rb') as f:
            word_to_id, id_to_word = pickle.load(f)
        return word_to_id, id_to_word

    word_to_id = {}
    id_to_word = {}
    data_type = 'train'
    file_name = key_file[data_type]
    file_path = dataset_dir + '/' + file_name

    _download(file_name)

    words = open(file_path).read().replace('\n', '<eos>').strip().split()

    for i, word in enumerate(words):
        if word not in word_to_id:
            tmp_id = len(word_to_id)
            word_to_id[word] = tmp_id
            id_to_word[tmp_id] = word

    with open(vocab_path, 'wb') as f:
        pickle.dump((word_to_id, id_to_word), f)

    return word_to_id, id_to_word


def load_ptb_data(data_type='train'):
    '''
        :param data_type: 数据的种类：'train' or 'test' or 'valid (val)'
        :return:
    '''
    if data_type == 'val': data_type = 'valid'
    save_path = dataset_dir + '/' + save_file[data_type]

    word_to_id, id_to_word = load_vocab()

    if os.path.exists(save_path):
        corpus = np.load(save_path)
        return corpus, word_to_id, id_to_word

    file_name = key_file[data_type]
    file_path = dataset_dir + '/' + file_name
    _download(file_name)

    words = open(file_path).read().replace('\n', '<eos>').strip().split()
    corpus = np.array([word_to_id[w] for w in words])

    np.save(save_path, corpus)
    return corpus, word_to_id, id_to_word

def postprocess(text, split_policy="Default"):
    text = text.lower()
    text = text.replace('.', " .")

    if split_policy == "Chinese":
        words = list(jieba.cut(text))
    else:
        words = text.split(' ')

    print(words)

    word_to_id = {}
    id_to_word = {}

    for word in words:
        new_id = len(word_to_id)
        word_to_id.setdefault(word, new_id)
        id_to_word.setdefault(new_id, word)

    corpus = [word_to_id[w] for w in words]
    corpus = np.array(corpus)

    return corpus, word_to_id, id_to_word

In [None]:
def create_co_matrix(corpus, vocab_size, windows_size=1):
    corpus_size = len(corpus)
    co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)

    for idx, word_id in enumerate(corpus):
        for i in range(windows_size):
            left_idx = idx - 1
            right_idx = idx + 1

            if left_idx >= 0:
                left_word_id = corpus[left_idx]
                co_matrix[word_id, left_word_id] += 1

            if right_idx < corpus_size:
                right_word_id = corpus[right_idx]
                co_matrix[word_id, right_word_id] += 1

    return co_matrix

## PPMI & SVD

But this doesn't represnet the link between word very well. Thus we introduced the PMI. The $P(x)$ denote the probablity of the event x.

$$ PMI(x, y) = log_{2}\frac{P(x, y)}{P(x)P(y)} $$

In case the appearance of negative values, we introduced the PPMI (Positive Pointwise Mutual Information). While negative PMI values can provide useful information, in many applications, only the strength of the positive association is of interest. The PPMI measure clips the PMI at zero, turning all negative PMI values to zero:

$$ PPMI(X,Y) = max(PMI(X,Y), 0) $$

But as the volcabulary grows, use a 2-dimension array like this is not efficient. So we introduced the Singular Value Decomposition, or SVD, to reduce the dimension.

$$ X=USV^{t} $$

In the graph, the nearer the two word are, it means the similerer the two words are. Following cell will implement PPMI & SVD to the matrix.

![SVD](https://pic3.zhimg.com/v2-f249e8a4be916e51b9d537c8380ae6e2_b.jpg)

In [None]:
def ppmi(C, verbose=False, eps=1e-8):
    M = np.zeros_like(C, dtype=np.float32)
    N = np.sum(C)
    S = np.sum(C, axis=0)
    total = C.shape[0] * C.shape[1]
    cnt = 0

    for i in range(C.shape[0]):
        for j in range(C.shape[1]):
            pmi = np.log2(C[i, j] * N / (S[j] * S[i]) + eps)
            M[i, j] = max(0, pmi)

            if verbose:
                cnt += 1
                if cnt % (total//100+1) == 0:
                    print('%.1f%% done' % (100*cnt/total))

    return M

## Similarity Comparsion

Our goal is to make the AI know the word based on their meaning. So we have to calculate the similarity between the words.

For example, we want the machine know that "I", "You" and "He" are simliary words in the English. This part we will **calculate the similarity of vectors**.

We have servals alogorithum to finish the tasks:

* Euclidean Distance
* Cosine Similarity
* Manhattan Distance

In [None]:
def cos_similarity(x, y, eps=1e-8):
    nx = x / np.sqrt(np.sum(x**2) + eps)
    ny = y / np.sqrt(np.sum(y**2) + eps)
    return np.dot(nx, ny)

def most_similar(query, word_to_id, id_to_word, word_matrix, top=5):
    if query not in word_to_id:
        print('%s is not found' % query)
        return

    print('\n[query] ' + query)
    query_id = word_to_id[query]
    query_vec = word_matrix[query_id]

    vocab_size = len(word_to_id)

    similarity = np.zeros(vocab_size)
    for i in range(vocab_size):
        similarity[i] = cos_similarity(word_matrix[i], query_vec)

    count = 0
    for i in (-1 * similarity).argsort():
        if id_to_word[i] == query:
            continue
        print(' %s: %s' % (id_to_word[i], similarity[i]))

        count += 1
        if count >= top:
            return

# Word-embedding: Predict-based Method

By using this method, we can get a more-connected word-emnbedding. This method is also called word2vec, which foucus on predict the word with given context or vise versa.

## Context Generation

In [None]:
def create_contexts_target(corpus, window_size=1):
    target = corpus[window_size: -window_size]
    context = []

    for idx in range(window_size, len(corpus)-window_size):
        cs = []
        for t in range(-window_size, window_size + 1):
            if t == 0:
                continue
            cs.append(corpus[idx + t])
        context.append(cs)

    return np.array(context), np.array(target)

def convert_one_hot(contexts, vocab_size):
    one_hots = []

    for context in contexts:
        if type(context) is np.ndarray:
            one_hots.append(convert_one_hot(context, vocab_size))
        else:
            labels = np.zeros(vocab_size)
            labels[context] = 1
            one_hots.append(labels)

    return np.array(one_hots)

## CBOW

The model takes a window of surrounding words as input and tries to predict the target word in the center of the window

In [None]:
class MatMul:
    def __init__(self, W):
        self.params = [W]
        self.grads = [np.zeros_like(W)]
        self.x = None

    def forward(self, x):
        W, = self.params
        out = np.dot(x, W)
        self.x = x
        return out

    def backward(self, dout):
        W, = self.params
        dx = np.dot(dout, W.T)
        dW = np.dot(self.x.T, dout)
        self.grads[0][...] = dW
        return dx

In [None]:
class SimpleCBOW:
    def __init__(self, vocab_size, hidden_size):
        V, H = vocab_size, hidden_size
        W_in = 0.01 * np.random.randn(V, H).astype('f')
        W_out = 0.01 * np.random.randn(H, V).astype('f')

        self.in_layer0 = MatMul(W_in)
        self.in_layer1 = MatMul(W_in)
        self.out_layer = MatMul(W_out)
        self.loss_layer = SoftmaxWithLoss(W_in)

        layers = [self.in_layer0, self.in_layer1, self.out_layer]
        self.params, self.grads = [], []

        for layer in layers:
            self.params += layer.params
            self.grads += layer.grads

        self.word_vecs = W_in

    def forward(self, contexts, target):
        h0 = self.in_layer0.forward(contexts[:, 0])
        h1 = self.in_layer1.forward(contexts[:, 0])
        h = (h0 + h1) * 0.5
        score = self.out_layer.forward(h)
        loss = self.loss_layer.forward(score, target)
        return loss

    def backward(self, dout=1):
        ds = self.loss_layer.backward(dout)
        da = self.out_layer.backward(ds)
        da *= 0.5

        self.in_layer1.backward(da)
        self.in_layer0.backward(da)

        return None

# Testing Word-embedding

You can skip this chapter if you don't want to learn the dataset preprocess.

In [None]:
#@title Test Config
use_pretrained = False #@param {type:"boolean"}
split_policy = "Default" #@param ["Japanese", "Chinese", "Default"]
custom_data = " You are my friend and I am his friend. " #@param ["\u6211\u7231\u5317\u4EAC\u5929\u5B89\u95E8\uFF0C\u5929\u5B89\u95E8\u4E0A\u592A\u9633\u5347\u3002\u6211\u7231\u5317\u4EAC\u6545\u5BAB\uFF0C\u6545\u5BAB\u7684\u592A\u9633\u65E9\u5DF2\u5347\u8D77\u3002", " You are my friend and I am his friend. "] {allow-input: true}
data_source = "PTB" #@param ["PTB", "Custom"]

#@markdown ## CBOW Hyerparameters

window_size = 1 #@param {type:"integer"}
hidden_size = 5 #@param {type:"integer"}
batch_size = 3 #@param {type:"integer"}
max_epoch = 1000 #@param {type:"slider", min:100, max:10000, step:100}



## Test Count-based data

We will use the function we created before to generate word-embedding using count-based method. If you don't need to see the data process you can skip the following code block and read the pre-trained embedding from the drive.

In [None]:
if not use_pretrained:
    corpus, word_to_id, id_to_word = load_ptb_data('train') if data_source == "PTB" else postprocess(custom_data, split_policy)

    vocab_size = len(word_to_id)

    C = create_co_matrix(corpus, vocab_size)

    print("Co-occurance matrix")
    print(C)

    wordvec_size=100
    np.set_printoptions(precision=3)
    W = ppmi(C)

    print("PPMIed Co-occurace Matrix")
    print(W)

    try:
        from sklearn.utils.extmath import randomized_svd
        U, S, V = randomized_svd(W, n_components=wordvec_size, n_iter=5, random_state=None)
    except ImportError:
        U, S, V = np.linalg.svd(W)

    print("SVDed PPMI matrix")
    print(U)

    if len(word_to_id) < 50:
        for word, word_id in word_to_id.items():
            plt.annotate(word, (U[word_id, 0], U[word_id, 1]))

        plt.scatter(U[:,0], U[:, 1], alpha=0.5)
        plt.show()

    wordvec_size = 100
    word_vecs = U[:, :wordvec_size]
    count_based_embedding = (word_vecs, word_to_id, id_to_word)

    with open(dataset_dir + "/count_based_embedding.pkl", 'wb') as f:
        pickle.dump(count_based_embedding, f)

else:

    with open(dataset_dir + "/count_based_embedding.pkl", 'rb') as f:
        if f:
            print("Embeddding Loaded")
            (word_vecs, word_to_id, id_to_word) = pickle.load(f)

Embeddding Loaded


In [None]:
querys = ['he', 'car', 'bread', 'watch', 'apple']
for query in querys:
    most_similar(query, word_to_id, id_to_word, word_vecs, top=5)


[query] he
 she: 0.8356764912605286
 it: 0.7061509490013123
 yeargin: 0.5221908092498779
 that: 0.48959219455718994
 nobody: 0.46606236696243286

[query] car
 auto: 0.6837215423583984
 truck: 0.6087413430213928
 jewelry: 0.5681161284446716
 vehicle: 0.5538378953933716
 disk-drive: 0.5514360666275024

[query] bread
 peasants: 0.6837708950042725
 cubs: 0.6571336388587952
 insistence: 0.656091034412384
 toys: 0.6539305448532104
 viewpoint: 0.6513710021972656

[query] watch
 ivy: 0.5914743542671204
 dignity: 0.5657815337181091
 knock: 0.5625321269035339
 reconsider: 0.5484458208084106
 send: 0.5478222370147705

[query] apple
 impeachment: 0.4544225037097931
 convex: 0.44156163930892944
 printer: 0.4251265227794647
 disks: 0.41802775859832764
 chaos: 0.41405174136161804


Let's streamline and clarify your summary:

In conclusion, we've executed the following steps:

1. Initially, we generated a co-occurrence matrix to represent contextual relationships between words. In this matrix, every word was unique, ensuring no duplicates.

2. Following this, we transformed the co-occurrence matrix using the Positive Pointwise Mutual Information (PPMI) algorithm, thereby enhancing the representational quality.

3. To mitigate the memory footprint of the matrix, we employed Singular Value Decomposition (SVD) for dimensionality reduction. This effectively eliminated null values, resulting in a denser, more compact matrix.

4. Ultimately, we obtained a vector space representation for each word, facilitating efficient semantic analysis.

This process is generally referred to as the **Count-based Method** for creating word embeddings.

The count-based method allows us to train data only once, but each word's embedding is less meaningful than Predict-based method. Besides, as the vocabulary grows bigger, the time complexity will increase exponentially (because the matrix is n*n).

## Test Predict-based data

asdfasdf

In [None]:
def remove_duplicate(params, grads):
    '''
    将参数列表中重复的权重整合为1个，
    加上与该权重对应的梯度
    '''
    params, grads = params[:], grads[:]  # copy list

    while True:
        find_flg = False
        L = len(params)

        for i in range(0, L - 1):
            for j in range(i + 1, L):
                # 在共享权重的情况下
                if params[i] is params[j]:
                    grads[i] += grads[j]  # 加上梯度
                    find_flg = True
                    params.pop(j)
                    grads.pop(j)
                # 在作为转置矩阵共享权重的情况下（weight tying）
                elif params[i].ndim == 2 and params[j].ndim == 2 and \
                     params[i].T.shape == params[j].shape and np.all(params[i].T == params[j]):
                    grads[i] += grads[j].T
                    find_flg = True
                    params.pop(j)
                    grads.pop(j)

                if find_flg: break
            if find_flg: break

        if not find_flg: break

    return params, grads

def clip_grads(grads, max_norm):
    total_norm = 0
    for grad in grads:
        total_norm += np.sum(grad ** 2)
    total_norm = np.sqrt(total_norm)

    rate = max_norm / (total_norm + 1e-6)
    if rate < 1:
        for grad in grads:
            grad *= rate

class Trainer:
    def __init__(self, model, optimizer):
        self.model = model
        self.optimizer = optimizer
        self.loss_list = []
        self.eval_interval = None
        self.current_epoch = 0

    def fit(self, x, t, max_epoch=10, batch_size=32, max_grad=None, eval_interval=20):
        data_size = len(x)
        max_iters = data_size // batch_size
        self.eval_interval = eval_interval
        model, optimizer = self.model, self.optimizer
        total_loss = 0
        loss_count = 0

        start_time = time.time()
        for epoch in range(max_epoch):
            # 打乱
            idx = np.random.permutation(np.arange(data_size))
            x = x[idx]
            t = t[idx]

            for iters in range(max_iters):
                batch_x = x[iters*batch_size:(iters+1)*batch_size]
                batch_t = t[iters*batch_size:(iters+1)*batch_size]

                # 计算梯度，更新参数
                loss = model.forward(batch_x, batch_t)
                model.backward()
                params, grads = remove_duplicate(model.params, model.grads)  # 将共享的权重整合为1个
                if max_grad is not None:
                    clip_grads(grads, max_grad)
                optimizer.update(params, grads)
                total_loss += loss
                loss_count += 1

                # 评价
                if (eval_interval is not None) and (iters % eval_interval) == 0:
                    avg_loss = total_loss / loss_count
                    elapsed_time = time.time() - start_time
                    print('| epoch %d |  iter %d / %d | time %d[s] | loss %.2f'
                          % (self.current_epoch + 1, iters + 1, max_iters, elapsed_time, avg_loss))
                    self.loss_list.append(float(avg_loss))
                    total_loss, loss_count = 0, 0

            self.current_epoch += 1

    def plot(self, ylim=None):
        x = np.arange(len(self.loss_list))
        if ylim is not None:
            plt.ylim(*ylim)
        plt.plot(x, self.loss_list, label='train')
        plt.xlabel('iterations (x' + str(self.eval_interval) + ')')
        plt.ylabel('loss')
        plt.show()

In [None]:
class Adam:
    '''
    Adam (http://arxiv.org/abs/1412.6980v8)
    '''
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.iter = 0
        self.m = None
        self.v = None

    def update(self, params, grads):
        if self.m is None:
            self.m, self.v = [], []
            for param in params:
                self.m.append(np.zeros_like(param))
                self.v.append(np.zeros_like(param))

        self.iter += 1
        lr_t = self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (1.0 - self.beta1**self.iter)

        for i in range(len(params)):
            self.m[i] += (1 - self.beta1) * (grads[i] - self.m[i])
            self.v[i] += (1 - self.beta2) * (grads[i]**2 - self.v[i])

            params[i] -= lr_t * self.m[i] / (np.sqrt(self.v[i]) + 1e-7)

if not use_pretrained:
    corpus, word_to_id, id_to_word = load_ptb_data('train') if data_source == "PTB" else postprocess(custom_data, split_policy)

    vocab_size = len(word_to_id)

    contexts, target = create_contexts_target(corpus, window_size=1)

    target = convert_one_hot(target, vocab_size)

    contexts = convert_one_hot(contexts , vocab_size)

    model = SimpleCBOW(vocab_size, hidden_size)
    optimizer = Adam()
    trainer = Trainer(model, optimizer)

    trainer.fit(contexts, target, max_epoch, batch_size)
    trainer.plot()

    predict_based_embedding = (model.word_vecs, word_to_id, id_to_word)


    with open(dataset_dir + "/predict_based_embedding.pkl", 'wb') as f:
        pickle.dump(predict_based_embedding, f)
else:
    with open(dataset_dir + "/predict_based_embedding.pkl", 'rb') as f:
        if f:
            print("Embeddding Loaded")
            (word_vecs, word_to_id, id_to_word) = pickle.load(f)

# Data Process

sdaf

In [None]:
#@title Text Process Config
embedding_method = "word2vec" #@param ["word2vec", "Count-based"]
use_cached_embedding = True #@param {type:"boolean"}


## Normalization

We should clear that the real-world data is not as clean as we assumed. Thet usually:

* Differs by length
* Has some non-character

In [None]:
if use_cached_embedding:
    if embedding_method == "Count-based":
        with open(dataset_dir + "count_based_embedding.pkl", 'rb') as f:
            if f: (word_vecs, word_to_id, id_to_word) = pickle.load(f)
    elif embedding_method == "word2vec":
        with open(dataset_dir + "/predict_based_embedding.pkl", 'rb') as f:
            if f: (word_vecs, word_to_id, id_to_word) = pickle.load(f)


def sentence_to_vector(sentence, model=None):
    # Note the shape[0] represnt the dataset's total word count. Because the matrix is actually a n*n shape
    sentence_vector = np.zeros(word_vecs.shape[1])

    num_words = 0

    for word in sentence.lower().split(" "):
        if word in word_to_id:
            sentence_vector += word_vecs[word_to_id[word]]

        num_words += 1

    # If the sentence is not empty, divide the sum by the number of words to get the average
    if num_words > 0:
        sentence_vector /= num_words

    return sentence_vector

print(sentence_to_vector("Friend New Friend"))

[-0.75908351  0.75383226  0.86338345 -0.88820116  1.05683732]


# Trainer: Spam Message Filter

If you want to know how ne

In [None]:
#@title Trainer Config


## Dataset

The dataset is mainly from "".

In [None]:
def read_dataset(file_path):
    with open(file_path, 'r') as f:
        lines = f.readlines()

    labels = []
    vectors = []

    for line in lines:
        parts = line.strip().split(None, 1)

        if len(parts) < 2:
            continue

        label, message = parts

        # Convert the label to binary (1 for spam, 0 for ham)
        label = 1 if label == 'spam' else 0

        # Convert the message to a vector using the sentence_to_vector function
        vector = sentence_to_vector(message)

        # Add the label and vector to the lists
        labels.append(label)
        vectors.append(vector)

    return labels, vectors

labels, vectors = read_dataset('/content/SMSSpamCollection.txt')

labels = np.array(labels)
embedded_messages = np.array(vectors)

## Basic Classfier

To learn more about each layer you can read this [notebook](https://colab.research.google.com/drive/18B-Fujnr7uDhfyERZzWHTI3-31anw5OH?usp=sharing) first.

In [None]:
import tensorflow as tf
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(embedded_messages, labels, test_size=0.2)

X_train = np.array(X_train)
y_train = np.array(y_train)
X_test = np.array(X_test)
y_test = np.array(y_test)

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(256, activation='relu', input_shape=(X_train.shape[1],), kernel_regularizer=tf.keras.regularizers.l2(0.001)), # assuming input vectors are 1D
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(128, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.001)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.001)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')  # output layer for binary classification
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_test, y_test))

loss, accuracy = model.evaluate(X_test, y_test)
print(f"Loss: {loss}, Accuracy: {Fore.BLUE}{accuracy}{Fore.RESET}")

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Loss: 0.2496841549873352, Accuracy: [34m0.9466192126274109[39m


In [None]:
def predict_spam(model, sentence):
    vector = sentence_to_vector(sentence)
    vector = np.expand_dims(vector, axis=0)
    prediction = model.predict(vector, verbose=0)
    label = 1 if prediction > 0.5 else 0
    return label

sentences = [
    "Please call our customer service to get a free coupon",
    "What's your plan of this Saturaday?",
    "I love you Cathy",
    "For only 6 Rewards points, scroll through your social feed worry-free with 500MB data, valid for 1 day!",
    "[Leetcode] 42094 is You login code. Do not share it with anyone.",
    "Long time no see my old friend.",
    "Make everyone green with envy with our new collection. Shop now at www.fakeshoppingsite.com",
    "Your OKX verification code is: 443287. This code will expire in 10 minutes. Don't share this code with anyone; our employees will never ask for the code.",
    "Hurry up! Limited time offer at www.fakeofferwebsite.com"
]

for s in sentences:
    prediction = predict_spam(model, s)
    print(f"{Fore.RED + 'SPAM  ' + Fore.RESET if prediction == 1 else  Fore.GREEN +  'HAM   ' + Fore.RESET}{Fore.YELLOW + s + Fore.RESET}")

[31mSPAM  [39m[33mPlease call our customer service to get a free coupon[39m
[32mHAM   [39m[33mWhat's your plan of this Saturaday?[39m
[32mHAM   [39m[33mI love you Cathy[39m
[31mSPAM  [39m[33mFor only 6 Rewards points, scroll through your social feed worry-free with 500MB data, valid for 1 day![39m
[32mHAM   [39m[33m[Leetcode] 42094 is You login code. Do not share it with anyone.[39m
[32mHAM   [39m[33mLong time no see my old friend.[39m
[32mHAM   [39m[33mMake everyone green with envy with our new collection. Shop now at www.fakeshoppingsite.com[39m
[32mHAM   [39m[33mYour OKX verification code is: 443287. This code will expire in 10 minutes. Don't share this code with anyone; our employees will never ask for the code.[39m
[32mHAM   [39m[33mHurry up! Limited time offer at www.fakeofferwebsite.com[39m


# Trainer: Text Generator

# Reference

1. Saito Kokih. Deep Learning from Scratch 3: Natural Language Processing[M]. Japan: O'Reilly Japan, 2018.

2. Khelifi Ahmed Aziz. Medium. Learn How to Write Markdown & LaTeX in The Jupyter Notebook (https://towardsdatascience.com/write-markdown-latex-in-the-jupyter-notebook-10985edb91fd)

3. lukesalamone. What is Temperature (https://lukesalamone.github.io/posts/what-is-temperature/)

4. jalammar. The Illustrated Word2Vec ([https://jalammar.github.io/illustrated-word2vec/](https://jalammar.github.io/illustrated-word2vec/))

5. jalammar. The Illustrated Transformer ([http://jalammar.github.io/illustrated-transformer/](http://jalammar.github.io/illustrated-transformer/))