# 垃圾短信拦截器

使用[垃圾短信数据集](https://archive-beta.ics.uci.edu/ml/datasets/sentiment+labelled+sentences)
- 通过spacy获取短信向量表达
- 建立多层感知机进行分类



In [1]:
! pip config set global.index-url http://pypi.douban.com/simple
! pip config set install.trusted-host pypi.douban.com

! pip install spacy

Writing to /home/jovyan/.config/pip/pip.conf
Writing to /home/jovyan/.config/pip/pip.conf
Looking in indexes: http://pypi.douban.com/simple


In [2]:
! spacy download en_core_web_sm

2022-12-09 13:13:58.206534: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-09 13:13:58.207253: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-09 13:13:58.207607: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^C

Aborted!


## 数据准备

首先下载数据集并解压

In [1]:
! [ -e smsspamcollection.zip ] || \
    wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
! [ -e "SMSSpamCollection" ] || \
    unzip smsspamcollection.zip -d .
! head SMSSpamCollection

ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham	Ok lar... Joking wif u oni...
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham	U dun say so early hor... U c already then say...
ham	Nah I don't think he goes to usf, he lives around here though
spam	FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
ham	Even my brother is not like to speak with me. They treat me like aids patent.
ham	As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
spam	WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
spam	H

每行数据的格式为：
```
<label>\t<sms>
```
使用`pandas`读取数据

In [14]:
import pandas as pd

df = pd.read_csv("SMSSpamCollection", sep="\t", names=["label", "sms"], header=None)
df

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


随机打乱数据集并获取一些基本信息

In [15]:
# prepare dataset
data = df.sample(frac=1)  # sample 100% => shuffle
n = 2  # number of classes
m = len(data)  # number of samples
m_train = int(m * 0.8)  # number of training samples

加载spacy模型备用

In [16]:
import spacy

nlp = spacy.load("en_core_web_sm")

获取训练集和测试集（需使用上面的`nlp`）

In [28]:
# Your code here to prepare train_data, train_targes, test_data, test_targets
# map labels to indexes, and sms to vectors
data["index"] = data["label"].map(lambda x: 1 if x == "spam" else 0)
data["vector"] = data["sms"].map(lambda x: nlp(x).vector)

# keep the first `m_train` as train data and the rest for test
train_data = data[:m_train]["vector"].to_numpy()
train_targets = data[:m_train]["index"].to_numpy()
test_data = data[m_train:]["vector"].to_numpy()
test_targets = data[m_train:]["index"].to_numpy()

data

Unnamed: 0,label,sms,index,vector
3084,ham,K..k:)how about your training process?,0,"[0.22786477, -0.48382202, 0.18847908, -0.31621..."
4164,ham,I told that am coming on wednesday.,0,"[-0.055274762, 0.39225176, 0.37901777, -0.3056..."
1691,spam,Sunshine Quiz Wkly Q! Win a top Sony DVD playe...,1,"[0.35940382, 0.03139996, -0.012604602, -0.2474..."
5098,spam,TheMob>Hit the link to get a premium Pink Pant...,1,"[0.25135776, 0.06056308, -0.028496165, 0.05041..."
1542,ham,Do u konw waht is rael FRIENDSHIP Im gving yuo...,0,"[0.15112671, -0.15580621, 0.008872468, -0.2764..."
...,...,...,...,...
1174,ham,Ü dun need to pick ur gf?,0,"[0.15914485, -0.03889931, 0.69562125, -0.15143..."
1106,ham,on hen night. Going with a swing,0,"[-0.27193192, -0.05489895, 0.67185766, -0.0789..."
2011,ham,Dunno lei... I thk mum lazy to go out... I nev...,0,"[0.010996377, -0.02518764, 0.13231944, -0.2391..."
4678,ham,Wewa is 130. Iriver 255. All 128 mb.,0,"[-0.34038374, 0.043166257, -0.09735346, -0.915..."


In [29]:
print(f"train spam: {train_targets.sum()}/{len(train_targets)}")
print(f"test spam: {test_targets.sum()}/{len(test_targets)}")

train spam: 598/4457
test spam: 149/1115


转为pytorch的`DataLoader`

In [30]:
# config
batch_size = 8
# make dataloader
from torch.utils.data import DataLoader

train_loader = DataLoader(list(zip(train_data, train_targets)), batch_size=batch_size, shuffle=True, num_workers=4)
test_loader = DataLoader(list(zip(test_data, test_targets)), batch_size=batch_size, shuffle=False, num_workers=4)

## 定义模型

In [31]:
import torch
from torch import nn

# device = "cuda:0" if torch.cuda.is_available() else "cpu"
device = "cpu"

# Your code here to define a MLP or whatever network you want
net = nn.Sequential(
    nn.Linear(96, 100),
    nn.ReLU(),
    nn.Linear(100, 2)
).to(device)

## 定义训练函数和工具

In [32]:
# get optimizer
from torch import optim

# tune only the MLP's parameter
optimizer = optim.Adam(net.parameters())

# setup the loss function
loss_fn = nn.CrossEntropyLoss()

In [37]:
from tqdm import tqdm
import torch

def train(train_loader, net, optimizer, loss_fn):
    # your code here to implement a training epoch
    # 0. set model to train mode
    net.train()
    # 1. initialize a correct sample counter and a loss collection
    correct = 0
    loss_list = []
    # 2. traverse for each sample in `train_loader`
    for inputs, labels in train_loader:
        # 0. move inputs and labels to device
        inputs.to(device)
        labels.to(device)
        # 1. zero the parameter gradients
        optimizer.zero_grad()
        # 2. get output with forward pass
        outputs = net(inputs)
        # 3. get loss value with loss_fn
        loss = loss_fn(outputs, labels)
        # 4. backward and take an optimize step
        loss.backward()
        optimizer.step()
        # 5. collect the correct sample and the loss
        pred = outputs.argmax(1)
        correct += (pred == labels).detach().cpu().numpy()
        loss_list.append(loss.detach().cpu().numpy())
    # 3. calculate the accuracy and the average loss
    accuracy = correct / len(train_loader)
    avg_loss = sum(loss_list) / len(loss_list)
    # 4. return the accuracy and the loss
    return accuracy, avg_loss

def test(test_loader, net, loss_fn):
    # your code here to implement a testing epoch
    # 0. set model to evaluation mode
    net.eval()
    # 1. initialize a correct sample counter and a loss collection
    correct = 0
    loss_list = []
    # 2. turn off gradient collection
    with torch.no_grad():
        # 3. traverse for each sample in `test_loader`
        for inputs, labels in test_loader:
            # 0. move inputs and labels to device
            inputs.to(device)
            labels.to(device)
            # 1. get output with forward pass
            outputs = net(inputs)
            # 2. get loss value with loss_fn
            loss = loss_fn(outputs, labels)
            # 3. collect the correct sample and the loss
            pred = outputs.argmax(1)
            correct += (pred == labels).detach().cpu().numpy()
            loss_list.append(loss.detach().cpu().numpy())
    # 4. calculate the accuracy and the average loss
    accuracy = correct / len(train_loader)
    avg_loss = sum(loss_list) / len(loss_list)
    # 5. return the accuracy and the loss 
    return accuracy, avg_loss

## 训练模型

In [38]:
accuracy_train_list = []
loss_train_list = []
accuracy_test_list = []
loss_test_list = []

for epoch in range(20):
    # your code here to train and test the model
    accuracy_train, loss_train = train(train_loader, net, optimizer, loss_fn)
    accuracy_train_list.append(accuracy_train)
    loss_train_list.append(loss_train)
    
    accuracy_test, loss_test = test(train_loader, net, loss_fn)
    accuracy_test_list.append(accuracy_test)
    loss_test_list.append(loss_test)


In [39]:
print(f"train_acc = {accuracy_train_list}, test_acc = {accuracy_test_list}")

train_acc = [array([0.98028674, 0.97849462, 0.97132616, 0.96594982, 0.96236559,
       0.96774194, 0.98207885, 0.97311828]), array([0.98387097, 0.97849462, 0.98387097, 0.97311828, 0.98387097,
       0.96774194, 0.98207885, 0.97491039]), array([0.97849462, 0.98028674, 0.98387097, 0.97849462, 0.98207885,
       0.97849462, 0.96774194, 0.98924731]), array([0.98566308, 0.98387097, 0.98387097, 0.98207885, 0.97849462,
       0.98387097, 0.9874552 , 0.99103943]), array([0.99103943, 0.9874552 , 0.98566308, 0.97491039, 0.9874552 ,
       0.99462366, 0.9874552 , 0.99103943]), array([0.9874552 , 0.99641577, 0.99283154, 0.9874552 , 0.98924731,
       0.98566308, 0.99103943, 0.99462366]), array([0.99103943, 0.99283154, 0.98924731, 0.9874552 , 0.99283154,
       0.98924731, 0.98566308, 0.98387097]), array([0.99283154, 0.99641577, 0.98924731, 0.99103943, 0.99641577,
       0.99462366, 0.98924731, 0.99283154]), array([1.        , 1.        , 0.98387097, 0.99641577, 0.99462366,
       0.99103943, 0.991