# Fudan PRML22 Spring Final Project

*Your name and Student ID: [Name], [Student ID]*

*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet, and a .pdf report file) with your assignment submission.*

**Congratulations, you have come to the last challenge!**

Having finished the past two assignments, we think all you gugs already have a solid foundation in the field of machine learning and deep learning. And now you are qualified to apply machine learning algorithms to the real-world tasks you are interested in, or start your machine learning research. 

**In this final project, you are free to choose a topic you are passionate about. The project can be an application one, a theoretical one or implementing your own amazing machine learning/deep learning framework like a toy pytorch. If you don't have any idea, we will also provide you with a default one you can play with.** 

**! Notice: If you want to work on your own idea, you have to email the TA (lip21[at]m.fudan.edu.cn) to give a simple project proposal first before May 22, 2022.** 

## Default Project: Natural Language Inference

![Sherlock](./img/inference.jpg)

The default final project this semester is a NLP task called "Natural Language Inference". Though deep neural networks have demonstrated astonishing performance in many tasks like text classification and generation, you might somehow think they are just "advanced statistics" but far from *intelligent* machines. One intelligent machine must be able to reason, you may think. And in this default final project, your aim is to design a machine which can conduct inference. The machine can know that "A man inspects the uniform of a figure in some East Asian country" is contradictory to "The man is sleeping", and "a soccer game with multiple males playing." entails "some men are playing a sport".

The dataset we use this time is the Original Chinese Natural Language Inference (OCNLI) dataset[1]. It is a chinese NLI dataset with about 50k training data and 3k development data. The sentence pairs in the dataset are labeled as "entailment", "neutral" and "contradiction". Due to they release the test data without its labels, we select 5k data pairs from the training data as labeled test data, and the other 45k data as your t. You can visit the [GitHub link](https://github.com/CLUEbenchmark/OCNLI) for more information.

After you finished the NLI task with the full 50k training set, you have to complete an advanced challenge. You have to select **at most 5k data** from the training set as labeled training set, leaving the other training data as unlabeled training set, then use these labeled and unlabeled data to finish the same NLI task. You can randomly choosing the 5k training data but can also think up some ideas to select more **important data** as labeled training data. Like assignment1, you may have to think how to use the unlabeled training data.

You can use the deep learning frameworks like paddle, pytorch, tensorflow in your experiment but not more high-level libraries like Huggingface. Please write down the version of them in the './requirements.txt' file.

**! Notice: You CAN NOT use any other people's pretrained model like 'bert-base-chinese' in this default project. You are encouraged to design your own model and algorithm, no matter it looks naive or not.**

NLI is a traditional but promising NLP task, and you can search the Google/Bing for more information. Some key words can be "natural language inference with attention", "training data selection", "semi-surpervised learning", "unsupervised representation learning" and so on.

## 1. Setup

import the libraries and load the dataset here.

In [255]:
# setup code
import json

%load_ext autoreload
%autoreload 2
%matplotlib inline

from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import jieba
import time
import math

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.is_available()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


True

In [256]:
dataset_path = './dataset'

train_data_file = dataset_path + '/train.json'
dev_data_file = dataset_path + '/dev.json'

In [257]:
def read_ocnli_file(data_file):
    # read the ocnli file. feel free to change it. 
    print ("loading data from ", data_file)
    
    text_outputs = []
    label_outputs = []
    
    label_to_idx = {"entailment": 0, "neutral": 1, "contradiction": 2}
    
    with open(data_file, 'r', encoding="utf-8") as f:
        line = f.readline()
        while line:
            line = json.loads(line.strip())
            text_a, text_b, label = line['sentence1'], line['sentence2'],line['label']
            label_id = label_to_idx[label.strip()]
            
            text_outputs.append((text_a,text_b))
            label_outputs.append(label_id)

            line = f.readline()
                
    print ("there are ", len(label_outputs), "sentence pairs in this file.")
    return text_outputs, label_outputs


training_data, training_labels = read_ocnli_file(train_data_file)
dev_data, dev_labels = read_ocnli_file(dev_data_file)

stop_words=[]

with open("stop_words.txt",'r',encoding="utf-8") as f:
    for line in f.readlines():
        line = line.strip('\n')
        stop_words.append(str(line))

loading data from  ./dataset/train.json
there are  45437 sentence pairs in this file.
loading data from  ./dataset/dev.json
there are  2950 sentence pairs in this file.


In [258]:
print ("training data samples: ", training_data[:5])
print ("training labels samples: ", training_labels[:5])

training data samples:  [('对,对,对,对,对,具体的答复.', '要的是抽象的答复'), ('当前国际形势仍处于复杂而深刻的变动之中', '一个月后将发生世界战争'), ('在全县率先推行宅基地有偿使用,全乡20年无须再扩大宅基地', '宅基地有偿使用获得较好成果,将在更大范围实施。'), ('上海马路上的喧声也是老调子', '上海有很多条马路'), ('那你看看第二封信什么时候到吧.', '第一封信已经收到了。')]
training labels samples:  [2, 1, 1, 1, 1]


## 2. Exploratory Data Analysis (5 points)

Your may have to explore the dataset and do some analysis first.

We'll need a unique index per word to use as the inputs and targets of
the networks later. To keep track of all this we will use a helper class
called ``Lang`` which has word → index (``word2index``) and index → word
(``index2word``) dictionaries, as well as a count of each word
``word2count`` which will be used to replace rare words later.

In [259]:
SOS_token = 0
EOS_token = 1


class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # Count SOS and EOS

    def addSentence(self, sentence):
        for word in sentence:
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

In [260]:
training_premise = []
training_hypothesis = []
test_premise = []
test_hypothesis = []

def is_chinese(uchar):
    if uchar >= u'\u4e00' and uchar <= u'\u9fa5':
        return True
    else:
        return False

def format_str(content):
    content_str = ''
    for i in content:
        if is_chinese(i):
            content_str = content_str + i
    return content_str

for pair in training_data:
    training_premise.append(format_str(pair[0]))
    training_hypothesis.append(format_str(pair[1]))

for pair in dev_data:
    test_premise.append(format_str(pair[0]))
    test_hypothesis.append(format_str(pair[1]))

print(training_premise[:5])
print(len(training_premise))



['对对对对对具体的答复', '当前国际形势仍处于复杂而深刻的变动之中', '在全县率先推行宅基地有偿使用全乡年无须再扩大宅基地', '上海马路上的喧声也是老调子', '那你看看第二封信什么时候到吧']
45437


In [261]:
randnum = random.randint(0,100)
# randnum = 63
print(randnum)
random.seed(randnum)
random.shuffle(training_premise)
random.seed(randnum)
random.shuffle(training_hypothesis)
random.seed(randnum)
random.shuffle(training_labels)

21


split words

In [262]:
def split_words(datas):
    cut_words = map(lambda s: list(jieba.cut(s)), datas)
    return list(cut_words)

training_premise = split_words(training_premise)
training_hypothesis = split_words(training_hypothesis)
test_premise = split_words(test_premise)
test_hypothesis = split_words(test_hypothesis)

print(training_premise[:5])

def drop_stopwords(contents, stopwords):
    contents_clean = []
    for line in contents:
        line_clean = []
        for word in line:
            if word in stopwords:
                continue
            line_clean.append(word)
        contents_clean.append(line_clean)
    return contents_clean

training_premise = drop_stopwords(training_premise,stop_words)
training_hypothesis = drop_stopwords(training_hypothesis,stop_words)

test_premise = drop_stopwords(test_premise,stop_words)
test_hypothesis = drop_stopwords(test_hypothesis,stop_words)

print(training_premise[:5])
print(training_hypothesis[:5])


[['我', '猜', '你', '在', '在', '这边儿', '就是', '帮', '他', '这个'], ['深化', '城镇', '住房', '制度', '改革', '满足', '居民', '多层次', '住房', '需求', '努力实现', '住', '有所', '居', '的', '目标'], ['呃', '没有', '问题', '我', '没有', '问题'], ['坐', '着', '无话', '蒋丽莉', '便', '起身', '到', '角落', '弹钢琴', '东一句西', '一句', '琴声', '淙淙', '毕竟', '是', '一点', '鼓舞', '也', '是', '一点', '推动'], ['呃', '现在', '正在', '办', '签证', '哎']]
[['我', '猜', '你', '在', '在', '这边儿', '就是', '帮', '他', '这个'], ['深化', '城镇', '住房', '制度', '改革', '满足', '居民', '多层次', '住房', '需求', '努力实现', '住', '有所', '居', '目标'], ['呃', '没有', '问题', '我', '没有', '问题'], ['坐', '着', '无话', '蒋丽莉', '便', '起身', '到', '角落', '弹钢琴', '东一句西', '一句', '琴声', '淙淙', '毕竟', '是', '一点', '鼓舞', '也', '是', '一点', '推动'], ['呃', '现在', '正在', '办', '签证']]
[['我', '不', '确定', '你', '在', '这里', '干什么', '只能', '是', '猜测'], ['居民', '没有', '住房', '需求'], ['我', '处于', '巨大', '麻烦', '中'], ['蒋丽莉', '所弹', '钢琴', '已经', '有', '多年', '历史'], ['这个', '人', '不', '在', '对', '办理', '签证', '工作人员', '说话']]


In [263]:
Max = 0
Sum = 0
count = 0
for sentence in training_premise:
    count += 1
    Max = max(Max,len(sentence))
    Sum += len(sentence)
print("Max",Max)
print("Avg",Sum/count)
print(len(training_premise))

Max 29
Avg 11.471025815964962
45437


filter the train_set

In [264]:
del_list_train = []
del_list_test = []


MAX_LENGTH = 15
for i in range(len(training_premise)):
    if len(training_premise[i]) > MAX_LENGTH or len(training_hypothesis[i]) > MAX_LENGTH :
        del_list_train.append(i)

for idx in sorted(del_list_train, reverse = True):
    del training_premise[idx]
    del training_hypothesis[idx]
    del training_labels[idx]

for i in range(len(test_premise)):
    if len(test_premise[i]) > MAX_LENGTH or len(test_hypothesis[i]) > MAX_LENGTH :
        del_list_test.append(i)

for idx in sorted(del_list_test, reverse = True):
    del test_premise[idx]
    del test_hypothesis[idx]
    del dev_labels[idx]

In [265]:
Max = 0
Sum = 0
count = 0
for sentence in training_premise:
    count += 1
    Max = max(Max,len(sentence))
    Sum += len(sentence)
print("Max",Max)
print("Avg",Sum/count)
print(len(training_premise))

Max 15
Avg 8.94646326962675
34481


In [266]:
Chinese = Lang("CN")
for sentence in training_premise:
    Chinese.addSentence(sentence)
for sentence in training_hypothesis:
    Chinese.addSentence(sentence)
for sentence in test_premise:
    Chinese.addSentence(sentence)
for sentence in test_hypothesis:
    Chinese.addSentence(sentence)

print(Chinese.name, Chinese.n_words)

premise_vec = []
hypothesis_vec = []

test_premise_vec = []
test_hypothesis_vec = []

for sentence in training_premise:
    word_list = []
    for word in sentence:
        word_list.append(Chinese.word2index[word])
    premise_vec.append(word_list)

for sentence in training_hypothesis:
    word_list = []
    for word in sentence:
        word_list.append(Chinese.word2index[word])
    hypothesis_vec.append(word_list)

for sentence in test_premise:
    word_list = []
    for word in sentence:
        word_list.append(Chinese.word2index[word])
    test_premise_vec.append(word_list)

for sentence in test_hypothesis:
    word_list = []
    for word in sentence:
        word_list.append(Chinese.word2index[word])
    test_hypothesis_vec.append(word_list)

CN 23572


## 3. Methodology (50 points)

The encoder of this network is a RNN 

In [267]:
class Bi_Lstm(nn.Module):
    def __init__(self):
        super(Bi_Lstm,self).__init__() 
        self.embeding = nn.Embedding(Chinese.n_words+1,100)
        self.lstm = nn.LSTM(input_size = 100, hidden_size = 128,num_layers = 1,bidirectional = True,batch_first=True,dropout=0.5)#加了双向，输出的节点数翻2倍
        self.l1 = nn.BatchNorm1d(128)
        self.l2 = nn.ReLU()
        self.l3 = nn.Linear(256,3)#特征输入
        self.l4 = nn.Dropout(0.3)
        self.l5 = nn.BatchNorm1d(3)
    def attention_net(self, lstm_output, final_state):
        batch_size = len(lstm_output)
        hidden = final_state.view(batch_size, -1, 1)   # hidden : [batch_size, n_hidden * num_directions(=2), n_layer(=1)]
        attn_weights = torch.bmm(lstm_output, hidden).squeeze(2) # attn_weights : [batch_size, n_step]
        soft_attn_weights = F.softmax(attn_weights, 1)
        # context : [batch_size, n_hidden * num_directions(=2)]
        context = torch.bmm(lstm_output.transpose(1, 2), soft_attn_weights.unsqueeze(2)).squeeze(2)
        return context, soft_attn_weights 
    def forward(self, x):
        x = self.embeding(x)
        #out,_ = self.lstm(x)
        out, (final_hidden_state, final_cell_state) = self.lstm(x)
        #选择最后一个时间点的output
        '''
        out = self.l1(out[:,-1,:])
        out = self.l2(out)
        out = self.l3(out)
        out = self.l4(out)
        out = self.l5(out)
        '''
        #output = out.transpose(0, 1) # output : [batch_size, seq_len, n_hidden]
        attn_output, attention = self.attention_net(out, final_hidden_state)
        #return self.out(attn_output), attention # model : [batch_size, num_classes]
        out = self.l3(attn_output)
        out = self.l4(out)
        out = self.l5(out)
        return out




train the model

In [268]:
def train(tensor, bilstm, optimizer, criterion,result, max_length=MAX_LENGTH):

    optimizer.zero_grad()


    loss = 0

    output=bilstm(tensor)


    loss+=criterion(output,result) 

    loss.backward()

    optimizer.step()

    return loss.item() 

In [269]:
data_length = len(premise_vec)

def trainIters(bilstm, n_iters,epoch_num=10, batch_size=64):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    

    criterion = nn.CrossEntropyLoss()

    result = torch.zeros(data_length,device=device).long()
    for i in range(data_length):
        result[i]=training_labels[i]

    batch_input = torch.zeros(batch_size,2*MAX_LENGTH,device=device).long()


    sum = 0
    count = 0
    for i in range(epoch_num):

        index = 0

        while index + batch_size < n_iters:

            for batch in range(batch_size):
                input_tensor = premise_vec[index+batch]
                target_tensor = hypothesis_vec[index+batch]
                # print(input_tensor)
                for iter in range(len(input_tensor)):
                    batch_input[batch][iter] = input_tensor[iter]
                for iter in range(len(target_tensor)):
                    batch_input[batch][iter+MAX_LENGTH] = target_tensor[iter]


            loss = train(batch_input, bilstm,optimizer, criterion,result[index:index+batch_size])
            print_loss_total += loss
            plot_loss_total += loss
            sum += loss
            count += 1
            progress = ((i*n_iters+index)/(epoch_num*n_iters))*100
            print("progress : ",round(progress,2),"%"," loss = ",loss," / ",sum/count)

            index += batch_size


    # showPlot(plot_losses)

Build and evaluate your model here.

In [270]:
bilstm = Bi_Lstm().to(device)
bilstm.train()

optimizer = torch.optim.SGD(bilstm.parameters(),lr=0.01)

trainIters(bilstm, 30000,  epoch_num=30,batch_size=256)



progress :  0.0 %  loss =  1.4536116123199463  /  1.4536116123199463
progress :  0.03 %  loss =  1.4090914726257324  /  1.4313515424728394
progress :  0.06 %  loss =  1.363506555557251  /  1.4087365468343098
progress :  0.09 %  loss =  1.3809815645217896  /  1.4017978012561798
progress :  0.11 %  loss =  1.3501108884811401  /  1.3914604187011719
progress :  0.14 %  loss =  1.4175978899002075  /  1.3958166639010112
progress :  0.17 %  loss =  1.471761703491211  /  1.4066659552710397
progress :  0.2 %  loss =  1.403518557548523  /  1.406272530555725
progress :  0.23 %  loss =  1.4863159656524658  /  1.415166245566474
progress :  0.26 %  loss =  1.2890655994415283  /  1.4025561809539795
progress :  0.28 %  loss =  1.2782249450683594  /  1.391253341328014
progress :  0.31 %  loss =  1.2473655939102173  /  1.3792626957098644
progress :  0.34 %  loss =  1.314460277557373  /  1.3742778943135188
progress :  0.37 %  loss =  1.3135640621185303  /  1.3699411920138769
progress :  0.4 %  loss =  1.

In [271]:
def accuracy(data_premise,data_hypothesis,label):
    batch_size = 32
    batch_input = torch.zeros(batch_size,2*MAX_LENGTH,device=device).long()
    index = 0

    sum = 0

    while index + batch_size < len(data_premise):

        for batch in range(batch_size):
            input_tensor = data_premise[index+batch]
            target_tensor = data_hypothesis[index+batch]
            # print(input_tensor)
            for iter in range(len(input_tensor)):
                batch_input[batch][iter] = input_tensor[iter]
            for iter in range(len(target_tensor)):
                batch_input[batch][iter+MAX_LENGTH] = target_tensor[iter]

        output = bilstm(batch_input)
        for j in range(batch_size):
            if output[j].argmax()==label[index+j]: 
                sum+=1

        index += batch_size
        
    return sum/len(data_premise)


print(accuracy(premise_vec,hypothesis_vec,training_labels))
print(accuracy(test_premise_vec,test_hypothesis_vec,dev_labels))



0.3730460253472927
0.35587975243147657


## 4. Attention Visualization (10 points)

Visualize the attention matrix in your model here.

## 5. Model Attack (30 points)

Attack your model here.

## 6. Conclusion (5 points)

Write down your conclusion here.

## Reference

[1] OCNLI: Original Chinese Natural Language Inference, arxiv: https://arxiv.org/abs/2010.05444