# Fudan PRML22 Spring Final Project

*Your name and Student ID: [Name], [Student ID]*

*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet, and a .pdf report file) with your assignment submission.*

**Congratulations, you have come to the last challenge!**

Having finished the past two assignments, we think all you gugs already have a solid foundation in the field of machine learning and deep learning. And now you are qualified to apply machine learning algorithms to the real-world tasks you are interested in, or start your machine learning research. 

**In this final project, you are free to choose a topic you are passionate about. The project can be an application one, a theoretical one or implementing your own amazing machine learning/deep learning framework like a toy pytorch. If you don't have any idea, we will also provide you with a default one you can play with.** 

**! Notice: If you want to work on your own idea, you have to email the TA (lip21[at]m.fudan.edu.cn) to give a simple project proposal first before May 22, 2022.** 

## Default Project: Natural Language Inference

![Sherlock](./img/inference.jpg)

The default final project this semester is a NLP task called "Natural Language Inference". Though deep neural networks have demonstrated astonishing performance in many tasks like text classification and generation, you might somehow think they are just "advanced statistics" but far from *intelligent* machines. One intelligent machine must be able to reason, you may think. And in this default final project, your aim is to design a machine which can conduct inference. The machine can know that "A man inspects the uniform of a figure in some East Asian country" is contradictory to "The man is sleeping", and "a soccer game with multiple males playing." entails "some men are playing a sport".

The dataset we use this time is the Original Chinese Natural Language Inference (OCNLI) dataset[1]. It is a chinese NLI dataset with about 50k training data and 3k development data. The sentence pairs in the dataset are labeled as "entailment", "neutral" and "contradiction". Due to they release the test data without its labels, we select 5k data pairs from the training data as labeled test data, and the other 45k data as your t. You can visit the [GitHub link](https://github.com/CLUEbenchmark/OCNLI) for more information.

After you finished the NLI task with the full 50k training set, you have to complete an advanced challenge. You have to select **at most 5k data** from the training set as labeled training set, leaving the other training data as unlabeled training set, then use these labeled and unlabeled data to finish the same NLI task. You can randomly choosing the 5k training data but can also think up some ideas to select more **important data** as labeled training data. Like assignment1, you may have to think how to use the unlabeled training data.

You can use the deep learning frameworks like paddle, pytorch, tensorflow in your experiment but not more high-level libraries like Huggingface. Please write down the version of them in the './requirements.txt' file.

**! Notice: You CAN NOT use any other people's pretrained model like 'bert-base-chinese' in this default project. You are encouraged to design your own model and algorithm, no matter it looks naive or not.**

NLI is a traditional but promising NLP task, and you can search the Google/Bing for more information. Some key words can be "natural language inference with attention", "training data selection", "semi-surpervised learning", "unsupervised representation learning" and so on.

## 1. Setup

import the libraries and load the dataset here.

In [7]:
# setup code
import json

%load_ext autoreload
%autoreload 2
%matplotlib inline

from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import jieba
import time
import math

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.is_available()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


True

In [10]:
dataset_path = '../dataset'

train_data_file = dataset_path + '/train.json'
dev_data_file = dataset_path + '/dev.json'

In [12]:
def read_ocnli_file(data_file):
    # read the ocnli file. feel free to change it. 
    print ("loading data from ", data_file)
    
    text_outputs = []
    label_outputs = []
    
    label_to_idx = {"entailment": 0, "neutral": 1, "contradiction": 2}
    
    with open(data_file, 'r', encoding="utf-8") as f:
        line = f.readline()
        while line:
            line = json.loads(line.strip())
            text_a, text_b, label = line['sentence1'], line['sentence2'],line['label']
            label_id = label_to_idx[label.strip()]
            
            text_outputs.append((text_a,text_b))
            label_outputs.append(label_id)

            line = f.readline()
                
    print ("there are ", len(label_outputs), "sentence pairs in this file.")
    return text_outputs, label_outputs


training_data, training_labels = read_ocnli_file(train_data_file)
dev_data, dev_labels = read_ocnli_file(dev_data_file)

stop_words=[]

with open("../stop_words.txt",'r',encoding="utf-8") as f:
    for line in f.readlines():
        line = line.strip('\n')
        stop_words.append(str(line))

loading data from  ../dataset/train.json
there are  45437 sentence pairs in this file.
loading data from  ../dataset/dev.json
there are  2950 sentence pairs in this file.


In [13]:
print ("training data samples: ", training_data[:5])
print ("training labels samples: ", training_labels[:5])

training data samples:  [('对,对,对,对,对,具体的答复.', '要的是抽象的答复'), ('当前国际形势仍处于复杂而深刻的变动之中', '一个月后将发生世界战争'), ('在全县率先推行宅基地有偿使用,全乡20年无须再扩大宅基地', '宅基地有偿使用获得较好成果,将在更大范围实施。'), ('上海马路上的喧声也是老调子', '上海有很多条马路'), ('那你看看第二封信什么时候到吧.', '第一封信已经收到了。')]
training labels samples:  [2, 1, 1, 1, 1]


## 2. Exploratory Data Analysis (5 points)

Your may have to explore the dataset and do some analysis first.

We'll need a unique index per word to use as the inputs and targets of
the networks later. To keep track of all this we will use a helper class
called ``Lang`` which has word → index (``word2index``) and index → word
(``index2word``) dictionaries, as well as a count of each word
``word2count`` which will be used to replace rare words later.

In [14]:
SOS_token = 0
EOS_token = 1


class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # Count SOS and EOS

    def addSentence(self, sentence):
        for word in sentence:
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

In [15]:
training_premise = []
training_hypothesis = []

def is_chinese(uchar):
    if uchar >= u'\u4e00' and uchar <= u'\u9fa5':
        return True
    else:
        return False

def format_str(content):
    content_str = ''
    for i in content:
        if is_chinese(i):
            content_str = content_str + i
    return content_str

for pair in training_data:
    training_premise.append(format_str(pair[0]))
    training_hypothesis.append(format_str(pair[1]))

print(training_premise[:5])
print(len(training_premise))



['对对对对对具体的答复', '当前国际形势仍处于复杂而深刻的变动之中', '在全县率先推行宅基地有偿使用全乡年无须再扩大宅基地', '上海马路上的喧声也是老调子', '那你看看第二封信什么时候到吧']
45437


split words

In [16]:
def split_words(datas):
    cut_words = map(lambda s: list(jieba.cut(s)), datas)
    return list(cut_words)

training_premise = split_words(training_premise)
training_hypothesis = split_words(training_hypothesis)

print(training_premise[:5])

def drop_stopwords(contents, stopwords):
    contents_clean = []
    for line in contents:
        line_clean = []
        for word in line:
            if word in stopwords:
                continue
            line_clean.append(word)
        contents_clean.append(line_clean)
    return contents_clean

training_premise = drop_stopwords(training_premise,stop_words)
training_hypothesis = drop_stopwords(training_hypothesis,stop_words)

print(training_premise[:5])
print(training_hypothesis[:5])


Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\use\AppData\Local\Temp\jieba.cache
Loading model cost 0.641 seconds.
Prefix dict has been built successfully.


[['对', '对', '对', '对', '对', '具体', '的', '答复'], ['当前', '国际形势', '仍', '处于', '复杂', '而', '深刻', '的', '变动', '之中'], ['在', '全县', '率先', '推行', '宅基地', '有偿', '使用', '全乡', '年', '无须再', '扩大', '宅基地'], ['上海', '马路上', '的', '喧声', '也', '是', '老调子'], ['那', '你', '看看', '第二', '封信', '什么', '时候', '到', '吧']]
[['对', '对', '对', '对', '对', '具体', '答复'], ['当前', '国际形势', '仍', '处于', '复杂', '而', '深刻', '变动', '之中'], ['在', '全县', '率先', '推行', '宅基地', '有偿', '使用', '全乡', '年', '无须再', '扩大', '宅基地'], ['上海', '马路上', '喧声', '也', '是', '老调子'], ['那', '你', '看看', '第二', '封信', '什么', '时候', '到']]
[['要', '是', '抽象', '答复'], ['一个月', '后', '将', '发生', '世界', '战争'], ['宅基地', '有偿', '使用', '获得', '较', '好', '成果', '将', '在', '更', '大', '范围', '实施'], ['上海', '有', '很多', '条', '马路'], ['第一', '封信', '已经', '收到']]


In [17]:
Max = 0
Sum = 0
count = 0
for sentence in training_premise:
    count += 1
    Max = max(Max,len(sentence))
    Sum += len(sentence)
print("Max",Max)
print("Avg",Sum/count)
print(len(training_premise))

Max 29
Avg 11.471025815964962
45437


filter the train_set

In [19]:
del_list = []


MAX_LENGTH = 10
for i in range(len(training_premise)):
    if len(training_premise[i]) > MAX_LENGTH or len(training_hypothesis[i]) > MAX_LENGTH :
        del_list.append(i)

for idx in sorted(del_list, reverse = True):
    del training_premise[idx]
    del training_hypothesis[idx]
    del training_labels[idx]

In [20]:
Max = 0
Sum = 0
count = 0
for sentence in training_premise:
    count += 1
    Max = max(Max,len(sentence))
    Sum += len(sentence)
print("Max",Max)
print("Avg",Sum/count)
print(len(training_premise))

Max 10
Avg 7.008875999648475
22758


In [21]:
Chinese = Lang("CN")
for sentence in training_premise:
    Chinese.addSentence(sentence)
for sentence in training_hypothesis:
    Chinese.addSentence(sentence)

print(Chinese.name, Chinese.n_words)

premise_vec = []
hypothesis_vec = []

for sentence in training_premise:
    word_list = []
    for word in sentence:
        word_list.append(Chinese.word2index[word])
    premise_vec.append(word_list)

for sentence in training_hypothesis:
    word_list = []
    for word in sentence:
        word_list.append(Chinese.word2index[word])
    hypothesis_vec.append(word_list)


CN 16185


## 3. Methodology (50 points)

The encoder of this network is a RNN 

In [22]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size,batch_first=True)

    def forward(self, input, hidden):
        batch_size = input.shape[0]
        embedded = self.embedding(input).view(batch_size, 1, -1)
        output = embedded
        output, hidden = self.gru(output, hidden)
        return output, hidden

    def initHidden(self,batch_size):
        return torch.zeros(1, batch_size, self.hidden_size, device=device)

In [23]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size,batch_first=True)
        # self.out = nn.Linear(hidden_size, output_size)
        self.out = nn.Linear(hidden_size, 1)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, input, hidden):
        batch_size = input.shape[0]
        output = self.embedding(input).view(batch_size, 1, -1)
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        # output = self.softmax(self.out(output[0]))
        output = self.out(output[0])
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

In [24]:
class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size, batch_first=True)
        # self.out = nn.Linear(self.hidden_size, self.output_size)
        self.out = nn.Linear(self.hidden_size, 1)


    def forward(self, input, hidden, encoder_outputs):
        batch_size = input.shape[0]
        embedded = self.embedding(input).view(batch_size, 1, -1)
        embedded = self.dropout(embedded)

        attn_input = torch.cat((embedded, hidden.view(batch_size, 1, -1)), 2).view(batch_size, -1)
        attn_weights = F.softmax(
            self.attn(attn_input), dim=-1)

        # attn_applied = torch.bmm(attn_weights.unsqueeze(0),
        #                          encoder_outputs.unsqueeze(0))

        attn_applied = torch.bmm(attn_weights.view(batch_size,1,-1), encoder_outputs)

        # output = torch.cat((embedded[0], attn_applied[0]), 1)
        output = torch.cat((embedded, attn_applied), 2)
        output = self.attn_combine(output)

        output = F.relu(output)
        output, hidden = self.gru(output, hidden)

        # output = F.log_softmax(self.out(output), dim=-1)
        output = self.out(output)
        return output, hidden, attn_weights

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

In [25]:
class Soft(nn.Module):
    def __init__(self, input_size, output_size):
        super(Soft, self).__init__()
        self.input_size = input_size
        self.output_size = output_size
        self.l1 = nn.Linear(self.input_size,1000)
        self.l2 = nn.Linear(1000,500)
        self.l3 = nn.Linear(500,250)
        self.l4 = nn.Linear(250,100)
        self.l5 = nn.Linear(100,50)
        self.l6 = nn.Linear(50,10)
        self.softmax = nn.Linear(10,self.output_size)

    def forward(self, x):
        x = torch.sigmoid(x)
        x = self.l1(x)
        x = torch.sigmoid(x)
        x = self.l2(x)
        x = torch.sigmoid(x)
        x = self.l3(x)
        x = torch.sigmoid(x)
        x = self.l4(x)
        x = torch.sigmoid(x)
        x = self.l5(x)
        x = torch.sigmoid(x)
        x = self.l6(x)
        x = torch.sigmoid(x)
        x = self.softmax(x)
        return x

train the model

In [26]:
def train(input_tensor, target_tensor, encoder, decoder,soft, encoder_optimizer, decoder_optimizer,soft_optimizer, criterion,result, max_length=MAX_LENGTH):
    batch_size = input_tensor.size(0)
    input_length = input_tensor.size(1)
    target_length = target_tensor.size(1)
    encoder_hidden = encoder.initHidden(batch_size)

    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()
    soft_optimizer.zero_grad()

    # print(input_tensor.shape)
    

    # encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)
    encoder_outputs = torch.zeros([batch_size, max_length, encoder.hidden_size], device=device)
    loss = 0

    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(
            input_tensor[:,ei], encoder_hidden)
        encoder_outputs[:,ei,:] = encoder_output[:, 0]

    # decoder_input = torch.tensor([[SOS_token]], device=device)
    decoder_input = torch.zeros([batch_size,1], device=device, dtype=torch.long)
    
    decoder_hidden = encoder_hidden

    soft_input = torch.zeros(batch_size, max_length, device=device)

    for di in range(target_length):
        decoder_output, decoder_hidden, decoder_attention = decoder(
            decoder_input, decoder_hidden, encoder_outputs)

        decoder_output = decoder_output.view(batch_size,-1)
        for i in range(batch_size):
            soft_input[i,di] = decoder_output[i]

        # loss += criterion(decoder_output, target_tensor[di])
        decoder_input = target_tensor[:,di]  # Teacher forcing


    soft_output=soft(soft_input)

    for i in range(batch_size):
        
        soft_one=soft_output[i].squeeze()
        result_one = result[i].squeeze()
        loss+=criterion(soft_one,result_one) 

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()
    soft_optimizer.step()

    return loss.item() 

In [27]:
data_length = len(premise_vec)

def trainIters(encoder, decoder,soft, n_iters, print_every=1000, plot_every=100, learning_rate=0.01, epoch_num=10, batch_size=64):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder.train()
    decoder.train()
    soft.train()

    encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate)
    soft_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate)

    criterion = nn.CrossEntropyLoss()

    result = torch.zeros(data_length,device=device).long()
    for i in range(data_length):
        result[i]=training_labels[i]

    batch_input = torch.zeros(batch_size,MAX_LENGTH,device=device).long()
    batch_target = torch.zeros(batch_size,MAX_LENGTH,device=device).long()

    count = 0
    sum = 0

    for i in range(epoch_num):

        index = 0

        while index + batch_size < n_iters:

            for batch in range(batch_size):
                input_tensor = premise_vec[index+batch]
                target_tensor = hypothesis_vec[index+batch]
                # print(input_tensor)
                for iter in range(len(input_tensor)):
                    batch_input[batch][iter] = input_tensor[iter]
                for iter in range(len(target_tensor)):
                    batch_target[batch][iter] = target_tensor[iter]


            loss = train(batch_input, batch_target, encoder,
                        decoder,soft, encoder_optimizer, decoder_optimizer,soft_optimizer, criterion,result[index:index+batch_size])
            print_loss_total += loss
            plot_loss_total += loss
            count += 1
            sum += loss
            

            print(index," data: loss = ",sum/count)

            index += batch_size


    # showPlot(plot_losses)

Build and evaluate your model here.

In [29]:
hidden_size = 512
encoder1 = EncoderRNN(Chinese.n_words, hidden_size).to(device)
attn_decoder1 = AttnDecoderRNN(hidden_size, Chinese.n_words, dropout_p=0.1).to(device)
soft1 = Soft(MAX_LENGTH,3).to(device)

trainIters(encoder1, attn_decoder1,soft1, 30000, print_every=10, learning_rate=0.01, epoch_num=10,batch_size=32)

0  data: loss =  36.19694519042969
32  data: loss =  35.62521743774414
64  data: loss =  35.727823893229164
96  data: loss =  35.761539459228516
128  data: loss =  35.957678985595706
160  data: loss =  35.906670888264976
192  data: loss =  36.1416380746024
224  data: loss =  36.0891809463501
256  data: loss =  36.06988525390625
288  data: loss =  36.00167007446289
320  data: loss =  36.11697491732511
352  data: loss =  36.10604985555013
384  data: loss =  35.9520137493427
416  data: loss =  35.911708286830354
448  data: loss =  35.953006490071616
480  data: loss =  35.83302068710327
512  data: loss =  35.82648894366096
544  data: loss =  35.7864793141683
576  data: loss =  35.864562586734166
608  data: loss =  35.87766437530517
640  data: loss =  35.79988479614258
672  data: loss =  35.84831914034757
704  data: loss =  35.88642319389012
736  data: loss =  35.90742556254069
768  data: loss =  35.88663803100586
800  data: loss =  35.89112868675819
832  data: loss =  35.86531646163375
864

KeyboardInterrupt: 

## 4. Attention Visualization (10 points)

Visualize the attention matrix in your model here.

## 5. Model Attack (30 points)

Attack your model here.

## 6. Conclusion (5 points)

Write down your conclusion here.

## Reference

[1] OCNLI: Original Chinese Natural Language Inference, arxiv: https://arxiv.org/abs/2010.05444