# <center>CSCI544 Homework2 Report</center>

In [1]:
import pandas as pd
import numpy as np
import torch.nn as nn
import torch
import torch.nn.functional as F

## 1. Dataset Generation

In [2]:
# read data
df = pd.read_csv('amazon_reviews_us_Kitchen_v1_00.tsv',
                 sep='\t',
                 usecols=['star_rating','review_body'])
# dropping the rows with missing values (eg., Nan)
df = df.dropna().reset_index(drop=True)

In [3]:
df['star_rating'].value_counts()

5.0    3124740
4.0     731718
1.0     426887
3.0     349552
2.0     241945
Name: star_rating, dtype: int64

Random select 250K reviews along with rates, 50K instances per rating score.

In [4]:
data_5 = (df[df['star_rating'] == 5]).sample(n=50000, random_state = 3)
data_4 = (df[df['star_rating'] == 4]).sample(n=50000, random_state = 3)
data_3 = (df[df['star_rating'] == 3]).sample(n=50000, random_state = 3)
data_2 = (df[df['star_rating'] == 2]).sample(n=50000, random_state = 3)
data_1 = (df[df['star_rating'] == 1]).sample(n=50000, random_state = 3)

use two datasets *data* and *data_without3* to store the generating data, the latter one does not contain the 3 stars rating.

In [5]:
from sklearn.utils import shuffle
# merge two sub dataset and shuffle
data = shuffle(pd.concat([data_5, data_4, data_3, data_2, data_1]))
data_without3 = shuffle(pd.concat([data_5, data_4, data_2, data_1]))

In [6]:
data = data.reset_index(drop=True)
data_without3 = data_without3.reset_index(drop=True)

Relable the 'star_rating', 4 and 5 denote to possible sentiment (class 1) which will be represented by *label 1*, 1 and 2 will be represented by *lable 2* and star 3 will be denoted to *label 3*.

In [7]:
data['label'] = data.star_rating.apply(lambda x: 1 if(x>=4) else (2 if(x<=2) else 3))
data['label'].value_counts()

1    100000
2    100000
3     50000
Name: label, dtype: int64

Relable the 'star_rating', 4 and 5 denote to possible sentiment (class 1) which will be represented by *label 1*, 1 and 2 will be represented by *lable 0* in the *data_without3* dataset.

In [8]:
data_without3['label'] = data_without3.star_rating.apply(lambda x: 1 if(x>=4) else 0)
data_without3['label'].value_counts()

0    100000
1    100000
Name: label, dtype: int64

After relabeling, there are 100,000 instances in *label 1* and *label 2*, respectively and 50,000 instances in *label 3*.

In [9]:
data = data[['review_body','label']]
data_without3 = data_without3[['review_body','label']]

In [10]:
# store data to local
data.to_csv('review_data.csv',index=False)
data_without3.to_csv('review_data_without3.csv',index=False)

### Train-Test Split

In [2]:
data = pd.read_csv('review_data.csv')
data_without3 = pd.read_csv('review_data_without3.csv')
data_without3.head()

Unnamed: 0,review_body,label
0,"Nice, sturdy. Arrived as promised. Head comes ...",1
1,Yuummmm!,1
2,I bought this item because I wanted a healthy ...,1
3,"I decided to put the coffee grinder, lid and a...",0
4,This pattern is stunning and certainly catches...,0


For the dataset which does not contain the class 3, there are 160,000 items in training set and 40,000 in testing set.

In [3]:
from sklearn.model_selection import train_test_split
label = data_without3['label']
reviews = data_without3.drop('label',axis=1)
X_train, X_test, y_train, y_test = train_test_split(reviews, label, random_state=42, test_size=0.2)
print(len(X_train), len(X_test), len(y_train), len(y_test))

160000 40000 160000 40000


For the dataset which contain class 3, there are 200,000 items in training set and 50,000 in testing set.

In [4]:
label1 = data['label']
reviews1 = data.drop('label',axis=1)
X_train1, X_test1, y_train1, y_test1 = train_test_split(reviews1, label1, random_state=42, test_size=0.2)
print(len(X_train1), len(X_test1), len(y_train1), len(y_test1))

200000 50000 200000 50000


In [8]:
y_train.to_csv('bin_label_train.csv',index=False)
y_test.to_csv('bin_label_test.csv',index=False)
y_train1.to_csv('ter_label_train.csv',index=False)
y_test1.to_csv('ter_label_test.csv',index=False)

## 2. Word Embedding

### (a) Pretrained Model

In [2]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

In [6]:
pairs = [
    ('well', 'excellent'),
    ('men','women')
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))

'well'	'excellent'	0.40
'men'	'women'	0.77


### (b) Train Word2Vec model

In [3]:
import gensim

In [18]:
reviews = []
for d in data.review_body:
    reviews.append(utils.simple_preprocess(d))

In [19]:
model = gensim.models.Word2Vec(sentences = reviews, vector_size=300, window=11, min_count=10)

In [20]:
model_path = 'w2v'
model.save(model_path)

In [4]:
model = gensim.models.Word2Vec.load('w2v')

In [9]:
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, model.wv.similarity(w1, w2)))

'well'	'excellent'	0.22
'men'	'women'	0.69


According to the similarity shown in two models mentioned above, it is obviously that pretrained model encoding semantic similarities between words better. I think the smaller the corpus and the stronger the relevance between documents, the generated word2vec model has greater advantages in checking the word similarity in documents, that is, the similarity of similar words is higher. However, due to the small corpus, many words that do not appear cannot be vectored.

# 3. Simple Models

In this section, using dataset without class 3 to train perceptron and SVM model.

## data cleaning

Perform contractions on the reviews.

Convert the all reviews into the lower case.

Remove the HTML and URLs from the reviews.

Remove non-alphabetical characters.

Remove the extra tab, Line break, spaces, etc. between the words.

The following are examples of cleaning X_train and X_test dataset, respectively.

In [13]:
import contractions
def contractionfunction(s):
    words = []
    for word in s.split():
        words.append(contractions.fix(word))
    new_str = ' '.join(words)
    return new_str

In [14]:
import re
def data_clean(X):
    l = []
    for r in X.review_body:
        # contraction
        r = contractionfunction(r)
        # change all letters into lower case
        r = r.lower()
        # dealing with html
        r = re.sub(r'</?\w+[^>]*>',' ',r)
        # dealing with urls
        r = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
                   '', r)
        # dealing with non-alphabetical characters
        r = re.sub("[^a-z]+",' ',r)
        # dealing with \n, \b, \r, \t
        r = re.sub(r'\r|\n|\t','',r)
        # dealing with ,/./:/extra spaces....
        r = re.sub(r'[^\w\s]','',r)
        # perform contractions on the reviews
        l.append(r) 
    return l

In [15]:
# dealing with X_train
reviews_train = data_clean(X_train)
# dealing with X_test
reviews_test = data_clean(X_test)

In [16]:
# dealing with X_train1
reviews_train1 = data_clean(X_train1)
# dealing with X_test1
reviews_test1 = data_clean(X_test1)

## Pre-processing

### Using the same processes in homework1 to remove stop words and perform lemmatization

#### remove the stop words

In [17]:
from nltk.corpus import stopwords

def rm_stopwords(review):
    stop_words = set(stopwords.words('english'))
    words = [w for w in review.split(' ') if w not in stop_words]
    s = ' '.join(words)
    return s

In [18]:
# remove the stop words in train and test dataset respectively
reviews_train_rmstopwords = []
reviews_test_rmstopwords = []
for review in reviews_train:
    reviews_train_rmstopwords.append(rm_stopwords(review))
for review in reviews_test:
    reviews_test_rmstopwords.append(rm_stopwords(review))

In [19]:
# remove the stop words in train and test dataset respectively
reviews_train_rmstopwords1 = []
reviews_test_rmstopwords1 = []
for review in reviews_train1:
    reviews_train_rmstopwords1.append(rm_stopwords(review))
for review in reviews_test1:
    reviews_test_rmstopwords1.append(rm_stopwords(review))

#### perform lemmatization

In [20]:
from nltk.stem import WordNetLemmatizer

def per_lemmatize(review):
    lm = WordNetLemmatizer()
    words = [lm.lemmatize(w) for w in review.split(' ')]
    s = ' '.join(words)
    return s

In [21]:
reviews_train_lemmatize = []
reviews_test_lemmatize = []
for review in reviews_train_rmstopwords:
    reviews_train_lemmatize.append(per_lemmatize(review))
for review in reviews_test_rmstopwords:
    reviews_test_lemmatize.append(per_lemmatize(review))

In [22]:
reviews_train_lemmatize_split = []
reviews_test_lemmatize_split = []
for r in reviews_train_lemmatize:
    reviews_train_lemmatize_split.append(r.split(' '))
for r in reviews_test_lemmatize:
    reviews_test_lemmatize_split.append(r.split(' '))

In [23]:
import json
file_path1 = 'train_datawithout3.json'
with open(file_path1,'w') as f:
    json.dump(reviews_train_lemmatize_split, f)
file_path2 = 'test_datawithout3.json'
with open(file_path2,'w') as f:
    json.dump(reviews_test_lemmatize_split, f)

In [24]:
# deal with dataset with class 3
reviews_train_lemmatize1 = []
reviews_test_lemmatize1 = []
for review in reviews_train_rmstopwords1:
    reviews_train_lemmatize1.append(per_lemmatize(review))
for review in reviews_test_rmstopwords1:
    reviews_test_lemmatize1.append(per_lemmatize(review))
    
reviews_train_lemmatize_split1 = []
reviews_test_lemmatize_split1 = []
for r in reviews_train_lemmatize1:
    reviews_train_lemmatize_split1.append(r.split(' '))
for r in reviews_test_lemmatize1:
    reviews_test_lemmatize_split1.append(r.split(' '))
    
file_path3 = 'train_data.json'
with open(file_path3,'w') as f:
    json.dump(reviews_train_lemmatize_split1, f)
file_path4 = 'test_data.json'
with open(file_path4,'w') as f:
    json.dump(reviews_test_lemmatize_split1, f)

In [5]:
import json
file_path1 = 'train_datawithout3.json'
file_path2 = 'test_datawithout3.json'

reviews_train_lemmatize_split = json.load(open(file_path1))
reviews_test_lemmatize_split = json.load(open(file_path2))

file_path3 = 'train_data.json'
file_path4 = 'test_data.json'

reviews_train_lemmatize_split1 = json.load(open(file_path3))
reviews_test_lemmatize_split1 = json.load(open(file_path4))

y_train = pd.read_csv('bin_label_train.csv')
y_test = pd.read_csv('bin_label_test.csv')
y_train1 = pd.read_csv('ter_label_train.csv')
y_test1 = pd.read_csv('ter_label_test.csv')

## w2v

In this w2v section, using TF-IDF, pre-trained model and model trained by myself to vector the text

#### TF-IDF

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,2))
X_train_tfidf = vectorizer.fit_transform(reviews_train_lemmatize)
X_test_tfidf = vectorizer.transform(reviews_test_lemmatize)

#### Pre-trained model

Embeding the words using pre-trained model and when there are some words don't appear in the pre-trained model, I just discard them to simplize the preprocessing

In [169]:
def word_embed(review, m):
    doc = np.zeros(300)
    n = 0 
    for r in review:    
        if r in m:
            n += 1
            doc += m[r]
    if n>0:
        return doc/n
    else:
        return doc

In [28]:
x_train_pretrain = []
x_test_pretrain = []
for review in reviews_train_lemmatize_split:
    x_train_pretrain.append(word_embed(review, wv))
    
for review in reviews_test_lemmatize_split:
    x_test_pretrain.append(word_embed(review, wv))

In [29]:
# deal with dataset with class 3
x_train_pretrain1 = []
x_test_pretrain1 = []
for review in reviews_train_lemmatize_split1:
    x_train_pretrain1.append(word_embed(review, wv))
    
for review in reviews_test_lemmatize_split1:
    x_test_pretrain1.append(word_embed(review, wv))

#### My w2v model

Do the same things by using word2vec model trained by myself.

In [30]:
x_train_my  = []
x_test_my = []
for review in reviews_train_lemmatize_split:
    x_train_my.append(word_embed(review, model.wv))
    
for review in reviews_test_lemmatize_split:
    x_test_my.append(word_embed(review, model.wv))

In [170]:
x_train_my1  = []
x_test_my1 = []
for review in reviews_train_lemmatize_split1:
    x_train_my1.append(word_embed(review, model.wv))
    
for review in reviews_test_lemmatize_split1:
    x_test_my1.append(word_embed(review, model.wv))

## 3.1 Perceptron

In [18]:
# deal with the label, convert Series to ndarry type
y_train = y_train.values
y_test = y_test.values

y_train1 = y_train1.values
y_test1 = y_test1.values

In [33]:
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import f1_score, confusion_matrix, roc_curve, roc_auc_score
def metric_measure(y_test_new, y_test_pred):  
    a_test = accuracy_score(y_test_new, y_test_pred)
    print('accuracy of test set is:',a_test)

In [35]:
from sklearn.linear_model import Perceptron
#tf-idf
clf1 = Perceptron()
clf1.fit(X_train_tfidf, y_train)
y_test_tfidf_pred = clf1.predict(X_test_tfidf)
#pre-trained
clf2 = Perceptron()
clf2.fit(x_train_pretrain, y_train)
y_test_pretrain_pred = clf2.predict(x_test_pretrain)
#my trained w2v
clf3 = Perceptron()
clf3.fit(x_train_my, y_train)
y_test_my_pred = clf3.predict(x_test_my)

In [36]:
print("TF_IDF:")
metric_measure(y_test, y_test_tfidf_pred)
print("\nword2vec-google-news-300:")
metric_measure(y_test, y_test_pretrain_pred)
print("\nmy Word2Vec:")
metric_measure(y_test, y_test_my_pred)

TF_IDF:
accuracy of test set is: 0.865175

word2vec-google-news-300:
accuracy of test set is: 0.746125

my Word2Vec:
accuracy of test set is: 0.817275


## 3.2 SVM

In [38]:
from sklearn.svm import LinearSVC
clf_svm1 = LinearSVC()
clf_svm1.fit(X_train_tfidf, y_train)
y_test_tfidf_pred = clf_svm1.predict(X_test_tfidf)
#pre-trained
clf_svm2 = LinearSVC()
clf_svm2.fit(x_train_pretrain, y_train)
y_test_pretrain_pred = clf_svm2.predict(x_test_pretrain)
#my trained w2v
clf_svm3 = LinearSVC()
clf_svm3.fit(x_train_my, y_train)
y_test_my_pred = clf_svm3.predict(x_test_my)



In [39]:
print("TF_IDF:")
metric_measure(y_test, y_test_tfidf_pred)
print("\nword2vec-google-news-300:")
metric_measure(y_test, y_test_pretrain_pred)
print("\nmy Word2Vec:")
metric_measure(y_test, y_test_my_pred)

TF_IDF:
accuracy of test set is: 0.891875

word2vec-google-news-300:
accuracy of test set is: 0.818975

my Word2Vec:
accuracy of test set is: 0.854825


# 4. Feedforward Neural Networks

## (a) Average Word Vector

### 4-(a)-1 Binary Classification

The following network is the a binary classification FNN with 2 hidden layers, each with 50 and 10 nodes, respectively.

In [40]:
class BinaryNet(nn.Module):
    def __init__(self):
        super(BinaryNet, self).__init__()
        hidden_1 = 50
        hidden_2 = 10
        self.fc1 = nn.Linear(300, hidden_1)
        # linear layer (n_hidden -> hidden_2)
        self.fc2 = nn.Linear(hidden_1, hidden_2)
         # linear layer (n_hidden -> 2) 2 classes
        self.fc3 = nn.Linear(hidden_2, 2)
        # dropout layer (p=0.2)
        # dropout prevents overfitting of data
        self.dropout = nn.Dropout(0.2)

    def forward(self, x):
        # add hidden layer, with relu activation function
        x = F.relu(self.fc1(x))
        # add dropout layer
        x = self.dropout(x)
        # add hidden layer, with relu activation function
        x = F.relu(self.fc2(x))
        # add dropout layer
        x = self.dropout(x)
        # add output layer
        x = self.fc3(x)
        return x

In [41]:
def train_model(model, x, y, optimizer):
    # number of epochs to train the model
    n_epochs = 15

    for epoch in range(n_epochs):
        y_pred = model(x) # prep model for training
        loss = criterion(y_pred, y)
        print('Epoch: {} \tTraining Loss: {:.6f}'.format(epoch+1, loss))
        optimizer.zero_grad()
        # backward pass: compute gradient of the loss with respect to model parameters
        loss.backward()
        # perform a single optimization step (parameter update)
        optimizer.step()
        # update running training loss

Training the binary classification FNN using word2vec-google-news-300

In [42]:
model_binary1 = BinaryNet()
print(model_binary1)
# specify loss function (categorical cross-entropy)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model_binary1.parameters(), lr=0.01)
x = torch.from_numpy(np.asarray(x_train_pretrain))
y = torch.from_numpy(y_train)
# Trian model
train_model(model_binary1, x.float(),y,optimizer)

BinaryNet(
  (fc1): Linear(in_features=300, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=2, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)
Epoch: 1 	Training Loss: 0.692728
Epoch: 2 	Training Loss: 0.679734
Epoch: 3 	Training Loss: 0.660719
Epoch: 4 	Training Loss: 0.637461
Epoch: 5 	Training Loss: 0.608692
Epoch: 6 	Training Loss: 0.578150
Epoch: 7 	Training Loss: 0.550201
Epoch: 8 	Training Loss: 0.527584
Epoch: 9 	Training Loss: 0.511844
Epoch: 10 	Training Loss: 0.502622
Epoch: 11 	Training Loss: 0.500934
Epoch: 12 	Training Loss: 0.496939
Epoch: 13 	Training Loss: 0.488979
Epoch: 14 	Training Loss: 0.484600
Epoch: 15 	Training Loss: 0.474112


In [43]:
y_pred_pretrain = np.array(model_binary1(torch.from_numpy(np.asarray(x_test_pretrain)).float()).argmax(axis=1))
count = 0
for i in range(0,len(y_pred_pretrain)):
    if y_pred_pretrain[i]==y_test[i]:
        count += 1
print("accuracy of Binary Classification Feedforward Neural Networks by using word2vec-google-news-300 is:", count/40000)

accuracy of Binary Classification Feedforward Neural Networks by using word2vec-google-news-300 is: 0.7922


Training the binary classification FNN using my word2vec model

In [45]:
# initialize the NN
model_binary2 = BinaryNet()
print(model_binary2)
# specify loss function (categorical cross-entropy)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model_binary2.parameters(), lr=0.01)

# Trian model
train_model(model_binary2, torch.from_numpy(np.asarray(x_train_my)).float(), torch.from_numpy(y_train),optimizer)

BinaryNet(
  (fc1): Linear(in_features=300, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=2, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)
Epoch: 1 	Training Loss: 0.695326
Epoch: 2 	Training Loss: 0.650984
Epoch: 3 	Training Loss: 0.579021
Epoch: 4 	Training Loss: 0.518114
Epoch: 5 	Training Loss: 0.484910
Epoch: 6 	Training Loss: 0.474535
Epoch: 7 	Training Loss: 0.470757
Epoch: 8 	Training Loss: 0.458524
Epoch: 9 	Training Loss: 0.442295
Epoch: 10 	Training Loss: 0.425144
Epoch: 11 	Training Loss: 0.412485
Epoch: 12 	Training Loss: 0.405563
Epoch: 13 	Training Loss: 0.400923
Epoch: 14 	Training Loss: 0.396845
Epoch: 15 	Training Loss: 0.392746


In [46]:
y_pred_my = np.array(model_binary2(torch.from_numpy(np.asarray(x_test_my)).float()).argmax(axis=1))
count = 0
for i in range(0,len(y_pred_my)):
    if y_pred_my[i]==y_test[i]:
        count += 1
print("accuracy of Binary Classification Feedforward Neural Networks by using my word2vec model is:", count/40000)

accuracy of Binary Classification Feedforward Neural Networks by using my word2vec model is: 0.838425


### 4-(a)-2 Ternary Classification

In [24]:
def train_model_1(model, x, y, optimizer):
    # number of epochs to train the model
    n_epochs = 15

    for epoch in range(n_epochs):
        y_pred = model(x) # prep model for training
        loss = criterion(y_pred, y.float())
        print('Epoch: {} \tTraining Loss: {:.6f}'.format(epoch+1, loss))
        optimizer.zero_grad()
        # backward pass: compute gradient of the loss with respect to model parameters
        loss.backward()
        # perform a single optimization step (parameter update)
        optimizer.step()
        # update running training loss

The following it the network of ternary classification, with input 300 dimensions and output 4 dimensions (using one-hot to represents these three labels).

In [48]:
import torch.nn.functional as F
class TernaryNet(nn.Module):
    def __init__(self):
        super(TernaryNet, self).__init__()
        hidden_1 = 50
        hidden_2 = 10
        self.fc1 = nn.Linear(300, hidden_1)
        # linear layer (n_hidden -> hidden_2)
        self.fc2 = nn.Linear(hidden_1, hidden_2)
         # linear layer (n_hidden -> 4) 3 classes
        self.fc3 = nn.Linear(hidden_2, 4)
        self.dropout = nn.Dropout(0.2)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        # add dropout layer
        x = self.dropout(x)
        # add hidden layer, with relu activation function
        x = F.relu(self.fc2(x))
        # add dropout layer
        x = self.dropout(x)
        # add output layer
        x = self.fc3(x)
        return x

Training Ternary classification FNN using word2vec-google-news-300

In [59]:
model_ternary1 = TernaryNet()
print(model_ternary1)
# specify loss function (categorical cross-entropy)
criterion = nn.BCEWithLogitsLoss()
optimizer1 = torch.optim.Adam(model_ternary1.parameters(), lr=0.01)
y_one_hot = torch.nn.functional.one_hot(torch.from_numpy(y_train1),4) 
# Trian model
train_model_1(model_ternary1, torch.from_numpy(np.asarray(x_train_pretrain1)).float(), y_one_hot, optimizer1)

TernaryNet(
  (fc1): Linear(in_features=300, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=4, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)
Epoch: 1 	Training Loss: 0.695023
Epoch: 2 	Training Loss: 0.667239
Epoch: 3 	Training Loss: 0.641101
Epoch: 4 	Training Loss: 0.611481
Epoch: 5 	Training Loss: 0.581244
Epoch: 6 	Training Loss: 0.555775
Epoch: 7 	Training Loss: 0.539521
Epoch: 8 	Training Loss: 0.532070
Epoch: 9 	Training Loss: 0.528539
Epoch: 10 	Training Loss: 0.523814
Epoch: 11 	Training Loss: 0.517178
Epoch: 12 	Training Loss: 0.509696
Epoch: 13 	Training Loss: 0.500452
Epoch: 14 	Training Loss: 0.491634
Epoch: 15 	Training Loss: 0.484169
Epoch: 16 	Training Loss: 0.477923
Epoch: 17 	Training Loss: 0.470929
Epoch: 18 	Training Loss: 0.465788
Epoch: 19 	Training Loss: 0.460675
Epoch: 20 	Training Loss: 0.456317
Epoch: 21 	Training Loss: 0.452520
Epoch: 22 	Training Loss: 0.448611
Ep

In [60]:
y_pred_pretrain = np.array(model_ternary1(torch.from_numpy(np.asarray(x_test_pretrain1)).float()).argmax(axis=1))
count = 0
for i in range(0,len(y_pred_pretrain)):
    if y_pred_pretrain[i]==y_test1[i]:
        count += 1
print("accuracy of Ternary Classification Feedforward Neural Networks by using word2vec-google-news-300 is:", count/50000)

accuracy of Ternary Classification Feedforward Neural Networks by using word2vec-google-news-300 is: 0.59092


Training ternary classification FNN using my word2vec model

In [61]:
model_ternary2 = TernaryNet()
print(model_ternary2)
# specify loss function (categorical cross-entropy)
criterion = nn.BCEWithLogitsLoss()
optimizer1 = torch.optim.Adam(model_ternary2.parameters(), lr=0.01)
# Trian model
train_model_1(model_ternary2, torch.from_numpy(np.asarray(x_train_my1)).float(), y_one_hot, optimizer1)

TernaryNet(
  (fc1): Linear(in_features=300, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=4, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)
Epoch: 1 	Training Loss: 0.728929
Epoch: 2 	Training Loss: 0.692136
Epoch: 3 	Training Loss: 0.649531
Epoch: 4 	Training Loss: 0.602641
Epoch: 5 	Training Loss: 0.562062
Epoch: 6 	Training Loss: 0.528674
Epoch: 7 	Training Loss: 0.500969
Epoch: 8 	Training Loss: 0.481685
Epoch: 9 	Training Loss: 0.469882
Epoch: 10 	Training Loss: 0.461428
Epoch: 11 	Training Loss: 0.451995
Epoch: 12 	Training Loss: 0.440891
Epoch: 13 	Training Loss: 0.430162
Epoch: 14 	Training Loss: 0.421607
Epoch: 15 	Training Loss: 0.415031
Epoch: 16 	Training Loss: 0.410670
Epoch: 17 	Training Loss: 0.403926
Epoch: 18 	Training Loss: 0.396525
Epoch: 19 	Training Loss: 0.390659
Epoch: 20 	Training Loss: 0.385260
Epoch: 21 	Training Loss: 0.382192
Epoch: 22 	Training Loss: 0.380088
Ep

In [62]:
y_pred_my = np.array(model_ternary2(torch.from_numpy(np.asarray(x_test_my1)).float()).argmax(axis=1))
count = 0
for i in range(0,len(y_pred_my)):
    if y_pred_my[i]==y_test1[i]:
        count += 1
print("accuracy of Ternary Classification Feedforward Neural Networks by using my word2vec model is:", count/50000)

accuracy of Ternary Classification Feedforward Neural Networks by using my word2vec model is: 0.6601


According to the four networks mentioned above, binary classification models perform better than ternary classification models, and compareed with word2vec-google-news-300, the word2vec trained by myself has contributes to a higher accuracy in test set. The accuracy of these four models are concluded below.

Binary classification + word2vec-google-news-300: <font color=red>0.7922</font>

Binary classification + my word2vec: <font color=red>0.8384</font>

Ternary classification + word2vec-google-news-300: <font color=red>0.5909</font>

Ternary classification + my word2vec: <font color=red>0.6601</font>

## 4-(b) First 10 Word2Vec vectors

Similar with section w2v in 3, and also skip the unappearing words

In [6]:
def word_embed_1(review, m):
    doc = []
    n = 0 
    for r in review:
        if r in m:
            n += 1
            doc.append(m[r])
    while n!=10:
        doc.append(np.zeros(300))
        n += 1
    return doc

#### Word embeding by word2vec-google-news-300

In [None]:
x_train_pretrain_10 = []
x_test_pretrain_10 = []
for review in reviews_train_lemmatize_split:
    x_train_pretrain_10.append(word_embed_1(review, wv))
    
for review in reviews_test_lemmatize_split:
    x_test_pretrain_10.append(word_embed_1(review, wv))

In [9]:
# deal with dataset with class 3
x_train_pretrain1_10 = []
x_test_pretrain1_10 = []
for review in reviews_train_lemmatize_split1:
    x_train_pretrain1_10.append(word_embed_1(review, wv))
    
for review in reviews_test_lemmatize_split1:
    x_test_pretrain1_10.append(word_embed_1(review, wv))

#### Word embeding by my word2vec model

In [66]:
x_train_my_10 = []
x_test_my_10 = []
for review in reviews_train_lemmatize_split:
    x_train_my_10.append(word_embed_1(review, model.wv))
    
for review in reviews_test_lemmatize_split:
    x_test_my_10.append(word_embed_1(review, model.wv))

In [33]:
# deal with dataset with class 3
x_train_my1_10 = []
x_test_my1_10 = []
for review in reviews_train_lemmatize_split1:
    x_train_my1_10.append(word_embed_1(review, model.wv))
    
for review in reviews_test_lemmatize_split1:
    x_test_my1_10.append(word_embed_1(review, model.wv))

### 4-(b)-1 Binary Classification

The following is the network of a binary classification FNN

In [68]:
class BinaryNet_10(nn.Module):
    def __init__(self):
        super(BinaryNet_10, self).__init__()
        hidden_1 = 50
        hidden_2 = 10
        # linear layer (3000 -> hidden_1)
        self.fc1 = nn.Linear(300*10, hidden_1)
        # linear layer (n_hidden -> hidden_2)
        self.fc2 = nn.Linear(hidden_1, hidden_2)
         # linear layer (n_hidden -> 2) 2 classes
        self.fc3 = nn.Linear(hidden_2, 2)
        # dropout layer (p=0.2)
        # dropout prevents overfitting of data
        self.dropout = nn.Dropout(0.2)

    def forward(self, x):
        # flatten the input
        x = x.view(-1, 300*10)
        # add hidden layer, with relu activation function
        x = F.relu(self.fc1(x))
        # add dropout layer
        x = self.dropout(x)
        # add hidden layer, with relu activation function
        x = F.relu(self.fc2(x))
        # add dropout layer
        x = self.dropout(x)
        # add output layer
        x = self.fc3(x)
        return x

Training the binary classification FNN using word2vec-google-news-300

In [106]:
model_binary3 = BinaryNet_10()
print(model_binary3)
# specify loss function (categorical cross-entropy)
criterion = nn.CrossEntropyLoss()
# specify optimizer (stochastic gradient descent) and learning rate = 0.01
optimizer = torch.optim.Adam(model_binary3.parameters(), lr=0.01)
x = torch.from_numpy(np.asarray(x_train_pretrain_10))
y = torch.from_numpy(y_train)
# del x_train_pretrain_10
# Trian model
train_model(model_binary3, x.float(),y ,optimizer)

BinaryNet_10(
  (fc1): Linear(in_features=3000, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=2, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)
Epoch: 1 	Training Loss: 0.694531
Epoch: 2 	Training Loss: 0.676725
Epoch: 3 	Training Loss: 0.606086
Epoch: 4 	Training Loss: 0.580015
Epoch: 5 	Training Loss: 0.642458
Epoch: 6 	Training Loss: 0.600940
Epoch: 7 	Training Loss: 0.596777
Epoch: 8 	Training Loss: 0.547712
Epoch: 9 	Training Loss: 0.549594
Epoch: 10 	Training Loss: 0.560075
Epoch: 11 	Training Loss: 0.552329
Epoch: 12 	Training Loss: 0.539295
Epoch: 13 	Training Loss: 0.531621
Epoch: 14 	Training Loss: 0.533833
Epoch: 15 	Training Loss: 0.529248


In [107]:
xtest = torch.from_numpy(np.asarray(x_test_pretrain_10))
y_pred_pretrain = np.array(model_binary3(xtest.float()).argmax(axis=1))
count = 0
for i in range(0,len(y_pred_pretrain)):
    if y_pred_pretrain[i]==y_test[i]:
        count += 1
print("accuracy of Ternary Classification Feedforward Neural Networks by using word2vec-google-news-300 and the fisrt 10 word vectors is:", count/50000)

accuracy of Ternary Classification Feedforward Neural Networks by using word2vec-google-news-300 and the fisrt 10 word vectors is: 0.59386


Training the binary classification FNN with my word2vec model

In [131]:
model_binary4 = BinaryNet_10()
print(model_binary4)
# specify loss function (categorical cross-entropy)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model_binary4.parameters(), lr=0.01)
x = torch.from_numpy(np.asarray(x_train_my_10))
y = torch.from_numpy(y_train)
# Trian model
train_model(model_binary4, x.float(),y ,optimizer)

BinaryNet_10(
  (fc1): Linear(in_features=3000, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=2, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)
Epoch: 1 	Training Loss: 0.698209
Epoch: 2 	Training Loss: 0.731390
Epoch: 3 	Training Loss: 0.561387
Epoch: 4 	Training Loss: 0.577182
Epoch: 5 	Training Loss: 0.567065
Epoch: 6 	Training Loss: 0.551155
Epoch: 7 	Training Loss: 0.531357
Epoch: 8 	Training Loss: 0.516819
Epoch: 9 	Training Loss: 0.507629
Epoch: 10 	Training Loss: 0.503787
Epoch: 11 	Training Loss: 0.504835
Epoch: 12 	Training Loss: 0.500498
Epoch: 13 	Training Loss: 0.494070
Epoch: 14 	Training Loss: 0.487584
Epoch: 15 	Training Loss: 0.481250


In [132]:
xtest = torch.from_numpy(np.asarray(x_test_my_10))
y_pred_my = np.array(model_binary4(xtest.float()).argmax(axis=1))
count = 0
for i in range(0,len(y_pred_my)):
    if y_pred_my[i]==y_test[i]:
        count += 1
print("accuracy of Ternary Classification Feedforward Neural Networks by using my word2vec model and the fisrt 10 word vectors is:", count/50000)

accuracy of Ternary Classification Feedforward Neural Networks by using my word2vec model and the fisrt 10 word vectors is: 0.61254


### 4-(b)-2 Ternary Classification

The following is the network of a ternary classification FNN

In [21]:
class TernaryNet_10(nn.Module):
    def __init__(self):
        super(TernaryNet_10, self).__init__()
        hidden_1 = 50
        hidden_2 = 10
        # linear layer (3000 -> hidden_1)
        self.fc1 = nn.Linear(300*10, hidden_1)
        # linear layer (n_hidden -> hidden_2)
        self.fc2 = nn.Linear(hidden_1, hidden_2)
         # linear layer (n_hidden -> 4) 3 classes
        self.fc3 = nn.Linear(hidden_2, 4)
        # dropout layer (p=0.2)
        # dropout prevents overfitting of data
        self.dropout = nn.Dropout(0.2)

    def forward(self, x):
        # flatten the input
        x = x.view(-1, 300*10)
        # add hidden layer, with relu activation function
        x = F.relu(self.fc1(x))
        # add dropout layer
        x = self.dropout(x)
        # add hidden layer, with relu activation function
        x = F.relu(self.fc2(x))
        # add dropout layer
        x = self.dropout(x)
        # add output layer
        x = self.fc3(x)
        return x

Training the ternary classification FNN word2vec-google-news-300

In [22]:
model_ternary3 = TernaryNet_10()
print(model_ternary3)
# specify loss function (categorical cross-entropy)
criterion = nn.BCEWithLogitsLoss()
optimizer1 = torch.optim.Adam(model_ternary3.parameters(), lr=0.01)
y_one_hot = torch.nn.functional.one_hot(torch.from_numpy(y_train1),4) 
x = torch.from_numpy(np.asarray(x_train_pretrain1_10))

TernaryNet_10(
  (fc1): Linear(in_features=3000, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=4, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)


In [25]:
# Trian model
train_model_1(model_ternary3, x.float(), y_one_hot, optimizer1)

Epoch: 1 	Training Loss: 0.685573
Epoch: 2 	Training Loss: 0.635210
Epoch: 3 	Training Loss: 0.609810
Epoch: 4 	Training Loss: 0.596938
Epoch: 5 	Training Loss: 0.581282
Epoch: 6 	Training Loss: 0.560111
Epoch: 7 	Training Loss: 0.545001
Epoch: 8 	Training Loss: 0.533099
Epoch: 9 	Training Loss: 0.522376
Epoch: 10 	Training Loss: 0.510442
Epoch: 11 	Training Loss: 0.499474
Epoch: 12 	Training Loss: 0.488090
Epoch: 13 	Training Loss: 0.478162
Epoch: 14 	Training Loss: 0.470140
Epoch: 15 	Training Loss: 0.463374


In [26]:
y_pred_pretrain = np.array(model_ternary3(torch.from_numpy(np.asarray(x_test_pretrain1_10)).float()).argmax(axis=1))
count = 0
for i in range(0,len(y_pred_pretrain)):
    if y_pred_pretrain[i]==y_test1[i]:
        count += 1
print("accuracy of Ternary Classification Feedforward Neural Networks by using word2vec-google-news-300 and the first 10 word vectors is:", count/50000)

accuracy of Ternary Classification Feedforward Neural Networks by using word2vec-google-news-300 and the first 10 word vectors is: 0.54856


Training the ternary classification FNN with my word2vec model

In [43]:
model_ternary4 = TernaryNet_10()
print(model_ternary4)
# specify loss function (categorical cross-entropy)
criterion = nn.BCEWithLogitsLoss()
optimizer1 = torch.optim.Adam(model_ternary4.parameters(), lr=0.01)
x = torch.from_numpy(np.asarray(x_train_my1_10))
# Trian model
train_model_1(model_ternary4, x.float(), y_one_hot, optimizer1)

TernaryNet_10(
  (fc1): Linear(in_features=3000, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=4, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)
Epoch: 1 	Training Loss: 0.661970
Epoch: 2 	Training Loss: 0.780576
Epoch: 3 	Training Loss: 0.511252
Epoch: 4 	Training Loss: 0.488810
Epoch: 5 	Training Loss: 0.491856
Epoch: 6 	Training Loss: 0.475775
Epoch: 7 	Training Loss: 0.459634
Epoch: 8 	Training Loss: 0.451441
Epoch: 9 	Training Loss: 0.448487
Epoch: 10 	Training Loss: 0.443971
Epoch: 11 	Training Loss: 0.434198
Epoch: 12 	Training Loss: 0.424064
Epoch: 13 	Training Loss: 0.418814
Epoch: 14 	Training Loss: 0.415921
Epoch: 15 	Training Loss: 0.414045


In [44]:
y_pred_my = np.array(model_ternary4(torch.from_numpy(np.asarray(x_test_my1_10)).float()).argmax(axis=1))
count = 0
for i in range(0,len(y_pred_my)):
    if y_pred_my[i]==y_test1[i]:
        count += 1
print("accuracy of Ternary Classification Feedforward Neural Networks by using my word2vec model and the first 10 word vectors is:", count/50000)

accuracy of Ternary Classification Feedforward Neural Networks by using my word2vec model and the first 10 word vectors is: 0.5936


According to the four networks mentioned above, using only the first 10 vectors in the reviews, binary classification models perform better than ternary classification models, and compareed with word2vec-google-news-300, the word2vec trained by myself has contributes to a higher accuracy in test set. The accuracy of these four models are concluded below.

Binary classification + word2vec-google-news-300(first 10 vectors): <font color=red>0.5938</font>

Binary classification + my word2vec(first 10 vectors): <font color=red>0.6125</font>

Ternary classification + word2vec-google-news-300(first 10 vectors): <font color=red>0.5486</font>

Ternary classification + my word2vec(first 10 vectors): <font color=red>0.5936</font>

After comparing all of these 8 models in section 4, I can find out that although using the average word vectors performs better than using only the first 10 word vector, all of their performances are not better than simple models mentioned in section 3. Only one binary classification which using average word2vec trained by myself has the best accuracy in these 8 models, which is 0.8384. It is similar with the accuracy of Perceptron with my wrod2vec and SVM with my word2vec, and it is higher than perceptron and SVM using word2vec-google-news-300, but lower than those two models using TF-IDF.

The part 5 Recurrent Neural Network is in the file CSCI544-HW2_p2.ipynb.