# Project 1 - Sentiment Classification

In this project, you will conduct a sentiment analysis task.
You will build a model to predict the scores (a.k.a. stars, from 1-5) of each review.
For each review, you are given a piece of text as well as some other features (Explore yourself!).
You can consider the predicted variables to be categorical, ordinal or numerical.

DDL: *April 6, 2021*
- *March 23, 2021* release the validation score of weak baseline--60.2%
- *March 30, 2021* release the validation score of strong baseline

Submission: Each team leader is required to submit the groupNo.zip file in the canvas. It shoud contain 
- `pre.csv` Predictions on test data (please make sure you can successfully evaluate your validation predictions on the validation data with the help of evaluate.py)
- report (1-2 pages of pdf)
- code (Frameworks and programming languages are not restricted.)

We will check your report with your code and the accuracy.

| Grade | Classifier (80%)                                                   | Report (20%)                      |
|-------|--------------------------------------------------------------------|-----------------------------------|
| 50%   | example code in tutorials or in Project 1 without any modification | submission                        |
| 60%   | an easy baseline that most students can outperform                 | algorithm you used                |
| 80%   | a competitive baseline that about half students can surpass        | detailed explanation              |
| 90%   | a very competitive baseline without any special mechanism          | detailed explanation and analysis, such as explorative data analysis and ablation study |
| 100%  | a very competitive baseline with at least one mechanism            | excellent ideas, detailed explanation and solid analysis |


## Instruction Content

1. Load & Dump the data
    1. Load the data
    1. Dump the data
1. Preprocessing
    1. Text data processing recap
    1. Explorative data analysis
1. Learning Baselines


## 1.Preparation

In [30]:
import pandas as pd
import numpy as np
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
ps = PorterStemmer()


import spacy
nlp = spacy.load('en_core_web_sm')

import keras as K
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
# from keras import metricsnmmnbn

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression

import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
import tqdm

### Load the data

Here is a function to load your data, remember put the dataset in the `data_2021_spring` folder.


In [14]:
def load_data(split_name='train', columns=['text', 'stars']):
    try:
        print(f"select [{', '.join(columns)}] columns from the {split_name} split")
        df = pd.read_csv(f'data_2021_spring/{split_name}.csv')
        df = df.loc[:,columns]
        print("succeed!")
        return df
    except:
        print("Failed, then try to ")
        print(f"select all columns from the {split_name} split")
        df = pd.read_csv(f'data_2021_spring/{split_name}.csv')
        return df

In [15]:
train_df = load_data('train', columns=['text',\
                                       'stars','business_id','cool',\
                                       'date','funny','review_id','useful','user_id'])

select [text, stars, business_id, cool, date, funny, review_id, useful, user_id] columns from the train split
succeed!


In [16]:
#Items of dataset:
    #business_id
    #cool: whether it's cool
    #date: public date
    #funny: whether it's funny
    #review_id
    #text： content
    #useful: whether it's useful
    #user_id
#load 10000 sentences are training set
train_df.head(0)

Unnamed: 0,text,stars,business_id,cool,date,funny,review_id,useful,user_id


In [17]:
#Prepare the test set
test_df = load_data('test')
test_df.head(0)

select [text, stars] columns from the test split
Failed, then try to 
select all columns from the test split


Unnamed: 0,business_id,cool,date,funny,review_id,text,useful,user_id


#### Below is the way to write dataset into a .csv file, i.e. the format of submission files.

In [18]:
#construct a dataset with genuine review_id and randomized stars
random_ans = pd.DataFrame(data={
    'review_id': test_df['review_id'],
    'stars': np.random.randint(0, 6, size=len(test_df))
})

In [19]:
random_ans.head()

Unnamed: 0,review_id,stars
0,b8-ELBwhmDKcmcM8icT86g,2
1,rBpAJhIen_V-zLoXZIcROg,5
2,_pALaDG6se9OTkGGhyhnNA,4
3,ru8fpA1Uk0tTFtO5hLM49g,4
4,fRPgwuFoY6SriToXZyaOQA,3


In [20]:
#write this dataset into a .csv file, which should be the format of our submission
group_number = -1
random_ans.to_csv(f'{group_number}-random_ans.csv', index=False)

## 2. Preprocessing

Preprocessing and feature engineering

In [27]:
#return the lower cases of the texts
def lower(s):
    """
    :param s: a string.
    return a string with lower characters
    Note that we allow the input to be nested string of a list.
    e.g.
    Input: 'Text mining is to identify useful information.'
    Output: 'text mining is to identify useful information.'
    """
    if isinstance(s, list):
        return [lower(t) for t in s]
    if isinstance(s, str):
        return s.lower()
    else:
        raise NotImplementedError("unknown datatype")

#tokenize the texts
def tokenize(text):
    """
    :param text: a doc with multiple sentences, type: str
    return a word list, type: list
    e.g.
    Input: 'Text mining is to identify useful information.'
    Output: ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']
    """
    return nltk.word_tokenize(text)


def stem(tokens):
    """
    :param tokens: a list of tokens, type: list
    return a list of stemmed words, type: list
    e.g.
    Input: ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']
    Output: ['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.']
    """
    ### equivalent code
    # results = list()
    # for token in tokens:
    #     results.append(ps.stem(token))
    # return results

    return [ps.stem(token) for token in tokens]

def n_gram(tokens, n=1):
    """
    :param tokens: a list of tokens, type: list
    :param n: the corresponding n-gram, type: int
    return a list of n-gram tokens, type: list
    e.g.
    Input: ['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.'], 2
    Output: ['text mine', 'mine is', 'is to', 'to identifi', 'identifi use', 'use inform', 'inform .']
    """
    if n == 1:
        return tokens
    else:
        results = list()
        for i in range(len(tokens)-n+1):
            # tokens[i:i+n] will return a sublist from i th to i+n th (i+n th is not included)
            results.append(" ".join(tokens[i:i+n]))
        return results
    

def filter_stopwords(tokens):
    """
    :param tokens: a list of tokens, type: list
    return a list of filtered tokens, type: list
    e.g.
    Input: ['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.']
    Output: ['text', 'mine', 'identifi', 'use', 'inform', '.']
    """
    ### equivalent code
    # results = list()
    # for token in tokens:
    #     if token not in stopwords and not token.isnumeric():
    #         results.append(token)
    # return results

    return [token for token in tokens if token not in stopwords and not token.isnumeric()]



def get_onehot_vector(feats, feats_dict):
    """
    :param data: a list of features, type: list
    :param feats_dict: a dict from features to indices, type: dict
    return a feature vector,
    """
    # initialize the vector as all zeros
    vector = np.zeros(len(feats_dict), dtype=np.float)
    for f in feats:
        # get the feature index, return -1 if the feature is not existed
        f_idx = feats_dict.get(f, -1)
        if f_idx != -1:
            # set the corresponding element as 1
            vector[f_idx] = 1
    return vector

In [28]:
test_df['tokens'] = test_df['text'].map(tokenize).map(filter_stopwords).map(lower)
print(test_df['tokens'].head().to_string())

0    [i, took, up, train, union, station, catch, ai...
1    [we, worked, fitness, twist, part, best, frien...
2    [it, 's, typical, ,, average, ,, run-of-the-mi...
3    [we, went, outback, today, celebrate, daughter...
4    [we, went, see, nashville, unplugged, country,...


In [None]:
fmt = "{:10s},\t " * 8

for token in doc:
    print(fmt.format(token.text, token.lemma_, token.pos_, token.dep_,
            token.shape_, str(token.is_alpha), str(token.is_stop), 
                     str(list(token.children))))

## 3.Classifier

In [31]:
train_df = load_data('train')[:5000]
valid_df = load_data('valid')

select [text, stars] columns from the train split
succeed!
select [text, stars] columns from the valid split
succeed!


In [33]:
x_train = train_df['text']
y_train = train_df['stars']

In [34]:
tfidf = TfidfVectorizer(tokenizer=tokenize)
lr = LogisticRegression()
steps = [('tfidf', tfidf),('lr', lr)]
pipe = Pipeline(steps)
print(pipe)

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(tokenizer=<function tokenize at 0x7f01c792c0d0>)),
                ('lr', LogisticRegression())])


In [35]:
pipe.fit(x_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Pipeline(steps=[('tfidf',
                 TfidfVectorizer(tokenizer=<function tokenize at 0x7f01c792c0d0>)),
                ('lr', LogisticRegression())])

In [36]:
x_valid = valid_df['text']
y_valid = valid_df['stars']
y_pred = pipe.predict(x_valid)
print(classification_report(y_valid, y_pred))
print("\n\n")
print(confusion_matrix(y_valid, y_pred))
print('accuracy', np.mean(y_valid == y_pred))

              precision    recall  f1-score   support

           1       0.66      0.88      0.75       517
           2       0.41      0.14      0.21       278
           3       0.44      0.47      0.45       344
           4       0.50      0.51      0.50       427
           5       0.70      0.67      0.68       434

    accuracy                           0.58      2000
   macro avg       0.54      0.53      0.52      2000
weighted avg       0.56      0.58      0.56      2000




[[456  23  21  11   6]
 [119  38  96  20   5]
 [ 64  22 160  87  11]
 [ 22   6  78 217 104]
 [ 32   3  11  99 289]]
accuracy 0.58


In [37]:
train_text = train_df['text'].map(tokenize).map(filter_stopwords).map(stem)
valid_text = valid_df['text'].map(tokenize).map(filter_stopwords).map(stem)

In [39]:
word2id = {}
for tokens in train_text:
    for t in tokens:
        if not t in word2id:
            word2id[t] = len(word2id)
word2id['<pad>'] = len(word2id)

In [40]:
def texts_to_id_seq(texts, padding_length=500):
    records = []
    for tokens in texts:
        record = []
        for t in tokens:
            record.append(word2id.get(t, len(word2id)))
        if len(record) >= padding_length:
            records.append(record[:padding_length])
        else:
            records.append(record + [word2id['<pad>']] * (padding_length - len(record)))
    return records

In [41]:
train_seqs = texts_to_id_seq(train_text)
valid_seqs = texts_to_id_seq(valid_text)

In [44]:
class MyDataset(Dataset):
    
    def __init__(self, seq, y):
        assert len(seq) == len(y)
        self.seq = seq
        self.y = y-1
    
    def __getitem__(self, idx):
        return np.asarray(self.seq[idx]), self.y[idx]

    def __len__(self):
        return len(self.seq)

In [45]:
batch_size = 16

train_loader = DataLoader(MyDataset(train_seqs, y_train), batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(MyDataset(valid_seqs, y_valid), batch_size=batch_size)

In [46]:
class mlp(nn.Module):
    def __init__(self):
        super(mlp, self).__init__()
        self.embedding = nn.Embedding(num_embeddings=len(word2id)+1, embedding_dim=64)
        self.cnn = nn.Sequential(
            nn.Conv1d(in_channels=64,
                      out_channels=64,
                      kernel_size=3,
                      stride=1),
            nn.MaxPool1d(kernel_size=3, stride=1),
            nn.ReLU(),
            nn.Conv1d(in_channels=64,
                      out_channels=64,
                      kernel_size=3,
                      stride=1),
            nn.MaxPool1d(kernel_size=3, stride=1),
            nn.Dropout(0.5)
        )
        self.linear = nn.Linear(64, 5)
    
    def forward(self, x):
        x = self.embedding(x)
        x = torch.transpose(x, 1, 2)
        x = self.cnn(x)
        x = torch.max(x, dim=-1)[0]
        x = self.linear(x)
        return x

In [47]:
model = mlp()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = torch.nn.CrossEntropyLoss()

In [48]:
for e in range(1, 11):    
    print('epoch', e)
    model.train()
    total_acc = 0
    total_loss = 0
    total_count = 0
    with tqdm.tqdm(train_loader) as t:
        for x, y in t:
            optimizer.zero_grad()
            logits = model(x)
            loss = criterion(logits, y)
            loss.backward()
            total_acc += (logits.argmax(1) == y).sum().item()
            total_count += y.size(0)
            total_loss += loss.item()
            optimizer.step()
            t.set_postfix({'loss': total_loss/total_count, 'acc': total_acc/total_count})

    model.eval()
    y_pred = []
    y_true = []
    with tqdm.tqdm(valid_loader) as t:
        for x, y in t:
            logits = model(x)
            total_acc += (logits.argmax(1) == y).sum().item()
            total_count += len(y)
            y_pred += logits.argmax(1).tolist()
            y_true += y.tolist()
    print(classification_report(y_true, y_pred))
    print("\n\n")
    print(confusion_matrix(y_true, y_pred))

  0%|          | 1/313 [00:00<00:32,  9.53it/s, loss=0.11, acc=0.167]  

epoch 1


100%|██████████| 313/313 [00:16<00:00, 18.79it/s, loss=0.0961, acc=0.311]
100%|██████████| 125/125 [00:01<00:00, 82.00it/s]
  1%|          | 2/313 [00:00<00:20, 15.14it/s, loss=0.0784, acc=0.594]

              precision    recall  f1-score   support

           0       0.47      0.72      0.57       517
           1       0.00      0.00      0.00       278
           2       0.30      0.44      0.36       344
           3       0.40      0.12      0.18       427
           4       0.46      0.61      0.53       434

    accuracy                           0.42      2000
   macro avg       0.33      0.38      0.33      2000
weighted avg       0.36      0.42      0.36      2000




[[373   1  60   6  77]
 [139   0  88   9  42]
 [116   0 153  23  52]
 [ 90   0 154  50 133]
 [ 72   0  60  38 264]]
epoch 2


100%|██████████| 313/313 [00:15<00:00, 19.99it/s, loss=0.0817, acc=0.461]
100%|██████████| 125/125 [00:01<00:00, 86.31it/s]
  1%|          | 3/313 [00:00<00:13, 22.26it/s, loss=0.0702, acc=0.484]

              precision    recall  f1-score   support

           0       0.64      0.63      0.64       517
           1       0.30      0.11      0.16       278
           2       0.28      0.71      0.40       344
           3       0.43      0.07      0.13       427
           4       0.53      0.53      0.53       434

    accuracy                           0.43      2000
   macro avg       0.43      0.41      0.37      2000
weighted avg       0.46      0.43      0.40      2000




[[328  36 119   1  33]
 [ 71  31 151   5  20]
 [ 36  19 243  12  34]
 [ 33  13 232  32 117]
 [ 48   6 127  24 229]]
epoch 3


100%|██████████| 313/313 [00:15<00:00, 19.95it/s, loss=0.0688, acc=0.558]
100%|██████████| 125/125 [00:01<00:00, 94.13it/s]
  1%|          | 3/313 [00:00<00:13, 22.15it/s, loss=0.0621, acc=0.609]

              precision    recall  f1-score   support

           0       0.71      0.57      0.64       517
           1       0.28      0.30      0.29       278
           2       0.36      0.13      0.20       344
           3       0.32      0.72      0.45       427
           4       0.67      0.32      0.44       434

    accuracy                           0.44      2000
   macro avg       0.47      0.41      0.40      2000
weighted avg       0.50      0.44      0.43      2000




[[297 106  15  85  14]
 [ 59  84  22 108   5]
 [ 22  65  46 203   8]
 [ 15  28  36 307  41]
 [ 25  12   8 248 141]]
epoch 4


100%|██████████| 313/313 [00:15<00:00, 20.01it/s, loss=0.0574, acc=0.643]
100%|██████████| 125/125 [00:01<00:00, 92.16it/s]
  1%|          | 3/313 [00:00<00:14, 21.15it/s, loss=0.0432, acc=0.75] 

              precision    recall  f1-score   support

           0       0.69      0.64      0.66       517
           1       0.28      0.41      0.33       278
           2       0.38      0.30      0.33       344
           3       0.37      0.35      0.36       427
           4       0.55      0.56      0.55       434

    accuracy                           0.47      2000
   macro avg       0.45      0.45      0.45      2000
weighted avg       0.48      0.47      0.47      2000




[[329 110  29  16  33]
 [ 69 113  39  33  24]
 [ 36  95 103  88  22]
 [ 15  60  80 150 122]
 [ 25  31  22 114 242]]
epoch 5


100%|██████████| 313/313 [00:15<00:00, 20.16it/s, loss=0.0451, acc=0.724]
100%|██████████| 125/125 [00:01<00:00, 91.99it/s]
  1%|          | 3/313 [00:00<00:12, 25.02it/s, loss=0.0291, acc=0.906]

              precision    recall  f1-score   support

           0       0.69      0.69      0.69       517
           1       0.30      0.33      0.31       278
           2       0.36      0.32      0.34       344
           3       0.35      0.52      0.42       427
           4       0.63      0.35      0.45       434

    accuracy                           0.47      2000
   macro avg       0.47      0.44      0.44      2000
weighted avg       0.49      0.47      0.47      2000




[[359  80  29  33  16]
 [ 84  91  47  51   5]
 [ 33  79 111 113   8]
 [ 14  37  94 220  62]
 [ 28  18  27 208 153]]
epoch 6


100%|██████████| 313/313 [00:15<00:00, 19.66it/s, loss=0.0338, acc=0.811]
100%|██████████| 125/125 [00:01<00:00, 89.09it/s]
  1%|          | 2/313 [00:00<00:18, 16.97it/s, loss=0.0303, acc=0.833]

              precision    recall  f1-score   support

           0       0.61      0.79      0.69       517
           1       0.32      0.16      0.21       278
           2       0.33      0.42      0.37       344
           3       0.36      0.33      0.34       427
           4       0.58      0.49      0.53       434

    accuracy                           0.47      2000
   macro avg       0.44      0.44      0.43      2000
weighted avg       0.46      0.47      0.46      2000




[[407  27  42  17  24]
 [116  44  77  32   9]
 [ 67  38 145  76  18]
 [ 32  19 134 140 102]
 [ 46  10  43 124 211]]
epoch 7


100%|██████████| 313/313 [00:16<00:00, 18.43it/s, loss=0.0228, acc=0.873]
100%|██████████| 125/125 [00:01<00:00, 69.47it/s]
  1%|          | 2/313 [00:00<00:17, 17.56it/s, loss=0.0143, acc=0.958]

              precision    recall  f1-score   support

           0       0.71      0.62      0.66       517
           1       0.29      0.28      0.29       278
           2       0.31      0.48      0.38       344
           3       0.37      0.35      0.36       427
           4       0.60      0.46      0.52       434

    accuracy                           0.46      2000
   macro avg       0.46      0.44      0.44      2000
weighted avg       0.49      0.46      0.47      2000




[[322  92  59  18  26]
 [ 69  79  89  29  12]
 [ 28  59 165  77  15]
 [ 11  29 158 151  78]
 [ 23  15  62 134 200]]
epoch 8


100%|██████████| 313/313 [00:16<00:00, 19.25it/s, loss=0.0143, acc=0.931]
100%|██████████| 125/125 [00:01<00:00, 93.47it/s]
  1%|          | 2/313 [00:00<00:21, 14.20it/s, loss=0.0087, acc=1]

              precision    recall  f1-score   support

           0       0.64      0.74      0.69       517
           1       0.32      0.10      0.16       278
           2       0.34      0.51      0.41       344
           3       0.36      0.25      0.30       427
           4       0.52      0.57      0.54       434

    accuracy                           0.47      2000
   macro avg       0.43      0.44      0.42      2000
weighted avg       0.46      0.47      0.45      2000




[[384  18  56  15  44]
 [111  29  99  21  18]
 [ 51  27 177  61  28]
 [ 24  13 141 108 141]
 [ 30   5  51  99 249]]
epoch 9


100%|██████████| 313/313 [00:16<00:00, 19.53it/s, loss=0.00815, acc=0.972]
100%|██████████| 125/125 [00:01<00:00, 91.02it/s]
  1%|          | 3/313 [00:00<00:14, 21.27it/s, loss=0.00471, acc=0.979]

              precision    recall  f1-score   support

           0       0.60      0.76      0.67       517
           1       0.29      0.15      0.20       278
           2       0.31      0.52      0.39       344
           3       0.33      0.25      0.28       427
           4       0.61      0.41      0.49       434

    accuracy                           0.45      2000
   macro avg       0.43      0.42      0.41      2000
weighted avg       0.45      0.45      0.44      2000




[[394  32  57  14  20]
 [111  42  92  22  11]
 [ 58  41 180  52  13]
 [ 35  23 192 106  71]
 [ 55   8  69 123 179]]
epoch 10


100%|██████████| 313/313 [00:14<00:00, 21.12it/s, loss=0.00492, acc=0.987]
100%|██████████| 125/125 [00:01<00:00, 89.31it/s]

              precision    recall  f1-score   support

           0       0.61      0.73      0.67       517
           1       0.31      0.19      0.24       278
           2       0.33      0.41      0.36       344
           3       0.37      0.43      0.40       427
           4       0.61      0.39      0.47       434

    accuracy                           0.46      2000
   macro avg       0.44      0.43      0.43      2000
weighted avg       0.47      0.46      0.46      2000




[[379  49  41  30  18]
 [106  53  75  33  11]
 [ 57  40 140  93  14]
 [ 30  18 128 185  66]
 [ 47  11  45 163 168]]



