# HW04: ML and DL

Remember that these homework work as a completion grade. **You can skip one section without losing credit.**

## Load and Pre-process Text
We do sentiment analysis on the [Movie Review Data](https://www.cs.cornell.edu/people/pabo/movie-review-data/). If you would like to know more about the data, have a look at [the paper](https://www.cs.cornell.edu/home/llee/papers/pang-lee-stars.pdf) (but no need to do so).

In [None]:
# In this tutorial, we do sentiment analysis
# download the data
#!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
#!tar xf aclImdb_v1.tar.gz

!wget https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz
!wget https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz
 
!tar xf scale_data.tar.gz 
!tar xf scale_whole_review.tar.gz

--2023-03-12 04:42:04--  https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.36
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4029756 (3.8M) [application/x-gzip]
Saving to: ‘scale_data.tar.gz’


2023-03-12 04:42:04 (13.5 MB/s) - ‘scale_data.tar.gz’ saved [4029756/4029756]

--2023-03-12 04:42:04--  https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.36
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8853204 (8.4M) [application/x-gzip]
Saving to: ‘scale_whole_review.tar.gz’


2023-03-12 04:42:05 (23.7 MB/s) - ‘scale_whole_review.tar.gz’ saved [8853204/8853204]



First, we have to load the data for which we provide the function below. Note how we also preprocess the text using gensim's simple_preprocess() function and how we already split the data into a train and test split.

In [None]:
import os
from gensim.utils import simple_preprocess
def load_data():
    examples, labels = [], []
    authors = os.listdir("scale_whole_review")
    for author in authors:
        path = os.listdir(os.path.join("scale_whole_review", author, "txt.parag"))
        fn_ids = os.path.join("scaledata", author, "id." + author)
        fn_ratings = os.path.join("scaledata", author, "rating." + author)
        with open(fn_ids) as ids, open(fn_ratings) as ratings:
            for idx, rating in zip(ids, ratings):
                labels.append(float(rating.strip()))
                filename_text = os.path.join("scale_whole_review", author, "txt.parag", idx.strip() + ".txt")
                with open(filename_text, encoding='latin-1') as f:
                    examples.append(" ".join(simple_preprocess(f.read())))
    return examples, labels
                  
X,y  = load_data()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print ("text:", X_train[0], "\nlabel:", y_train[0])

text: stanley ipkiss whose letter to the local paper signed nice guys finish last had generated torrent of replies the year before has been undergoing change lately bank clerk ipkiss played with sweet sincerity by jim carrey discovers mask that like dr jekyll potion temporarily creates an all new person to understand how the mask works he turns to masks that people wear expert named dr neuman played with dripping sincerity and dead pan humor by ben stein although the doctor proves useless stanley finally discovers for himself what the mask does it magnifies your inner desires since ipkiss is an incurable romantic who spends his free time watching cartoons it is inevitable that the mask turns him into the world greatest lover and song and dance man after avoiding carrey for years was blown away by his performance in liar liar one of this year funniest films since the mask in was the movie that really launched his film career suggested we check it out one evening on vacation with the hel

## Vectorize the data

In [None]:
# train a TF_IDF Vectorizer on X_train and vectorize X_train and X_test
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(min_df=0.01, # at min 1% of docs
                        max_df=.5,  
                        stop_words='english',
                        ngram_range=(1,2))

##TODO train vectorizer
vec.fit(X_train)
##TODO transform X_train to TF-IDF values
X_train_tfidf = vec.transform(X_train)
##TODO transform X_test to TF-IDF values
X_test_tfidf = vec.transform(X_test)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=False)
##TODO scale both training and test data with the standard scaler

X_train_scaled = scaler.fit_transform(X_train_tfidf)
X_test_scaled = scaler.transform(X_test_tfidf)

## ElasticNet

In [None]:
##TODO train an elastic net on the transformed output of the scaler
from sklearn.linear_model import ElasticNet

en = ElasticNet(alpha=0.01)

##TODO train the ElasticNet
en.fit(X_train_scaled, y_train)
##TODO predict the testset
predicted = en.predict(X_test_scaled)

from sklearn.metrics import r2_score, accuracy_score, mean_squared_error, balanced_accuracy_score
##TODO print mean squared error and r2 score on the test set
print ("mse", mean_squared_error(y_test, predicted))
print ("r2", r2_score(y_test, predicted))

mse 0.01512374023090484
r2 0.5067135227411286


## Logistic Regression

Next, we train an OLS model doing binary prediction on these movie reviews. Two get two bins, we transform the continuous ratings into two classes, where one class contains all the negative ratings (value < 0.5), the other class all the positive ratings (value > 0.5).

In [None]:
y_train = [1 if i >= 0.5 else 0 for i in y_train]
y_test = [1 if i >= 0.5 else 0 for i in y_test]


In [None]:
##TODO train logistic regression 
from sklearn.linear_model import LogisticRegression
logistic_regression = LogisticRegression()

##TODO train a logistic regression on the scaled X_train
lr_clf = logistic_regression.fit(X_train_scaled, y_train)
##TODO predict the testset 
predicted = lr_clf.predict(X_test_scaled)

##since we have continuous output, we need to post-process our labels into two classes. We choose a threshold of 0.5 
def map_predictions(predicted):
    predicted = [1 if i > 0.5 else 0 for i in predicted]
    return predicted

##TODO print the accuracy of our classifier on the testset
binary_predictions = map_predictions(predicted)
print (accuracy_score(y_test, binary_predictions))
## TODO print the 10 most informative words of the regression (the 10 words having the highest coefficients)
import numpy as np
id2word = vec.get_feature_names_out()
coefs = logistic_regression.coef_.squeeze()
indices = np.argsort(coefs)
print (logistic_regression.coef_.shape)
for i in indices[-10:]:
    print (coefs[i], id2word[i])

0.8202179176755447
(1, 5616)
0.17038391559368846 brilliance
0.17184606388124232 engaging
0.1754505048684661 area
0.1780437794159087 mysteries
0.17826925287087944 imaginative
0.18180467904844047 destruction
0.1822993821958528 effective
0.1828440074888503 surprisingly
0.18647911705118803 fine
0.20526358328144687 success


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Deep Learning

## MLP

In [1]:
#Import the AG news dataset (same as hw01)
#Download them from here 
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv

import pandas as pd
import nltk
df = pd.read_csv('train.csv')

df.columns = ["label", "title", "lead"]
label_map = {1:"world", 2:"sport", 3:"business", 4:"sci/tech"}
def replace_label(x):
	return label_map[x]
df["label"] = df["label"].apply(replace_label) 
df["text"] = df["title"] + " " + df["lead"]
df = df.sample(n=10000) # # only use 10K datapoints
df.head()

--2023-03-16 04:14:03--  https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29470338 (28M) [text/plain]
Saving to: ‘train.csv’


2023-03-16 04:14:03 (148 MB/s) - ‘train.csv’ saved [29470338/29470338]



Unnamed: 0,label,title,lead,text
103993,business,"Easymobile Closer to #39;Lean, Low-Cost Servi...",Moves by easyJet founder Stelios Haji-Ioannou ...,"Easymobile Closer to #39;Lean, Low-Cost Servi..."
5164,world,Bombs explode in Nepal capital,KATHMANDU (Reuters) - Nepal #39;s embattled go...,Bombs explode in Nepal capital KATHMANDU (Reut...
54374,sci/tech,"Update: AMD's Q3 led by strong Opteron, Athlon...","As expected, Advanced Micro Devices Inc.'s (AM...","Update: AMD's Q3 led by strong Opteron, Athlon..."
107591,sport,Blue Jays sign Menechino,"Toronto, ON (Sports Network) - The Toronto Blu...","Blue Jays sign Menechino Toronto, ON (Sports N..."
32652,sport,Rallying: Sparkling Solberg win means title wa...,It may not be enough to tilt the championship ...,Rallying: Sparkling Solberg win means title wa...


In [2]:
# create a new variable "business" that takes value 1 if the label is business and 0 otherwise
df['business'] = df['label'].apply(lambda x: int(x=='business'))
y = df['business'].values
df['business'].head()

103993    1
5164      0
54374     0
107591    0
32652     0
Name: business, dtype: int64

In [3]:
import spacy
nlp = spacy.load('en_core_web_sm')
from sklearn.feature_extraction.text import CountVectorizer

##pre-process text as you did in HW02
def tokenize(x):
    return [w.lemma_.lower() for w in nlp(x) if not w.is_stop and not w.is_punct and not w.is_digit]
df["tokens"] = df["text"].apply(lambda x: tokenize(x))
df["preprocessed"] = df['tokens'].apply(lambda x: ' '.join(x))

##TODO vectorize the pre-processed text using CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=0.01, # at min 1% of docs
                        max_df=.9,  
                        max_features=1000,
                        stop_words='english',
                        ngram_range=(1,3))
X = vectorizer.fit_transform(df['preprocessed'])
pd.to_pickle(X,'X.pkl')



In [4]:
df["preprocessed"]

103993    easymobile close   39;lean low cost service la...
5164      bomb explode nepal capital kathmandu reuters n...
54374     update amd q3 lead strong opteron athlon sale ...
107591    blue jays sign menechino toronto sports networ...
32652     rallying sparkling solberg win mean title wait...
                                ...                        
57581     nuclear asset vanish iraq technology help prod...
23434     swede win date hewitt lleyton hewitt suppose p...
87698     pire feel wrath france national coach rarely p...
79756     union leader weigh late contract offer union l...
99454     henson play cowboys start henson wait like fin...
Name: preprocessed, Length: 10000, dtype: object

Your goal here is to use features from the Vectorized text to predict whether the snippet is from a business article.

In [8]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from torchsummary import summary

## TODO build a MLP model with at least 2 hidden layers with ReLU activation, followed by dropout and an output layer with sigmoid activation
input_dim = X.shape[1]
class MLP(nn.Module):
    def __init__(self):
        super(MLP, self).__init__()
        # use nn.Sequential to sequentially stack modules
        self.layers = nn.Sequential(
            nn.Linear(input_dim, 50), # input layer
            nn.ReLU(), # activation function
            nn.Linear(50, 50), # hidden layer 1
            nn.ReLU(), # activation 1
            nn.Dropout(0.5), # dropout 1
            nn.Linear(50, 50), # hidden layer 2
            nn.ReLU(), # activation 2
            nn.Dropout(0.5), # dropout 2
            nn.Linear(50, 1), # output layer
            nn.Sigmoid(), # sigmoid
        )
        
    # define the forward propagation which is necessary for torch models
    def forward(self, x):
        return self.layers(x)


## TODO summarize the model using torchsummary
model_summarize = MLP()

summary(model_summarize, input_size=(input_dim,))


----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Linear-1                   [-1, 50]          21,050
              ReLU-2                   [-1, 50]               0
            Linear-3                   [-1, 50]           2,550
              ReLU-4                   [-1, 50]               0
           Dropout-5                   [-1, 50]               0
            Linear-6                   [-1, 50]           2,550
              ReLU-7                   [-1, 50]               0
           Dropout-8                   [-1, 50]               0
            Linear-9                    [-1, 1]              51
          Sigmoid-10                    [-1, 1]               0
Total params: 26,201
Trainable params: 26,201
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.10
Estimated Tot

In [17]:
## TODO fit the model using early stopping (patience = 0.5) to predict the business label (split can follow train:valid:test = 8:1:1)
# (hint: early stopping means if the validation score does not increase for more than "patience" times, training should stop and load the best model so far)
import math
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# First prepare the datasets
class GenericDataset(Dataset):

  def __init__(self, X, y):
    self.X = X
    self.y = y

  def __len__(self):
    return len(self.y)

  def __getitem__(self, index):
    return self.X[index], self.y[index]

tsize = math.ceil(0.1 * len(y))
X_train, X_valid, y_train, y_valid = train_test_split(X.toarray(), np.array(y), test_size=tsize)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=tsize)

train_dataset = GenericDataset(X_train, y_train)
valid_dataset = GenericDataset(X_valid, y_valid)
test_dataset = GenericDataset(X_test, y_test)

train_loader = DataLoader(dataset=train_dataset, batch_size=32, shuffle=True)
valid_loader = DataLoader(dataset=valid_dataset, batch_size=32, shuffle=False)
test_loader = DataLoader(dataset=test_dataset, batch_size=32, shuffle=False)

mean_train_losses = []
mean_valid_losses = []
valid_acc_list = []
patience = 0
epochs = 100
best_score = 0

device = 'cpu'

model = MLP().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.BCELoss() 

for epoch in range(epochs):
    # switch the model to train mode
    model.train()
    
    train_losses = []
    valid_losses = []
    for i, (Xs, labels) in enumerate(train_loader):
        
        optimizer.zero_grad()
        
        outputs = model(Xs.float().to(device))
        loss = loss_fn(outputs, labels.float().unsqueeze(1).to(device)) # shape (32,) -> (32,1)
        loss.backward()
        optimizer.step()
        
        train_losses.append(loss.item())
            
    model.eval()
    pred_labels = []
    true_labels = []
    with torch.no_grad():
        for i, (Xs, labels) in enumerate(valid_loader):
            outputs = model(Xs.float().to(device))
            loss = loss_fn(outputs, labels.float().unsqueeze(1).to(device))
            
            valid_losses.append(loss.item())
            
            predicted = [1 if d > 0.5 else 0 for d in outputs.data.squeeze()]
            pred_labels.extend(predicted)
            true_labels.extend(list(labels))
            
    mean_train_losses.append(np.mean(train_losses))
    mean_valid_losses.append(np.mean(valid_losses))
    
    accuracy = accuracy_score(true_labels, pred_labels)
    if accuracy > best_score:
      torch.save(model, 'best.pt')
      best_score = accuracy
      patience = 0 # reset patience
    else:
      patience += 1
    valid_acc_list.append(accuracy)
    print('epoch : {}, train loss : {:.4f}, valid loss : {:.4f}, valid acc : {:.2f}%, patience: {}'\
         .format(epoch+1, np.mean(train_losses), np.mean(valid_losses), accuracy, patience))
    if patience > 5:
      print('epoch : {}, patience : {}, training early stops'.format(epoch+1, patience))
      break
    


epoch : 1, train loss : 0.4137, valid loss : 0.2714, valid acc : 0.89%, patience: 0
epoch : 2, train loss : 0.2773, valid loss : 0.2663, valid acc : 0.89%, patience: 1
epoch : 3, train loss : 0.2512, valid loss : 0.2591, valid acc : 0.90%, patience: 0
epoch : 4, train loss : 0.2236, valid loss : 0.2667, valid acc : 0.90%, patience: 1
epoch : 5, train loss : 0.1991, valid loss : 0.2892, valid acc : 0.89%, patience: 2
epoch : 6, train loss : 0.1617, valid loss : 0.3098, valid acc : 0.89%, patience: 3
epoch : 7, train loss : 0.1309, valid loss : 0.3392, valid acc : 0.88%, patience: 4
epoch : 8, train loss : 0.0962, valid loss : 0.3790, valid acc : 0.90%, patience: 5
epoch : 9, train loss : 0.0734, valid loss : 0.4282, valid acc : 0.89%, patience: 6
epoch : 9, patience : 6, training early stops
