# HOMEWORK 6: TEXT CLASSIFICATION
In this homework, you will create models to classify texts from TRUE call-center. There are two classification tasks:
1. Action Classification: Identify which action the customer would like to take (e.g. enquire, report, cancel)
2. Object Classification: Identify which object the customer is referring to (e.g. payment, truemoney, internet, roaming) 

In this homework, you are asked to do the following tasks:
1. Data Cleaning
2. Preprocessing data for pytorch
3. Build and evaluate a model for "action" classification
4. Build and evaluate a model for "object" classification
5. Build and evaluate a multi-task model that does both "action" and "object" classifications in one-go 


Note: we have removed phone numbers from the dataset for privacy purposes. 

In [None]:
# !wget --no-check-certificate https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv

--2023-02-26 12:42:32--  https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv
Resolving www.dropbox.com (www.dropbox.com)... 162.125.8.18, 2620:100:6018:18::a27d:312
Connecting to www.dropbox.com (www.dropbox.com)|162.125.8.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/37u83g55p19kvrl/clean-phone-data-for-students.csv [following]
--2023-02-26 12:42:32--  https://www.dropbox.com/s/raw/37u83g55p19kvrl/clean-phone-data-for-students.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucb0ea74e3b0a9bcd634831ebbb4.dl.dropboxusercontent.com/cd/0/inline/B3P_WxkRmHfN0ZFCcX_2Ffk4eXNICPOytNjq0ViCnzdn0Bj3IN6Ih2QXri-4K-TlMpUQ8Q66VAyR7Sb1Tfx6_Tr9Z5vSC5N3U2-oYfrYxDPQwQ36eNRcN7-JDo8HCq--6u55-R_lP_aDLJvIwySw2PLKfkPw6UkvpE0OftNt6voVdQ/file# [following]
--2023-02-26 12:42:32--  https://ucb0ea74e3b0a9bcd634831ebbb4.dl.dropboxusercontent.com/cd/0/inline/B3P_WxkRmHfN0ZFCc

In [None]:
# !pip install pythainlp

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Import Libs

In [1]:
%matplotlib inline
import pandas
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import torch
import pandas as pd 

from torch.utils.data import Dataset
from IPython.display import display
from pythainlp.tokenize import word_tokenize
from collections import defaultdict
from sklearn.metrics import accuracy_score

## Loading data
First, we load the data from disk into a Dataframe.

A Dataframe is essentially a table, or 2D-array/Matrix with a name for each column.

In [2]:
data_df = pd.read_csv('clean-phone-data-for-students.csv')

Let's preview the data.

In [3]:
# Show the top 5 rows
display(data_df.head())
# Summarize the data
data_df.describe()

Unnamed: 0,Sentence Utterance,Action,Object
0,<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counte...,enquire,payment
1,internet ยังความเร็วอยุ่เท่าไหร ครับ,enquire,package
2,ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้...,report,suspend
3,พี่ค่ะยังใช้ internet ไม่ได้เลยค่ะ เป็นเครื่อ...,enquire,internet
4,ฮาโหล คะ พอดีว่าเมื่อวานเปิดซิมทรูมูฟ แต่มันโ...,report,phone_issues


Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


## Data cleaning

We call the DataFrame.describe() again.
Notice that there are 33 unique labels/classes for object and 10 unique labels for action that the model will try to predict.
But there are unwanted duplications e.g. Idd,idd,lotalty_card,Lotalty_card

Also note that, there are 13389 unqiue sentence utterances from 16175 utterances. You have to clean that too!

## #TODO 1: 
You will have to remove unwanted label duplications as well as duplications in text inputs. 
Also, you will have to trim out unwanted whitespaces from the text inputs. 
This shouldn't be too hard, as you have already seen it in the demo.



In [4]:
display(data_df.describe())
display(data_df.Object.unique())
display(data_df.Action.unique())

Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nonTrueMove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd',
       'TrueMoney', 'garbage', 'Payment', 'IDD', 'ringtone', 'Idd',
       'rate', 'loyalty_card', 'contact', 'officer', 'Balance', 'Service',
       'Loyalty_card'], dtype=object)

array(['enquire', 'report', 'cancel', 'Enquire', 'buy', 'activate',
       'request', 'Report', 'garbage', 'change'], dtype=object)

In [5]:
# TODO1: Data cleaning
data_df['Action'] = data_df['Action'].str.lower()
data_df['Object'] = data_df['Object'].str.lower()
data_df = data_df.drop_duplicates(subset=['Sentence Utterance'])
display(data_df.describe())
display(data_df.Object.unique())
display(data_df.Action.unique())

Unnamed: 0,Sentence Utterance,Action,Object
count,13389,13389,13389
unique,13389,8,26
top,<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counte...,enquire,service
freq,1,8658,2111


array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nontruemove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd', 'garbage',
       'ringtone', 'rate', 'loyalty_card', 'contact', 'officer'],
      dtype=object)

array(['enquire', 'report', 'cancel', 'buy', 'activate', 'request',
       'garbage', 'change'], dtype=object)

In [6]:
data = data_df.to_numpy()


array([['<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counter Services เค้าเช็ต 3276.25 บาท เมื่อวานที่ผมเช็คที่ศูนย์บอกมียอด 3057.79 บาท',
        'enquire', 'payment'],
       ['internet ยังความเร็วอยุ่เท่าไหร ครับ', 'enquire', 'package'],
       ['ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้ ค่ะ', 'report',
        'suspend'],
       ['พี่ค่ะยังใช้ internet ไม่ได้เลยค่ะ เป็นเครื่อง โกลไล',
        'enquire', 'internet'],
       ['ฮาโหล คะ พอดีว่าเมื่อวานเปิดซิมทรูมูฟ แต่มันโทรออกไม่ได้คะ แต่เล่นเนตได้คะ',
        'report', 'phone_issues'],
       ['*2222 ใช้งานยังไง ขอรายละเอียดการสมัครหน่อย', 'enquire',
        'service'],
       ['<PHONE_NUMBER_REMOVED> เคยมีช่างมาซ่อมที่บ้าน แล้วโทรศัพท์ใช้งานไม่ได้ครับ',
        'enquire', 'nontruemove'],
       ['<PHONE_NUMBER_REMOVED> ค้างค่าบริการเท่าไหร่ครับ', 'enquire',
        'balance'],
       ['<PHONE_NUMBER_REMOVED> อินเตอร์เน็ตไฟ Adsl ไม่มีสัญญาณครับ',
        'enquire', 'nontruemove'],
       ['<PHONE_NUMBER_REMOVED> เค้าบอกจะส่งรหัสเน็ตม

## TODO2 : Assign index to word and labels in each sentences. 

Note that please use **word_tokenize** (https://pythainlp.github.io/docs/2.0/api/tokenize.html) as a function to tokenize each sentences.

In [11]:
# TODO2: assign index to each words and labels in sentence. 

unique_object = set(data_df.Object.unique().tolist())
unique_action = set(data_df.Action.unique().tolist())
unique_label = unique_action.union(unique_object)

label_to_idx = dict(zip(unique_label, range(len(unique_label))))
idx_to_label = dict(zip(range(len(unique_label)), unique_label))

for x in data:
    x[-2] = label_to_idx[x[-2]]
    x[-1] = label_to_idx[x[-1]]

data[:10]

array([['<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counter Services เค้าเช็ต 3276.25 บาท เมื่อวานที่ผมเช็คที่ศูนย์บอกมียอด 3057.79 บาท',
        30, 17],
       ['internet ยังความเร็วอยุ่เท่าไหร ครับ', 30, 28],
       ['ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้ ค่ะ', 31, 11],
       ['พี่ค่ะยังใช้ internet ไม่ได้เลยค่ะ เป็นเครื่อง โกลไล', 30, 19],
       ['ฮาโหล คะ พอดีว่าเมื่อวานเปิดซิมทรูมูฟ แต่มันโทรออกไม่ได้คะ แต่เล่นเนตได้คะ',
        31, 0],
       ['*2222 ใช้งานยังไง ขอรายละเอียดการสมัครหน่อย', 30, 4],
       ['<PHONE_NUMBER_REMOVED> เคยมีช่างมาซ่อมที่บ้าน แล้วโทรศัพท์ใช้งานไม่ได้ครับ',
        30, 2],
       ['<PHONE_NUMBER_REMOVED> ค้างค่าบริการเท่าไหร่ครับ', 30, 24],
       ['<PHONE_NUMBER_REMOVED> อินเตอร์เน็ตไฟ Adsl ไม่มีสัญญาณครับ', 30,
        2],
       ['<PHONE_NUMBER_REMOVED> เค้าบอกจะส่งรหัสเน็ตมาให้ แต่ยังไม่ได้ส่งมาเลยค่ะ',
        30, 19]], dtype=object)

In [13]:
tokenized_data = []
for x in data:
  words = word_tokenize(x[0],keep_whitespace = False)
  tokenized_data.append(words)
tokenized_data[:5]

[['<',
  'PHONE',
  '_',
  'NUMBER',
  '_',
  'REMOVED',
  '>',
  'ผม',
  'ไป',
  'จ่าย',
  'เงิน',
  'ที่',
  'Counter',
  'Services',
  'เค้า',
  'เช็ต',
  '3276.25',
  'บาท',
  'เมื่อวาน',
  'ที่',
  'ผม',
  'เช็ค',
  'ที่',
  'ศูนย์',
  'บอก',
  'มี',
  'ยอด',
  '3057.79',
  'บาท'],
 ['internet', 'ยัง', 'ความเร็ว', 'อยุ่', 'เท่า', 'ไห', 'ร', 'ครับ'],
 ['ตะกี้',
  'ไป',
  'ชำระ',
  'ค่าบริการ',
  'ไป',
  'แล้ว',
  'แต่',
  'ยัง',
  'ใช้งาน',
  'ไม่',
  'ได้',
  'ค่ะ'],
 ['พี่',
  'ค่ะ',
  'ยัง',
  'ใช้',
  'internet',
  'ไม่',
  'ได้',
  'เลย',
  'ค่ะ',
  'เป็น',
  'เครื่อง',
  'โก',
  'ลไล'],
 ['ฮา',
  'โหล',
  'คะ',
  'พอดี',
  'ว่า',
  'เมื่อวาน',
  'เปิด',
  'ซิม',
  'ทรูมูฟ',
  'แต่',
  'มัน',
  'โทร',
  'ออก',
  'ไม่',
  'ได้',
  'คะ',
  'แต่',
  'เล่น',
  'เนต',
  'ได้',
  'คะ']]

In [14]:
word_to_idx ={}
idx_to_word ={}

for sentence in tokenized_data:
    for word in sentence:
        if word not in word_to_idx:
            word_to_idx[word] = len(word_to_idx)+1
            idx_to_word[word_to_idx[word]] = word
word_to_idx['UNK'] = len(word_to_idx)


In [15]:
def word2features(sent, i):
    word = sent[i]
    if word in word_to_idx :
        return word_to_idx[word]
    else :
        return word_to_idx['UNK']

def sent2features(sent):
    return np.asarray([word2features(sent, i) for i in range(len(sent))])

In [16]:
dataset = np.asarray([sent2features(sent) for sent in tokenized_data])

  dataset = np.asarray([sent2features(sent) for sent in tokenized_data])


In [22]:
dataset[:10]

array([array([ 1,  2,  3,  4,  3,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
              17, 18, 11,  7, 19, 11, 20, 21, 22, 23, 24, 17])                   ,
       array([25, 26, 27, 28, 29, 30, 31, 32]),
       array([33,  8, 34, 35,  8, 36, 37, 26, 38, 39, 40, 41]),
       array([42, 41, 26, 43, 25, 39, 40, 44, 41, 45, 46, 47, 48]),
       array([49, 50, 51, 52, 53, 18, 54, 55, 56, 37, 57, 58, 59, 39, 40, 51, 37,
              60, 61, 40, 51])                                                   ,
       array([62, 63, 38, 64, 65, 66, 67, 68, 69]),
       array([ 1,  2,  3,  4,  3,  5,  6, 70, 22, 71, 72, 73, 11, 74, 36, 75, 38,
              39, 40, 32])                                                       ,
       array([ 1,  2,  3,  4,  3,  5,  6, 76, 35, 77, 32]),
       array([ 1,  2,  3,  4,  3,  5,  6, 78, 79, 80, 39, 22, 81, 32]),
       array([ 1,  2,  3,  4,  3,  5,  6, 14, 21, 82, 83, 84, 85, 72, 86, 37, 26,
              39, 40, 83, 72, 44, 41])                     

## TODO 2,3: Preprocessing data for pytorch
You will be using pytorch in this assignment. Please show us how you prepare your dataloader for pytorch.
Don't forget to split data into train, valdation, and test sets (normally the ratio will be 80:10:10 , respectively)

In [19]:
# import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torchinfo

In [20]:
# TODO2: Preprocessing data for pytorch 
class TrueCallCenterDataset(Dataset):
  def __init__(self,data,labels=None):
    self.data = data 
    self.labels = labels

    if labels is not None: 
      assert len(data) == len(labels)  

  def __getitem__(self,idx):
    if self.labels is None: 
      return torch.LongTensor(self.data[idx])
    else: 
      return (
          torch.LongTensor(self.data[idx]), 
          torch.LongTensor(self.labels[idx])
      )

  def __len__(self):
    return len(self.data)


## TODO 3: Split the data

We recommend to use train_test_spilt from scikit-learn to split the data into train, validation, test set. 

In addition, it should split the data that distribution of the labels in train , validation, test set are similar. There is **stratify** variable handling this issue. 

In this case, you can choose whatever you want either "**Action**" or "**Object**" ;). 

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html


In [31]:
# TODO3: split data into train, validation, test  
from sklearn.model_selection import train_test_split
random_seed = 2023

x_train_action , x_nottrain_action , al_train , al_nottrain = train_test_split(dataset,data[:,1],stratify = data[:,1], test_size=0.2,random_state=random_seed)
x_val_action , x_test_action , al_val , al_test = train_test_split(x_nottrain_action,al_nottrain, test_size=0.5,random_state=random_seed)

x_train_object , x_nottrain_object , ol_train , ol_nottrain = train_test_split(dataset,data[:,2],stratify = data[:,2], test_size=0.2,random_state=random_seed)
x_val_object , x_test_object , ol_val , ol_test = train_test_split(x_nottrain_object,ol_nottrain, test_size=0.5,random_state=random_seed)



## TODO 4: Build a model for classifying these texts.


In [43]:
class Encoder(nn.Module):
  def __init__(self,word_to_idx):
    super(Encoder,self).__init__()
    self.embed = nn.Embedding(len(word_to_idx),32)
    self.bigru = nn.GRU(32,32,bidirectional=True,batch_first=True)
    

  def forward(self,x):
    # print(x.shape)
    x = self.embed(x)
    # print(x.shape)
    out, _ = self.bigru(x)

    return out

class Classifier(nn.Module):

  def __init__(self):
    super(Classifier,self).__init__()
    self.dropout = nn.Dropout(0.2) 
    self.flatten = nn.Flatten()
    self.classifier = nn.Linear(7232, 33) 

  def forward(self,x):
    # print(x.shape)
    x = F.relu(self.dropout(x))
    # print(x.shape)
    x = self.flatten(x)
    # print(x.shape)
    out = self.classifier(x)

    return out

class Model(nn.Module):
  def __init__(self,encoder,classifier):
    super().__init__()
    self.encoder = encoder
    self.classifier = classifier
  def forward(self,x):

    x = self.encoder(x)

    x = self.classifier(x)

    return x 

## #TODO 3: Build and evaluate a model for "action" classification


In [56]:
## TODO 3.1: prepare dataloader 

from torch.nn.utils.rnn import pad_sequence

x_train = [torch.LongTensor(sentence) for sentence in x_train_action]
y_train = [torch.LongTensor([label]) for label in al_train]
x_val = [torch.LongTensor(sentence) for sentence in x_val_action]
y_val = [torch.LongTensor([label]) for label in al_val]

x_test = [torch.LongTensor(sentence) for sentence in x_test_action]
y_test = [torch.LongTensor([label]) for label in al_test]

x_train = pad_sequence(x_train, batch_first=True)
y_train = pad_sequence(y_train, batch_first=True)
x_val = pad_sequence(x_val, batch_first=True)
y_val = pad_sequence(y_val, batch_first=True)
x_test = pad_sequence(x_test, batch_first=True)
y_test = pad_sequence(y_test, batch_first=True)

maxlen = x_train.size(1)  

# Pad the sequence length of x_test to be maxlen 
remaining_len = x_train.size(1) - x_test.size(1)
remaining_mat = torch.zeros((x_test.size(0), remaining_len), dtype=torch.long) 
x_test = torch.cat((x_test, remaining_mat), dim=1) 

# Pad the sequence length of x_test to be maxlen 
remaining_len = x_train.size(1) - x_val.size(1)
remaining_mat = torch.zeros((x_val.size(0), remaining_len), dtype=torch.long) 
x_val = torch.cat((x_val, remaining_mat), dim=1) 

tensor([[30],
        [30],
        [30],
        [30],
        [30]])


In [48]:
train_dataset = TrueCallCenterDataset(x_train, y_train) 
val_dataset = TrueCallCenterDataset(x_val, y_val) 
test_dataset = TrueCallCenterDataset(x_test)

# print(train_dataset[0])

num_workers = 2
batch_size = 64

train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True) 
val_dataloader = DataLoader(val_dataset, batch_size=64, shuffle=True) 
test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=False) 

In [49]:
## TODO 3.2: setup model 
device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = Model(Encoder(word_to_idx),Classifier()) 
model.to(device) 

optimizer = optim.Adam(model.parameters(), lr=1e-3) 
criterion = nn.CrossEntropyLoss()
print(torchinfo.summary(model))

num_epochs = 20

Layer (type:depth-idx)                   Param #
Model                                    --
├─Encoder: 1-1                           --
│    └─Embedding: 2-1                    133,088
│    └─GRU: 2-2                          12,672
├─Classifier: 1-2                        --
│    └─Dropout: 2-3                      --
│    └─Flatten: 2-4                      --
│    └─Linear: 2-5                       238,689
Total params: 384,449
Trainable params: 384,449
Non-trainable params: 0


In [53]:
## TODO 3.3: training loop
PATH = './action_model.pth'
min_val_loss = 1e10
for epoch in range(1, num_epochs+1): 
  running_loss = 0.0
  running_val_loss = 0.0
  model.train() 
  for inputs, targets in train_dataloader: 
    optimizer.zero_grad() 

    inputs, targets = inputs.to(device), targets.to(device)

    pred = model(inputs)
    
    targets = targets.reshape(-1)

    loss = criterion(pred, targets) 

    loss.backward() 
    optimizer.step() 

    running_loss += loss.item()

  model.eval() 
  y_pred = [] 
  with torch.no_grad():
      for i,data in enumerate(val_dataloader, 0):
          inputs, labels = data
          inputs = inputs.to(device)
          labels = labels.to(device)
          '''
          Insert your code here

          '''
          # print(inputs.shape)
          preds = model(inputs)
          labels = labels.reshape(-1)
          loss = criterion(preds,labels)

          # print(loss.grad)

          running_val_loss += loss.item()
          
      avg_val_loss = running_val_loss/len(val_dataloader)

  if avg_val_loss < min_val_loss:
      torch.save(model.state_dict(), PATH)
      min_val_loss = avg_val_loss

  print("epoc :{}, running_loss :{}".format(epoch,running_loss))

epoc :1, running_loss :56.82822507619858
epoc :2, running_loss :50.37886633723974
epoc :3, running_loss :45.00737752020359
epoc :4, running_loss :39.55078262090683
epoc :5, running_loss :35.78563795238733
epoc :6, running_loss :31.510195165872574
epoc :7, running_loss :27.336409527808428
epoc :8, running_loss :24.814037401229143
epoc :9, running_loss :21.664752764627337
epoc :10, running_loss :19.505208730697632
epoc :11, running_loss :16.821817617863417
epoc :12, running_loss :15.120801562443376
epoc :13, running_loss :13.319533098489046
epoc :14, running_loss :11.993461695499718
epoc :15, running_loss :10.370506486855447
epoc :16, running_loss :9.328579120337963
epoc :17, running_loss :7.728622563648969
epoc :18, running_loss :7.189861802849919
epoc :19, running_loss :6.620659687556326
epoc :20, running_loss :6.242256939643994


In [55]:
model = Model(Encoder(word_to_idx), Classifier())
model.load_state_dict(torch.load('action_model.pth'))

<All keys matched successfully>

In [72]:
## TODO 3.5: evalaute on test set 
from sklearn.metrics import classification_report

predict = list()
label = al_test
# since we're not training, we don't need to calculate the gradients for our outputs
model.eval()
with torch.no_grad():
    for X_test in test_dataloader:
        Y_pred = model(X_test)
        _, pred = torch.max(Y_pred.data, 1)
        for p in pred:
            predict.append(p.item())

print("acc testing with test data:", sum(predict == label)/len(label) *100)

acc testing with test data: 84.91411501120238


In [None]:
y_pred_val = model(x_val)
print("Model Acc. on val data %f%%"
       % ((y_val.squeeze(1) == torch.argmax(y_pred_val,axis=1)).sum() / sentence_val_data_1.shape[0] * 100))

Model Acc. on val data 83.034378%


## #TODO 4: Build and evaluate a model for "object" classification



In [81]:
## TODO 4.1: prepare dataloader 
from torch.nn.utils.rnn import pad_sequence

x_train = [torch.LongTensor(sentence) for sentence in x_train_object]
y_train_2 = [torch.LongTensor([label]) for label in ol_train]
x_val = [torch.LongTensor(sentence) for sentence in x_val_object]
y_val = [torch.LongTensor([label]) for label in ol_val]

x_test = [torch.LongTensor(sentence) for sentence in x_test_object]
y_test = [torch.LongTensor([label]) for label in ol_test]

x_train = pad_sequence(x_train, batch_first=True)
y_train = pad_sequence(y_train, batch_first=True)
x_val = pad_sequence(x_val, batch_first=True)
y_val = pad_sequence(y_val, batch_first=True)
x_test = pad_sequence(x_test, batch_first=True)
y_test = pad_sequence(y_test, batch_first=True)

maxlen = max([x_train.size(1), x_val.size(1), x_test.size(1)])

remaining_len = maxlen - x_train.size(1)
remaining_mat = torch.zeros((x_train.size(0), remaining_len), dtype=torch.long) 
x_train = torch.cat((x_train, remaining_mat), dim=1) 

# Pad the sequence length of x_test to be maxlen 
remaining_len = maxlen - x_test.size(1)
remaining_mat = torch.zeros((x_test.size(0), remaining_len), dtype=torch.long) 
x_test = torch.cat((x_test, remaining_mat), dim=1) 

# Pad the sequence length of x_test to be maxlen 
remaining_len = maxlen - x_val.size(1)
remaining_mat = torch.zeros((x_val.size(0), remaining_len), dtype=torch.long) 
x_val = torch.cat((x_val, remaining_mat), dim=1) 

In [82]:
train_dataset = TrueCallCenterDataset(x_train, y_train) 
val_dataset = TrueCallCenterDataset(x_val, y_val) 
test_dataset = TrueCallCenterDataset(x_test)

# print(train_dataset[0])

num_workers = 2
batch_size = 64

train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True) 
val_dataloader = DataLoader(val_dataset, batch_size=64, shuffle=True) 
test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=False) 

In [83]:
## TODO 4.2: setup model 
device = 'cuda' if torch.cuda.is_available() else 'cpu'

model2 = Model(Encoder(word_to_idx),Classifier()) 
model2.to(device) 

optimizer = optim.Adam(model2.parameters(), lr=1e-3) 
criterion = nn.CrossEntropyLoss()
print(torchinfo.summary(model2))

num_epochs = 20

Layer (type:depth-idx)                   Param #
Model                                    --
├─Encoder: 1-1                           --
│    └─Embedding: 2-1                    133,088
│    └─GRU: 2-2                          12,672
├─Classifier: 1-2                        --
│    └─Dropout: 2-3                      --
│    └─Flatten: 2-4                      --
│    └─Linear: 2-5                       238,689
Total params: 384,449
Trainable params: 384,449
Non-trainable params: 0


In [85]:
## TODO 4.3: training loop
PATH = './object_model.pth'
min_val_loss = 1e10
for epoch in range(1, num_epochs+1): 
  running_loss = 0.0
  running_val_loss = 0.0
  model2.train() 
  for inputs, targets in train_dataloader: 
    optimizer.zero_grad() 

    inputs, targets = inputs.to(device), targets.to(device)

    pred = model2(inputs)
    
    targets = targets.reshape(-1)

    loss = criterion(pred, targets) 

    loss.backward() 
    optimizer.step() 

    running_loss += loss.item()

  model2.eval() 
  y_pred = [] 
  with torch.no_grad():
      for i,data in enumerate(val_dataloader, 0):
          inputs, labels = data
          inputs = inputs.to(device)
          labels = labels.to(device)
          preds = model2(inputs)
          labels = labels.reshape(-1)
          loss = criterion(preds,labels)
          running_val_loss += loss.item()
          
      avg_val_loss = running_val_loss/len(val_dataloader)

  if avg_val_loss < min_val_loss:
      torch.save(model2.state_dict(), PATH)
      min_val_loss = avg_val_loss

  print("epoc :{}, running_loss :{}".format(epoch,running_loss))

epoc :1, running_loss :268.4608509540558
epoc :2, running_loss :206.9984848499298
epoc :3, running_loss :174.43016189336777
epoc :4, running_loss :150.39867055416107
epoc :5, running_loss :134.20238164067268
epoc :6, running_loss :118.51393008232117
epoc :7, running_loss :106.38608959317207
epoc :8, running_loss :95.52411434054375
epoc :9, running_loss :87.53812626004219
epoc :10, running_loss :77.95145598053932
epoc :11, running_loss :69.765764772892
epoc :12, running_loss :63.28580106794834
epoc :13, running_loss :57.122556924819946
epoc :14, running_loss :51.275114350020885
epoc :15, running_loss :46.35949873179197
epoc :16, running_loss :41.33680732548237
epoc :17, running_loss :37.29395304620266
epoc :18, running_loss :33.83821315318346
epoc :19, running_loss :30.9090263992548
epoc :20, running_loss :27.744039051234722


In [88]:
model2 = Model(Encoder(word_to_idx), Classifier())
model2.load_state_dict(torch.load('object_model.pth'))

<All keys matched successfully>

In [89]:
## TODO 4.5: evalaute on test set  
predict = list()
label = al_test
# since we're not training, we don't need to calculate the gradients for our outputs
model.eval()
with torch.no_grad():
    for X_test in test_dataloader:
        Y_pred = model(X_test)
        _, pred = torch.max(Y_pred.data, 1)
        for p in pred:
            predict.append(p.item())

print("acc testing with test data:", sum(predict == label)/len(label) *100)

acc testing with test data: 45.556385362210605
