# HOMEWORK 5: TEXT CLASSIFICATION
In this homework, you will create models to classify texts from TRUE call-center. There are two classification tasks:
1. Action Classification: Identify which action the customer would like to take (e.g. enquire, report, cancle)
2. Object Classification: Identify which object the customer is referring to (e.g. payment, truemoney, internet, roaming) 

In this homework, you are asked to do the following tasks:
1. Data Cleaning
2. Preprocessing data for keras
3. Build and evaluate a model for "action" classification
4. Build and evaluate a model for "object" classification
5. Build and evaluate a multi-task model that does both "action" and "object" classifications in one-go 


Note: we have removed phone numbers from the dataset for privacy purposes. 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import shutil
shutil.copy("/content/drive/MyDrive/fra501/Dataset/clean-phone-data.csv", "/content/clean-phone-data.csv")

'/content/clean-phone-data.csv'

## Import Libs

In [None]:
%matplotlib inline
import pandas
import sklearn
import numpy as np
from IPython.display import display

import matplotlib.pyplot as plt

## Loading data
First, we load the data from disk into a Dataframe.

A Dataframe is essentially a table, or 2D-array/Matrix with a name for each column.

In [None]:
phone_df = pandas.read_csv('clean-phone-data.csv')

Let's preview the data.

In [None]:
data_df = phone_df.copy()

In [None]:
# Show the top 5 rows
display(data_df)
# Summarize the data
data_df.describe()

Unnamed: 0,Sentence Utterance,Action,Object
0,<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counte...,enquire,payment
1,internet ยังความเร็วอยุ่เท่าไหร ครับ,enquire,package
2,ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้...,report,suspend
3,พี่ค่ะยังใช้ internet ไม่ได้เลยค่ะ เป็นเครื่อ...,enquire,internet
4,ฮาโหล คะ พอดีว่าเมื่อวานเปิดซิมทรูมูฟ แต่มันโ...,report,phone_issues
...,...,...,...
16170,เชื่อมต่ออินเตอร์เน็ตไม่ได้ค่ะ,enquire,internet
16171,โทรออกต่างประเทศค่ะ,enquire,idd
16172,ยอดเงินเหลือเท่าไหร่ค่ะ,enquire,balance
16173,ยอดเงินในระบบ,enquire,balance


Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


## Data cleaning

We call the DataFrame.describe() again.
Notice that there are 33 unique labels/classes for object and 10 unique labels for action that the model will try to predict.
But there are unwanted duplications e.g. Idd,idd,lotalty_card,Lotalty_card

Also note that, there are 13389 unqiue sentence utterances from 16175 utterances. You have to clean that too!

## #TODO 1: 
You will have to remove unwanted label duplications as well as duplications in text inputs. 
Also, you will have to trim out unwanted whitespaces from the text inputs. 
This shouldn't be too hard, as you have already seen it in the demo.



In [None]:
display(data_df.describe())
display(data_df.Object.unique())
display(data_df.Action.unique())

Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nonTrueMove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd',
       'TrueMoney', 'garbage', 'Payment', 'IDD', 'ringtone', 'Idd',
       'rate', 'loyalty_card', 'contact', 'officer', 'Balance', 'Service',
       'Loyalty_card'], dtype=object)

array(['enquire', 'report', 'cancel', 'Enquire', 'buy', 'activate',
       'request', 'Report', 'garbage', 'change'], dtype=object)

In [None]:
# TODO1: Data cleaning
data_df = data_df.drop_duplicates(['Sentence Utterance'], ignore_index=True)

In [None]:
for i in range(len(data_df)):
  data_df['Sentence Utterance'][i] = data_df['Sentence Utterance'][i].replace(' ', '')
  data_df['Action'][i] = data_df['Action'][i].lower()
  data_df['Object'][i] = data_df['Object'][i].lower()

In [None]:
display(data_df.describe())
display(data_df.Object.unique())
display(data_df.Action.unique())

Unnamed: 0,Sentence Utterance,Action,Object
count,13389,13389,13389
unique,13282,8,26
top,ต้องการเปลี่ยนโปรโมชั่นค่ะ,enquire,service
freq,3,8658,2111


array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nontruemove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd', 'garbage',
       'ringtone', 'rate', 'loyalty_card', 'contact', 'officer'],
      dtype=object)

array(['enquire', 'report', 'cancel', 'buy', 'activate', 'request',
       'garbage', 'change'], dtype=object)

In [None]:
data_df.head()

Unnamed: 0,Sentence Utterance,Action,Object
0,<PHONE_NUMBER_REMOVED>ผมไปจ่ายเงินที่CounterSe...,enquire,payment
1,internetยังความเร็วอยุ่เท่าไหรครับ,enquire,package
2,ตะกี้ไปชำระค่าบริการไปแล้วแต่ยังใช้งานไม่ได้ค่ะ,report,suspend
3,พี่ค่ะยังใช้internetไม่ได้เลยค่ะเป็นเครื่องโกลไล,enquire,internet
4,ฮาโหลคะพอดีว่าเมื่อวานเปิดซิมทรูมูฟแต่มันโทรออ...,report,phone_issues


## #TODO 2: Preprocessing data for Keras
You will be using Tensorflow 2 keras in this assignment. Please show us how you prepare your data for keras.
Don't forget to split data into train and test sets (+ validation set if you want)

In [None]:
# TODO2: Preprocessing data for Keras
input_chars = list(set(''.join(data_df['Sentence Utterance'])))
# action_chars = list(set(''.join(data_df['Action'])))
# object_chars = list(set(''.join(data_df['Object'])))

# +1 for padding
data_size, vocab_size = len(data_df), len(input_chars)+1 
print(data_size, vocab_size)
# action_vocab_size = len(action_chars)+1
# object_vocab_size = len(object_chars)+1
# print(data_size, vocab_size, action_vocab_size, object_vocab_size)

maxlen = len(max(data_df['Sentence Utterance'], key=len))
print('max len:', maxlen)

13389 152
max len: 449


In [None]:
sorted_chars= sorted(input_chars)
# sorted_action_chars= sorted(action_chars)
# sorted_object_chars= sorted(object_chars)
sorted_chars.insert(0,"<PAD>") #PADDING for input
# sorted_action_chars.insert(0,"<PAD>") #PADDING for action
# sorted_object_chars.insert(0,"<PAD>") #PADDING for object
#Input
char_to_ix = { ch:i for i,ch in enumerate(sorted_chars) }
ix_to_char = { i:ch for i,ch in enumerate(sorted_chars) } #reverse dictionary
# #Action
# action_char_to_ix = { ch:i for i,ch in enumerate(sorted_action_chars) }
# ix_to_action_char = { i:ch for i,ch in enumerate(sorted_action_chars) } #reverse dictionary
# #Object
# object_char_to_ix = { ch:i for i,ch in enumerate(sorted_object_chars) }
# ix_to_object_char = { i:ch for i,ch in enumerate(sorted_object_chars) } #reverse dictionary

print(ix_to_char)
# print(ix_to_action_char)
# print(ix_to_object_char)

{0: '<PAD>', 1: '\n', 2: '"', 3: '#', 4: '%', 5: '&', 6: "'", 7: '(', 8: ')', 9: '*', 10: '+', 11: ',', 12: '-', 13: '.', 14: '/', 15: '0', 16: '1', 17: '2', 18: '3', 19: '4', 20: '5', 21: '6', 22: '7', 23: '8', 24: '9', 25: '<', 26: '>', 27: '?', 28: '@', 29: 'A', 30: 'B', 31: 'C', 32: 'D', 33: 'E', 34: 'F', 35: 'G', 36: 'H', 37: 'I', 38: 'J', 39: 'K', 40: 'L', 41: 'M', 42: 'N', 43: 'O', 44: 'P', 45: 'R', 46: 'S', 47: 'T', 48: 'U', 49: 'V', 50: 'W', 51: 'X', 52: 'Y', 53: '_', 54: 'a', 55: 'b', 56: 'c', 57: 'd', 58: 'e', 59: 'f', 60: 'g', 61: 'h', 62: 'i', 63: 'j', 64: 'k', 65: 'l', 66: 'm', 67: 'n', 68: 'o', 69: 'p', 70: 'q', 71: 'r', 72: 's', 73: 't', 74: 'u', 75: 'v', 76: 'w', 77: 'x', 78: 'y', 79: 'z', 80: 'é', 81: 'ก', 82: 'ข', 83: 'ฃ', 84: 'ค', 85: 'ฆ', 86: 'ง', 87: 'จ', 88: 'ฉ', 89: 'ช', 90: 'ซ', 91: 'ฌ', 92: 'ญ', 93: 'ฎ', 94: 'ฐ', 95: 'ฑ', 96: 'ฒ', 97: 'ณ', 98: 'ด', 99: 'ต', 100: 'ถ', 101: 'ท', 102: 'ธ', 103: 'น', 104: 'บ', 105: 'ป', 106: 'ผ', 107: 'ฝ', 108: 'พ', 109: 'ฟ', 110:

In [None]:
temp = []
action_to_num = {}
for act in data_df['Action']:
  if act not in temp:
    action_to_num[act] = len(temp)
    temp.append(act)

temp = []
object_to_num = {}
for obj in data_df['Object']:
  if obj not in temp:
    object_to_num[obj] = len(temp)
    temp.append(obj)

print(action_to_num)
print(object_to_num)

{'enquire': 0, 'report': 1, 'cancel': 2, 'buy': 3, 'activate': 4, 'request': 5, 'garbage': 6, 'change': 7}
{'payment': 0, 'package': 1, 'suspend': 2, 'internet': 3, 'phone_issues': 4, 'service': 5, 'nontruemove': 6, 'balance': 7, 'detail': 8, 'bill': 9, 'credit': 10, 'promotion': 11, 'mobile_setting': 12, 'iservice': 13, 'roaming': 14, 'truemoney': 15, 'information': 16, 'lost_stolen': 17, 'balance_minutes': 18, 'idd': 19, 'garbage': 20, 'ringtone': 21, 'rate': 22, 'loyalty_card': 23, 'contact': 24, 'officer': 25}


In [None]:
df = data_df.copy()

In [None]:
for act in action_to_num:
  df.loc[df["Action"] == act, "Action"] = action_to_num[act]

for obj in object_to_num:
  df.loc[df["Object"] == obj, "Object"] = object_to_num[obj]

df.head()

Unnamed: 0,Sentence Utterance,Action,Object
0,<PHONE_NUMBER_REMOVED>ผมไปจ่ายเงินที่CounterSe...,0,0
1,internetยังความเร็วอยุ่เท่าไหรครับ,0,1
2,ตะกี้ไปชำระค่าบริการไปแล้วแต่ยังใช้งานไม่ได้ค่ะ,1,2
3,พี่ค่ะยังใช้internetไม่ได้เลยค่ะเป็นเครื่องโกลไล,0,3
4,ฮาโหลคะพอดีว่าเมื่อวานเปิดซิมทรูมูฟแต่มันโทรออ...,1,4


In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Bidirectional, Concatenate, Permute, Dot, Input, LSTM, Multiply
from tensorflow.keras.layers import RepeatVector, Dense, Activation, Lambda, Reshape, SimpleRNN
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import load_model, Model
import tensorflow.keras.backend as K
import numpy as np

In [None]:
train = df[:int(len(df)*0.7)].copy()
test = df[int(len(df)*0.7):].copy()

In [None]:
train_X = []
for line in train['Sentence Utterance']:
    temp=[]
    for char in line:
        temp.append(char_to_ix[char])
    train_X.append(temp)
train_X = pad_sequences(train_X,maxlen=maxlen)
train_X = to_categorical(train_X,vocab_size)
train_X =train_X.reshape(len(train),maxlen ,vocab_size)
train_act = train['Action']
train_obj = train['Object']

test_X = []
for line in test['Sentence Utterance']:
    temp=[]
    for char in line:
        temp.append(char_to_ix[char])
    test_X.append(temp)
test_X = pad_sequences(test_X,maxlen=maxlen)
test_X = to_categorical(test_X,vocab_size)
test_X =test_X.reshape(len(test),maxlen ,vocab_size)
test_act = test['Action']
test_obj = test['Object']



In [None]:
print(train_X.shape, test_X.shape)

(9372, 449, 152) (4017, 449, 152)


In [None]:
train_act = to_categorical(train_act, len(df.Action.unique()))
train_obj = to_categorical(train_obj, len(df.Object.unique()))
test_act = to_categorical(test_act, len(df.Action.unique()))
test_obj = to_categorical(test_obj, len(df.Object.unique()))

## #TODO 3: Build and evaluate a model for "action" classification


In [None]:
#TODO 3: Build and evaluate a model for "action" classification
def train_model(maxlen, hidden, n_class):
  X = Input(shape=(maxlen, vocab_size))

  # a0 = Input(shape=(hidden,), name='a0')
  # a = a0

  reshapor = Reshape((1,  vocab_size)) #Reshape the size of a tensor                         
  RNN_cell = SimpleRNN(hidden, return_state = True) #An RNN Cell       
  output_layer = Dense(n_class, activation='softmax')  #softmax output layer  
    
  # REMOVE DUE TO PROCESSING POWER NOT ENOUGH
    # Loop  through the sequence of length Tx
  # for t in range(maxlen):
  #     # Select the "t"th time step vector from X.
  #     x =  X[:,t,:] #--> shape(n_values)
  #     # Reshape x to be (1, n_values)
  #     x = reshapor(x) 
  #     # Update the hidden state of the RNN 
  #     # a, _ = RNN_cell(x, initial_state=[a]) 
  #     if t > 0:
  #       a, _ = RNN_cell(x, initial_state=[a])
  #     elif t==0:
  #       a, _ = RNN_cell(x)
      # Pass the hidden vector to a softmax function
  # x = reshapor(X)
  a, _ = RNN_cell(X)
  out = output_layer(a)
        
    # Create the model instance
  model =  Model(inputs=X, outputs=out)    
  return model
  

In [None]:
model = train_model(maxlen, 16, len(df.Action.unique()))
opt = Adam(lr=0.001) #optimizer
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()



Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 449, 152)]        0         
                                                                 
 simple_rnn_1 (SimpleRNN)    [(None, 16),              2704      
                              (None, 16)]                        
                                                                 
 dense_1 (Dense)             (None, 8)                 136       
                                                                 
Total params: 2,840
Trainable params: 2,840
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.fit(train_X, train_act,verbose=1 ,epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f82b45eaee0>

In [None]:
model.evaluate(test_X, test_act)



[0.8061073422431946, 0.7864077687263489]

*The result can be improve easily by training more epoch but we will train it only until 50 epochs due to insufficient RAM

## #TODO 4: Build and evaluate a model for "object" classification



In [None]:
#TODO 4: Build and evaluate a model for "object" classification
model = train_model(maxlen, 16, len(df.Object.unique()))
opt = Adam(lr=0.001) #optimizer
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()



Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 449, 152)]        0         
                                                                 
 simple_rnn_2 (SimpleRNN)    [(None, 16),              2704      
                              (None, 16)]                        
                                                                 
 dense_2 (Dense)             (None, 26)                442       
                                                                 
Total params: 3,146
Trainable params: 3,146
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.fit(train_X, train_obj,verbose=1 ,epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f81d6d3cb20>

In [None]:
model.evaluate(test_X, test_obj)



[2.068596124649048, 0.3885984420776367]

*The result can be improve easily by training more epoch but we will train it only until 50 epochs due to insufficient RAM

## #TODO 5: Build and evaluate a multi-task model that does both "action" and "object" classifications in one-go 

This can be a bit tricky, if you are not familiar with the Keras functional API. PLEASE READ these webpages(https://www.tensorflow.org/guide/keras/functional, https://keras.io/getting-started/functional-api-guide/) before you start this task.   

Your model will have 2 separate output layers one for action classification task and another for object classification task. 

This is a rough sketch of what your model might look like:
image --> https://drive.google.com/file/d/1r7M6tFyQDu6pJIxLd_fn2kBMjo_CWmUK/view?usp=share_link

In [None]:
#TODO 5: Build and evaluate a multi-task model that does both "action" and "object" classifications in one-go
def train_model(maxlen, hidden, n_act, n_obj):
  X = Input(shape=(maxlen, vocab_size))


  reshapor = Reshape((1,  vocab_size)) #Reshape the size of a tensor                         
  RNN_cell = SimpleRNN(hidden, return_state = True) #An RNN Cell       
  output_act = Dense(n_act, activation='softmax')  #softmax output layer 
  output_obj = Dense(n_obj, activation='softmax')  #softmax output layer   
    
  a, _ = RNN_cell(X)
  out_act = output_act(a)
  out_obj = output_obj(a)
        
    # Create the model instance
  model =  Model(inputs=X, outputs=[out_act, out_obj])    
  return model

In [None]:
model = train_model(maxlen, 16, len(df.Action.unique()), len(df.Object.unique()))
opt = Adam(lr=0.001) #optimizer
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()



Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_2 (InputLayer)           [(None, 449, 152)]   0           []                               
                                                                                                  
 simple_rnn_1 (SimpleRNN)       [(None, 16),         2704        ['input_2[0][0]']                
                                 (None, 16)]                                                      
                                                                                                  
 dense_2 (Dense)                (None, 8)            136         ['simple_rnn_1[0][0]']           
                                                                                                  
 dense_3 (Dense)                (None, 26)           442         ['simple_rnn_1[0][0]']     

In [None]:
model.fit(train_X, [train_act, train_obj],verbose=1 ,epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7fae05f8dc70>

In [None]:
model.evaluate(test_X, [test_act, test_obj])



[3.009551763534546,
 0.8491876721382141,
 2.1603636741638184,
 0.7664924263954163,
 0.3798854947090149]

*The result can be improve easily by training more epoch but we will train it only until 50 epochs due to insufficient RAM