# Assignment 2 - Text classification

In this homework, you will create models to classify texts from TRUE call-center. There are two classification tasks:

1. Action Classification: Identify which action the customer would like to take (e.g. enquire, report, cancle)
2. Object Classification: Identify which object the customer is referring to (e.g. payment, truemoney, internet, roaming)

In this homework, you are asked to do the following tasks:

1. Data Cleaning
2. Preprocessing data for keras
3. Build and evaluate a model for "action" classification
4. Build and evaluate a model for "object" classification
5. Build and evaluate a multi-task model that does both "action" and "object" classifications in one-go

Note: we have removed phone numbers from the dataset for privacy purposes.

Please submit

1. **a colab worksheet** (a link to your worksheet or .ipynb file)
   **NLP google classroom assignment 2 by Wed 21st Feb 11.59 pm. Late submission will deduct 1% per day.**


In [5]:
!wget --no-check-certificate https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv

--2024-02-20 22:41:20--  https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv
Resolving www.dropbox.com (www.dropbox.com)... 162.125.81.18, 2620:100:6031:18::a27d:5112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.81.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/37u83g55p19kvrl/clean-phone-data-for-students.csv [following]
--2024-02-20 22:41:21--  https://www.dropbox.com/s/raw/37u83g55p19kvrl/clean-phone-data-for-students.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucd49587d27a7ffd18fccbad9ab0.dl.dropboxusercontent.com/cd/0/inline/CNod7hupc1DNriKz97hxmHXeDfrZY3r-luaE_5hLSpp8rYaxxvNpi1-zE5zqG_FSgK5HtcA36PsDERMm9WSxOZ2GT61DZf9arAHlvdsgzsf5Y2mSv9hUXlbmh_NSG5IW264/file# [following]
--2024-02-20 22:41:21--  https://ucd49587d27a7ffd18fccbad9ab0.dl.dropboxusercontent.com/cd/0/inline/CNod7hupc1DNriKz97hxmHXeDfrZY3r-luaE_5hLSpp8rYaxxvNpi1-zE

## Import Libs


In [1]:
%matplotlib inline
import pandas as pd
import sklearn
import numpy as np
from IPython.display import display

import matplotlib.pyplot as plt

## Loading data

First, we load the data from disk into a Dataframe.

A Dataframe is essentially a table, or 2D-array/Matrix with a name for each column.


In [2]:
data_df = pd.read_csv("clean-phone-data-for-students.csv")
# Show the top 5 rows
display(data_df.head())
# Summarize the data
data_df.describe()

Unnamed: 0,Sentence Utterance,Action,Object
0,<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counte...,enquire,payment
1,internet ยังความเร็วอยุ่เท่าไหร ครับ,enquire,package
2,ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้...,report,suspend
3,พี่ค่ะยังใช้ internet ไม่ได้เลยค่ะ เป็นเครื่อ...,enquire,internet
4,ฮาโหล คะ พอดีว่าเมื่อวานเปิดซิมทรูมูฟ แต่มันโ...,report,phone_issues


Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


## Data cleaning

We call the DataFrame.describe() again.
Notice that there are 33 unique labels/classes for object and 10 unique labels for action that the model will try to predict.
But there are unwanted duplications e.g. Idd,idd,lotalty_card,Lotalty_card

Also note that, there are 13389 unqiue sentence utterances from 16175 utterances. You have to clean that too!

## #TODO 1:

You will have to remove unwanted label duplications as well as duplications in text inputs.
Also, you will have to trim out unwanted whitespaces from the text inputs.
This shouldn't be too hard, as you have already seen it in the demo.


In [3]:
display(data_df.describe())
display(data_df.Object.unique())
display(data_df.Action.unique())

Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nonTrueMove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd',
       'TrueMoney', 'garbage', 'Payment', 'IDD', 'ringtone', 'Idd',
       'rate', 'loyalty_card', 'contact', 'officer', 'Balance', 'Service',
       'Loyalty_card'], dtype=object)

array(['enquire', 'report', 'cancel', 'Enquire', 'buy', 'activate',
       'request', 'Report', 'garbage', 'change'], dtype=object)

In [4]:
# 1. Convert the Object and Action columns to lowercase
data_df["Object"] = data_df["Object"].str.lower()
data_df["Action"] = data_df["Action"].str.lower()
display(data_df.describe())
display(data_df.Object.unique())
display(data_df.Action.unique())

To lower case


Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,8,26
top,บริการอื่นๆ,enquire,service
freq,97,10484,2528


array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nontruemove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd', 'garbage',
       'ringtone', 'rate', 'loyalty_card', 'contact', 'officer'],
      dtype=object)

array(['enquire', 'report', 'cancel', 'buy', 'activate', 'request',
       'garbage', 'change'], dtype=object)

In [None]:
### To Do 1 ###
display(data_df.describe())
display(data_df.Object.unique())
display(data_df.Action.unique())

Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nonTrueMove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd',
       'TrueMoney', 'garbage', 'Payment', 'IDD', 'ringtone', 'Idd',
       'rate', 'loyalty_card', 'contact', 'officer', 'Balance', 'Service',
       'Loyalty_card'], dtype=object)

array(['enquire', 'report', 'cancel', 'Enquire', 'buy', 'activate',
       'request', 'Report', 'garbage', 'change'], dtype=object)

Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,8,26
top,บริการอื่นๆ,enquire,service
freq,97,10484,2528


array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nontruemove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd', 'garbage',
       'ringtone', 'rate', 'loyalty_card', 'contact', 'officer'],
      dtype=object)

array(['enquire', 'report', 'cancel', 'buy', 'activate', 'request',
       'garbage', 'change'], dtype=object)

Unnamed: 0,Sentence Utterance,Action,Object
count,13389,13389,13389
unique,13389,8,26
top,<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counte...,enquire,service
freq,1,8658,2111


{'enquire': 0, 'report': 1, 'cancel': 2, 'buy': 3, 'activate': 4, 'request': 5, 'garbage': 6, 'change': 7}
{0: 'enquire', 1: 'report', 2: 'cancel', 3: 'buy', 4: 'activate', 5: 'request', 6: 'garbage', 7: 'change'}
{'payment': 0, 'package': 1, 'suspend': 2, 'internet': 3, 'phone_issues': 4, 'service': 5, 'nontruemove': 6, 'balance': 7, 'detail': 8, 'bill': 9, 'credit': 10, 'promotion': 11, 'mobile_setting': 12, 'iservice': 13, 'roaming': 14, 'truemoney': 15, 'information': 16, 'lost_stolen': 17, 'balance_minutes': 18, 'idd': 19, 'garbage': 20, 'ringtone': 21, 'rate': 22, 'loyalty_card': 23, 'contact': 24, 'officer': 25}
{0: 'payment', 1: 'package', 2: 'suspend', 3: 'internet', 4: 'phone_issues', 5: 'service', 6: 'nontruemove', 7: 'balance', 8: 'detail', 9: 'bill', 10: 'credit', 11: 'promotion', 12: 'mobile_setting', 13: 'iservice', 14: 'roaming', 15: 'truemoney', 16: 'information', 17: 'lost_stolen', 18: 'balance_minutes', 19: 'idd', 20: 'garbage', 21: 'ringtone', 22: 'rate', 23: 'loyal

Unnamed: 0,Sentence Utterance,Action,Object,Action_Label,Object_Label
0,ผมไปจ่ายเงินที่ Counter Services เค้าเช็ต 3276...,enquire,payment,0,0
1,internet ยังความเร็วอยุ่เท่าไหร ครับ,enquire,package,0,1
2,ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้ ค่ะ,report,suspend,1,2
3,พี่ค่ะยังใช้ internet ไม่ได้เลยค่ะ เป็นเครื่อง...,enquire,internet,0,3
4,ฮาโหล คะ พอดีว่าเมื่อวานเปิดซิมทรูมูฟ แต่มันโท...,report,phone_issues,1,4
...,...,...,...,...,...
16167,ต้องการทราบวันตัดรอบบิลค่ะ,enquire,bill,0,9
16170,เชื่อมต่ออินเตอร์เน็ตไม่ได้ค่ะ,enquire,internet,0,3
16172,ยอดเงินเหลือเท่าไหร่ค่ะ,enquire,balance,0,7
16173,ยอดเงินในระบบ,enquire,balance,0,7


## #TODO 2: Preprocessing data for Keras

You will be using Tensorflow 2 keras in this assignment. Please show us how you prepare your data for keras.
Don't forget to split data into train and test sets (+ validation set if you want)


In [21]:
#### TO DO 2: Preprocessing data for Keras ###

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from gensim.models import Word2Vec
import tensorflow as tf

  0%|          | 0/13389 [00:00<?, ?it/s]

  0%|          | 0/13389 [00:00<?, ?it/s]

Unnamed: 0,Sentence Utterance,Action,Object,Action_Label,Object_Label,feature_word2vec,feature_USE
0,ผมไปจ่ายเงินที่ Counter Services เค้าเช็ต 3276...,enquire,payment,0,0,"[-0.06751839995073776, -0.08544876910746098, 0...","(tf.Tensor(-0.04425638, shape=(), dtype=float3..."
1,internet ยังความเร็วอยุ่เท่าไหร ครับ,enquire,package,0,1,"[-0.02682400057092309, 0.019718199986891706, 0...","(tf.Tensor(-0.016392484, shape=(), dtype=float..."
2,ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้ ค่ะ,report,suspend,1,2,"[-0.05292649952960866, -0.15816636217225874, 0...","(tf.Tensor(-0.061429273, shape=(), dtype=float..."
3,พี่ค่ะยังใช้ internet ไม่ได้เลยค่ะ เป็นเครื่อง...,enquire,internet,0,3,"[-0.028056292192024344, 0.00807352715770214, 0...","(tf.Tensor(-0.010377238, shape=(), dtype=float..."
4,ฮาโหล คะ พอดีว่าเมื่อวานเปิดซิมทรูมูฟ แต่มันโท...,report,phone_issues,1,4,"[-0.04323520032688975, -0.16180400032550096, 0...","(tf.Tensor(0.03725694, shape=(), dtype=float32..."
...,...,...,...,...,...,...,...
16167,ต้องการทราบวันตัดรอบบิลค่ะ,enquire,bill,0,9,"[-0.120816002599895, -0.06486314155959658, 0.0...","(tf.Tensor(-0.020553127, shape=(), dtype=float..."
16170,เชื่อมต่ออินเตอร์เน็ตไม่ได้ค่ะ,enquire,internet,0,3,"[-0.10178399900905788, -0.01846420131623745, 0...","(tf.Tensor(0.0008798843, shape=(), dtype=float..."
16172,ยอดเงินเหลือเท่าไหร่ค่ะ,enquire,balance,0,7,"[0.029824000550433993, -0.25461525237187743, 0...","(tf.Tensor(0.008985904, shape=(), dtype=float3..."
16173,ยอดเงินในระบบ,enquire,balance,0,7,"[0.2574733238046368, -0.11953299554685752, 0.0...","(tf.Tensor(0.11337892, shape=(), dtype=float32..."


## #TODO 3: Build and evaluate a model for "action" classification.

Please include the classification report from the test set you separate in the second step.


In [None]:
# TODO 3: Build and evaluate a model for "action" classification

[0 0 0 ... 2 0 0]
[0 0 0 ... 2 0 0]
              precision    recall  f1-score   support

     enquire       0.90      0.88      0.89      2597
      report       0.68      0.74      0.71       433
      cancel       0.85      0.95      0.90       329
         buy       0.70      0.79      0.75       235
    activate       0.65      0.79      0.71       166
     request       0.55      0.32      0.40        85
     garbage       0.00      0.00      0.00        15
      change       0.80      0.77      0.78       157

    accuracy                           0.84      4017
   macro avg       0.64      0.65      0.64      4017
weighted avg       0.84      0.84      0.84      4017



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## #TODO 4: Build and evaluate a model for "object" classification.

Please include the classification report from the test set you separate in the second step.


In [None]:
# TODO 4: Build and evaluate a model for "object" classification

In [None]:
# split train-test dataset on action label

[[ 0.06071522  0.02074529 -0.03020859 ... -0.08663857  0.01112783
   0.13158736]
 [-0.04441892 -0.0735823  -0.07116093 ... -0.02686526  0.03481807
   0.10086537]
 [-0.02523407 -0.00506832  0.00365995 ... -0.06739269  0.02490636
   0.08007193]
 ...
 [-0.05176339  0.00751107 -0.00891026 ... -0.00431294 -0.00437608
   0.05689482]
 [-0.03049629 -0.01125776 -0.06556503 ... -0.06462871 -0.00598627
   0.03758879]
 [-0.04742752  0.01085262  0.00931566 ... -0.08705591 -0.00978002
   0.11655738]]
5     1478
1     1258
3     1253
7     1037
11     801
2      511
0      449
4      407
9      378
8      229
16     208
12     197
15     174
14     172
6      172
17     162
19     144
10     121
21      55
23      47
18      35
20      34
22      25
13      15
25       7
24       3
Name: Object_Label, dtype: int64
5     633
1     539
3     537
7     445
11    344
2     219
0     192
4     174
9     162
8      98
16     89
12     84
14     74
15     74
6      74
17     69
19     62
10     52
21     24

start training
Model: "model_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_6 (InputLayer)        [(None, 512)]             0         
                                                                 
 dense_20 (Dense)            (None, 1000)              513000    
                                                                 
 dropout_15 (Dropout)        (None, 1000)              0         
                                                                 
 dense_21 (Dense)            (None, 1000)              1001000   
                                                                 
 dropout_16 (Dropout)        (None, 1000)              0         
                                                                 
 dense_22 (Dense)            (None, 1000)              1001000   
                                                                 
 dropout_17 (Dropout)        (None, 1000)   

<keras.callbacks.History at 0x7fdec7073250>

                 precision    recall  f1-score   support

        payment       0.65      0.66      0.65       192
        package       0.67      0.71      0.69       539
        suspend       0.80      0.78      0.79       219
       internet       0.71      0.80      0.75       537
   phone_issues       0.59      0.65      0.62       174
        service       0.79      0.75      0.77       633
    nontruemove       0.52      0.34      0.41        74
        balance       0.77      0.82      0.79       445
         detail       0.54      0.38      0.44        98
           bill       0.70      0.64      0.67       162
         credit       0.81      0.73      0.77        52
      promotion       0.66      0.67      0.66       344
 mobile_setting       0.56      0.58      0.57        84
       iservice       0.00      0.00      0.00         7
        roaming       0.76      0.82      0.79        74
      truemoney       0.70      0.77      0.74        74
    information       0.62    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## #TODO 5: Build and evaluate a multi-task model that does both "action" and "object" classifications in one-go

This can be tricky if you are unfamiliar with the Keras functional API. PLEASE READ these webpages(https://www.tensorflow.org/guide/keras/functional, https://keras.io/getting-started/functional-api-guide/) before you start this task.

Your model will have two separate output layers, one for the action classification task and another for the object classification task.

This is a rough sketch of what your model might look like:
![image](https://raw.githubusercontent.com/ekapolc/nlp_course/master/HW5/multitask_sketch.png)

Hint: You can search how to do it with "Keras Single Input Multiple Outputs. " One of the methods is to concatenate [Output1, Output2] when building a shared multi-tasked model.


In [None]:
# TODO 5: Build and evaluate a model for multi-task classification

13389
[ 9667  1300  3121 ... 10425  5525   491]
[ 7482  8621  2186 ... 10027  4059 11725]


start training
Model: "model_6"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_7 (InputLayer)           [(None, 512)]        0           []                               
                                                                                                  
 dense_24 (Dense)               (None, 1000)         513000      ['input_7[0][0]']                
                                                                                                  
 dropout_18 (Dropout)           (None, 1000)         0           ['dense_24[0][0]']               
                                                                                                  
 dense_25 (Dense)               (None, 1000)         1001000     ['dropout_18[0][0]']             
                                                                             

<keras.callbacks.History at 0x7fdeb0c1b4f0>

(4017, 8) (4017, 26)
              precision    recall  f1-score   support

     enquire       0.88      0.93      0.90      2602
      report       0.79      0.72      0.76       437
      cancel       0.91      0.91      0.91       329
         buy       0.74      0.72      0.73       218
    activate       0.76      0.67      0.71       165
     request       0.77      0.39      0.52        88
     garbage       0.00      0.00      0.00        15
      change       0.85      0.80      0.82       163

    accuracy                           0.86      4017
   macro avg       0.71      0.64      0.67      4017
weighted avg       0.85      0.86      0.86      4017

                 precision    recall  f1-score   support

        payment       0.66      0.67      0.66       192
        package       0.65      0.74      0.69       539
        suspend       0.78      0.78      0.78       219
       internet       0.72      0.77      0.75       537
   phone_issues       0.61      0.67      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
