# HOMEWORK 5: TEXT CLASSIFICATION
In this homework, you will create models to classify texts from TRUE call-center. There are two classification tasks:
1. Action Classification: Identify which action the customer would like to take (e.g. enquire, report, cancle)
2. Object Classification: Identify which object the customer is referring to (e.g. payment, truemoney, internet, roaming) 

In this homework, you are asked to do the following tasks:
1. Data Cleaning
2. Preprocessing data for keras
3. Build and evaluate a model for "action" classification
4. Build and evaluate a model for "object" classification
5. Build and evaluate a multi-task model that does both "action" and "object" classifications in one-go 


Note: we have removed phone numbers from the dataset for privacy purposes. 

In [1]:
# from google.colab import drive
# drive.mount('/content/drive')

In [2]:
# import shutil
# shutil.copy("/content/drive/MyDrive/FRA 501 IntroNLP&DL/Dataset/clean-phone-data.csv", "/content/clean-phone-data.csv")

## Import Libs

In [3]:
# %matplotlib inline
import pandas
import sklearn
import numpy as np
from IPython.display import display

import matplotlib.pyplot as plt

## Loading data
First, we load the data from disk into a Dataframe.

A Dataframe is essentially a table, or 2D-array/Matrix with a name for each column.

In [4]:
data_df = pandas.read_csv('data/clean-phone-data.csv')

Let's preview the data.

In [5]:
# Show the top 5 rows
display(data_df.head())
# Summarize the data
data_df.describe()

Unnamed: 0,Sentence Utterance,Action,Object
0,<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counte...,enquire,payment
1,internet ยังความเร็วอยุ่เท่าไหร ครับ,enquire,package
2,ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้...,report,suspend
3,พี่ค่ะยังใช้ internet ไม่ได้เลยค่ะ เป็นเครื่อ...,enquire,internet
4,ฮาโหล คะ พอดีว่าเมื่อวานเปิดซิมทรูมูฟ แต่มันโ...,report,phone_issues


Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


## Data cleaning

We call the DataFrame.describe() again.
Notice that there are 33 unique labels/classes for object and 10 unique labels for action that the model will try to predict.
But there are unwanted duplications e.g. Idd,idd,lotalty_card,Lotalty_card

Also note that, there are 13389 unqiue sentence utterances from 16175 utterances. You have to clean that too!

## #TODO 1: 
You will have to remove unwanted label duplications as well as duplications in text inputs. 
Also, you will have to trim out unwanted whitespaces from the text inputs. 
This shouldn't be too hard, as you have already seen it in the demo.



In [6]:
display(data_df.describe())
display(data_df.Object.unique())
display(data_df.Action.unique())

Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nonTrueMove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd',
       'TrueMoney', 'garbage', 'Payment', 'IDD', 'ringtone', 'Idd',
       'rate', 'loyalty_card', 'contact', 'officer', 'Balance', 'Service',
       'Loyalty_card'], dtype=object)

array(['enquire', 'report', 'cancel', 'Enquire', 'buy', 'activate',
       'request', 'Report', 'garbage', 'change'], dtype=object)

In [7]:
data_df = data_df[["Sentence Utterance", "Action", "Object"]]
data_df.columns = ['input', 'raw_label_Action','raw_label_Object']
display(data_df.describe())
display(data_df.raw_label_Action.unique())

Unnamed: 0,input,raw_label_Action,raw_label_Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


array(['enquire', 'report', 'cancel', 'Enquire', 'buy', 'activate',
       'request', 'Report', 'garbage', 'change'], dtype=object)

In [8]:
# TODO1: Data cleaning
data_df['clean_lb_Act']=data_df['raw_label_Action'].str.lower().copy()
data_df['clean_lb_Obj']=data_df['raw_label_Object'].str.lower().copy()
display(data_df.describe())
print("clean_lb_Act")
display(data_df.clean_lb_Act.unique())
print("clean_lb_Obj")
display(data_df.clean_lb_Obj.unique())


Unnamed: 0,input,raw_label_Action,raw_label_Object,clean_lb_Act,clean_lb_Obj
count,16175,16175,16175,16175,16175
unique,13389,10,33,8,26
top,บริการอื่นๆ,enquire,service,enquire,service
freq,97,10377,2525,10484,2528


clean_lb_Act


array(['enquire', 'report', 'cancel', 'buy', 'activate', 'request',
       'garbage', 'change'], dtype=object)

clean_lb_Obj


array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nontruemove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd', 'garbage',
       'ringtone', 'rate', 'loyalty_card', 'contact', 'officer'],
      dtype=object)

In [9]:
data_df.drop('raw_label_Object', axis=1, inplace=True)
data_df.drop('raw_label_Action', axis=1, inplace=True)
display(data_df.describe())

Unnamed: 0,input,clean_lb_Act,clean_lb_Obj
count,16175,16175,16175
unique,13389,8,26
top,บริการอื่นๆ,enquire,service
freq,97,10484,2528


In [10]:
data_df = data_df.drop_duplicates("input", keep="first")
display(data_df.describe())

Unnamed: 0,input,clean_lb_Act,clean_lb_Obj
count,13389,13389,13389
unique,13389,8,26
top,<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counte...,enquire,service
freq,1,8658,2111


In [11]:
data = data_df.to_numpy()
data

array([[' <PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counter Services เค้าเช็ต 3276.25 บาท เมื่อวานที่ผมเช็คที่ศูนย์บอกมียอด 3057.79 บาท',
        'enquire', 'payment'],
       [' internet ยังความเร็วอยุ่เท่าไหร ครับ', 'enquire', 'package'],
       [' ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้ ค่ะ', 'report',
        'suspend'],
       ...,
       ['ยอดเงินเหลือเท่าไหร่ค่ะ', 'enquire', 'balance'],
       ['ยอดเงินในระบบ', 'enquire', 'balance'],
       ['สอบถามโปรโมชั่นปัจจุบันที่ใช้อยู่ค่ะ', 'enquire', 'package']],
      dtype=object)

### Substitute Strings in Label

In [13]:
data = data_df.to_numpy()

unique_label = data_df.clean_lb_Obj.unique()

label_2_num_map = dict(zip(unique_label, range(len(unique_label))))
num_2_label_map = dict(zip(range(len(unique_label)), unique_label))

print("Create Mappings")
display(num_2_label_map)
display(label_2_num_map)

print("Before Mappings")
display(data[:, 2])
data[:,2] = np.vectorize(label_2_num_map.get)(data[:,2])

print("After Mappings")
display(data[:, 2])

Create Mappings


{0: 'payment',
 1: 'package',
 2: 'suspend',
 3: 'internet',
 4: 'phone_issues',
 5: 'service',
 6: 'nontruemove',
 7: 'balance',
 8: 'detail',
 9: 'bill',
 10: 'credit',
 11: 'promotion',
 12: 'mobile_setting',
 13: 'iservice',
 14: 'roaming',
 15: 'truemoney',
 16: 'information',
 17: 'lost_stolen',
 18: 'balance_minutes',
 19: 'idd',
 20: 'garbage',
 21: 'ringtone',
 22: 'rate',
 23: 'loyalty_card',
 24: 'contact',
 25: 'officer'}

{'payment': 0,
 'package': 1,
 'suspend': 2,
 'internet': 3,
 'phone_issues': 4,
 'service': 5,
 'nontruemove': 6,
 'balance': 7,
 'detail': 8,
 'bill': 9,
 'credit': 10,
 'promotion': 11,
 'mobile_setting': 12,
 'iservice': 13,
 'roaming': 14,
 'truemoney': 15,
 'information': 16,
 'lost_stolen': 17,
 'balance_minutes': 18,
 'idd': 19,
 'garbage': 20,
 'ringtone': 21,
 'rate': 22,
 'loyalty_card': 23,
 'contact': 24,
 'officer': 25}

Before Mappings


array(['payment', 'package', 'suspend', ..., 'balance', 'balance',
       'package'], dtype=object)

After Mappings


array([0, 1, 2, ..., 7, 7, 1], dtype=object)

In [15]:
unique_label = data_df.clean_lb_Act.unique()

label_2_num_map = dict(zip(unique_label, range(len(unique_label))))
num_2_label_map = dict(zip(range(len(unique_label)), unique_label))

print("Create Mappings")
display(num_2_label_map)
display(label_2_num_map)

print("Before Mappings")
display(data[:, 1])
data[:,1] = np.vectorize(label_2_num_map.get)(data[:,1])

print("After Mappings")
display(data[:, 1])

Create Mappings


{0: 'enquire',
 1: 'report',
 2: 'cancel',
 3: 'buy',
 4: 'activate',
 5: 'request',
 6: 'garbage',
 7: 'change'}

{'enquire': 0,
 'report': 1,
 'cancel': 2,
 'buy': 3,
 'activate': 4,
 'request': 5,
 'garbage': 6,
 'change': 7}

Before Mappings


array(['enquire', 'enquire', 'report', ..., 'enquire', 'enquire',
       'enquire'], dtype=object)

After Mappings


array([0, 0, 1, ..., 0, 0, 0], dtype=object)

In [16]:
data

array([[' <PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counter Services เค้าเช็ต 3276.25 บาท เมื่อวานที่ผมเช็คที่ศูนย์บอกมียอด 3057.79 บาท',
        0, 0],
       [' internet ยังความเร็วอยุ่เท่าไหร ครับ', 0, 1],
       [' ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้ ค่ะ', 1, 2],
       ...,
       ['ยอดเงินเหลือเท่าไหร่ค่ะ', 0, 7],
       ['ยอดเงินในระบบ', 0, 7],
       ['สอบถามโปรโมชั่นปัจจุบันที่ใช้อยู่ค่ะ', 0, 1]], dtype=object)

In [17]:
def strip_str(string):
    return string.strip()
     
# Trim of extra begining and trailing whitespace in the string
print("Before")
print(data)
data[:,0] = np.vectorize(strip_str)(data[:,0])
print("After")
print(data)

Before
[[' <PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counter Services เค้าเช็ต 3276.25 บาท เมื่อวานที่ผมเช็คที่ศูนย์บอกมียอด 3057.79 บาท'
  0 0]
 [' internet ยังความเร็วอยุ่เท่าไหร ครับ' 0 1]
 [' ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้ ค่ะ' 1 2]
 ...
 ['ยอดเงินเหลือเท่าไหร่ค่ะ' 0 7]
 ['ยอดเงินในระบบ' 0 7]
 ['สอบถามโปรโมชั่นปัจจุบันที่ใช้อยู่ค่ะ' 0 1]]
After
[['<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counter Services เค้าเช็ต 3276.25 บาท เมื่อวานที่ผมเช็คที่ศูนย์บอกมียอด 3057.79 บาท'
  0 0]
 ['internet ยังความเร็วอยุ่เท่าไหร ครับ' 0 1]
 ['ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้ ค่ะ' 1 2]
 ...
 ['ยอดเงินเหลือเท่าไหร่ค่ะ' 0 7]
 ['ยอดเงินในระบบ' 0 7]
 ['สอบถามโปรโมชั่นปัจจุบันที่ใช้อยู่ค่ะ' 0 1]]


## #TODO 2: Preprocessing data for Keras
You will be using Tensorflow 2 keras in this assignment. Please show us how you prepare your data for keras.
Don't forget to split data into train and test sets (+ validation set if you want)

In [18]:
data

array([['<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counter Services เค้าเช็ต 3276.25 บาท เมื่อวานที่ผมเช็คที่ศูนย์บอกมียอด 3057.79 บาท',
        0, 0],
       ['internet ยังความเร็วอยุ่เท่าไหร ครับ', 0, 1],
       ['ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้ ค่ะ', 1, 2],
       ...,
       ['ยอดเงินเหลือเท่าไหร่ค่ะ', 0, 7],
       ['ยอดเงินในระบบ', 0, 7],
       ['สอบถามโปรโมชั่นปัจจุบันที่ใช้อยู่ค่ะ', 0, 1]], dtype=object)

In [19]:
# TODO2: Preprocessing data for Keras
from tensorflow.keras.utils import split_dataset
from tensorflow.keras.layers import Embedding, Reshape, Activation, Input, Dense, Masking, Conv1D, Bidirectional,Conv2D

ImportError: cannot import name 'split_dataset' from 'tensorflow.keras.utils' (C:\Users\kla\AppData\Roaming\Python\Python39\site-packages\keras\api\_v2\keras\utils\__init__.py)

In [20]:
import tensorflow as tf
x_train, y_train = tf.keras.utils.split_dataset(data, left_size=0.8, shuffle=True, seed=51)

AttributeError: module 'tensorflow.keras.utils' has no attribute 'split_dataset'

## #TODO 3: Build and evaluate a model for "action" classification


In [None]:
#TODO 3: Build and evaluate a model for "action" classification

## #TODO 4: Build and evaluate a model for "object" classification



In [None]:
#TODO 4: Build and evaluate a model for "object" classification

## #TODO 5: Build and evaluate a multi-task model that does both "action" and "object" classifications in one-go 

This can be a bit tricky, if you are not familiar with the Keras functional API. PLEASE READ these webpages(https://www.tensorflow.org/guide/keras/functional, https://keras.io/getting-started/functional-api-guide/) before you start this task.   

Your model will have 2 separate output layers one for action classification task and another for object classification task. 

This is a rough sketch of what your model might look like:
image --> https://drive.google.com/file/d/1r7M6tFyQDu6pJIxLd_fn2kBMjo_CWmUK/view?usp=share_link

In [None]:
#TODO 5: Build and evaluate a multi-task model that does both "action" and "object" classifications in one-go