### Student Information
Name: Dylan Sienatra 施威任

Student ID: 110006232

GitHub ID: DylanSie

Kaggle name: DYLAN SIENATRA

Kaggle private scoreboard snapshot:

---

### Instructions

1. First: __This part is worth 30% of your grade.__ Do the **take home exercises** in the [DM2024-Lab2-master Repo](https://github.com/didiersalazar/DM2024-Lab2-Master). You may need to copy some cells from the Lab notebook to this notebook. 


2. Second: __This part is worth 30% of your grade.__ Participate in the in-class [Kaggle Competition](https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework) regarding Emotion Recognition on Twitter by this link: https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework. The scoring will be given according to your place in the Private Leaderboard ranking: 
    - **Bottom 40%**: Get 20% of the 30% available for this section.

    - **Top 41% - 100%**: Get (0.6N + 1 - x) / (0.6N) * 10 + 20 points, where N is the total number of participants, and x is your rank. (ie. If there are 100 participants and you rank 3rd your score will be (0.6 * 100 + 1 - 3) / (0.6 * 100) * 10 + 20 = 29.67% out of 30%.)   
    Submit your last submission **BEFORE the deadline (Nov. 26th, 11:59 pm, Tuesday)**. Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the **img** folder of this repository and rerun the cell **Student Information**.
    

3. Third: __This part is worth 30% of your grade.__ A report of your work developing the model for the competition (You can use code and comment on it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained. 


4. Fourth: __This part is worth 10% of your grade.__ It's hard for us to follow if your code is messy :'(, so please **tidy up your notebook**.


Upload your files to your repository then submit the link to it on the corresponding e-learn assignment.

Make sure to commit and save your changes to your repository __BEFORE the deadline (Nov. 26th, 11:59 pm, Tuesday)__. 

### Importing Necessary Librares


In [58]:
import pandas as pd
import json
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score
from tqdm import tqdm
import nltk

### Preparing the Data
In this part, we will just load the data first to check what the data looks like roughly from data_identification.csv, emotion_data.csv, and tweets_DM.json

In [59]:
data_identification = pd.read_csv('data_identification.csv')
emotion_data = pd.read_csv('emotion.csv')

In [60]:
data_identification

Unnamed: 0,tweet_id,identification
0,0x28cc61,test
1,0x29e452,train
2,0x2b3819,train
3,0x2db41f,test
4,0x2a2acc,train
...,...,...
1867530,0x227e25,train
1867531,0x293813,train
1867532,0x1e1a7e,train
1867533,0x2156a5,train


In [61]:
emotion_data

Unnamed: 0,tweet_id,emotion
0,0x3140b1,sadness
1,0x368b73,disgust
2,0x296183,anticipation
3,0x2bd6e1,joy
4,0x2ee1dd,anticipation
...,...,...
1455558,0x38dba0,joy
1455559,0x300ea2,joy
1455560,0x360b99,fear
1455561,0x22eecf,joy


In [62]:
tweets_data = []
with open('tweets_DM.json', 'r') as file:
    for line in file:
        tweets_data.append(json.loads(line))
tweets_df = pd.DataFrame([tweet['_source']['tweet'] for tweet in tweets_data])

In [63]:
tweets_df

Unnamed: 0,hashtags,tweet_id,text
0,[Snapchat],0x376b20,"People who post ""add me on #Snapchat"" must be ..."
1,"[freepress, TrumpLegacy, CNN]",0x2d5350,"@brianklaas As we see, Trump is dangerous to #..."
2,[bibleverse],0x28b412,"Confident of your obedience, I write to you, k..."
3,[],0x1cd5b0,Now ISSA is stalking Tasha 😂😂😂 <LH>
4,[],0x2de201,"""Trust is not the same as faith. A friend is s..."
...,...,...,...
1867530,"[mixedfeeling, butimTHATperson]",0x316b80,When you buy the last 2 tickets remaining for ...
1867531,[],0x29d0cb,I swear all this hard work gone pay off one da...
1867532,[],0x2a6a4f,@Parcel2Go no card left when I wasn't in so I ...
1867533,[],0x24faed,"Ah, corporate life, where you can date <LH> us..."


### Data Preprocess and Cleaning
Before doing anything with the data, I decided to merge the data identification, emotion data, and the tweets data into one dataframe so that we can visualize the data more clearly. Next, we will split the data into the training data (for the model training later on) and testing data (for submission prediction later on). 

The preprocess approach I used is using nltk tokenization from with countvectorizer and TF-IDF vectorizer, however after trying both of the feature engineering, I decided to use TF-IDF vectorizer because it yields better result because it is better for text analysis especially classification.

Note: The preprocess part is in every section of the model building

In [64]:
merged_data = pd.merge(data_identification, tweets_df, on='tweet_id', how='left')
merged_data = pd.merge(merged_data, emotion_data, on='tweet_id', how='left')

In [65]:
# Check the combined data
merged_data

Unnamed: 0,tweet_id,identification,hashtags,text,emotion
0,0x28cc61,test,[],@Habbo I've seen two separate colours of the e...,
1,0x29e452,train,[],Huge Respect🖒 @JohnnyVegasReal talking about l...,joy
2,0x2b3819,train,"[spateradio, app]",Yoooo we hit all our monthly goals with the ne...,joy
3,0x2db41f,test,[],@FoxNews @KellyannePolls No serious self respe...,
4,0x2a2acc,train,[],@KIDSNTS @PICU_BCH @uhbcomms @BWCHBoss Well do...,trust
...,...,...,...,...,...
1867530,0x227e25,train,[rip],@BBCBreaking Such an inspirational talented pe...,disgust
1867531,0x293813,train,"[libtards, Hillary, lost, sad, growup, Trump]",And still #libtards won't get off the guy's ba...,sadness
1867532,0x1e1a7e,train,"[seeds, Joy, GLTChurch]",When you sow #seeds of service or hospitality ...,joy
1867533,0x2156a5,train,[],@lorettalrose Will you be displaying some <LH>...,trust


In [66]:
# Split the data into training and test data
train_data = merged_data[merged_data['identification'] == 'train']
test_data = merged_data[merged_data['identification'] == 'test']

In [67]:
# Check the training data
train_data

Unnamed: 0,tweet_id,identification,hashtags,text,emotion
1,0x29e452,train,[],Huge Respect🖒 @JohnnyVegasReal talking about l...,joy
2,0x2b3819,train,"[spateradio, app]",Yoooo we hit all our monthly goals with the ne...,joy
4,0x2a2acc,train,[],@KIDSNTS @PICU_BCH @uhbcomms @BWCHBoss Well do...,trust
5,0x2a8830,train,"[PUBG, GamersUnite, twitch, BeHealthy, StayPos...",Come join @ambushman27 on #PUBG while he striv...,joy
6,0x20b21d,train,"[strength, bones, God]",@fanshixieen2014 Blessings!My #strength little...,anticipation
...,...,...,...,...,...
1867530,0x227e25,train,[rip],@BBCBreaking Such an inspirational talented pe...,disgust
1867531,0x293813,train,"[libtards, Hillary, lost, sad, growup, Trump]",And still #libtards won't get off the guy's ba...,sadness
1867532,0x1e1a7e,train,"[seeds, Joy, GLTChurch]",When you sow #seeds of service or hospitality ...,joy
1867533,0x2156a5,train,[],@lorettalrose Will you be displaying some <LH>...,trust


In [68]:
# Check to see any missing values per column
missing_values = train_data.isnull().sum()
missing_values
# Data looks good with no missing values

tweet_id          0
identification    0
hashtags          0
text              0
emotion           0
dtype: int64

In [69]:
# Check the testing data
test_data

Unnamed: 0,tweet_id,identification,hashtags,text,emotion
0,0x28cc61,test,[],@Habbo I've seen two separate colours of the e...,
3,0x2db41f,test,[],@FoxNews @KellyannePolls No serious self respe...,
15,0x2466f6,test,[womendrivers],"Looking for a new car, and it says 1 lady owne...",
23,0x23f9e9,test,[robbingmembers],@cineworld “only the brave” just out and fount...,
31,0x1fb4e1,test,[],Felt like total dog 💩 going into open gym and ...,
...,...,...,...,...,...
1867495,0x2c4dc2,test,[kids],6 year old walks in astounded. Mum! Look how b...,
1867496,0x31be7c,test,[inspiringvolunteerawards2017],Only one week to go until the #inspiringvolunt...,
1867500,0x1ca58e,test,[],"I just got caught up with the manga for ""My He...",
1867515,0x35c8ba,test,[],Speak only when spoken to and make hot ass mus...,


In [70]:
# Emotion missing is to be expected because these are the values we will try to predict but other than that we found no missing values
missing_values_test = test_data.isnull().sum()
missing_values_test

tweet_id               0
identification         0
hashtags               0
text                   0
emotion           411972
dtype: int64

### Data Splitting for Training, Validation, and Test
In this part of the process, we will split the training data that we have into 80:20 training:validation by randomly splitting them

In [71]:
train_set, val_set = train_test_split(train_data, test_size=0.2, random_state=42, stratify=train_data['emotion'])

In [72]:
# Check the train set
train_set

Unnamed: 0,tweet_id,identification,hashtags,text,emotion
1489744,0x285cdd,train,[],Closed Buy 1.4 Lots EURUSD 1.2022 for +11.3 pi...,joy
825857,0x1e13d5,train,[Impressive],@FloWrestling Ohio and PA are always 1 and 2. ...,joy
1247950,0x2e186e,train,"[Atlanta, Falcons, Victory, Lions, Defense, De...",#Atlanta #Falcons 3-0 hang on for the W on the...,joy
1160634,0x2eee0a,train,[],Love isn't about what you do for me. Love for ...,joy
1619355,0x32200d,train,[sky],Good to see Day playing like he use too.#love ...,joy
...,...,...,...,...,...
1277740,0x2fd079,train,[WTF],"Hey @Colts O-line, allowing 6 sacks is not goi...",disgust
1538592,0x2c8457,train,[],"Thank you Todd Gurley for putting Rivers, Gree...",joy
1641930,0x32e118,train,[StrivingForGreatness],5-1 in fantasy now <LH> #StrivingForGreatness,joy
536125,0x27e949,train,[],@kenolin1 @NBCThisisUs Got my Kleenex <LH>,trust


In [73]:
# Check the validation set
val_set

Unnamed: 0,tweet_id,identification,hashtags,text,emotion
303298,0x252fde,train,[],I have the greatest wife in the world. <LH>,joy
1136045,0x29c585,train,[Insercure],Cheering in my mind as the hours go by #Inserc...,fear
71290,0x1e6dd3,train,[],I actually don't know what I'd do if I didn't ...,joy
745888,0x388173,train,"[internet, WiFi]",Finally I have #internet <LH> #WiFi,joy
18665,0x23b95b,train,"[walk, forest, sea]",What if Byron had the Iphone? <LH> #walk #fore...,sadness
...,...,...,...,...,...
900364,0x234e98,train,"[RIPower, powermarathon, tashaAlwaysHasToclean...",Look at how the Feds messed up our house! <LH>...,disgust
884807,0x33a3f4,train,[],It's calm just calm <LH>,joy
1490808,0x25821a,train,[],@data_monsters <LH> for the follow! Feel free ...,trust
1754413,0x302fc1,train,[sugarcon],"""[Soon] SugarCRM will be one of the most impor...",trust


### Model Building Naive Bayes (BOW and TFIDF)
In this part, I tried to use the same method from the Lab 2 master notebook, first by using bag of words countvectorizer that performs word frequency and use these as features to train the model. Then, using TF-IDF which converts text into term frequency inverse document frequency

#### Feature Engineering BOW 500 features
Here, we will tokenize using nltk.word_tokenize with 500 features as well as embed it to be feed into the model.

In [83]:
# Load necessary libraries for the feature engineering
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


In [84]:
# build analyzers (bag-of-words)
BOW_500 = CountVectorizer(max_features=500, tokenizer=nltk.word_tokenize) 

# apply analyzer to training data
BOW_500.fit(train_set['text'])

train_data_BOW_features_500 = BOW_500.transform(train_set['text'])

## check dimension
train_data_BOW_features_500.shape



(1164450, 500)

In [85]:
# observe some feature names
feature_names_500 = BOW_500.get_feature_names_out()
feature_names_500[100:110]

array(['can', 'car', 'care', 'change', 'christ', 'christmas', 'class',
       'closed', 'come', 'comes'], dtype=object)

In [86]:
# for a classificaiton problem, you need to provide both training & testing data
X_train = BOW_500.transform(train_set['text'])
y_train = train_set['emotion']

X_test = BOW_500.transform(val_set['text'])
y_test = val_set['emotion']

## take a look at data dimension is a good habit  :)
print('X_train.shape: ', X_train.shape)
print('y_train.shape: ', y_train.shape)
print('X_test.shape: ', X_test.shape)
print('y_test.shape: ', y_test.shape)

X_train.shape:  (1164450, 500)
y_train.shape:  (1164450,)
X_test.shape:  (291113, 500)
y_test.shape:  (291113,)


#### Build the model using Naive Bayes Multinomial
Based on the results below, the accuracy is around 0.42 for both the testing and training.

In [87]:
NB_model = MultinomialNB()
NB_model.fit(X_train, y_train)

y_test_pred_nb = NB_model.predict(X_test)
y_train_pred_nb = NB_model.predict(X_train)

acc_test_nb = accuracy_score(y_test, y_test_pred_nb)
acc_train_nb = accuracy_score(y_train, y_train_pred_nb)
print("Naive Bayes Accuracy Testing: ", round(acc_test_nb, 2))
print("Naive Bayes Accuracy Training: ", round(acc_train_nb, 2))

print(classification_report(y_test, y_test_pred_nb))
print(classification_report(y_train, y_train_pred_nb))

Naive Bayes Accuracy Testing:  0.42
Naive Bayes Accuracy Training:  0.42
              precision    recall  f1-score   support

       anger       0.17      0.11      0.14      7973
anticipation       0.45      0.43      0.44     49787
     disgust       0.29      0.32      0.30     27820
        fear       0.18      0.13      0.15     12800
         joy       0.49      0.63      0.55    103204
     sadness       0.37      0.37      0.37     38687
    surprise       0.41      0.11      0.18      9746
       trust       0.35      0.21      0.26     41096

    accuracy                           0.42    291113
   macro avg       0.34      0.29      0.30    291113
weighted avg       0.40      0.42      0.40    291113

              precision    recall  f1-score   support

       anger       0.17      0.12      0.14     31894
anticipation       0.46      0.43      0.44    199148
     disgust       0.29      0.32      0.30    111281
        fear       0.18      0.13      0.15     51199
     

#### Feature Engineering using TF-IDF Vectorizer with 1000 Features
Here, we will tokenize using nltk.word_tokenize with 1000 features as well as embed it to be feed into the model.

In [88]:
TFIDF_vectorizer = TfidfVectorizer(max_features = 1000, tokenizer = nltk.word_tokenize)
TFIDF_vectorizer.fit(train_set['text'])
train_data_TFIDF_features = TFIDF_vectorizer.transform(train_set['text'])
train_data_TFIDF_features.shape



(1164450, 1000)

In [89]:
# for a classificaiton problem, you need to provide both training & testing data
X_train_TFIDF = TFIDF_vectorizer.transform(train_set['text'])
y_train_TFIDF = train_set['emotion']

X_test_TFIDF = TFIDF_vectorizer.transform(val_set['text'])
y_test_TFIDF = val_set['emotion']

## take a look at data dimension is a good habit  :)
print('X_train.shape: ', X_train_TFIDF.shape)
print('y_train.shape: ', y_train_TFIDF.shape)
print('X_test.shape: ', X_test_TFIDF.shape)
print('y_test.shape: ', y_test_TFIDF.shape)

X_train.shape:  (1164450, 1000)
y_train.shape:  (1164450,)
X_test.shape:  (291113, 1000)
y_test.shape:  (291113,)


#### Build The Multinomial Naive Bayes Model
Based on the results below the testing and training has an accuracy of 0.46 which performs better than using BOW 500 features.

In [90]:
NB_model = MultinomialNB()
NB_model.fit(X_train_TFIDF, y_train_TFIDF)

y_test_pred_nb_TFIDF = NB_model.predict(X_test_TFIDF)
y_train_pred_nb_TFIDF = NB_model.predict(X_train_TFIDF)

acc_test_nb_TFIDF = accuracy_score(y_test_TFIDF, y_test_pred_nb_TFIDF)
acc_train_nb_TFIDF = accuracy_score(y_train_TFIDF, y_train_pred_nb_TFIDF)
print("Naive Bayes Accuracy Testing: ", round(acc_test_nb_TFIDF, 2))
print("Naive Bayes Accuracy Training: ", round(acc_train_nb_TFIDF, 2))

print(classification_report(y_test_TFIDF, y_test_pred_nb_TFIDF))
print(classification_report(y_train_TFIDF, y_train_pred_nb_TFIDF))

Naive Bayes Accuracy Testing:  0.46
Naive Bayes Accuracy Training:  0.46
              precision    recall  f1-score   support

       anger       0.89      0.04      0.07      7973
anticipation       0.60      0.33      0.43     49787
     disgust       0.53      0.14      0.22     27820
        fear       0.88      0.16      0.27     12800
         joy       0.42      0.92      0.58    103204
     sadness       0.50      0.30      0.37     38687
    surprise       0.85      0.07      0.13      9746
       trust       0.73      0.07      0.12     41096

    accuracy                           0.46    291113
   macro avg       0.68      0.25      0.27    291113
weighted avg       0.56      0.46      0.39    291113

              precision    recall  f1-score   support

       anger       0.88      0.04      0.08     31894
anticipation       0.60      0.34      0.43    199148
     disgust       0.55      0.14      0.23    111281
        fear       0.89      0.16      0.27     51199
     

#### File Prediction Submission
I tried to submit the TF-IDF Multinomial Naive Bayes model and it got around 0.31 score in Kaggle, which is not that good, so I tried another approach below.

Note: This is not the best prediction that was submitted. Uncomment the code below if want to make this model submission

In [91]:
# X_submission_TFIDF = TFIDF_vectorizer.transform(test_data['text'])
# submission_predictions = NB_model.predict(X_submission_TFIDF)

In [92]:
# submission = test_data[['tweet_id']].copy()
# submission['emotion'] = submission_predictions
# submission.rename(columns={'tweet_id': 'id'}, inplace=True)

In [93]:
# submission.to_csv('submission.csv', index=False)

### Deep Neural Network
I tried a different approach using DNN with Keras as the deep learning framework. In this part, I followed the template from the Lab 2 Master Notebook. I used TFIDF Vectorizer for feature engineering.

In [94]:
# Import Necessary Libraries
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


#### Feature Engineering using TF-IDF Vectorizer with 10.000 features, tokenized wth nltk.word_tokenize


In [95]:
TFIDF_vectorizer = TfidfVectorizer(max_features = 10000, tokenizer = nltk.word_tokenize)
TFIDF_vectorizer.fit(train_set['text'])
train_data_TFIDF_features = TFIDF_vectorizer.transform(train_set['text']) 
train_data_TFIDF_features.shape





(1164450, 10000)

#### Preparing the Data (X, y)

In [96]:
# for a classificaiton problem, you need to provide both training & testing data
X_train = TFIDF_vectorizer.transform(train_set['text']) 
y_train = train_set['emotion'] 

X_test = TFIDF_vectorizer.transform(val_set['text'])
y_test = val_set['emotion']

## take a look at data dimension is a good habit  :)
print('X_train.shape: ', X_train.shape)
print('y_train.shape: ', y_train.shape)
print('X_test.shape: ', X_test.shape)
print('y_test.shape: ', y_test.shape)

X_train.shape:  (1164450, 10000)
y_train.shape:  (1164450,)
X_test.shape:  (291113, 10000)
y_test.shape:  (291113,)


In [97]:
import keras

#### Deal with categorical label (y)
We have to process the categorical label by ourselves because we cannot directly use tran_set['emotion'] into the model. We will use one-hot encoding to encode all the possible emotions, which are anger, anticipation, disgust, fear, sadness, surprise, trust, and joy (8 in total).

In [98]:
## deal with label (string -> one-hot)

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_encoder.fit(y_train)
print('check label: ', label_encoder.classes_)
print('\n## Before convert')
print('y_train[0:4]:\n', y_train[0:4])
print('\ny_train.shape: ', y_train.shape)
print('y_test.shape: ', y_test.shape)

def label_encode(le, labels):
    enc = le.transform(labels)
    return keras.utils.to_categorical(enc)

def label_decode(le, one_hot_label):
    dec = np.argmax(one_hot_label, axis=1)
    return le.inverse_transform(dec)

y_train = label_encode(label_encoder, y_train)
y_test = label_encode(label_encoder, y_test)

print('\n\n## After convert')
print('y_train[0:4]:\n', y_train[0:4])
print('\ny_train.shape: ', y_train.shape)
print('y_test.shape: ', y_test.shape)

check label:  ['anger' 'anticipation' 'disgust' 'fear' 'joy' 'sadness' 'surprise'
 'trust']

## Before convert
y_train[0:4]:
 1489744    joy
825857     joy
1247950    joy
1160634    joy
Name: emotion, dtype: object

y_train.shape:  (1164450,)
y_test.shape:  (291113,)


## After convert
y_train[0:4]:
 [[0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0.]]

y_train.shape:  (1164450, 8)
y_test.shape:  (291113, 8)


#### Building the Model
In this DNN model, we have an input shape of 10000 with 2 hidden layers consisting of 64 neuron each and output layer of 8 which represents all the possible output of emotion.

In [99]:
# I/O check
input_shape = X_train.shape[1]
print('input_shape: ', input_shape)

output_shape = len(label_encoder.classes_)
print('output_shape: ', output_shape)

input_shape:  10000
output_shape:  8


In [100]:
from keras.models import Model
from keras.layers import Input, Dense
from keras.layers import ReLU, Softmax

# input layer
model_input = Input(shape=(input_shape, ))  # 10000
X = model_input

# 1st hidden layer
X_W1 = Dense(units=64)(X)  # 64
H1 = ReLU()(X_W1)

# 2nd hidden layer
H1_W2 = Dense(units=64)(H1)  # 64
H2 = ReLU()(H1_W2)

# output layer
H2_W3 = Dense(units=output_shape)(H2)  # 8
H3 = Softmax()(H2_W3)

model_output = H3

# create model
model = Model(inputs=[model_input], outputs=[model_output])

# loss function & optimizer
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# show model construction
model.summary()

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 10000)]           0         
                                                                 
 dense_9 (Dense)             (None, 64)                640064    
                                                                 
 re_lu_6 (ReLU)              (None, 64)                0         
                                                                 
 dense_10 (Dense)            (None, 64)                4160      
                                                                 
 re_lu_7 (ReLU)              (None, 64)                0         
                                                                 
 dense_11 (Dense)            (None, 8)                 520       
                                                                 
 softmax_3 (Softmax)         (None, 8)                 0   

#### Training the Model
In this part, I set the epochs to 5 with batch size of 32

In [101]:
from keras.callbacks import CSVLogger

csv_logger = CSVLogger('training_log.csv')

# training setting
epochs = 5
batch_size = 32

# training!
history = model.fit(X_train, y_train, 
                    epochs=epochs, 
                    batch_size=batch_size, 
                    callbacks=[csv_logger],
                    validation_data = (X_test, y_test))
print('training finish')

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
training finish


#### Predict on Testing Data
In this part, we will predict the emotion for our validation set and then check the accuracy, which we achieved 0.57.

In [102]:
## predict
pred_result = model.predict(X_test, batch_size=128)
pred_result[:5]



array([[0.00252895, 0.05722999, 0.01630291, 0.01607423, 0.7738319 ,
        0.01578089, 0.02313867, 0.09511235],
       [0.04135317, 0.10112825, 0.04427046, 0.00963964, 0.54006714,
        0.14361677, 0.04491951, 0.07500505],
       [0.03028055, 0.04287689, 0.07142125, 0.07426297, 0.20240101,
        0.43660122, 0.06885865, 0.0732974 ],
       [0.07398282, 0.03299089, 0.10858422, 0.01257806, 0.62184805,
        0.09453171, 0.01581681, 0.03966746],
       [0.01162795, 0.07686212, 0.16254216, 0.01735673, 0.16955139,
        0.5127418 , 0.03950766, 0.00981012]], dtype=float32)

In [103]:
import numpy as np
pred_result = label_decode(label_encoder, pred_result)
pred_result[:5]

array(['joy', 'joy', 'sadness', 'joy', 'sadness'], dtype=object)

In [104]:
from sklearn.metrics import accuracy_score

print('testing accuracy: {}'.format(round(accuracy_score(label_decode(label_encoder, y_test), pred_result), 2)))

testing accuracy: 0.57


In [105]:
#Let's take a look at the training log
training_log = pd.DataFrame()
training_log = pd.read_csv("training_log.csv")
training_log

Unnamed: 0,epoch,accuracy,loss,val_accuracy,val_loss
0,0,0.549584,1.244094,0.567264,1.191116
1,1,0.585984,1.141873,0.57342,1.177691
2,2,0.603513,1.096664,0.573657,1.181969
3,3,0.616212,1.063311,0.57218,1.193152
4,4,0.627271,1.036109,0.569861,1.207552


#### Predict the Submission Testing Data
This part, we will just predict the testing data for the submission in Kaggle and let Kaggle find the score. In Kaggle, this model reach 0.453 score which is way better than NB model.

Note: Uncomment the code below to make a submission with this model

In [106]:
# X_submission = TFIDF_vectorizer.transform(test_data['text'])

In [107]:
# # Predict probabilities for each class
# submission_pred_probs = model.predict(X_submission, batch_size=128)

# # Convert probabilities to emotion labels
# submission_pred = label_decode(label_encoder, submission_pred_probs)


In [108]:
# # Prepare submission dataframe
# submission_df = test_data[['tweet_id']].copy()  # Keep only tweet_id
# submission_df['emotion'] = submission_pred     # Add predicted emotion
# submission_df.rename(columns={'tweet_id': 'id'}, inplace=True)
# # Save submission to CSV
# submission_df.to_csv('submission.csv', index=False)
# print("Submission file created: 'submission.csv'")


### DNN TFIDF with all the data
In this section, it is the same as before but the difference is I will use all the training data without splitting it into 80:20 for training and validation in order to train the model with more data and make the model more accurate.

#### Feature Engineering using TF-IDF Vectorizer with 10.000 features, tokenized wth nltk.word_tokenize

In [109]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Use 10k features
TFIDF_vectorizer = TfidfVectorizer(max_features = 10000, tokenizer = nltk.word_tokenize)
TFIDF_vectorizer.fit(train_data['text']) 
train_data_TFIDF_features = TFIDF_vectorizer.transform(train_data['text']) 
train_data_TFIDF_features.shape



(1455563, 10000)

#### Preparing the Data for DNN (X, y)

In [110]:
# Use all training data
X_train = TFIDF_vectorizer.transform(train_data['text'])
y_train = train_data['emotion']

# Print data dimensions
print('X_train.shape: ', X_train.shape)
print('y_train.shape: ', y_train.shape)

X_train.shape:  (1455563, 10000)
y_train.shape:  (1455563,)


#### Deal with categorical label (y) using one-hot encoding
We have to process the categorical label by ourselves because we cannot directly use tran_set['emotion'] into the model. We will use one-hot encoding to encode all the possible emotions, which are anger, anticipation, disgust, fear, sadness, surprise, trust, and joy (8 in total).

In [111]:
import keras
## deal with label (string -> one-hot)

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_encoder.fit(y_train)
print('check label: ', label_encoder.classes_)
print('\n## Before convert')
print('y_train[0:4]:\n', y_train[0:4])
print('\ny_train.shape: ', y_train.shape)


def label_encode(le, labels):
    enc = le.transform(labels)
    return keras.utils.to_categorical(enc)

def label_decode(le, one_hot_label):
    dec = np.argmax(one_hot_label, axis=1)
    return le.inverse_transform(dec)

y_train = label_encode(label_encoder, y_train)


print('\n\n## After convert')
print('y_train[0:4]:\n', y_train[0:4])
print('\ny_train.shape: ', y_train.shape)


check label:  ['anger' 'anticipation' 'disgust' 'fear' 'joy' 'sadness' 'surprise'
 'trust']

## Before convert
y_train[0:4]:
 1      joy
2      joy
4    trust
5      joy
Name: emotion, dtype: object

y_train.shape:  (1455563,)


## After convert
y_train[0:4]:
 [[0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1. 0. 0. 0.]]

y_train.shape:  (1455563, 8)


#### Building the Model

In this DNN model, we have an input shape of 10000 with 2 hidden layers consisting of 64 neuron each and output layer of 8 which represents all the possible output of emotion.

In [112]:
# I/O check
input_shape = X_train.shape[1]
print('input_shape: ', input_shape)

output_shape = len(label_encoder.classes_)
print('output_shape: ', output_shape)

input_shape:  10000
output_shape:  8


In [113]:
from keras.models import Model
from keras.layers import Input, Dense
from keras.layers import ReLU, Softmax

# input layer
model_input = Input(shape=(input_shape, ))  # 10k
X = model_input

# 1st hidden layer
X_W1 = Dense(units=64)(X)  # 64
H1 = ReLU()(X_W1)

# 2nd hidden layer
H1_W2 = Dense(units=64)(H1)  # 64
H2 = ReLU()(H1_W2)

# output layer
H2_W3 = Dense(units=output_shape)(H2)  # 8
H3 = Softmax()(H2_W3)

model_output = H3

# create model
model = Model(inputs=[model_input], outputs=[model_output])

# loss function & optimizer
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# show model construction
model.summary()

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, 10000)]           0         
                                                                 
 dense_12 (Dense)            (None, 64)                640064    
                                                                 
 re_lu_8 (ReLU)              (None, 64)                0         
                                                                 
 dense_13 (Dense)            (None, 64)                4160      
                                                                 
 re_lu_9 (ReLU)              (None, 64)                0         
                                                                 
 dense_14 (Dense)            (None, 8)                 520       
                                                                 
 softmax_4 (Softmax)         (None, 8)                 0   

#### Training all the data

In this part, I set the epochs to 5 with batch size of 32

In [114]:
from keras.callbacks import CSVLogger

csv_logger = CSVLogger('training_log.csv')

# training setting
epochs = 5
batch_size = 32

# training!
history = model.fit(X_train, y_train, 
                    epochs=epochs, 
                    batch_size=batch_size, 
                    callbacks=[csv_logger])
print('training finish')

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
training finish


#### Predict and Create Submission
In this part, I submitted to Kaggle and got the best score of 0.458.

In [115]:
# Preprocess test data
X_submission = TFIDF_vectorizer.transform(test_data['text'])

# Predict test data
submission_pred_probs = model.predict(X_submission, batch_size=128)
submission_pred = label_decode(label_encoder, submission_pred_probs)

# Create submission dataframe
submission_df = test_data[['tweet_id']].copy()  # Keep only tweet_id
submission_df['emotion'] = submission_pred     # Add predicted emotion
submission_df.rename(columns={'tweet_id': 'id'}, inplace=True)
# Save submission file
submission_df.to_csv('submission.csv', index=False)
print("Submission file created: 'submission.csv'")


Submission file created: 'submission.csv'
