# Introduction

This notebook is for the major project submission for COMP7220/8220, on the [image/language] dataset and task.  It contains the following sections:

*   a description of the selected conventional ML model;
*   some notes about the choices made in building the conventional ML model;
*   a description of the selected deep learning model;
*   some notes about the choices made in building the deep model; and
*   a discussion of the performance of the two models.



# Conventional ML Model

The final model that produced the best-performing predictions for the Kaggle submission (accuracy 54.183%) was a linear support vector machine with C=1.0, degree=3 and gamma=auto.

Some libraries for loading the datasets

In [0]:
# some initialisation code
import numpy as np
from os.path import join
from google.colab import drive
import pickle

drive.mount('/content/drive/')

def load_pickle(path):
    with open(path, 'rb') as f:
        file = pickle.load(f)
        print ('Loaded %s..' %path)
        return file

dataset_directory = '/content/drive/My Drive/20comp8220/proj/text_dataset/'  ## CHANGE TO YOUR OWN DIRECTORY

emotions = ['anger', 'fear', 'joy', 'sadness']

tweets_train = np.load(join(dataset_directory, 'text_train_tweets.npy'))
labels_train = np.load(join(dataset_directory, 'text_train_labels.npy'))
vocabulary = load_pickle(join(dataset_directory, 'text_word_to_idx.pkl'))

tweets_val = np.load(join(dataset_directory, 'text_val_tweets.npy'))
labels_val = np.load(join(dataset_directory, 'text_val_labels.npy'))

tweets_test_public = np.load(join(dataset_directory, 'text_test_public_tweets_rand.npy'))

tweets_test_private = np.load(join(dataset_directory, 'text_test_private_tweets.npy'))

print(len(vocabulary))
idx_to_word = {i: w for w, i in vocabulary.items()}
for i in range(7):
  print(i, idx_to_word[i])

sample = 1  ## YOU CAN TRY OUT OTHER TWEETS

print('sample tweet, stored form:')
print(tweets_train[sample])
print(labels_train[sample])

print('sample tweet, readable form:')
decode = []
for i in range(50):
  decode.append(idx_to_word[tweets_train[sample][i]])
print(decode)
print(emotions[labels_train[sample]])


print(tweets_train.shape)
print(labels_train.shape)
print(tweets_val.shape)
print(labels_val.shape)
print(tweets_test_public.shape)
print(tweets_test_private.shape)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive/
Loaded /content/drive/My Drive/20comp8220/proj/text_dataset/text_word_to_idx.pkl..
13978
0 <NULL>
1 <START>
2 <END>
3 it
4 makes
5 me
6 so
sample tweet, stored form:
[ 1 23 24 20 25 19 26 27 28  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0]
0
sample tweet, readable form:
['<START>', 'lol', 'adam', 'the', 'bull', 'with', 'his', 'fake', 'outrage', '<END>', '<NULL>', '<NULL>', '

In [0]:
import pandas as pd
import numpy as np
from collections import defaultdict
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection,naive_bayes,svm
from sklearn.metrics import accuracy_score

The code below performs preproceesing on text datasets.

Libraries for preprocessing

In [0]:
import numpy as np
from os.path import join
from google.colab import drive
import pickle
import pandas as pd
from collections import defaultdict
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection , naive_bayes , svm
from sklearn.metrics import accuracy_score
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Managing data to convert tweets into words.

In [0]:
def manage_data(raw_tweets):
    check ='<'
    tweets=[]
    for sample in range(len(raw_tweets)):
        decode=[]
        for i in range(len(raw_tweets[sample])):
            decode.append(idx_to_word[raw_tweets[sample][i]])
            res = [idx for idx in decode if idx[0].lower() !=check.lower()]
        a = " ".join(res)
        tweets.append(a)
    return tweets

In [0]:
tweets_train=manage_data(tweets_train)
tweets_test_public=manage_data(tweets_test_public)
tweets_val=manage_data(tweets_val)
tweets_test_private=manage_data(tweets_test_private)

Converting numpy datasets into pandas dataframe and renaming the columns of every dataset.

In [0]:
tweets_train=pd.DataFrame(tweets_train)
tweets_test_public=pd.DataFrame(tweets_test_public)
tweets_train=tweets_train.rename(columns={0:'text'})
tweets_test_public=tweets_test_public.rename(columns={0:'text'})
tweets_val=pd.DataFrame(tweets_val)
tweets_val=tweets_val.rename(columns={0:'text'})
tweets_train['label']=pd.DataFrame(labels_train)
tweets_val['label']=pd.DataFrame(labels_val)
tweets_test_private=pd.DataFrame(tweets_test_private)
tweets_test_private=tweets_test_private.rename(columns={0:'text'})

Removing the blank rows, converting the text to lowercase, and, tokenizing the text for training, testing and validation datasets.

In [0]:
tweets_train['text'].dropna(inplace =True)
tweets_test_public['text'].dropna(inplace =True)
tweets_test_private['text'].dropna(inplace =True)
tweets_val['text'].dropna(inplace=True)
tweets_train['text'] = [entry.lower() for entry in tweets_train['text']]
tweets_test_public['text'] = [entry.lower() for entry in tweets_test_public['text']]
tweets_test_private['text'] = [entry.lower() for entry in tweets_test_private['text']]
tweets_val['text']=[entry.lower() for entry in tweets_val['text'] ]
tweets_train['text']= [word_tokenize (entry) for entry in tweets_train['text']]
tweets_test_public['text']= [word_tokenize (entry) for entry in tweets_test_public['text']]
tweets_test_private['text']= [word_tokenize (entry) for entry in tweets_test_private['text']]
tweets_val['text']=[word_tokenize (entry) for entry in tweets_val['text']]


Part of speech tagging

In [0]:
tag_map = defaultdict (lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

Removing stop words and non alphabetic words, and then, performing lemmatisation for training, testing and validation datasets

In [0]:
for index, entry in enumerate(tweets_train['text']):
  final_words = []
  word_lemmatized = WordNetLemmatizer() 
  for word, tag in pos_tag(entry):
    if word not in stopwords.words('english') and word.isalpha():
      word_final = word_lemmatized.lemmatize(word, tag_map[tag[0]]) 
      final_words.append(word_final)
  tweets_train.loc[index, 'text_final'] = str(final_words)
print(tweets_train['text_final'].head())

0    ['make', 'fuck', 'irate', 'jesus', 'nobody', '...
1           ['lol', 'adam', 'bull', 'fake', 'outrage']
2    ['pass', 'away', 'early', 'morning', 'fast', '...
3    ['lol', 'wow', 'gon', 'na', 'say', 'really', '...
4    ['need', 'sushi', 'date', 'olive', 'guarded', ...
Name: text_final, dtype: object


In [0]:
for index, entry in enumerate(tweets_test_public['text']):
  final_words = []
  word_lemmatized = WordNetLemmatizer() 
  for word, tag in pos_tag(entry):
    if word not in stopwords.words('english') and word.isalpha():
      word_final = word_lemmatized.lemmatize(word, tag_map[tag[0]]) 
      final_words.append(word_final)
  tweets_test_public.loc[index, 'text_final'] = str(final_words)
print(tweets_test_public['text_final'].head())

0    ['omg', 'mother', 'daughter', 'dull', 'ni', 'm...
1    ['happy', 'birthday', 'miss', 'excited', 'back...
2    ['ever', 'cry', 'middle', 'bomb', 'rest', 'som...
3        ['mentally', 'suffered', 'worthless', 'pain']
4    ['courage', 'driver', 'shot', 'bus', 'show', '...
Name: text_final, dtype: object


In [0]:
for index, entry in enumerate(tweets_test_private['text']):
  final_words = []
  word_lemmatized = WordNetLemmatizer() 
  for word, tag in pos_tag(entry):
    if word not in stopwords.words('english') and word.isalpha():
      word_final = word_lemmatized.lemmatize(word, tag_map[tag[0]]) 
      final_words.append(word_final)
  tweets_test_private.loc[index, 'text_final'] = str(final_words)
print(tweets_test_private['text_final'].head())

0    ['whatever', 'decide', 'make', 'sure', 'make',...
1    ['accept', 'challenge', 'literally', 'even', '...
2    ['roommate', 'okay', 'spell', 'autocorrect', '...
3    ['cute', 'atsu', 'probably', 'shy', 'photo', '...
4    ['rooneys', 'fuck', 'untouchable', 'fuck', 'dr...
Name: text_final, dtype: object


In [0]:
for index, entry in enumerate(tweets_val['text']):
  final_words = []
  word_lemmatized = WordNetLemmatizer() 
  for word, tag in pos_tag(entry):
    if word not in stopwords.words('english') and word.isalpha():
      word_final = word_lemmatized.lemmatize(word, tag_map[tag[0]]) 
      final_words.append(word_final)
  tweets_val.loc[index, 'text_final'] = str(final_words)
print(tweets_val['text_final'].head())

0         ['fume', 'hijacked', 'move', 'full', 'back']
1                    ['nightmare', 'dream', 'freedom']
2    ['cnn', 'really', 'need', 'get', 'business', '...
3    ['kikme', 'horny', 'kik', 'nude', 'girl', 'hor...
4    ['fuck', 'tag', 'picture', 'family', 'first', ...
Name: text_final, dtype: object


Preparing training and testing sets 

In [0]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(tweets_train['text_final'],tweets_train['label'], test_size=0.3)

Encoding

In [0]:
encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train)
y_test = encoder.fit_transform(y_test)

Using TF-IDF to vectorize words for training, validation and testing sets.

In [0]:
tfidf_vect = TfidfVectorizer(max_features=5000)
tfidf_vect.fit(tweets_train['text_final'])

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=5000,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

Performing data vectorization on training, testing and validation datasets.

In [0]:
X_train_tfidf = tfidf_vect.transform(X_train)
X_test_tfidf = tfidf_vect.transform(tweets_test_public['text_final'])
X_val_tfidf = tfidf_vect.transform(tweets_val['text_final'])
X_private_tfidf=tfidf_vect.transform(tweets_test_private['text_final'])

The following below is a support vector machine model after preprocessing

In [0]:
from sklearn import model_selection, naive_bayes, svm
sv= svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')

# Fit the training dataset.
sv.fit(X_train_tfidf, y_train)

# Predict the labels on the validation dataset
SVtest = sv.predict(X_test_tfidf)
SVtrain=sv.predict(X_train_tfidf)
SVvalid=sv.predict(X_val_tfidf)

In [0]:
print("Accuracy score",accuracy_score(SVtrain,y_train))
print("Accuracy score",accuracy_score(SVvalid,tweets_val['label']))

Accuracy score 0.964975845410628
Accuracy score 0.4438356164383562


# Making a csv file for predictions on public test data

In [0]:
import csv

In [0]:
with open('/content/drive/My Drive/20comp8220/proj/text_dataset/45765758-conv.csv','w') as file:
     writer = csv.writer(file)
     writer.writerow(["ID","Prediction", ])
     for i in range(SVtest.shape[0]):

       writer.writerow([i+1,SVtest[i]])

# Notes on the Conventional ML Model

For the final model, we have just chosen the hyperparameters randomly from a given user guide of hyperparameters(https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

In addition to the final model, I have also tried a random forest model which has performed fairly poorly (accuracy 52.805%). I think this is because unlike SVM, random forest model causes a huge difference between training accuracy score and validation accuracy score which result in overfitting as shown below.

In [0]:
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100,random_state=123456)


clf.fit(X_train_tfidf,y_train)
predictions_rand = clf.predict(X_train_tfidf)
predictions_train = clf.predict(X_test_tfidf)
randomforestpred=clf.predict(X_val_tfidf)

In [0]:
print("Accuracy score->",accuracy_score(predictions_rand,y_train))
print("Accuracy score->",accuracy_score(randomforestpred,tweets_val['label']))

Accuracy score-> 0.9792673107890499
Accuracy score-> 0.4280821917808219


# Deep Learning Model

The final model that produced the best-performing predictions for the Kaggle submission (accuracy (54.699)%) is a dense model with two dropout layers which are 0.7 and 0.8.  The input is the training data that has been preprocessed by tfidf,word vectorization, removing stop words and lemmatzation. 

In [0]:
import tensorflow as tf
from tensorflow import keras

In [0]:
from keras.models import Sequential
from keras import layers
input_dim = X_train_tfidf.shape[1]  # Number of features
modeldropout1 = Sequential()
modeldropout1.add(layers.Dense(1000, input_dim=input_dim, activation='relu'))
modeldropout1.add(layers.Dense(500,activation='relu'))
modeldropout1.add(layers.Dropout(0.7))
modeldropout1.add(layers.Dense(700,activation='relu'))
modeldropout1.add(layers.Dense(800,activation='relu'))
modeldropout1.add(layers.Dropout(0.8))
modeldropout1.add(layers.Dense(4, activation='softmax'))

Using TensorFlow backend.


In [0]:
modeldropout1.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
                   
modeldropout1.fit(X_train_tfidf, y_train,epochs=20, batch_size=128, verbose=1,validation_data=(X_val_tfidf, tweets_val['label']))

Train on 4968 samples, validate on 1460 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.callbacks.History at 0x7fa9c40d4ba8>

In [0]:
l=modeldropout1.predict_classes(X_test_tfidf)
f=modeldropout1.predict_classes(X_train_tfidf)
r=modeldropout1.predict_classes(X_val_tfidf)
newprivate=modeldropout1.predict_classes(X_private_tfidf)

In [0]:
print("Accuracy score",accuracy_score(f,y_train))
print("Accuracy score",accuracy_score(r,tweets_val['label']))

Accuracy score 0.9786634460547504
Accuracy score 0.42054794520547945


[Following this, code and comments as above.]

# Notes on the Deep Learning Model

For the final model,I have chosen the hyperparameters randomly. The model, upon running, produces a less difference between validation accuracy score and training accuracy score.

In addition to the final model, I have also tried with only dense layers. It has provided an accuracy of 54.379% which is less than neural network with dense layers and dropout. This gap in performance is due to absence of dropout layers. Also, the performance of the model can vary depending on the dataset. It might be that the model with only dense layers can perform better on other datasets, and, the dense model with dropout layers can generalize well on public test set. Furthermore, the model with dense and dropout layers has a better training accuracy score than the model with only dense layers.

The following below is the code of neural network model with only dense layers.

In [0]:
from keras.models import Sequential
from keras import layers
input_dim = X_train_tfidf.shape[1]  # Number of features
model = Sequential()
model.add(layers.Dense(1000, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(500,activation='relu'))
model.add(layers.Dense(800,activation='relu'))
model.add(layers.Dense(4, activation='softmax'))

In [0]:
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
                   
model.fit(X_train_tfidf, y_train,epochs=20, batch_size=128, verbose=1,validation_data=(X_val_tfidf, tweets_val['label']))

Train on 4968 samples, validate on 1460 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.callbacks.History at 0x7fa91a766e10>

In [0]:
g=model.predict_classes(X_test_tfidf)
h=model.predict_classes(X_train_tfidf)
newl=model.predict_classes(X_private_tfidf)
f=model.predict_classes(X_val_tfidf)

In [0]:
print("accuracy score->",accuracy_score(h,y_train))
print("accuracy score->",accuracy_score(f,tweets_val['label']))

accuracy score-> 0.9768518518518519
accuracy score-> 0.4273972602739726


# Discussion of Model Performance and Implementation

Comparing my final conventional ML and deep learning models, I can see that the deep learning one has performed better by 0.32% on the public test set.  The deep learning model has been ranked 35  out of 57 submissions on the public test set. The same model has been ranked 22 out of 49 submissions on the private test with accuracy of 64.153% which is only 9.454% higher than the one which has been submitted on public test set. Though, the model hasn't worked well on the public test set, it has worked better on the private test set. There is nothing wrong with the model. The model is absolutely fine. Depending on the dataset, I am getting different accuracy scores. 
	
