# Part 3.8 - Self Exploration
We will be experimenting with different pre-trained Word2Vec models.

First we will reimport everything we need.

In [13]:
import numpy as np
import json

with open('goemotions.json') as f:
	dataset = np.array(json.load(f))
	
posts = dataset[:, 0]
emotions = dataset[:, 1]
sentiments = dataset[:, 2]

In [14]:
from sklearn.preprocessing import LabelEncoder

# Emotions.
emotionsLabelEncoder = LabelEncoder()
encodedEmotions = emotionsLabelEncoder.fit_transform(emotions)

# Sentiments.
sentimentsLabelEncoder = LabelEncoder()
encodedSentiments = sentimentsLabelEncoder.fit_transform(sentiments)

## 3.1 - Loading
In this part we will load one of the pre-trained models found [here](https://radimrehurek.com/gensim/auto_examples/howtos/run_downloader_api.html).

In [15]:
import gensim.downloader as api
w2v = api.load('glove-twitter-100')



## 3.2 - Extraction
In this part we will tokenize the posts.

In [16]:
import nltk

tokenizedPosts = []

for post in posts:
	tokenizedPosts.append(nltk.tokenize.word_tokenize(post))

Since we need to find the number of tokens, let us split our training set here.

In [17]:
from sklearn.model_selection import train_test_split

tokenizedPosts_emotions_train, tokenizedPosts_emotions_test, emotions_train, emotions_test =  train_test_split(tokenizedPosts, encodedEmotions, test_size=0.20)
tokenizedPosts_sentiments_train, tokenizedPosts_sentiments_test, sentiments_train, sentiments_test =  train_test_split(tokenizedPosts, encodedSentiments, test_size=0.20)

## 3.3 - Computing Embeddings
In this section we will use our Word2Vec model and our tokens to create embeddings for our posts.

First we create our general function.

In [18]:
def countTokens(posts_tokenized):
	counter = 0
	for post in posts_tokenized:
		counter += len(post)
	return counter

In [19]:
def tokenizedPostsToEmbedding(posts_tokenized, w2v, w2v_size=300):
	postEmbeddings = []

	for post in posts_tokenized:
		
		embedding = np.zeros(w2v_size) # We use the zero vector if none of the tokens are in our model.
		
		for token in post:
			if token in w2v:
				embedding += np.array(w2v[token]) # Vector addition.

		embedding = embedding/len(post) # Averaging the results.
		postEmbeddings.append(embedding)

	return np.array(postEmbeddings)

Now we use this function on our different tokenized posts.

In [20]:
embeddedPosts_emotions_train = tokenizedPostsToEmbedding(tokenizedPosts_emotions_train, w2v, 100)
embeddedPosts_emotions_test= tokenizedPostsToEmbedding(tokenizedPosts_emotions_test, w2v, 100)
embeddedPosts_sentiments_train = tokenizedPostsToEmbedding(tokenizedPosts_sentiments_train, w2v, 100)
embeddedPosts_sentiments_test = tokenizedPostsToEmbedding(tokenizedPosts_sentiments_test, w2v, 100)

## 3.4 - Hit Rates
We will now find the hit rates of our model.

First we define a general function.

In [21]:
def findHitRate(posts_tokenized, w2v):
	numberOfHits = 0
	for post in posts_tokenized:
		for token in post:
			if token in w2v:
				numberOfHits += 1
	return numberOfHits / countTokens(posts_tokenized)

Now we can apply this function to our various sets of posts.

In [22]:
# Emotions Training Hit Rate.
hitRate_emotions_train = findHitRate(tokenizedPosts_emotions_train, w2v)
print(f"Hit Rate Of Emotions Training Set: {hitRate_emotions_train}")

# Emotions Testing Hit Rate.
hitRate_emotions_test = findHitRate(tokenizedPosts_emotions_test, w2v)
print(f"Hit Rate Of Emotions Testing Set: {hitRate_emotions_test}")

# Sentiments Training Hit Rate.
hitRate_sentiments_train = findHitRate(tokenizedPosts_sentiments_train, w2v)
print(f"Hit Rate Of Sentiments Training Set: {hitRate_sentiments_train}")

# Sentiments Testing Hit Rate.
hitRate_sentiments_test = findHitRate(tokenizedPosts_sentiments_test, w2v)
print(f"Hit Rate Of Sentiments Testing Set: {hitRate_sentiments_test}")

Hit Rate Of Emotions Training Set: 0.8454672167522564
Hit Rate Of Emotions Testing Set: 0.8451212457532424
Hit Rate Of Sentiments Training Set: 0.8454508667897014
Hit Rate Of Sentiments Testing Set: 0.8451843048510069


## 3.5 - Base Multi-Layered Perceptron
We will now train and predict using our base multi-layered perceptron.

In [23]:
from sklearn.neural_network import MLPClassifier

baseMLPClassifier_emotions = MLPClassifier(early_stopping=True)
baseMLPClassifier_sentiments = MLPClassifier(early_stopping=True)

For the emotions:

In [24]:
baseMLPClassifier_emotions.fit(embeddedPosts_emotions_train, emotions_train)
baseMLP_emotions_results = baseMLPClassifier_emotions.predict(embeddedPosts_emotions_test)
baseMLPClassifier_emotions.score(embeddedPosts_emotions_test, emotions_test)

0.37239553020602956

For the sentiments:

In [25]:
baseMLPClassifier_sentiments.fit(embeddedPosts_sentiments_train, sentiments_train)
baseMLP_sentiments_results = baseMLPClassifier_sentiments.predict(embeddedPosts_sentiments_test)
baseMLPClassifier_sentiments.score(embeddedPosts_sentiments_test, sentiments_test)

0.49627517169130486

## 3.6 - Top Multi-Layered Perceptron
We will now use a better performing Multi-Layered Perceptron using `GridSearchCV` similar to in part 2. We will not be performing this as GridSearch takes too long.

In [26]:
# from sklearn.model_selection import GridSearchCV

# topMLPParameters = {
#     'activation': ('logistic', 'tanh'),
#     'hidden_layer_sizes': ((30,20), (10,10,10)),
#     'solver': ('sgd', 'adam'),
# 	'early_stopping': [True]
# }

We will train the model on the emotions.

In [27]:
# topMLPsearch_emotions = GridSearchCV(baseMLPClassifier_emotions, topMLPParameters)
# topMLPsearch_emotions.fit(embeddedPosts_emotions_train, emotions_train)
# topMLP_emotions_results = topMLPsearch_emotions.predict(embeddedPosts_emotions_test)
# topMLPsearch_emotions.score(embeddedPosts_emotions_test, emotions_test)

We will now perform it on the sentiments.

In [28]:
# topMLPsearch_sentiments = GridSearchCV(baseMLPClassifier_sentiments, topMLPParameters)
# topMLPsearch_sentiments.fit(embeddedPosts_sentiments_train, sentiments_train)
# topMLP_sentiments_results = topMLPsearch_sentiments.predict(embeddedPosts_sentiments_test)
# topMLPsearch_sentiments.score(embeddedPosts_sentiments_test, sentiments_test)

## 3.7 - Performance Report
We will add the new information to the `performance` file.

First we will redefine the function in part 2.

In [29]:
from sklearn import metrics

def stringifyConfusionMatrix(confusionMatrix):
	output = ""

	for row in confusionMatrix:
		for column in row:
			output += f"{column}\t"
		output += "\n"
	
	return output

def logPerformance(destination, title, emotionsActual, emotionsPredicted, sentimentsActual, sentimentsPredicted, emotionsPara=None, sentimentsPara=None):
	with open(destination, 'a') as outfile:
		outfile.write(f"\n# {title}\n")
		outfile.write("## Emotions:\n")
		if emotionsPara != None:
			outfile.write("### Parameters:\n")
			outfile.write(f"{str(emotionsPara)}\n")
		outfile.write("### Confusion Matrix:\n")
		outfile.write(stringifyConfusionMatrix(metrics.confusion_matrix(emotionsActual, emotionsPredicted)))
		outfile.write("\n### Metrics:\n")
		outfile.write(metrics.classification_report(emotionsActual, emotionsPredicted))
		outfile.write("## Sentiments:\n")
		if sentimentsPara != None:
			outfile.write("### Parameters:\n")
			outfile.write(f"{str(sentimentsPara)}\n")
		outfile.write("### Confusion Matrix:\n")
		outfile.write(stringifyConfusionMatrix(metrics.confusion_matrix(sentimentsActual, sentimentsPredicted)))
		outfile.write("\n### Metrics:\n")
		outfile.write(metrics.classification_report(sentimentsActual, sentimentsPredicted))

Now we will insert the information.

In [30]:
# For Base Multi-Layered Perceptron
logPerformance("performance-e3.txt", "Embedded Base Multi-Layered Perceptron", emotions_test, baseMLP_emotions_results, sentiments_test, baseMLP_sentiments_results)

# # For Top Multi-Layered Perceptron
# logPerformance("performance-e3.txt", "Embedded Top Multi-Layered Perceptron", emotions_test, topMLP_emotions_results, sentiments_test, topMLP_sentiments_results, topMLPsearch_emotions.best_params_, topMLPsearch_sentiments.best_params_)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## 3.8 - Self Exploration
As we can see by looking at `performance-e3.txt`, our accuracy and F1 metrics are lower than they previously were. This could be due to 2 different factors.

First, it could be because of the overall lower hit-rate. In this case we had 84% hit rate. Whereas before we had 85% hit rate.

The second reason could be the lower dimension of the vectors. This could provide us with less features to work with, resulting in poorer performance by our model.

We can test this hypothesis by using the previous model, but with dimension 100 instead.