The joke is to tune a neural network to predict if the joke is funny. Training using Jester dataset: https://eigentaste.berkeley.edu/dataset/

Dataset V3 was picked as it has more jokes than V1, more ratings than V1 and V4.

Load jokes text data

In [24]:
import pandas as pd
texts_frame = pd.read_excel('Dataset3JokeSet.xlsx', header=None)
joke_list = list(texts_frame.iloc[:, 0])
joke_list[:5]

['A man visits the doctor. The doctor says "I have bad news for you.You have cancer and Alzheimer\'s disease".  The man replies "Well,thank God I don\'t have cancer!"',
 'This couple had an excellent relationship going until one day he came home from work to find his girlfriend packing. He asked her why she was leaving him and she told him that she had heard awful things about him.   "What could they possibly have said to make you move out?"   "They told me that you were a pedophile."   He replied, "That\'s an awfully big word for a ten year old."',
 "Q. What's 200 feet long and has 4 teeth?   A. The front row at a Willie Nelson Concert.",
 "Q. What's the difference between a man and a toilet?   A. A toilet doesn't follow you around after you use it.",
 "Q.\tWhat's O. J. Simpson's Internet address?  A.\tSlash, slash, backslash, slash, slash, escape."]

In [25]:
len(joke_list)

150

Generate text embeddings.

In [17]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-cased')

  from .autonotebook import tqdm as notebook_tqdm
Downloading (…)846e0/.gitattributes: 100%|██████████| 491/491 [00:00<?, ?B/s] 
Downloading (…)5d11f846e0/README.md: 100%|██████████| 8.98k/8.98k [00:00<00:00, 9.00MB/s]
Downloading (…)11f846e0/config.json: 100%|██████████| 570/570 [00:00<00:00, 563kB/s]
Downloading model.safetensors: 100%|██████████| 436M/436M [01:00<00:00, 7.20MB/s] 
Downloading pytorch_model.bin: 100%|██████████| 436M/436M [00:59<00:00, 7.28MB/s] 
Downloading (…)846e0/tokenizer.json: 100%|██████████| 436k/436k [00:00<00:00, 1.34MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 29.0/29.0 [00:00<00:00, 19.1kB/s]
Downloading (…)5d11f846e0/vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 844kB/s]
No sentence-transformers model found with name C:\Users\szymo/.cache\torch\sentence_transformers\bert-base-cased. Creating a new one with MEAN pooling.
Some weights of the model checkpoint at C:\Users\szymo/.cache\torch\sentence_transformers\bert-base-cased were not 

In [26]:
embeddings = model.encode(joke_list)

embeddings.shape

(150, 768)

Load numerical data (ratings of jokes)

In [84]:
ratings_frame = pd.read_excel('JESTER_3_RATINGS.xls', header=None)
ratings_frame.head


<bound method NDFrame.head of        0    1    2    3    4         5    6        7        8    9    ...  \
0       62   99   99   99   99   0.21875   99 -9.28125 -9.28125   99  ...   
1       34   99   99   99   99  -9.68750   99  9.93750  9.53125   99  ...   
2       18   99   99   99   99  -9.84375   99 -9.84375 -7.21875   99  ...   
3       82   99   99   99   99   6.90625   99  4.75000 -5.90625   99  ...   
4       27   99   99   99   99  -0.03125   99 -9.09375 -0.40625   99  ...   
...    ...  ...  ...  ...  ...       ...  ...      ...      ...  ...  ...   
54900   13   99   99   99   99  99.00000   99 -6.53125 -2.34375   99  ...   
54901    8   99   99   99   99  99.00000   99  8.93750  9.78125   99  ...   
54902   27   99   99   99   99  99.00000   99 -1.59375  4.53125   99  ...   
54903    8   99   99   99   99  99.00000   99 -7.40625  6.93750   99  ...   
54904   12   99   99   99   99  99.00000   99  4.25000  6.59375   99  ...   

        141   142   143   144   145   146   1

Clean rating data - rating equal to 99 means that the given person did not rate the joke.

In [85]:
ratings_frame = ratings_frame.iloc[:, 1:].replace(99, float('nan'))
ratings_frame = ratings_frame.mean()
ratings_frame.shape

(150,)

In [86]:
ratings_frame.describe()

count    140.000000
mean       1.619898
std        1.422424
min       -2.738766
25%        0.773734
50%        1.895642
75%        2.759035
max        3.660416
dtype: float64

Now both X and y is 150 x 1 in size, there are some nan values:

In [87]:
nan_indices = ratings_frame.index[ratings_frame.isnull()].tolist()
nan_indices

[1, 2, 3, 4, 6, 9, 10, 11, 12, 14]

Clean the Nan values from ratings AND corresponding jokes

In [88]:
import numpy as np

ratings_frame = ratings_frame.dropna()

all_indices = np.arange(150).tolist()
for id in nan_indices:
    print(id)
    all_indices.remove(id)

embeddings = embeddings[all_indices]

1
2
3
4
6
9
10
11
12
14


After cleaning both X and y are 140 in size

In [89]:
ratings_frame.shape

(140,)

In [90]:
embeddings.shape

(140, 768)

Splitting the dataset between train and validation set

In [91]:
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(
    embeddings,
    ratings_frame,
    test_size=0.25,
    random_state=0)


In [171]:
from collections import namedtuple

RunResults = namedtuple("RunResults", ["train_loss", "val_loss"])
NetworkParameters = namedtuple("NetworkParameters", ["set_name", "learning_rate", "hidden_layer_sizes", "alpha"])

In [172]:
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error

def run_perceptron(no_of_epochs: int, network_parameters: NetworkParameters):
    regressor = MLPRegressor(solver='sgd',
                             learning_rate='constant',
                             random_state=1,
                             max_iter=1,    # we only run one iteration, this is to measue mse during run
                             early_stopping=True,
                             warm_start=True,   # additional monitoring, using warm_start=True and max_iter=1 and iterating yourself can be helpful
                             learning_rate_init=network_parameters.learning_rate,
                             hidden_layer_sizes=network_parameters.hidden_layer_sizes,
                             alpha=network_parameters.alpha
                             )
    
    training_mse = []
    validation_mse = []

    for _ in range(no_of_epochs):
        regressor.fit(train_X, train_y)
        # cost function for train set
        y_pred_train_set = regressor.predict(train_X)
        curr_train_score = mean_squared_error(train_y, y_pred_train_set)
        training_mse.append(curr_train_score)

        # cost function for validation set
        y_pred_val_set = regressor.predict(val_X)
        curr_valid_score = mean_squared_error(val_y, y_pred_val_set)
        validation_mse.append(curr_valid_score)

    return RunResults(training_mse, validation_mse)

In [164]:
FIRST_NETWORK_PARAMETERS = NetworkParameters("first", 0.001, (100,), 0.0)

In [173]:
import plotly.express as px

def plot_metrics(hyperparameters: NetworkParameters):
    base_network_metrics = run_perceptron(750, hyperparameters)
    df = pd.DataFrame(base_network_metrics)
    df = df.T.rename(columns={0: "Train loss", 1: "Val loss"})

    fig = px.line(df, title=hyperparameters.set_name)
    fig.show()

In [179]:
plot_metrics(FIRST_NETWORK_PARAMETERS)


Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.



Tuning the network - beginning with the most important hyperparameter - learning rate

With too low learning rate, the convergence happens slowly - the loss function is not lowering quickly enough in epochs.

With too high learning rate, oscillations in loss function appear, and algorithm can produce divergence instead of convergence.

Usually using dynamic learning rate is optimal

In [176]:
hyperparameters_to_experiment = (
    NetworkParameters("Learning rate 0.01 - default times 10. Oscilations are clearly visible", 0.01, FIRST_NETWORK_PARAMETERS.hidden_layer_sizes, FIRST_NETWORK_PARAMETERS.alpha),
    NetworkParameters("Learning rate 0.001 - default", 0.001, FIRST_NETWORK_PARAMETERS.hidden_layer_sizes, FIRST_NETWORK_PARAMETERS.alpha),
    NetworkParameters("Learning rate 0.0001 - default divided by 10", 0.0001, FIRST_NETWORK_PARAMETERS.hidden_layer_sizes, FIRST_NETWORK_PARAMETERS.alpha),
    NetworkParameters("Learning rate 0.000001 - default divided by 100", 0.00001, FIRST_NETWORK_PARAMETERS.hidden_layer_sizes, FIRST_NETWORK_PARAMETERS.alpha),
)

for param_set in hyperparameters_to_experiment:
    plot_metrics(param_set)


Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.




Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.




Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.




Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.



The examples above are educational, but the values are too extreme to determine the right value. Optimal learning rate should probably be higher than 0.001 but much lower than 0.01

In [178]:
hyperparameters_to_experiment = (
    NetworkParameters("Learning rate 0.003 - default times 3", 0.003, FIRST_NETWORK_PARAMETERS.hidden_layer_sizes, FIRST_NETWORK_PARAMETERS.alpha),
    NetworkParameters("Learning rate 0.002 - default times 2.5", 0.0025, FIRST_NETWORK_PARAMETERS.hidden_layer_sizes, FIRST_NETWORK_PARAMETERS.alpha),
    NetworkParameters("Learning rate 0.002 - default times 2", 0.002, FIRST_NETWORK_PARAMETERS.hidden_layer_sizes, FIRST_NETWORK_PARAMETERS.alpha),
)

for param_set in hyperparameters_to_experiment:
    plot_metrics(param_set)


Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.




Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.




Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.



Increasing learning rate by less extreme amounts results in slightly better final Validation set loss for all 3 learning rates. They all result in train loss being smaller than validation loss.

Default * 3 model is definitely overfitted, since the validation set plot rises (0.80 to 0.81) even though train set line is going down (0.6 to 0.35 on the same period). Same can be said about "default * 2.5" model, but here the rise of val loss is smaller (rise from 0.801 to 0.806).

"Default * 2" model has the highest learning rate which doesn't result in overfitting. I picked it for further tuning of other hyperparameters.

In [180]:
OPTIMAL_LEARNING_RATE = 0.002

Tuning hidden layer size.

Input layer size is determined by the shape of training data, the output layer size has a single node (in regressor). The size of all the hidden intermediary layers is a hyperparameter which can be tuned.

In [186]:
hyperparameters_to_experiment = (
    NetworkParameters("Layer size 200 - default multiplied by 2", OPTIMAL_LEARNING_RATE, (200,), FIRST_NETWORK_PARAMETERS.alpha),
    NetworkParameters("Layer size 100 - default", OPTIMAL_LEARNING_RATE, (100,), FIRST_NETWORK_PARAMETERS.alpha),
    NetworkParameters("Layer size 50 - default divided by 2", OPTIMAL_LEARNING_RATE, (50,), FIRST_NETWORK_PARAMETERS.alpha),
    NetworkParameters("Layer size 3", OPTIMAL_LEARNING_RATE, (3,), FIRST_NETWORK_PARAMETERS.alpha),
)

for param_set in hyperparameters_to_experiment:
    plot_metrics(param_set)


Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.




Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.




Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.




Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.



With extreme layer size we will see overfitting. This is caused by the fact that size of layer is getting close to the size of input data (in our case 768)

In [191]:
hyperparameters_to_experiment = (
    NetworkParameters("Layer size 700", OPTIMAL_LEARNING_RATE, (700,), FIRST_NETWORK_PARAMETERS.alpha),
)

for param_set in hyperparameters_to_experiment:
    plot_metrics(param_set)


Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.




Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.



In [194]:
hyperparameters_to_experiment = (
    NetworkParameters("Layer size", OPTIMAL_LEARNING_RATE, (90,), FIRST_NETWORK_PARAMETERS.alpha),
)

for param_set in hyperparameters_to_experiment:
    plot_metrics(param_set)


Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.



After testing with values of 90 and 110, default value of 100 was picked as optimal.
The last hyperparameter for experimenting is alpha, which is strength of the L2 regularization term. The L2 regularization term is divided by the sample size when added to the loss.

In [199]:
hyperparameters_to_experiment = (
    NetworkParameters("Alpha 20", OPTIMAL_LEARNING_RATE, FIRST_NETWORK_PARAMETERS.hidden_layer_sizes, 20),
    NetworkParameters("Alpha 0.001 - default multiplied by 10", OPTIMAL_LEARNING_RATE, FIRST_NETWORK_PARAMETERS.hidden_layer_sizes, 2),
    NetworkParameters("Alpha 0.0001 - default", OPTIMAL_LEARNING_RATE, FIRST_NETWORK_PARAMETERS.hidden_layer_sizes, 0.0001),
    NetworkParameters("Alpha 0", OPTIMAL_LEARNING_RATE, FIRST_NETWORK_PARAMETERS.hidden_layer_sizes, 0),
)

for param_set in hyperparameters_to_experiment:
    plot_metrics(param_set)


Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.




Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.




Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.




Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.



Changing the value of alpha had little effect on the curves. L2 Regularization is used to prevent overfitting. With other hyperparameters set as currently, we do not get overfitting. Higher alue of alpha results in stronger regularization (simplification).

To see the effect of alpha, we will increase layer size to introduce overfitting intentionally.

In [208]:
hyperparameters_to_experiment = (
    NetworkParameters("Layer size 200, alpha 0", OPTIMAL_LEARNING_RATE, (500,), 0.0),
    NetworkParameters("Layer size 200 - alpha 1", OPTIMAL_LEARNING_RATE, (500,), 1),
)

for param_set in hyperparameters_to_experiment:
    plot_metrics(param_set)


Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.




Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.



In [210]:
hyperparameters_to_experiment = (
    NetworkParameters("Layer size 50 - alpha 15", OPTIMAL_LEARNING_RATE, (500,), 15),
)

for param_set in hyperparameters_to_experiment:
    plot_metrics(param_set)


Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.



In this configuration of layer sizes, alpha 0 results in overfitting. Increasing alpha to 15 does in fact combat overfitting and produces a better model (best fit of 0.85 instead of 0.87).

Testing the best tuned model on different dataset, to test in practice if jokes are predicted to be funny

In [232]:
external_joke_list = [
    "Knock! Knock! Whos there? Control Freak. Con… OK, now you say, Control Freak who?",
    "How do you drown a hipster? Throw him in the mainstream.",
    "What breed of dog can jump higher than buildings? Any dog, because buildings can't jump.",
]

In [233]:
test_embeddings = model.encode(external_joke_list)

test_embeddings.shape

(3, 768)

In [234]:
final_regressor = MLPRegressor(solver='sgd',
                             learning_rate='constant',
                             random_state=1,
                             max_iter=750,
                             early_stopping=True,
                             warm_start=True,
                             learning_rate_init=OPTIMAL_LEARNING_RATE,
                             hidden_layer_sizes=100,
                             alpha=0.001
                             )

In [235]:
final_regressor.fit(train_X, train_y)

In [236]:
final_regressor.predict(test_embeddings)

array([ 0.37364078, -0.44006145,  0.02445169], dtype=float32)

SOURCES:
- SIIIW LECTURE
- documentation: scikit learn, plotly, numpy
- https://stackoverflow.com/questions/64516701/how-to-plot-correctly-loss-curves-for-training-and-validation-sets
- https://towardsdatascience.com/how-to-choose-the-optimal-learning-rate-for-neural-networks-362111c5c783
- https://machinelearningmastery.com/introduction-to-regularization-to-reduce-overfitting-and-improve-generalization-error/


LIBRARIES USED:
- pandas
- sentence_transformers
- scikit_learn
- numpy
- plotly