# 03 - Predict using GPT2 Model

This notebook contains the steps to use the trained gpt2 model from the previous steps for prediction

Author:
- Santosh Yadaw
- santoshyadawprl@gmail.com

## a. Setup

In [16]:
import os
import ast
import random
import logging

from tqdm.auto import tqdm
import pandas as pd
import spacy
from scipy.spatial.distance import cosine

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tqdm.pandas()

In [2]:
# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging

In [3]:
# Check device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
logger.info(f"device: {device}")

INFO:root:device: cuda


In [4]:
# Constants
HOME_PATH = os.path.split(os.getcwd())[0]
logger.info(f"HOME_PATH: {HOME_PATH}")

SPLIT_DATA_PATH = os.path.join(HOME_PATH,"data","processed","split_data.csv")
logger.info(f"SPLIT_DATA_PATH: {SPLIT_DATA_PATH}")

# Set the path to save gpt2 model
MODEL_PATH = os.path.join(HOME_PATH, "models")
logger.info(f"model_path: {MODEL_PATH}")

# GPT Inference constants
MAX_LENGTH= 100
NUM_RETURN_SEQUENCE= 1
NO_REPEAT_NGRAM_SIZE= 2
REPETITION_PENALTY= 1.5
TOP_P= 0.92
TEMPERATURE=.85
DO_SAMPLE= True
TOP_K= 125
EARLY_STOPPING= True

INFO:root:HOME_PATH: /home/jupyter/text-gen
INFO:root:SPLIT_DATA_PATH: /home/jupyter/text-gen/data/processed/split_data.csv
INFO:root:model_path: /home/jupyter/text-gen/models


In [5]:
# Load Validation data
data = pd.read_csv(SPLIT_DATA_PATH)
data_val = data[data["split"] == "val"]
data_val["text"] = data_val["text"].astype(str)
data_val.head()

Unnamed: 0,text,split
42218,bought media room great faster previous version,val
42219,second kindle would lost without convenient th...,val
42220,got wife loves easy read loves fact carry book,val
42221,every year never run,val
42222,works great watching tv shows plugged right ea...,val


In [6]:
# Loading trained model and tokenizer
gpt2_model = GPT2LMHeadModel.from_pretrained(MODEL_PATH)
gpt2_tokenizer = GPT2Tokenizer.from_pretrained(MODEL_PATH)

In [7]:
# Prep data for inference by taking away original sentence all words except 2-3 words randomly
def truncate_text(text: str):
    
    ran_num = random.randint(5,10)
    ran_num = 4
    
    # Split by space
    text_list_split = text.split(" ")
    
    # Select randomly 2-4 words to retain
    text_list_trunc = text_list_split[:ran_num]
    
    # Return
    return " ".join(text_list_trunc)

data_val["trunc_text"] = data_val["text"].progress_apply(lambda x: truncate_text(x))

  0%|          | 0/4691 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_val["trunc_text"] = data_val["text"].progress_apply(lambda x: truncate_text(x))


## b. Inference

In [8]:
# Generate inference

# Create a list for trunc text
trunc_list = data_val["trunc_text"].to_list()

def get_inference_gpt2(text: str):
    # Encode the text using tokenizer
    text_ids = gpt2_tokenizer.encode(text, return_tensors = 'pt')
    
    generated_text_samples = gpt2_model.generate(
    text_ids, 
    max_length= MAX_LENGTH,  
    num_return_sequences= NUM_RETURN_SEQUENCE,
    no_repeat_ngram_size=NO_REPEAT_NGRAM_SIZE ,
    repetition_penalty=REPETITION_PENALTY,
    top_p=TOP_P,
    temperature=TEMPERATURE,
    do_sample= DO_SAMPLE,
    top_k= TOP_K,
    early_stopping= EARLY_STOPPING)

    return gpt2_tokenizer.decode(generated_text_samples[0], skip_special_tokens=True)

# Get res
res = []

for review in tqdm(trunc_list):
    res.append(get_inference_gpt2(review))
    
    
# Add back to original dataframe
data_val["gpt_text_gen"] = res

  0%|          | 0/4691 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_val["gpt_text_gen"] = res


## c. Evaluation

- Jaccard similarity
- Cross Encoder: Measure of how sysmantically similar are the output of the model and reference answer

### i. Jaccard Similarity

Jaccard similarity coefficient basically treats the data objects like sets. It is defined as the size of the intersection of two sets divide by the size of the union. We use this as a way to measure how many words that is generated by gpt2 is identical to the original words in the sentence. The higher the ratio means the more similar the words are

In [31]:
# Helper function
def jaccard_similarity(x,y):
    """ returns the jaccard similarity between two lists """
    intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
    union_cardinality = len(set.union(*[set(x), set(y)]))
    
    return intersection_cardinality/float(union_cardinality)

def corpus(text):
    text_list = text.split()
    return text_list

def count_words(text_list: str):
    # text_list_format = ast.literal_eval(text_list)
    return len(text_list)

# Printing some examples
def view_generated_samples(index: int, data: pd.DataFrame):  
    index = index
    # original_text = (" ").join(ast.literal_eval(data.iloc[index]["text_lists"]))
    original_text = (" ").join(data.iloc[index]["text_lists"])
    print(f"Original text: {original_text}")
    input_words = (" ").join(data.iloc[index]["trunc_text"].split(" ")[1:])
    print(f"input_words: {input_words}")
    gpt2_text = data.iloc[index]["gpt_text_gen"]
    print(f"gpt2_text generated: {gpt2_text}")
    print(f"\n")

In [11]:
# Calculate jaccard similarity
data_val["jaccard_score"] = data_val.progress_apply(lambda x: jaccard_similarity(x["text"],x["gpt_text_gen"]),axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_val["text"] = data_val["text"].astype(str)


  0%|          | 0/4691 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_val["jaccard_score"] = data_val.progress_apply(lambda x: jaccard_similarity(x["text"],x["gpt_text_gen"]),axis=1)


In [12]:
# Write down results using Jaccard
data_val.describe()

Unnamed: 0,jaccard_score
count,4691.0
mean,0.8086
std,0.126615
min,0.05
25%,0.736842
50%,0.809524
75%,0.888889
max,1.0


In [19]:
data_val

Unnamed: 0,text,split,trunc_text,gpt_text_gen,jaccard_score,text_lists
42218,bought media room great faster previous version,val,bought media room great,bought media room great picture good sound,0.842105,"[bought, media, room, great, faster, previous,..."
42219,second kindle would lost without convenient th...,val,second kindle would lost,second kindle would lost without kindles kindl...,0.900000,"[second, kindle, would, lost, without, conveni..."
42220,got wife loves easy read loves fact carry book,val,got wife loves easy,got wife loves easy use portable plenty memory...,0.772727,"[got, wife, loves, easy, read, loves, fact, ca..."
42221,every year never run,val,every year never run,every year never run batteries,0.666667,"[every, year, never, run]"
42222,works great watching tv shows plugged right ea...,val,works great watching tv,works great watching tv shows movies wish came...,0.772727,"[works, great, watching, tv, shows, plugged, r..."
...,...,...,...,...,...,...
46904,features old rca tablet memory battery power p...,val,features old rca tablet,features old rca tablet got broken decided buy...,0.760000,"[features, old, rca, tablet, memory, battery, ..."
46905,always happy amazon didnt disappoint work grea...,val,always happy amazon didnt,always happy amazon didnt disappoint nice tablet,0.809524,"[always, happy, amazon, didnt, disappoint, wor..."
46906,im giving three stars havent used much watch s...,val,im giving three stars,im giving three stars due difficulty setting m...,0.875000,"[im, giving, three, stars, havent, used, much,..."
46907,bought kids really love,val,bought kids really love,bought kids really love,1.000000,"[bought, kids, really, love]"


In [23]:
# Split the original text into list of words then count
data_val["text_lists"] = data_val["text"].progress_apply(corpus)
data_val["word_count"] = data_val["text_lists"].progress_apply(count_words)

  0%|          | 0/4691 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_val["text_lists"] = data_val["text"].progress_apply(corpus)


  0%|          | 0/4691 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_val["word_count"] = data_val["text_lists"].progress_apply(count_words)


In [24]:
data_val

Unnamed: 0,text,split,trunc_text,gpt_text_gen,jaccard_score,text_lists,word_count
42218,bought media room great faster previous version,val,bought media room great,bought media room great picture good sound,0.842105,"[bought, media, room, great, faster, previous,...",7
42219,second kindle would lost without convenient th...,val,second kindle would lost,second kindle would lost without kindles kindl...,0.900000,"[second, kindle, would, lost, without, conveni...",13
42220,got wife loves easy read loves fact carry book,val,got wife loves easy,got wife loves easy use portable plenty memory...,0.772727,"[got, wife, loves, easy, read, loves, fact, ca...",9
42221,every year never run,val,every year never run,every year never run batteries,0.666667,"[every, year, never, run]",4
42222,works great watching tv shows plugged right ea...,val,works great watching tv,works great watching tv shows movies wish came...,0.772727,"[works, great, watching, tv, shows, plugged, r...",9
...,...,...,...,...,...,...,...
46904,features old rca tablet memory battery power p...,val,features old rca tablet,features old rca tablet got broken decided buy...,0.760000,"[features, old, rca, tablet, memory, battery, ...",12
46905,always happy amazon didnt disappoint work grea...,val,always happy amazon didnt,always happy amazon didnt disappoint nice tablet,0.809524,"[always, happy, amazon, didnt, disappoint, wor...",9
46906,im giving three stars havent used much watch s...,val,im giving three stars,im giving three stars due difficulty setting m...,0.875000,"[im, giving, three, stars, havent, used, much,...",34
46907,bought kids really love,val,bought kids really love,bought kids really love,1.000000,"[bought, kids, really, love]",4


#### Explore samples with higher than average jaccard similiarity score

In [25]:
# Sample those with higher than average jaccard similarity score
data_val_higher_jac_score = data_val[data_val["jaccard_score"] > 0.6]
data_val_higher_jac_score

Unnamed: 0,text,split,trunc_text,gpt_text_gen,jaccard_score,text_lists,word_count
42218,bought media room great faster previous version,val,bought media room great,bought media room great picture good sound,0.842105,"[bought, media, room, great, faster, previous,...",7
42219,second kindle would lost without convenient th...,val,second kindle would lost,second kindle would lost without kindles kindl...,0.900000,"[second, kindle, would, lost, without, conveni...",13
42220,got wife loves easy read loves fact carry book,val,got wife loves easy,got wife loves easy use portable plenty memory...,0.772727,"[got, wife, loves, easy, read, loves, fact, ca...",9
42221,every year never run,val,every year never run,every year never run batteries,0.666667,"[every, year, never, run]",4
42222,works great watching tv shows plugged right ea...,val,works great watching tv,works great watching tv shows movies wish came...,0.772727,"[works, great, watching, tv, shows, plugged, r...",9
...,...,...,...,...,...,...,...
46904,features old rca tablet memory battery power p...,val,features old rca tablet,features old rca tablet got broken decided buy...,0.760000,"[features, old, rca, tablet, memory, battery, ...",12
46905,always happy amazon didnt disappoint work grea...,val,always happy amazon didnt,always happy amazon didnt disappoint nice tablet,0.809524,"[always, happy, amazon, didnt, disappoint, wor...",9
46906,im giving three stars havent used much watch s...,val,im giving three stars,im giving three stars due difficulty setting m...,0.875000,"[im, giving, three, stars, havent, used, much,...",34
46907,bought kids really love,val,bought kids really love,bought kids really love,1.000000,"[bought, kids, really, love]",4


In [26]:
# Getting the statistics
data_val_higher_jac_score.describe()

Unnamed: 0,jaccard_score,word_count
count,4471.0,4471.0
mean,0.82377,14.2051
std,0.10586,14.45981
min,0.606061,1.0
25%,0.75,7.0
50%,0.818182,10.0
75%,0.894737,16.0
max,1.0,216.0


In [32]:
# Look at some samples
view_generated_samples(0, data_val_higher_jac_score)
view_generated_samples(10, data_val_higher_jac_score)
view_generated_samples(-1, data_val_higher_jac_score)

Original text: bought media room great faster previous version
input_words: media room great
gpt2_text generated: bought media room great picture good sound


Original text: keeps busy great tablet always home bored stuck netflix play games homework log work
input_words: busy great tablet
gpt2_text generated: keeps busy great tablet price


Original text: like bigger screen size allows read books without straining eyes allows text displayed
input_words: bigger screen size
gpt2_text generated: like bigger screen size allows read books without getting unwanted companion information needs clarity external speaker system good though




#### Explore samples with lower than average jaccard similiarity score

In [33]:
# Sample those with lower than average jaccard similarity score
data_val_low_jac_score = data_val[data_val["jaccard_score"] < 0.6]
data_val_low_jac_score

Unnamed: 0,text,split,trunc_text,gpt_text_gen,jaccard_score,text_lists,word_count
42294,good ole aaa batteries purchased conjunction p...,val,good ole aaa batteries,good ole aaa batteries good price convenient a...,0.592593,"[good, ole, aaa, batteries, purchased, conjunc...",29
42330,echo learns use learn ask questions echos answ...,val,echo learns use learn,echo learns use learn every day,0.523810,"[echo, learns, use, learn, ask, questions, ech...",10
42381,casual user on-demand content devices tried sm...,val,casual user on-demand content,casual user on-demand content much easier stre...,0.516129,"[casual, user, on-demand, content, devices, tr...",163
42414,last long,val,last long,last long lasting batteries good price,0.533333,"[last, long]",2
42424,love fire hd returned 32gb,val,love fire hd returned,love fire hd returned wanted space works well,0.565217,"[love, fire, hd, returned, 32gb]",5
...,...,...,...,...,...,...,...
46790,works needed,val,works needed,works needed - ask questions give directions,0.500000,"[works, needed]",2
46807,works,val,works,works great value,0.384615,[works],1
46859,replaced older kindle new one great product wi...,val,replaced older kindle new,replaced older kindle new model love,0.590909,"[replaced, older, kindle, new, one, great, pro...",11
46870,dead,val,dead,dead far good,0.375000,[dead],1


In [34]:
# Getting the statistics
data_val_low_jac_score.describe()

Unnamed: 0,jaccard_score,word_count
count,193.0,193.0
mean,0.48635,16.440415
std,0.121811,36.040744
min,0.05,1.0
25%,0.444444,2.0
50%,0.538462,6.0
75%,0.571429,18.0
max,0.592593,401.0


In [35]:
# Sample those with lower than average jaccard similarity score
data_val_low_jac_score = data_val[data_val["jaccard_score"] < 0.6]
data_val_low_jac_score

Unnamed: 0,text,split,trunc_text,gpt_text_gen,jaccard_score,text_lists,word_count
42294,good ole aaa batteries purchased conjunction p...,val,good ole aaa batteries,good ole aaa batteries good price convenient a...,0.592593,"[good, ole, aaa, batteries, purchased, conjunc...",29
42330,echo learns use learn ask questions echos answ...,val,echo learns use learn,echo learns use learn every day,0.523810,"[echo, learns, use, learn, ask, questions, ech...",10
42381,casual user on-demand content devices tried sm...,val,casual user on-demand content,casual user on-demand content much easier stre...,0.516129,"[casual, user, on-demand, content, devices, tr...",163
42414,last long,val,last long,last long lasting batteries good price,0.533333,"[last, long]",2
42424,love fire hd returned 32gb,val,love fire hd returned,love fire hd returned wanted space works well,0.565217,"[love, fire, hd, returned, 32gb]",5
...,...,...,...,...,...,...,...
46790,works needed,val,works needed,works needed - ask questions give directions,0.500000,"[works, needed]",2
46807,works,val,works,works great value,0.384615,[works],1
46859,replaced older kindle new one great product wi...,val,replaced older kindle new,replaced older kindle new model love,0.590909,"[replaced, older, kindle, new, one, great, pro...",11
46870,dead,val,dead,dead far good,0.375000,[dead],1


In [36]:
# Printing some samples
view_generated_samples(0, data_val_low_jac_score)
view_generated_samples(10, data_val_low_jac_score)
view_generated_samples(-1, data_val_low_jac_score)

Original text: good ole aaa batteries purchased conjunction pack aa battery organizer somewhat conflicted purchase rechargeable ones estimated 10 year shelf life seldom need batteries felt would much better suited needs
input_words: ole aaa batteries
gpt2_text generated: good ole aaa batteries good price convenient available times


Original text: use device daily easy use sound incredible hear itt clearly throughout 1800 sq foot home
input_words: device daily easy
gpt2_text generated: use device daily easy use better apple ipad


Original text: good products
input_words: products
gpt2_text generated: good products similar kids best far know recommend




### Overall observation using Jaccard Similarity Score

1. The average jaccard similarity score calculated on the validation set is 0.6. This means the generated text on average are only 60% similar to the original text
2. The model tends to be better when the original sentence contains more amount of words as compared to those that contain less number of words
3. The limitation with jaccard similiarity:
- is it does not capture the magnitude or direction of the vectors and hence it may not reflec the strength of the similarity
- Does not consider the order or the context of the words and it may miss semantic variations that could be generated by gpt2

### ii. Symantic Similarity Search - Word2vec Cosine Similarity

One of the pitfalls of using jaccard similarity is it does not take into account the symatic meaning of the sentences. As language, there are many ways to express things and likewise, certain sentences can the same meaning but can be written in a different way. Hence we can make use of the idea of embedding and calculate the cosine similarity (which is the measure of the similarity between two vectors) between the original and gpt generated text. 

To calcualte the similarity this, we will use a pretrained word2vec model to generate the embeddings of the original text and the gpt2 generated text. Then we will compare the embeddings via cosine similarity.

In [37]:
# Helper functions
# Create embeddings using simply word2vec
def generate_word2vec_embedding(sentence: str):
    # generate the average of word embeddings
    return nlp(sentence).vector

def calculate_cosine_similarity_score(sentence_one: str, sentence_two: str):
    # encode the sentences into embeddings
    sentence_one_emb = generate_word2vec_embedding(sentence_one)
    sentence_two_emb = generate_word2vec_embedding(sentence_two)
    
    # calculate cosine similarity score
    cos_sim_score = 1 - cosine(sentence_one_emb, sentence_two_emb)
    return cos_sim_score

In [39]:
# Load word2vec pretrained model
nlp = spacy.load("en_core_web_sm")

In [40]:
# Calculate cosine similarity score
data_val["cos_sim_score"] = data_val.progress_apply(lambda x: calculate_cosine_similarity_score(x["text"], x["gpt_text_gen"]), axis=1)

  0%|          | 0/4691 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_val["cos_sim_score"] = data_val.progress_apply(lambda x: calculate_cosine_similarity_score(x["text"], x["gpt_text_gen"]), axis=1)


In [41]:
# Statistics on cosine similarity
data_val.describe()

Unnamed: 0,jaccard_score,word_count,cos_sim_score
count,4691.0,4691.0,4691.0
mean,0.8086,14.3123,0.785016
std,0.126615,15.933587,0.128655
min,0.05,1.0,-0.051194
25%,0.736842,7.0,0.707449
50%,0.809524,10.0,0.7833
75%,0.888889,16.0,0.859025
max,1.0,401.0,1.0


#### Explore samples with higher than average cosine similiarity score

In [42]:
# Sample those with higher than average cosine similarity score
data_val_high_cos_sim_score = data_val[data_val["cos_sim_score"] > 0.51]
data_val_high_cos_sim_score

Unnamed: 0,text,split,trunc_text,gpt_text_gen,jaccard_score,text_lists,word_count,cos_sim_score
42218,bought media room great faster previous version,val,bought media room great,bought media room great picture good sound,0.842105,"[bought, media, room, great, faster, previous,...",7,0.846796
42219,second kindle would lost without convenient th...,val,second kindle would lost,second kindle would lost without kindles kindl...,0.900000,"[second, kindle, would, lost, without, conveni...",13,0.695357
42220,got wife loves easy read loves fact carry book,val,got wife loves easy,got wife loves easy use portable plenty memory...,0.772727,"[got, wife, loves, easy, read, loves, fact, ca...",9,0.698367
42221,every year never run,val,every year never run,every year never run batteries,0.666667,"[every, year, never, run]",4,0.874663
42222,works great watching tv shows plugged right ea...,val,works great watching tv,works great watching tv shows movies wish came...,0.772727,"[works, great, watching, tv, shows, plugged, r...",9,0.750343
...,...,...,...,...,...,...,...,...
46904,features old rca tablet memory battery power p...,val,features old rca tablet,features old rca tablet got broken decided buy...,0.760000,"[features, old, rca, tablet, memory, battery, ...",12,0.693422
46905,always happy amazon didnt disappoint work grea...,val,always happy amazon didnt,always happy amazon didnt disappoint nice tablet,0.809524,"[always, happy, amazon, didnt, disappoint, wor...",9,0.904758
46906,im giving three stars havent used much watch s...,val,im giving three stars,im giving three stars due difficulty setting m...,0.875000,"[im, giving, three, stars, havent, used, much,...",34,0.661557
46907,bought kids really love,val,bought kids really love,bought kids really love,1.000000,"[bought, kids, really, love]",4,1.000000


In [43]:
data_val_high_cos_sim_score.describe()

Unnamed: 0,jaccard_score,word_count,cos_sim_score
count,4601.0,4601.0,4601.0
mean,0.814232,14.389263,0.792371
std,0.117004,15.951529,0.117733
min,0.25,1.0,0.510404
25%,0.73913,7.0,0.712125
50%,0.809524,10.0,0.786056
75%,0.894737,16.0,0.860545
max,1.0,401.0,1.0


In [44]:
# Printing some samples
view_generated_samples(0, data_val_high_cos_sim_score)
view_generated_samples(10, data_val_high_cos_sim_score)
view_generated_samples(-1, data_val_high_cos_sim_score)

Original text: bought media room great faster previous version
input_words: media room great
gpt2_text generated: bought media room great picture good sound


Original text: keeps busy great tablet always home bored stuck netflix play games homework log work
input_words: busy great tablet
gpt2_text generated: keeps busy great tablet price


Original text: like bigger screen size allows read books without straining eyes allows text displayed
input_words: bigger screen size
gpt2_text generated: like bigger screen size allows read books without getting unwanted companion information needs clarity external speaker system good though




#### Explore samples with lower than average cosine similiarity score

In [45]:
# Sample those with lower than average cosine similarity score
data_val_low_cos_sim_score = data_val[data_val["cos_sim_score"] < 0.51]
data_val_low_cos_sim_score

Unnamed: 0,text,split,trunc_text,gpt_text_gen,jaccard_score,text_lists,word_count,cos_sim_score
42428,started appletv traded roku wasnt happy switch...,val,started appletv traded roku,started appletv traded roku 3 kodi box streami...,0.714286,"[started, appletv, traded, roku, wasnt, happy,...",12,0.498593
42450,loved fast great fun kids cannot put,val,loved fast great fun,loved fast great fun apps games surfing web ea...,0.666667,"[loved, fast, great, fun, kids, cannot, put]",7,0.480220
42462,,val,,nan got 7 year old could play games looooooong...,0.086957,[nan],1,0.282779
42611,thanks,val,thanks,thanks great value batteries last longer durac...,0.300000,[thanks],1,0.263774
42624,pros - must amazon prime subscribers - lots me...,val,pros - must amazon,pros - must amazon prime,0.466667,"[pros, -, must, amazon, prime, subscribers, -,...",53,0.452748
...,...,...,...,...,...,...,...,...
46730,well made,val,well made,well made purchase free space phone use books ...,0.333333,"[well, made]",2,0.327132
46739,another great idea wife didnt think would good...,val,another great idea wife,another great idea wife use kindle e-readers t...,0.818182,"[another, great, idea, wife, didnt, think, wou...",14,0.458935
46749,light easy read nice backlight feature issue h...,val,light easy read nice,light easy read nice white screen,0.652174,"[light, easy, read, nice, backlight, feature, ...",22,0.483998
46784,amazon fanboy ill say right front device altho...,val,amazon fanboy ill say,amazon fanboy ill say theyre batteries,0.592593,"[amazon, fanboy, ill, say, right, front, devic...",55,0.358004


In [46]:
data_val_low_cos_sim_score.describe()

Unnamed: 0,jaccard_score,word_count,cos_sim_score
count,90.0,90.0,90.0
mean,0.520664,10.377778,0.408996
std,0.227484,14.532859,0.100283
min,0.05,1.0,-0.051194
25%,0.317708,1.0,0.344158
50%,0.555556,2.0,0.454024
75%,0.714286,15.5,0.480582
max,0.913043,69.0,0.509371


In [50]:
# Printing some samples
view_generated_samples(0, data_val_low_cos_sim_score)
view_generated_samples(20, data_val_low_cos_sim_score)
view_generated_samples(-2, data_val_low_cos_sim_score)

Original text: started appletv traded roku wasnt happy switched amazon fire tv couldnt happier
input_words: appletv traded roku
gpt2_text generated: started appletv traded roku 3 kodi box streaming options 1 hand 4 gifts xmas worth every penny


Original text: kindle paperwhite absolutely perfection perfect picking book instantly starting left lighting realistic feels like youre reading real paper definitely bargain
input_words: paperwhite absolutely perfection
gpt2_text generated: kindle paperwhite absolutely perfection e-reader


Original text: amazon fanboy ill say right front device although nice currently limited functions especially considering 200 price tag useful functions linking google instance added give echo much pains average rating nice extra cash sitting around dont mind waiting see useful functions added without android syncing functionsor ability control exampletv channels far amazons usual gotta right category
input_words: fanboy ill say
gpt2_text generated: amazon fa

### Overall observations on Cosine Similarity Score
1. The average cosine similarity score between the original and gpt2 generated text on validation data is around 0.51 with a min score of -0.06 and maximum score 0.932
2. Similar to jaccard similarity score, the model tends to be better when the original sentence contains more amount of words as compared to those that contain less number of words
3. The limitation with cosine similarity:
- There are some instances whereby the sentence generated by gpt2 is identical to the original text but the cosine similarity is low --> very perculiar

### Which evaluation metric to use?

## Improvements
1. Splitting the dataset -> perhaps we can try to split the data to ensure we have a representative dataset. For example we can try using sentence transformer model to generate the embeddings, then perform clustering to group the data. Then we systematically sample data for each of the groups rather than randomly splitting. This is evident since we did see the gpt2 generated text were identical similar to the original text thereby indicatin some form of data leakage.
2. Maybe we can try to retrain the model using a reviews dataset first and then use the current dataset and fine tune it.
3. Using pretraind word2vec may not be the best way to measure and evaluate the quality of the text generated since its a quantitative approach. Perhaps incorporating a more qualitiative approach too might be needed to fully evaluate the gpt2 generated text - coherence etc.
4. Using sentence transformers to generate embeddings rather than word2vec.
5. Trying out more conventional sentence generation evaluation metrics - BLUE or ROGUE