# 03 - Predict using GPT2 Model

This notebook contains the steps to use the trained gpt2 model from the previous steps for prediction

Author:
- Santosh Yadaw
- santoshyadawprl@gmail.com

## a. Setup

In [2]:
import os
import ast
import random
import logging

from tqdm.auto import tqdm
import pandas as pd
# import spacy
from scipy.spatial.distance import cosine

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tqdm.pandas()

In [3]:
# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging

In [4]:
# Check device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
logger.info(f"device: {device}")

INFO:root:device: cuda


In [5]:
# Constants
HOME_PATH = os.path.split(os.getcwd())[0]
logger.info(f"HOME_PATH: {HOME_PATH}")

SPLIT_DATA_PATH = os.path.join(HOME_PATH,"data","processed","split_data.csv")
logger.info(f"SPLIT_DATA_PATH: {SPLIT_DATA_PATH}")

# Set the path to save gpt2 model
MODEL_PATH = os.path.join(HOME_PATH, "models")
logger.info(f"model_path: {MODEL_PATH}")

# GPT Inference constants
MAX_LENGTH= 100
NUM_RETURN_SEQUENCE= 1
NO_REPEAT_NGRAM_SIZE= 2
REPETITION_PENALTY= 1.5
TOP_P= 0.92
TEMPERATURE=.85
DO_SAMPLE= True
TOP_K= 125
EARLY_STOPPING= True

INFO:root:HOME_PATH: /home/jupyter/text-gen
INFO:root:SPLIT_DATA_PATH: /home/jupyter/text-gen/data/processed/split_data.csv
INFO:root:model_path: /home/jupyter/text-gen/models


In [6]:
# Load Validation data
data = pd.read_csv(SPLIT_DATA_PATH)
data_val = data[data["split"] == "val"]
data_val["text"] = data_val["text"].astype(str)
data_val.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_val["text"] = data_val["text"].astype(str)


Unnamed: 0,text,split
42218,bought media room great faster previous version,val
42219,second kindle would lost without convenient th...,val
42220,got wife loves easy read loves fact carry book,val
42221,every year never run,val
42222,works great watching tv shows plugged right ea...,val


In [7]:
# Loading trained model and tokenizer
gpt2_model = GPT2LMHeadModel.from_pretrained(MODEL_PATH)
gpt2_tokenizer = GPT2Tokenizer.from_pretrained(MODEL_PATH)

In [8]:
# Prep data for inference by taking away original sentence all words except 2-3 words randomly
def truncate_text(text: str):
    
    ran_num = random.randint(5,10)
    ran_num = 4
    
    # Split by space
    text_list_split = text.split(" ")
    
    # Select randomly 2-4 words to retain
    text_list_trunc = text_list_split[:ran_num]
    
    # Return
    return " ".join(text_list_trunc)

data_val["trunc_text"] = data_val["text"].progress_apply(lambda x: truncate_text(x))

  0%|          | 0/4691 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_val["trunc_text"] = data_val["text"].progress_apply(lambda x: truncate_text(x))


## b. Inference

In [9]:
# Generate inference

# Create a list for trunc text
trunc_list = data_val["trunc_text"].to_list()

def get_inference_gpt2(text: str):
    # Encode the text using tokenizer
    text_ids = gpt2_tokenizer.encode(text, return_tensors = 'pt')
    
    generated_text_samples = gpt2_model.generate(
    text_ids, 
    max_length= MAX_LENGTH,  
    num_return_sequences= NUM_RETURN_SEQUENCE,
    no_repeat_ngram_size=NO_REPEAT_NGRAM_SIZE ,
    repetition_penalty=REPETITION_PENALTY,
    top_p=TOP_P,
    temperature=TEMPERATURE,
    do_sample= DO_SAMPLE,
    top_k= TOP_K,
    early_stopping= EARLY_STOPPING)

    return gpt2_tokenizer.decode(generated_text_samples[0], skip_special_tokens=True)

# Get res
res = []

for review in tqdm(trunc_list):
    res.append(get_inference_gpt2(review))
    
    
# Add back to original dataframe
data_val["gpt_text_gen"] = res

  0%|          | 0/4691 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_val["gpt_text_gen"] = res


## c. Evaluation

- Jaccard similarity
- Cross Encoder: Measure of how sysmantically similar are the output of the model and reference answer

### i. Jaccard Similarity

Jaccard similarity coefficient basically treats the data objects like sets. It is defined as the size of the intersection of two sets divide by the size of the union. We use this as a way to measure how many words that is generated by gpt2 is identical to the original words in the sentence. The higher the ratio means the more similar the words are

In [10]:
# Helper function
def jaccard_similarity(x,y):
    """ returns the jaccard similarity between two lists """
    intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
    union_cardinality = len(set.union(*[set(x), set(y)]))
    
    return intersection_cardinality/float(union_cardinality)

def corpus(text):
    text_list = text.split()
    return text_list

def count_words(text_list: str):
    # text_list_format = ast.literal_eval(text_list)
    return len(text_list)

# Printing some examples
def view_generated_samples(index: int, data: pd.DataFrame):  
    index = index
    # original_text = (" ").join(ast.literal_eval(data.iloc[index]["text_lists"]))
    original_text = (" ").join(data.iloc[index]["text_lists"])
    print(f"Original text: {original_text}")
    input_words = data.iloc[index]["trunc_text"]
    print(f"input_words: {input_words}")
    gpt2_text = data.iloc[index]["gpt_text_gen"]
    print(f"gpt2_text generated: {gpt2_text}")
    print(f"\n")

In [11]:
# Calculate jaccard similarity
data_val["jaccard_score"] = data_val.progress_apply(lambda x: jaccard_similarity(x["text"],x["gpt_text_gen"]),axis=1)

  0%|          | 0/4691 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_val["jaccard_score"] = data_val.progress_apply(lambda x: jaccard_similarity(x["text"],x["gpt_text_gen"]),axis=1)


In [13]:
# Split the original text into list of words then count
data_val["text_lists"] = data_val["text"].progress_apply(corpus)
data_val["word_count"] = data_val["text_lists"].progress_apply(count_words)

  0%|          | 0/4691 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_val["text_lists"] = data_val["text"].progress_apply(corpus)


  0%|          | 0/4691 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_val["word_count"] = data_val["text_lists"].progress_apply(count_words)


In [22]:
# Write down results using Jaccard
data_val.describe()

Unnamed: 0,jaccard_score,word_count
count,4691.0,4691.0
mean,0.806629,14.3123
std,0.128142,15.933587
min,0.055556,1.0
25%,0.727273,7.0
50%,0.809524,10.0
75%,0.888889,16.0
max,1.0,401.0


#### Explore samples with higher than average jaccard similiarity score

In [23]:
# Sample those with higher than average jaccard similarity score
mean_score = data_val.describe()["jaccard_score"]["mean"]
data_val_higher_jac_score = data_val[data_val["jaccard_score"] > mean_score]
data_val_higher_jac_score

Unnamed: 0,text,split,trunc_text,gpt_text_gen,jaccard_score,text_lists,word_count
42219,second kindle would lost without convenient th...,val,second kindle would lost,second kindle would lost without kindles kinde...,0.904762,"[second, kindle, would, lost, without, conveni...",13
42226,gave echo 5 stars like amazon products ultimat...,val,gave echo 5 stars,gave echo 5 stars could take little bit time f...,0.807692,"[gave, echo, 5, stars, like, amazon, products,...",43
42227,love pricing quality always buy amazon batteries,val,love pricing quality always,love pricing quality always order,0.818182,"[love, pricing, quality, always, buy, amazon, ...",7
42229,tablet makes reading watching video enjoyable,val,tablet makes reading watching,tablet makes reading watching shows easy enjoy,0.952381,"[tablet, makes, reading, watching, video, enjo...",6
42230,great device amazon way go great quality shows,val,great device amazon way,great device amazon way go great streaming vid...,0.809524,"[great, device, amazon, way, go, great, qualit...",8
...,...,...,...,...,...,...,...
46902,bit skeptical first purchasing device roku gla...,val,bit skeptical first purchasing,bit skeptical first purchasing amazonbasics pr...,0.833333,"[bit, skeptical, first, purchasing, device, ro...",22
46903,wife loves neat works info endless music optio...,val,wife loves neat works,wife loves neat works videos well amazon prime,0.809524,"[wife, loves, neat, works, info, endless, musi...",10
46905,always happy amazon didnt disappoint work grea...,val,always happy amazon didnt,always happy amazon didnt disappoint job great...,0.863636,"[always, happy, amazon, didnt, disappoint, wor...",9
46906,im giving three stars havent used much watch s...,val,im giving three stars,im giving three stars instead five seems silly...,0.875000,"[im, giving, three, stars, havent, used, much,...",34


In [24]:
# Getting the statistics
data_val_higher_jac_score.describe()

Unnamed: 0,jaccard_score,word_count
count,2384.0,2384.0
mean,0.903722,12.459732
std,0.069567,11.757412
min,0.807692,1.0
25%,0.842105,6.0
50%,0.888889,9.0
75%,1.0,14.0
max,1.0,159.0


In [25]:
# Look at some samples
view_generated_samples(0, data_val_higher_jac_score)
view_generated_samples(10, data_val_higher_jac_score)
view_generated_samples(-1, data_val_higher_jac_score)

Original text: second kindle would lost without convenient throw purse take along wherever go love
input_words: second kindle would lost
gpt2_text generated: second kindle would lost without kindles kindel best purchase ive made dont know missing


Original text: good kids looking reasonable cost
input_words: good kids looking reasonable
gpt2_text generated: good kids looking reasonable cost tablet works well


Original text: bought kids really love
input_words: bought kids really love
gpt2_text generated: bought kids really love easy use




#### Explore samples with lower than average jaccard similiarity score

In [26]:
# Sample those with lower than average jaccard similarity score
mean_score = data_val.describe()["jaccard_score"]["mean"]
data_val_low_jac_score = data_val[data_val["jaccard_score"] < mean_score]
data_val_low_jac_score

Unnamed: 0,text,split,trunc_text,gpt_text_gen,jaccard_score,text_lists,word_count
42218,bought media room great faster previous version,val,bought media room great,bought media room great sound wanted,0.789474,"[bought, media, room, great, faster, previous,...",7
42220,got wife loves easy read loves fact carry book,val,got wife loves easy,got wife loves easy use simple set,0.619048,"[got, wife, loves, easy, read, loves, fact, ca...",9
42221,every year never run,val,every year never run,every year never run batteries great price,0.533333,"[every, year, never, run]",4
42222,works great watching tv shows plugged right ea...,val,works great watching tv,works great watching tv shows netflix amazon p...,0.708333,"[works, great, watching, tv, shows, plugged, r...",9
42223,know bluetooth think auxiliary port older spea...,val,know bluetooth think auxiliary,know bluetooth think auxiliary port since tabl...,0.782609,"[know, bluetooth, think, auxiliary, port, olde...",7
...,...,...,...,...,...,...,...
46896,seems like quality varies batts work great las...,val,seems like quality varies,seems like quality varies based needs use,0.772727,"[seems, like, quality, varies, batts, work, gr...",12
46899,easy setup use like picture screen bought 5 ga...,val,easy setup use like,easy setup use like asking weather traffic etc,0.739130,"[easy, setup, use, like, picture, screen, boug...",12
46900,duds bought christmas could stick toys one wor...,val,duds bought christmas could,duds bought christmas couldnt happier,0.782609,"[duds, bought, christmas, could, stick, toys, ...",35
46904,features old rca tablet memory battery power p...,val,features old rca tablet,features old rca tablet different brands amazo...,0.727273,"[features, old, rca, tablet, memory, battery, ...",12


In [27]:
# Getting the statistics
data_val_low_jac_score.describe()

Unnamed: 0,jaccard_score,word_count
count,2307.0,2307.0
mean,0.706296,16.226701
std,0.092617,19.138406
min,0.055556,1.0
25%,0.666667,7.0
50%,0.727273,11.0
75%,0.769231,19.0
max,0.8,401.0


In [28]:
# Printing some samples
view_generated_samples(0, data_val_low_jac_score)
view_generated_samples(10, data_val_low_jac_score)
view_generated_samples(-1, data_val_low_jac_score)

Original text: bought media room great faster previous version
input_words: bought media room great
gpt2_text generated: bought media room great sound wanted


Original text: intending buy rechargeables bought mistake work fine though
input_words: intending buy rechargeables bought
gpt2_text generated: intending buy rechargeables bought amazonbasics use minor devices work great cost savings pretty decent quality


Original text: like bigger screen size allows read books without straining eyes allows text displayed
input_words: like bigger screen size
gpt2_text generated: like bigger screen size easy read books overall satisfied




### Overall observation using Jaccard Similarity Score

1. The average jaccard similarity score calculated on the validation set is 0.8. This means the generated text on average are only 80% similar to the original text which seems to indicate a pretty good score.
2. In general, the jaccard score is higher for given sentences that are shorter in length.
3. The limitation with jaccard similiarity:
- is it does not capture the magnitude or direction of the vectors and hence it may not reflec the strength of the similarity
- Does not consider the order or the context of the words and it may miss semantic variations that could be generated by gpt2

### ii. Symantic Similarity Search - Word2vec Cosine Similarity

One of the pitfalls of using jaccard similarity is it does not take into account the symantic meaning of the sentences. As language, there are many ways to express things and likewise, certain sentences can the same meaning but can be written in a different way. Hence we can make use of the idea of embedding and calculate the cosine similarity (which is the measure of the similarity between two vectors) between the original and gpt generated text. 

To calcualte the similarity this, we will use a pretrained word2vec model to generate the embeddings of the original text and the gpt2 generated text. Then we will compare the embeddings via cosine similarity.

In [30]:
import spacy

In [31]:
# Helper functions
# Create embeddings using simply word2vec
def generate_word2vec_embedding(sentence: str):
    # generate the average of word embeddings
    return nlp(sentence).vector

def calculate_cosine_similarity_score(sentence_one: str, sentence_two: str):
    # encode the sentences into embeddings
    sentence_one_emb = generate_word2vec_embedding(sentence_one)
    sentence_two_emb = generate_word2vec_embedding(sentence_two)
    
    # calculate cosine similarity score
    cos_sim_score = 1 - cosine(sentence_one_emb, sentence_two_emb)
    return cos_sim_score

In [32]:
# Load word2vec pretrained model
nlp = spacy.load("en_core_web_sm")

In [33]:
# Calculate cosine similarity score
data_val["cos_sim_score"] = data_val.progress_apply(lambda x: calculate_cosine_similarity_score(x["text"], x["gpt_text_gen"]), axis=1)

  0%|          | 0/4691 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_val["cos_sim_score"] = data_val.progress_apply(lambda x: calculate_cosine_similarity_score(x["text"], x["gpt_text_gen"]), axis=1)


In [34]:
# Statistics on cosine similarity
data_val.describe()

Unnamed: 0,jaccard_score,word_count,cos_sim_score
count,4691.0,4691.0,4691.0
mean,0.806629,14.3123,0.782896
std,0.128142,15.933587,0.130124
min,0.055556,1.0,0.021825
25%,0.727273,7.0,0.706458
50%,0.809524,10.0,0.782801
75%,0.888889,16.0,0.856873
max,1.0,401.0,1.0


#### Explore samples with higher than average cosine similiarity score

In [36]:
# Sample those with higher than average cosine similarity score
mean_score = data_val.describe()["cos_sim_score"]["mean"]
data_val_high_cos_sim_score = data_val[data_val["cos_sim_score"] > mean_score]
data_val_high_cos_sim_score

Unnamed: 0,text,split,trunc_text,gpt_text_gen,jaccard_score,text_lists,word_count,cos_sim_score
42220,got wife loves easy read loves fact carry book,val,got wife loves easy,got wife loves easy use simple set,0.619048,"[got, wife, loves, easy, read, loves, fact, ca...",9,0.891198
42221,every year never run,val,every year never run,every year never run batteries great price,0.533333,"[every, year, never, run]",4,0.796730
42225,great tablet lite portable exceptionally fast ...,val,great tablet lite portable,great tablet lite portable resolution makes pe...,0.666667,"[great, tablet, lite, portable, exceptionally,...",14,0.790777
42227,love pricing quality always buy amazon batteries,val,love pricing quality always,love pricing quality always order,0.818182,"[love, pricing, quality, always, buy, amazon, ...",7,0.801396
42228,keeps busy great tablet always home bored stuc...,val,keeps busy great tablet,keeps busy great tablet money,0.695652,"[keeps, busy, great, tablet, always, home, bor...",14,0.812312
...,...,...,...,...,...,...,...,...
46898,friend purchased kindle really impressed ease ...,val,friend purchased kindle really,friend purchased kindle really like setup extr...,0.869565,"[friend, purchased, kindle, really, impressed,...",19,0.853455
46902,bit skeptical first purchasing device roku gla...,val,bit skeptical first purchasing,bit skeptical first purchasing amazonbasics pr...,0.833333,"[bit, skeptical, first, purchasing, device, ro...",22,0.804096
46904,features old rca tablet memory battery power p...,val,features old rca tablet,features old rca tablet different brands amazo...,0.727273,"[features, old, rca, tablet, memory, battery, ...",12,0.809305
46905,always happy amazon didnt disappoint work grea...,val,always happy amazon didnt,always happy amazon didnt disappoint job great...,0.863636,"[always, happy, amazon, didnt, disappoint, wor...",9,0.947824


In [37]:
data_val_high_cos_sim_score.describe()

Unnamed: 0,jaccard_score,word_count,cos_sim_score
count,2344.0,2344.0,2344.0
mean,0.862033,12.073805,0.882647
std,0.113807,15.427113,0.077448
min,0.428571,1.0,0.782942
25%,0.777778,6.0,0.816977
50%,0.863636,8.0,0.856913
75%,1.0,13.0,0.981396
max,1.0,401.0,1.0


In [38]:
# Printing some samples
view_generated_samples(0, data_val_high_cos_sim_score)
view_generated_samples(10, data_val_high_cos_sim_score)
view_generated_samples(-1, data_val_high_cos_sim_score)

Original text: got wife loves easy read loves fact carry book
input_words: got wife loves easy
gpt2_text generated: got wife loves easy use simple set


Original text: smart amazon echo enjoying theses amazon echo life much easy excellent amazon echo
input_words: smart amazon echo enjoying
gpt2_text generated: smart amazon echo enjoying far great device


Original text: bought kids really love
input_words: bought kids really love
gpt2_text generated: bought kids really love easy use




#### Explore samples with lower than average cosine similiarity score

In [39]:
# Sample those with lower than average cosine similarity score
mean_score = data_val.describe()["cos_sim_score"]["mean"]
data_val_low_cos_sim_score = data_val[data_val["cos_sim_score"] < mean_score]
data_val_low_cos_sim_score

Unnamed: 0,text,split,trunc_text,gpt_text_gen,jaccard_score,text_lists,word_count,cos_sim_score
42218,bought media room great faster previous version,val,bought media room great,bought media room great sound wanted,0.789474,"[bought, media, room, great, faster, previous,...",7,0.744388
42219,second kindle would lost without convenient th...,val,second kindle would lost,second kindle would lost without kindles kinde...,0.904762,"[second, kindle, would, lost, without, conveni...",13,0.660017
42222,works great watching tv shows plugged right ea...,val,works great watching tv,works great watching tv shows netflix amazon p...,0.708333,"[works, great, watching, tv, shows, plugged, r...",9,0.781023
42223,know bluetooth think auxiliary port older spea...,val,know bluetooth think auxiliary,know bluetooth think auxiliary port since tabl...,0.782609,"[know, bluetooth, think, auxiliary, port, olde...",7,0.733885
42224,good price batteries seem good quality price r...,val,good price batteries seem,good price batteries seem work well name brand...,0.739130,"[good, price, batteries, seem, good, quality, ...",8,0.779305
...,...,...,...,...,...,...,...,...
46900,duds bought christmas could stick toys one wor...,val,duds bought christmas could,duds bought christmas couldnt happier,0.782609,"[duds, bought, christmas, could, stick, toys, ...",35,0.609053
46901,really quick service glad discover amazon carr...,val,really quick service glad,really quick service glad bought item,0.909091,"[really, quick, service, glad, discover, amazo...",12,0.748612
46903,wife loves neat works info endless music optio...,val,wife loves neat works,wife loves neat works videos well amazon prime,0.809524,"[wife, loves, neat, works, info, endless, musi...",10,0.641326
46906,im giving three stars havent used much watch s...,val,im giving three stars,im giving three stars instead five seems silly...,0.875000,"[im, giving, three, stars, havent, used, much,...",34,0.664652


In [40]:
data_val_low_cos_sim_score.describe()

Unnamed: 0,jaccard_score,word_count,cos_sim_score
count,2347.0,2347.0,2347.0
mean,0.751296,16.547934,0.683272
std,0.117308,16.119669,0.089379
min,0.055556,1.0,0.021825
25%,0.695652,8.0,0.643253
50%,0.764706,12.0,0.706483
75%,0.826087,20.0,0.748282
max,1.0,163.0,0.782855


In [41]:
# Printing some samples
view_generated_samples(0, data_val_low_cos_sim_score)
view_generated_samples(20, data_val_low_cos_sim_score)
view_generated_samples(-2, data_val_low_cos_sim_score)

Original text: bought media room great faster previous version
input_words: bought media room great
gpt2_text generated: bought media room great sound wanted


Original text: love stick kinda slow navigating one much faster going use cancel cable told cable company wanted cancel going stream everything cut bundle cost 100 kept cable also still use firetv alot
input_words: love stick kinda slow
gpt2_text generated: love stick kinda slow navigating used voice search


Original text: im giving three stars havent used much watch shows moviesive aprox month sometimes screen goes blank idea whyis tabletor maybe appclueless im making big deal bought black friday money waste dont think id buy
input_words: im giving three stars
gpt2_text generated: im giving three stars instead five seems silly restrictions used many ways cut cable bill




### Overall observations on Cosine Similarity Score
1. The average cosine similarity score between the original and gpt2 generated text on validation data is around 0.78 with a min score of -0.05 and maximum score 1.0
2. Similar to jaccard similarity score, the cosine similarity score of the gpt2 generated text is higher when the original sentences have less words

## Improvements
1. Overall we can see the generated text are not quite identicle to the original text. This is expected since we only trained the model on 6 epochs and the loss had not yet converged.
2. Splitting the dataset -> perhaps we can try to split the data to ensure we have a representative dataset. For example we can try using sentence transformer model to generate the embeddings, then perform clustering to group the data. Then we systematically sample data for each of the groups rather than randomly splitting.
3. Maybe we can try to retrain the model using a reviews dataset first and then use the current dataset and fine tune it.
4. Using pretraind word2vec may not be the best way to measure and evaluate the quality of the text generated since its a quantitative approach. Perhaps incorporating a more qualitiative approach too might be needed to fully evaluate the gpt2 generated text - coherence etc, BLEU or ROGUE
5. Using sentence transformers to generate embeddings rather than word2vec.

## END