For this homework, make sure that you format your notbook nicely and cite all sources in the appropriate sections. Programmatically generate or embed any figures or graphs that you need.

Names: Haotian Zhang, Zain Alam

Step 1: Train your own word embeddings
--------------------------------

(describe the provided dataset that you have chosen here)

Between the two given datasets, we used the Spooky Author dataset.

Describe what data set you have chosen to compare and contrast with the your chosen provided dataset. Make sure to describe where it comes from and it's general properties.

(describe your dataset here)

We chose [Financial Sentiment Analysis](https://www.kaggle.com/datasets/sbhatti/financial-sentiment-analysis) as our self-chosen data.

Description(copied from Kaggle): The following data is intended for advancing financial sentiment analysis research. It's two datasets (FiQA, Financial PhraseBank) combined into one easy-to-use CSV file. It provides financial sentences with sentiment labels.

Columns: 
Sentence: string
Sentiment: enum

We just took the sentence column in the dataset to make it a set of sentences.

In [1]:
# import your libraries here
import nltk
import numpy as np
import random
from collections import Counter
from nltk.tokenize import RegexpTokenizer
# libs used for preprocessing
from gensim.parsing.preprocessing import stem_text, remove_stopwords, strip_punctuation
from nltk import ngrams
# libs used for file reading and parsing
from csv import reader
from csv import writer

nltk.download('punkt')

#write our dataset into a csv file
def write_into_csv(data: list[list[str]], file_name: str):
  """This function takes in the data and writes it into a csv file.

  Args:
      data (List[List[str]]): list of sentences where each sentence is broken
                        into list of words.
      file_name (str): name of the file to be written
  """
  with open(file_name, 'w') as f:
    csv_writer = writer(f)
    csv_writer.writerows(data)
  f.close()

#read our dataset from our self-generated csv file
#placed here for running convenience
def read_from_csv(file_name: str):
  data = []
  with open(file_name, 'r') as f:
    csv_reader = reader(f)
    for row in csv_reader:
      data.append(row)
  f.close()
  return data

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/zhanghaotian/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### a) Train embeddings on GIVEN dataset

In [None]:
def parse_csv(training_file_path: str, select_column:int, percentage: float) -> list[str]:
  """This function is used to parse input lines
  and returns a the provided percent of data.

  Args:
      lines (List[str]): list of lines
      percentage (int): percent of the dataset needed
      select_column (int): column to be selected from the dataset
  Returns:
      List[str]: lines (percentage of dataset)
  """
  sentences = []
  with open(training_file_path, "r", encoding="utf8", errors="ignore") as csvfile:
    csv_reader = reader(csvfile)
    #skipping header
    header = next(csv_reader)

    # line_length = len(list(csv_reader_copy))
   
    if header != None:
      for row in csv_reader:
        sentences.append(row[select_column])

  random.shuffle(sentences)
  to_return = sentences[:int(len(sentences)*percentage)]

  return to_return

In [None]:
def preprocessing(running_lines: list[str], ngrams: int) -> list[list[str]]:
  """This function takes in the running test and return back the
  preprocessed text. Four tasks are done as part of this:
    1. lower word case
    2. remove punctuation
    3. remove stopwords
    4. Add - <s> and </s> for every sentence
  """
  preprocessed_lines = []
  tokenizer = RegexpTokenizer(r'\w+')
  for line in running_lines:
    editing_line = line.copy()
    line = line.lower()
    line = strip_punctuation(line)
    line = remove_stopwords(line)

    for i in range(1, ngrams):
      editing_line.insert(0,"<s>")
    editing_line.append("</s>")
    preprocessed_lines.append(editing_line)
  return preprocessed_lines

In [None]:
halfspooky = parse_csv("train.csv", 1, 0.5)
halfspooky = preprocessing(halfspooky, 3)
random.shuffle(halfspooky)

In [None]:
#writing into csv file as a backup
write_into_csv(halfspooky, "half_spooky_training_set.csv")

In [None]:
from gensim.models import Word2Vec

# The dimension of word embedding. 
# This variable will be used throughout the program
# you may vary this as you desire
EMBEDDINGS_SIZE = 200

# Train the Word2Vec model from Gensim. 
# Below are the hyperparameters that are most relevant. 
# But feel free to explore other 
# options too:
# sg = 1
# window = 5
# vector_size = EMBEDDINGS_SIZE
# min_count = 1

w2v_model_half_spooky = Word2Vec(halfspooky, vector_size=EMBEDDINGS_SIZE, window=5, min_count=1, sg=1, sorted_vocab=1, workers=4)

In [None]:
# if you save your Word2Vec as the variable model, this will 
# print out the vocabulary size
print('Vocab size {}'.format(len(w2v_model_half_spooky.wv.key_to_index)))

In [None]:
# You can save file in txt format, then load later if you wish.
w2v_model_half_spooky.wv.save_word2vec_format('half_spooky_embeddings.txt', binary=False)

### b) Train embedding on YOUR dataset

In [None]:
financial_set = parse_csv("financial_data.csv", 0, 1)
financial_set = preprocessing(financial_set, 3)
random.shuffle(financial_set)

In [None]:
#writing into csv file as a backup
write_into_csv(financial_set, "financial_data_training_set.csv", )

In [None]:
#the word embeddings for the financial data
w2v_model_financial = Word2Vec(financial_set, vector_size=EMBEDDINGS_SIZE, window=5, min_count=1, sg=1, sorted_vocab=1, workers=4)

In [None]:
# the vocabulary size for the financial data
print('Vocab size {}'.format(len(w2v_model_financial.wv.key_to_index)))

In [None]:
#save the word embeddings for the financial data
w2v_model_financial.wv.save_word2vec_format('financial_embeddings.txt', binary=False)

What text-normalization and pre-processing did you do and why?

Text normalization and pre-processing steps completed:

Lowercasing words:

Lowercasing is a common technique used to reduce vocabulary size and improve model performance. This transfers our text into plain-text. Even though some of the proper nouns would be normalized and become less significant, there're not much these words in our two datasets chosen.


Removing punctuation:

Removing puncuation is for the same reason as lowercasing to perform more clean data.


Removing stopwords:

Removing stopwords can help to further reduce vocabulary size and eliminate frequently used words that can skew model predictions. It removes the commonly-used stopwords that flattening the significance of sentences.

Tokenizing the text and adding sentence separator tokens:

Tokenizing the text and adding sentence separator tokens provides additional context to the words and has been shown to significantly improve accuracy in language modeling tasks such as predicting the next word. It makes the words' absolute and relative position in a sentence clear.

Step 2: Evaluate the differences between the word embeddings
----------------------------

(make sure to include graphs, figures, and paragraphs with full sentences)

## Write down your analysis:

Cite your sources:
-------------

Step 3: Feedforward Neural Language Model
--------------------------

### a) First, encode  your text into integers

In [2]:
# Importing utility functions from Keras
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD

# The size of the ngram language model you want to train
# change as needed for your experiments
NGRAM = 3

# Initializing a Tokenizer
# It is used to vectorize a text corpus. Here, it just creates a mapping from 
# word to a unique index. (Note: Indexing starts from 0)
# Example:
# tokenizer = Tokenizer()
# tokenizer.fit_on_texts(data)
# encoded = tokenizer.texts_to_sequences(data)

halfspooky = read_from_csv("half_spooky_training_set.csv")
half_spooky_tokenizer = Tokenizer()
half_spooky_tokenizer.fit_on_texts(halfspooky)
half_spooky_encoded = half_spooky_tokenizer.texts_to_sequences(halfspooky)
financial = read_from_csv("financial_data_training_set.csv")
financial_tokenizer = Tokenizer()
financial_tokenizer.fit_on_texts(financial)
financial_encoded = financial_tokenizer.texts_to_sequences(financial)

2023-03-22 14:10:29.864620: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


### b) Next, prepare your sequences from text

#### Fixed ngram based sequences 

The training samples will be structured in the following format. 
Depending on which ngram model we choose, there will be (n-1) tokens 
in the input sequence (X) and we will need to predict the nth token (Y)

            X,						  y
    this,    process               however
    process, however               afforded
    however, afforded	           me

In [3]:
def generate_ngram_training_samples(encoded: list, ngram:int) -> list:
    '''
    Takes the encoded data (list of lists) and 
    generates the training samples out of it.
    Parameters:
    up to you, we've put in what we used
    but you can add/remove as needed
    return: 
    list of lists in the format [[x1, x2, ... , x(n-1), y], ...]
    '''
    training_samples = []
    for line in encoded:
        for i in range(len(line) - ngram + 1):
            training_samples.append(line[i:i+ngram])
    return training_samples


### c) Then, split the sequences into X and y and create a Data Generator

In [4]:
# Note here that the sequences were in the form: 
# sequence = [x1, x2, ... , x(n-1), y]
# We still need to separate it into [[x1, x2, ... , x(n-1)], ...], [y1, y2, ...]

def generate_xy(encoded: list, ngram: int) -> (list, list):
    '''
    generates the X and y for the training samples
    Parameters:
    encoded: the encoded data
    ngram: the ngram size

    Returns:
    X: list of lists in the format [[x1, x2, ... , x(n-1)], ...]
    y: list of lists in the format [y1, y2, ...]
    '''
    training_samples = generate_ngram_training_samples(encoded, ngram)
    toreturn_X = []
    toreturn_y = []
    random.shuffle(training_samples)
    for sample in training_samples:
        toreturn_X.append(sample[:-1])
        toreturn_y.append(sample[-1])
    return toreturn_X, toreturn_y

In [5]:
def read_embeddings(filename, tokenizer) -> (dict, dict):
    '''Loads and parses embeddings trained in earlier.

    Parameters:
    filename: the path to the embeddings file
    tokenizer: the tokenizer used to generate the embeddings

    Returns:
    word_to_embedding: a dict mapping words to their embeddings
    index_to_embedding: a dict mapping indices to their embeddings
    '''
    
    # you may find generating the following two dicts useful:
    # word to embedding : {'the':[0....], ...}
    # index to embedding : {1:[0....], ...} 
    # use your tokenizer's word_index to find the index of
    # a given word
    
    # code to read the embeddings file and return the above two dicts
    f = open(filename, 'r')
    f.readline()
    word_to_embedding = {}
    index_to_embedding = {}
    for line in f:
        line = line.split()
        word_to_embedding[line[0]] = [float(x) for x in line[1:]]
        #index to embedding refer to the index of the word in the tokenizer
        index_to_embedding[tokenizer.word_index[line[0]]] = [float(x) for x in line[1:]]
    return word_to_embedding, index_to_embedding




In [6]:
def data_generator(X: list, y: list, num_sequences_per_batch: int, embedding: dict, epochs: int) -> (list,list):
    '''
    Returns data generator to be used by feed_forward
    https://wiki.python.org/moin/Generators
    https://realpython.com/introduction-to-python-generators/
    
    Yields batches of embeddings and labels to go with them.
    Use one hot vectors to encode the labels 
    (see the to_categorical function)
    
    '''
    a = []
    b = []
    for _ in range(epochs):
        X_copy = X.copy()
        y_copy = y.copy()
        random.shuffle(X_copy)
        random.shuffle(y_copy)
        for i in range(len(X_copy)):
            temp = []
            for item in X_copy[i]:
                temp.extend(embedding[item])
            a.append(temp)
            b.append(to_categorical(y_copy[i], num_classes= len(embedding) + 1))
            if len(a) == num_sequences_per_batch:
                yield (np.matrix(a), np.matrix(b))
                a = []
                b = []
        yield (np.matrix(a), np.matrix(b))

In [7]:
# Examples
# initialize data_generator
# num_sequences_per_batch = 128 # this is the batch size
# steps_per_epoch = len(sequences)//num_sequences_per_batch  # Number of batches per epoch
# train_generator = data_generator(X, y, num_sequences_per_batch)

# sample=next(train_generator) # this is how you get data out of generators
# sample[0].shape # (batch_size, (n-1)*EMBEDDING_SIZE)  (128, 200)
# sample[1].shape   # (batch_size, |V|) to_categorical
def prepare_data_for_feed_forward(tokenizer: Tokenizer, embedding_file: str, encoded: list, ngram: int) -> (list, list, int):
    '''
    Returns data generator to be used by feed_forward, and steps_per_epoch
    Parameters:
    tokenizer: tokenizer used to encode the data
    embedding_file: file name containing the embeddings
    encoded: encoded data
    ngram: ngram used to generate the training samples
    Returns:
    train_generator: data generator to be used by feed_forward
    steps_per_epoch: number of batches per epoch
    '''
    w, i = read_embeddings(embedding_file, tokenizer)
    X, y = generate_xy(encoded, NGRAM)
    num_sequences_per_batch = 128 # this is the batch size
    steps_per_epoch = len(X)//num_sequences_per_batch  # Number of batches per epoch
    train_generator = data_generator(X, y, num_sequences_per_batch, embedding=i, epochs=3)
    sample=next(train_generator) # this is how you get data out of generators
    print(sample[0].shape) # (batch_size, (n-1)*EMBEDDING_SIZE)  (128, 200)
    print(sample[1].shape) # (batch_size, |V|) to_categorical
    return (train_generator, steps_per_epoch)

### d) Train your models

In [8]:
# code to train a feedforward neural language model 
# on a set of given word embeddings
# make sure not to just copy + paste to train your two models

# Define the model architecture using Keras Sequential API
EMBEDDINGS_SIZE = 200
def feed_forward(embedding_size: int, vocab_size: int, ngram: int) -> Sequential:
    '''
    Returns a feed forward neural network model
    Parameters:
    embedding_size: the size of the embeddings
    vocab_size: the size of the vocabulary
    ngram: the ngram size

    returns: a compiled Keras model
    '''
    learning_rate = 0.1
    model = Sequential()
    model.add(Dense(ngram * EMBEDDINGS_SIZE, input_dim=(ngram - 1) * embedding_size, activation='relu'))
    model.add(Dense(vocab_size, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer=SGD(learning_rate=learning_rate), metrics=['accuracy'])
    return model
# code to train the model
spooky_model = feed_forward(EMBEDDINGS_SIZE, len(half_spooky_tokenizer.word_index) + 1, NGRAM)
financial_model = feed_forward(EMBEDDINGS_SIZE, len(financial_tokenizer.word_index) + 1, NGRAM)


2023-03-22 14:10:47.722553: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [9]:
spooky_generator, spooky_steps_per_epoch = prepare_data_for_feed_forward(half_spooky_tokenizer, 'half_spooky_embeddings.txt', half_spooky_encoded, NGRAM)
# Start training the model
spooky_model.fit(x=spooky_generator, 
          steps_per_epoch=spooky_steps_per_epoch,
          epochs=3)

(128, 400)
(128, 19106)
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fa633d773d0>

In [10]:
#model for our self-chosen dataset
financial_generator, financial_steps_per_epoch = prepare_data_for_feed_forward(financial_tokenizer, 'financial_embeddings.txt', financial_encoded, NGRAM)
financial_model.fit(x=financial_generator,
          steps_per_epoch=financial_steps_per_epoch,
          epochs=3)

(128, 400)
(128, 11267)
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fa619f78bb0>

### e) Generate Sentences

In [37]:
# generate a sequence from the model
def generate_seq(model: Sequential, 
                 tokenizer: Tokenizer, 
                 seed: list,
                 word_to_embedding: dict, 
                 n_words: 20):
    '''
    Parameters:
        model: your neural network
        tokenizer: the keras preprocessing tokenizer
        seed: [w1, w2, w(n-1)]
        word_to_embedding: dictionary mapping word to embedding
        n_words: generate a sentence of length n_words
    Returns: string sentence
    '''
    sentence = seed
    pres = seed
    word_range = [i for i in range(len(tokenizer.word_index) + 1)]
    for _ in range(n_words):
        input_X = []
        for item in pres:
            input_X += word_to_embedding[item]
        input_X = np.matrix(input_X)
        output_y = model.predict(input_X)
        #randomly choose the next word
        output_y = list(list(output_y)[0])
        word_chosen = random.choices(word_range, output_y)[0]
        sentence += [tokenizer.index_word[word_chosen]]
        if tokenizer.index_word[word_chosen] == '</s>':
            return ' '.join(sentence)
        pres = sentence[-NGRAM + 1:]
    return ' '.join(sentence)

In [38]:
spooky_word_to_embeddings = read_embeddings('half_spooky_embeddings.txt', half_spooky_tokenizer)[0]
financial_word_to_embeddings = read_embeddings('financial_embeddings.txt', financial_tokenizer)[0]
spooky_sentences = []
financial_sentences = [] 
for _ in range(50):
    spooky_sentences.append(generate_seq(spooky_model, half_spooky_tokenizer, ['<s>', '<s>'], spooky_word_to_embeddings,20))
    financial_sentences.append(generate_seq(financial_model, financial_tokenizer, ['<s>', '<s>'], financial_word_to_embeddings,20))



In [39]:
spooky_sentences

['<s> <s> second love school suddenly ancestors unlocked blood laundry dead love points suffering time there jedgment long looks city voiceless connected',
 '<s> <s> proceeded cottage cord great pleasant wolejko </s>',
 '<s> <s> roused fair boyhood lord undecayed melancholy passed nerves consciousness having gave night drawn palpable moment </s>',
 '<s> <s> game sunfish grasp frequently creator concealed hellish been mauvais far heerd i s returned sure ejaculation referred eleventh stations person',
 '<s> <s> present believe followed visible represent shine mass suitable near well old heartfelt cottage sprang exceptions gendarme happiness emerging candle details',
 '<s> <s> glowing waters appearance creatures things position recherchés casks endeavoured ulterior adduce dissolution canisters there chin lovely visage wrote trouble been',
 '<s> <s> roofed beauvais flashes consummation torpedoed modern better having t </s>',
 '<s> <s> ears place fireplace equal tinted cold if poorer annoya

In [42]:
with open('financial_lines.txt', 'w') as f:
    for item in financial_sentences:
        f.write( item + '\n')

### f) Compare your generated sentences

You may find it useful to run your HW 2 code on one of the datasets (or a subset of the dataset) that you used for this homework.

In [None]:
# def generate_txt_from_csv(filename: str):
#     lines = read_from_csv(filename + '.csv')
#     with open (filename + '.txt', 'w') as f:
#         for line in lines:
#             f.write(' '.join(line) + '\n')
# generate_txt_from_csv('half_spooky_trainig_set')
# generate_txt_from_csv('financial_data_training_set')

# used the code above to generate the txt files to input into hw2 code
# 50 sentences each are stored in hw2_spooky.txt and hw2_financial.txt

Sources Cited
----------------------------
