[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/satyaki-mallick/DeepLearningAssignment1/blob/master/Assignment_1.ipynb#scrollTo=7uut_aem5B5q)

# Assignment 1

<b>Group [fill in group number]</b>
* <b> Student 1 </b> : SATYAKI MALLICK + 1410881
* <b> Student 2 </b> : HUILIN ZHU+ 1378627

**Reading material**
* [1] Mikolov, Tomas, et al. "[Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781)" Advances in neural information processing systems. 2013. 

<b><font color='red'>NOTE</font></b> When submitting your notebook, please make sure that the training history of your model is visible in the output. This means that you should **NOT** clean your output cells of the notebook. Make sure that your notebook runs without errors in linear order.



# Question 1 - Keras implementation (10 pt)

### Word embeddings
Build word embeddings with a Keras implementation where the embedding vector is of length 50, 150 and 300. Use the Alice in Wonderland text book for training. Use a window size of 2 to train the embeddings (`window_size` in the jupyter notebook). 

1. Build word embeddings of length 50, 150 and 300 using the Skipgram model
2. Build word embeddings of length 50, 150 and 300 using CBOW model
3. Analyze the different word embeddings:
    - Implement your own function to perform the analogy task (see [1] for concrete examples). Use the same distance metric as in the paper. Do not use existing libraries for this task such as Gensim. 
Your function should be able to answer whether an analogy like: "a king is to a queen as a man is to a woman" ($e_{king} - e_{queen} + e_{woman} \approx e_{man}$) is true. $e_{x}$ denotes the embedding of word $x$. We want to find the word $p$ in the vocabulary, where the embedding of $p$ ($e_p$) is the closest to the predicted embedding (i.e. result of the formula). Then, we can check if $p$ is the same word as the true word $t$.
    - Give at least 5 different  examples of analogies.
    - Compare the performance on the analogy tasks between the word embeddings and briefly discuss your results.

4. Discuss:
  - Given the same number of sentences as input, CBOW and Skipgram arrange the data into different number of training samples. Which one has more and why?


<b>HINT</b> See practical 3.1 for some helpful code to start this assignment.


### Import libraries

In [0]:
%tensorflow_version 2.x

In [2]:
import numpy as np
import keras.backend as K
import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Reshape, Lambda
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import plot_model
from tensorflow.keras.preprocessing import sequence

# other helpful libraries
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors as nn
from matplotlib import pylab
import pandas as pd

Using TensorFlow backend.


In [3]:
print(tf.__version__) #  check what version of TF is imported

2.2.0


### Import file

If you use Google Colab, you need to mount your Google Drive to the notebook when you want to use files that are located in your Google Drive. Paste the authorization code, from the new tab page that opens automatically when running the cell, in the cell below.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


Navigate to the folder in which `alice.txt` is located. Make sure to start path with '/content/drive/My Drive/' if you want to load the file from your Google Drive.

In [0]:
cd '/content/drive/My Drive/Colab Notebooks/'

/content/drive/My Drive/Colab Notebooks


In [5]:
cd '/content/drive/My Drive/Colab Notebooks/DeepLearning/Practical3'

/content/drive/My Drive/Colab Notebooks/DeepLearning/Practical3


In [0]:
file_name = 'alice.txt'
corpus = open(file_name).readlines()

### Data preprocessing

See Practical 3.1 for an explanation of the preprocessing steps done below.

In [0]:
# Removes sentences with fewer than 3 words
corpus = [sentence for sentence in corpus if sentence.count(" ") >= 2]

# remove punctuation in text and fit tokenizer on entire corpus
tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'+"'")
tokenizer.fit_on_texts(corpus)

# convert text to sequence of integer values
corpus = tokenizer.texts_to_sequences(corpus)
n_samples = sum(len(s) for s in corpus) # total number of words in the corpus
V = len(tokenizer.word_index) + 1 # total number of unique words in the corpus

In [8]:
n_samples, V

(27165, 2557)

In [9]:
# example of how word to integer mapping looks like in the tokenizer
print(list((tokenizer.word_index.items()))[:5])

[('the', 1), ('and', 2), ('to', 3), ('a', 4), ('it', 5)]


In [0]:
# parameters
window_size = 2
window_size_corpus = 4

In [0]:
# def save_embeddings_old(model, model_name):
#   weights = model.get_weights()
#   embedding = weights[0]
#   f = open(model_name + '.txt', 'w')
#   f.write(" ".join([str(V-1), str(dim)]))
#   f.write("\n")

#   for word,i in tokenizer.word_index.items():
#     f.write(word)
#     f.write(" ")
#     f.write(" ".join(map(str, list(embedding[i,:]))))
#     f.write("\n")
#   f.close()

In [0]:
def save_embeddings(model, model_name):
  weights = model.get_weights()
  embedding = weights[0]
  np.savetxt('new_' + model_name + '.txt',embedding)

## Task 1.1 - Skipgram
Build word embeddings of length 50, 150 and 300 using the Skipgram model.

In [0]:
#prepare data for skipgram
def generate_data_skipgram(corpus, window_size, V):
    # TODO Implement here
    # HINT: see Practical 3.1
    maxlen = window_size*2
    all_in = []
    all_out = []
    for words in corpus:
        L = len(words)
        for index, word in enumerate(words):
            p = index - window_size
            n = index + window_size + 1
                    
            in_words = []
            labels = []
            for i in range(p, n):
                if i != index and 0 <= i < L:
                    # Add the input word
                    #in_words.append(word)
                    all_in.append(word)
                    # Add one-hot of the context words
                    all_out.append(to_categorical(words[i], V))
                                      
    return (np.array(all_in),np.array(all_out))

In [0]:
# create training data
x_skipgram , y_skipgram = generate_data_skipgram(corpus,window_size,V)

In [0]:
# x each word represented as a number and then a window created

In [0]:
# Sample of y:
# rows represent each word.
# columns represent total unique words
# sample array for y for input [1, 0, 3, 4, 5, 0, 2, 1]
# array([[ 0.,  1.,  0.,  0.,  0.,  0.],
#        [ 1.,  0.,  0.,  0.,  0.,  0.],
#        [ 0.,  0.,  0.,  1.,  0.,  0.],
#        [ 0.,  0.,  0.,  0.,  1.,  0.],
#        [ 0.,  0.,  0.,  0.,  0.,  1.],
#        [ 1.,  0.,  0.,  0.,  0.,  0.],
#        [ 0.,  0.,  1.,  0.,  0.,  0.],
#        [ 0.,  1.,  0.,  0.,  0.,  0.]])

In [0]:
x_skipgram.shape, y_skipgram.shape

((94556,), (94556, 2557))

In [0]:
V = y_skipgram.shape[1]

In [0]:
def skipgram_architechture(dim):
  model = Sequential()
  model.add(Embedding(input_dim=V, output_dim=dim, input_length=1, embeddings_initializer='glorot_uniform'))
  # not sure about the input length
  #model.add(Reshape((94556,), input_shape=(1,94556)))
  # above line or below line
  model.add(Reshape((dim,)))
  model.add(Dense(V, activation='softmax', kernel_initializer='glorot_uniform'))
  model.compile(optimizer='adadelta',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
  return model

### Skipgram for Embedding Vector Length 50

In [0]:
dim = 50
skipgram50 = skipgram_architechture(dim)
skipgram50.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 1, 50)             127850    
_________________________________________________________________
reshape_2 (Reshape)          (None, 50)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 2557)              130407    
Total params: 258,257
Trainable params: 258,257
Non-trainable params: 0
_________________________________________________________________


<b>HINT</b>: To increase training speed of your model, you can use the free available GPU power in Google Colab. Go to `Edit` --> `Notebook Settings` --> select `GPU` under `hardware accelerator`.

In [0]:
# train skipgram model
skipgram50.fit(x_skipgram, y_skipgram, batch_size=64, epochs=10)
save_embeddings(skipgram50, 'skipgram_50')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Skipgram for embedding vector length 150

In [0]:
# save embeddings for vectors of length 50, 150 and 300 using skipgram model
dim = 150
skipgram150 = skipgram_architechture(dim)
skipgram150.summary()
skipgram150.fit(x_skipgram, y_skipgram, batch_size=64, epochs=10)
save_embeddings(skipgram150, 'skipgram_150')


Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 1, 150)            383550    
_________________________________________________________________
reshape_5 (Reshape)          (None, 150)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 2557)              386107    
Total params: 769,657
Trainable params: 769,657
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Skipgram for embedding vector length 300

In [0]:
dim = 300
skipgram300 = skipgram_architechture(dim)
skipgram300.fit(x_skipgram, y_skipgram, batch_size=32, epochs=10)
save_embeddings(skipgram300, 'skipgram_300')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Task 1.2 - CBOW

Build word embeddings of length 50, 150 and 300 using CBOW model.

In [0]:
# prepare data for CBOW

# create training data

# create CBOW architecture

# train CBOW model

# save embeddings for vectors of length 50, 150 and 300 using CBOW model

Prepare data for CBOW



In [0]:
def generate_data_CBOW(corpus, window_size, V):
    maxlen = window_size*2
    all_in = []
    all_out = [] #one real label wt
    for line in corpus:
        sentence_length = len(line)
      
        for index, word in enumerate(line):  #for each word in the line, we create a little neighborhood [left,right]
            left = index - window_size
            right = index + window_size + 1
      
            in_words = []   #neighbor words of wt, used as input to predict wt       
      
            for i in range(left, right):
                if 0 <= i < sentence_length and i != index:
                    # Add the input word
                    in_words.append(line[i])

            
            all_in.append(in_words)
            all_out.append(to_categorical(word,V))
      
    all_in = sequence.pad_sequences(all_in, maxlen=maxlen)
                                      
    return (np.array(all_in),np.array(all_out))

Create training data

In [0]:
x , y = generate_data_CBOW(corpus,window_size,V)

Create CBOW architecture

In [0]:
def cbow_architechture(dim):
    cbow = Sequential()
    cbow.add(Embedding(input_dim=V, output_dim=dim, embeddings_initializer='glorot_uniform', input_length=window_size*2))
    cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(dim,)))
    cbow.add(Dense(V, kernel_initializer='glorot_uniform', activation='softmax'))
    #multiclass classification->categorical_crossentropy loss, optimizer->PPT02 p24
    cbow.compile(loss='categorical_crossentropy', optimizer='adadelta', metrics=['accuracy'])
    return cbow

###CBOW for Embedding Vector Length 50

Train CBOW model - embedding vector length 50

In [0]:
# dimension of word embedding
dim = 50

cbow = cbow_architechture(dim)

In [0]:
plot_model(cbow, show_shapes = True, show_layer_names=False)

In [0]:
cbow.summary()

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 4, 50)             127850    
_________________________________________________________________
lambda_3 (Lambda)            (None, 50)                0         
_________________________________________________________________
dense_9 (Dense)              (None, 2557)              130407    
Total params: 258,257
Trainable params: 258,257
Non-trainable params: 0
_________________________________________________________________


In [0]:
# train skipgram model
cbow.fit(x, y, batch_size=64, epochs=10, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f327d0fdc50>

Save embedding

In [0]:
save_embeddings(cbow, 'cbow_{dim}'.format(dim=dim))

###CBOW for Embedding Vector Length 150

Train CBOW model - embedding vector length 150

In [0]:
# dimension of word embedding
dim = 150

cbow = cbow_architechture(dim)

In [0]:
plot_model(cbow, show_shapes = True, show_layer_names=False)

In [0]:
cbow.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 4, 150)            383550    
_________________________________________________________________
lambda_1 (Lambda)            (None, 150)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 2557)              386107    
Total params: 769,657
Trainable params: 769,657
Non-trainable params: 0
_________________________________________________________________


In [0]:
# train skipgram model
cbow.fit(x, y, batch_size=64, epochs=10, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f327d2a2a58>

Save embedding

In [0]:
save_embeddings(cbow, 'cbow_{dim}'.format(dim=dim))

###CBOW for Embedding Vector Length 300

Train CBOW model - embedding vector length 300

In [0]:
# dimension of word embedding
dim = 300

cbow = cbow_architechture(dim)

In [0]:
plot_model(cbow, show_shapes = True, show_layer_names=False)

In [0]:
cbow.summary()

In [0]:
# train skipgram model
cbow.fit(x, y, batch_size=64, epochs=10, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f327d1ac438>

Save embedding

In [0]:
save_embeddings(cbow, 'cbow_{dim}'.format(dim=dim))

## Task 1.3 - Analogy function

Implement your own function to perform the analogy task (see [1] for concrete examples). Use the same distance metric as in [1]. Do not use existing libraries for this task such as Gensim. Your function should be able to answer whether an analogy like: "a king is to a queen as a man is to a woman" ($e_{king} - e_{queen} + e_{woman} \approx e_{man}$) is true. 

In a perfect scenario, we would like that this analogy ( $e_{king} - e_{queen} + e_{woman}$) results in the embedding of the word "man". However, it does not always result in exactly the same word embedding. The result of the formula is called the expected or the predicted word embedding. In this context, "man" is called the true or the actual word $t$. We want to find the word $p$ in the vocabulary, where the embedding of $p$ ($e_p$) is the closest to the predicted embedding (i.e. result of the formula). Then, we can check if $p$ is the same word as the true word $t$.  

You have to answer an analogy function using each embedding for both CBOW and Skipgram model. This means that for each analogy we have 6 outputs. Show the true word (with distance similarity value between predicted embedding and true word embedding, i.e. `sim1`) , the predicted word (with distance similarity value between predicted embedding and the embedding of the word in the vocabulary that is closest to this predicted embedding, i.e. `sim2`) and a boolean answer whether the predicted word **exactly** equals the true word. 

<b>HINT</b>: to visualize the results of the analogy tasks , you can print them in a table. An example is given below.


| Analogy task | True word (sim1)  | Predicted word (sim2) | Embedding | Correct?|
|------|------|------|------|------|
|  queen is to king as woman is to ?	 | man (sim1) | predictd_word(sim2) | SG_50 | True / False|

* Give at least 5 different  examples of analogies.
* Compare the performance on the analogy s between the word embeddings and briefly discuss your results.

In [0]:
def embed(word, embedding, vocab_size=V, tokenizer=tokenizer):
  int_word = tokenizer.texts_to_sequences([word])[0]
  bin_word = to_categorical(int_word, V)
  return np.dot(bin_word, embedding)

In [0]:
def closest_word(predicted_embedding, word1_embedding, word2_embedding, word3_embedding, embedding):
  exclude_words = [predicted_embedding, word1_embedding, word2_embedding, word3_embedding]
  exclude_index = []
  for i in range(V):
    for j in exclude_words:
      if np.array_equal(j, embedding[[i]]):
        exclude_index.append(i)

  include_index = [x for x in range(V) if x not in exclude_index]

  biggest_dist = 0
  index = 0
  for i in include_index:  
    each_word = embedding[[i]]
    dist = cosine_similarity(predicted_embedding, each_word)   #sim1
    if dist > biggest_dist:
      biggest_dist = dist
      index = i
  
  mylist = list((tokenizer.word_index.items()))

  closest_word = [item[0] for item in mylist if item[1] == index]
  # print(least_dist)
  # print(least_dist[0][0])
  # print(closest_word)
  return biggest_dist[0][0], closest_word[0]

In [0]:
def analogy(word1, word2, word3, true_word):

  analogy = word2 + ' is to ' + word1 + ' as ' + word3 + ' is to?'
  df = pd.DataFrame(columns=['Analogy task', 'True word(sim1)', 'Predicted word(sim2)', 'Embedding', 'Correct?'])
  models = ['new_skipgram_50', 'new_skipgram_150', 'new_skipgram_300', 'new_cbow_50', 'new_cbow_150', 'new_cbow_300']

  dummy = []
  for i, model in enumerate(models):
    embedding = np.loadtxt(model + '.txt')
    
    word1_embedding = embed(word1, embedding)
    word2_embedding = embed(word2, embedding)
    word3_embedding = embed(word3, embedding)
    predicted_embedding = word1_embedding - word2_embedding + word3_embedding

    sim1 = cosine_similarity(predicted_embedding, embed(true_word, embedding))   #sim1 with true word

    pair = closest_word(predicted_embedding, word1_embedding, word2_embedding, word3_embedding, embedding)
    
    predicted_word = pair[1]
    sim2 = pair[0]
    
    
    sim1 = sim1[0][0]
    sim1 = '({val})'.format(val=sim1)
    t_word = true_word + sim1
    
    
    sim2 = '({val})'.format(val=sim2)
    p_word = predicted_word + sim2

    df.loc[i] = [analogy] + [t_word] + [p_word] + [model] + [true_word == predicted_word]

  return df

In [24]:
analogy('king', 'queen', 'woman', 'man')

Unnamed: 0,Analogy task,True word(sim1),Predicted word(sim2),Embedding,Correct?
0,queen is to king as woman is to?,man(0.17645969789489707),another(0.44990029377209617),new_skipgram_50,False
1,queen is to king as woman is to?,man(0.19671153018823417),smallest(0.2675138931382219),new_skipgram_150,False
2,queen is to king as woman is to?,man(0.07420692745887787),helpless(0.22614571125328742),new_skipgram_300,False
3,queen is to king as woman is to?,man(-0.1834220056211924),untwist(0.4833274125243154),new_cbow_50,False
4,queen is to king as woman is to?,man(0.04072989896274954),chair(0.2942734997372393),new_cbow_150,False
5,queen is to king as woman is to?,man(-0.047171602193436496),slate(0.19357305524093144),new_cbow_300,False


In [25]:
analogy('queen', 'king', 'man', 'woman')

Unnamed: 0,Analogy task,True word(sim1),Predicted word(sim2),Embedding,Correct?
0,king is to queen as man is to?,woman(-0.12540904719785675),blow(0.4647726281299349),new_skipgram_50,False
1,king is to queen as man is to?,woman(0.08479572510230286),reeds(0.26357401801243774),new_skipgram_150,False
2,king is to queen as man is to?,woman(-0.005560951152718523),glad(0.18838150416320984),new_skipgram_300,False
3,king is to queen as man is to?,woman(0.048848609622548486),both(0.45667987666832793),new_cbow_50,False
4,king is to queen as man is to?,woman(0.02953608888181832),dodo(0.25267461233221644),new_cbow_150,False
5,king is to queen as man is to?,woman(-0.05101961699696715),killing(0.19488828603668693),new_cbow_300,False


In [26]:
# french to France as english to England
analogy('France', 'french', 'english', 'England')

Unnamed: 0,Analogy task,True word(sim1),Predicted word(sim2),Embedding,Correct?
0,french is to France as english is to?,England(0.010037594101873819),pleasing(0.44191868532603623),new_skipgram_50,False
1,french is to France as english is to?,England(-0.18921131989457513),into(0.30822511843626627),new_skipgram_150,False
2,french is to France as english is to?,England(-0.01815044798698159),neat(0.1763568314689753),new_skipgram_300,False
3,french is to France as english is to?,England(0.24215522699192957),dinah(0.46289635122688705),new_cbow_50,False
4,french is to France as english is to?,England(0.14641683462362312),dunce(0.26379237704003605),new_cbow_150,False
5,french is to France as english is to?,England(0.024537948566496975),handsome(0.18467368743144072),new_cbow_300,False


In [27]:
# forget to forgetting as remember to remembering
analogy('forgetting', 'forget', 'remember', 'remembering')

Unnamed: 0,Analogy task,True word(sim1),Predicted word(sim2),Embedding,Correct?
0,forget is to forgetting as remember is to?,remembering(-0.24206488594168962),mind(0.45148866712301483),new_skipgram_50,False
1,forget is to forgetting as remember is to?,remembering(0.1439939235583136),lived(0.30396290069689363),new_skipgram_150,False
2,forget is to forgetting as remember is to?,remembering(0.08139538661542113),tray(0.2223919550464544),new_skipgram_300,False
3,forget is to forgetting as remember is to?,remembering(-0.22620641332522362),emphasis(0.4888385678577706),new_cbow_50,False
4,forget is to forgetting as remember is to?,remembering(-0.07978948882359788),floor(0.277579552107549),new_cbow_150,False
5,forget is to forgetting as remember is to?,remembering(0.02857875796476763),spoke(0.21356521797240333),new_cbow_300,False


In [28]:
# hand is to leg as to eye is to ear
analogy('knee', 'hand', 'eye', 'ear')

Unnamed: 0,Analogy task,True word(sim1),Predicted word(sim2),Embedding,Correct?
0,hand is to knee as eye is to?,ear(0.10204160267428564),crawling(0.4519353000022144),new_skipgram_50,False
1,hand is to knee as eye is to?,ear(-0.04415578613275498),chuckled(0.28050926165789153),new_skipgram_150,False
2,hand is to knee as eye is to?,ear(0.06165104262429445),us(0.20435088668743157),new_skipgram_300,False
3,hand is to knee as eye is to?,ear(0.12768119631137134),very(0.543729961961767),new_cbow_50,False
4,hand is to knee as eye is to?,ear(0.01635093705827108),even(0.25069047840317404),new_cbow_150,False
5,hand is to knee as eye is to?,ear(-0.025087382341028818),sighed(0.17720160555120615),new_cbow_300,False


In [29]:
# he is to his as to she is to hers
analogy('his', 'he', 'she', 'hers')

Unnamed: 0,Analogy task,True word(sim1),Predicted word(sim2),Embedding,Correct?
0,he is to his as she is to?,hers(-0.07428201388995209),settle(0.4295097532817805),new_skipgram_50,False
1,he is to his as she is to?,hers(-0.10372003520778603),appeared(0.2843083464182061),new_skipgram_150,False
2,he is to his as she is to?,hers(0.014964269771963981),shrill(0.19254202025984168),new_skipgram_300,False
3,he is to his as she is to?,hers(0.01937456269204181),undoing(0.48574646530538657),new_cbow_50,False
4,he is to his as she is to?,hers(0.00012462515723521902),each(0.24443508736304217),new_cbow_150,False
5,he is to his as she is to?,hers(-0.08031098742553902),arithmetic(0.17426783895637374),new_cbow_300,False


In [30]:
analogy('he', 'his', 'hers', 'she')

Unnamed: 0,Analogy task,True word(sim1),Predicted word(sim2),Embedding,Correct?
0,his is to he as hers is to?,she(-0.061049304855692044),small(0.4135019658150866),new_skipgram_50,False
1,his is to he as hers is to?,she(-0.04069604577476947),excellent(0.30856708261080656),new_skipgram_150,False
2,his is to he as hers is to?,she(0.06013155981663805),cucumber(0.19922338675414952),new_skipgram_300,False
3,his is to he as hers is to?,she(0.07197485680091115),since(0.4787940904787158),new_cbow_50,False
4,his is to he as hers is to?,she(-0.050996869568261126),it(0.24370894178848868),new_cbow_150,False
5,his is to he as hers is to?,she(0.07266589781509537),forget(0.18188486620594188),new_cbow_300,False


## Task 1.4 - Discussion
Answer the following question:
* Given the same number of sentences as input, CBOW and Skipgram arrange the data into different number of training samples. Which one has more and why?


Answer:

Given a sentence of n unique words, and a window size of 2L for training. Skipgram will arrage 2L pairs of training samples for each word; therefore, Skipgram has 2nL training samples in total. On the other hand, CBOW uses the context words in the window differently. CBOW combines all 2L context words in the window into one traing sample; thus, CBOW has n training samples in total. The conclusion is that Skipgram has more training samples due to the way it pairs up each word with each of its context words in the window.

# Question 2 - Peer review (0 pt):
Finally, each group member must write a single paragraph outlining their opinion on the work distribution within the group. Did every group member
contribute equally? Did you split up tasks in a fair manner, or jointly worked through the exercises. Do you think that some members of your group deserve a different grade from others? You can use the table below to make an overview of how the tasks were divided:



| Student name | Task  |
|------|------|
|  Huilin Zhu  | task x |
| student name 2  | task x|
| everyone | task x|


In [0]:
# Does the order of first 2 words matter?
# How to find the predicted word from the predicted word embedding?
# How to find the closest word from the predicted word embedding?
# What is sim1 and sim2 ..?
# to determine closest, we should use mod of the value