### Code Explanation

1. **Importing pandas and numpy**:
    - `pandas`: This library is used for data manipulation and analysis, particularly with data structures like DataFrames.
    - `numpy`: This library is used for numerical computations, including support for large multi-dimensional arrays and matrices.

2. **Importing Tokenizer and to_categorical from Keras**:
    - `Tokenizer`: This class from `tensorflow.keras.preprocessing.text` is used for converting text into sequences of integers, which can be used for feeding into a neural network.
    - `to_categorical`: This function from `tensorflow.keras.utils` converts a class vector (integers) to binary class matrix, which is useful for categorical classification.

3. **Importing model and layers from Keras**:
    - `Sequential`: This class from `tensorflow.keras.models` is a linear stack of layers.
    - `Dense`: This layer from `tensorflow.keras.layers` is a regular densely-connected NN layer.
    - `Embedding`: This layer from `tensorflow.keras.layers` is used to convert positive integers (indexes) into dense vectors of fixed size.
    - `Reshape`: This layer from `tensorflow.keras.layers` reshapes an output to a certain shape.
    - `Lambda`: This layer from `tensorflow.keras.layers` wraps arbitrary expressions as a `Layer` object.

4. **Importing Keras and TensorFlow backend**:
    - `keras`: This is the high-level neural networks API from TensorFlow.
    - `tensorflow`: The core open source library developed by Google for machine learning and deep learning.
    - `keras.backend as K`: This is the backend module from Keras, which provides an abstraction for different backends (e.g., TensorFlow, Theano).


In [20]:
import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Reshape, Lambda
from tensorflow import keras
import tensorflow as tf
import keras.backend as K

In [21]:
file = "alice.txt"
corpus = open(file).readlines()

In [22]:
corpus = [sentence for sentence in corpus if sentence.count(" ") >= 2]

### Code Explanation

1. **Initializing the Tokenizer**:
    ```python
    tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n'+"'")
    ```
    - `Tokenizer`: This initializes the Tokenizer with a specific set of filters to remove unwanted characters from the text.

2. **Fitting the Tokenizer on Texts**:
    ```python
    tokenizer.fit_on_texts(corpus)
    ```
    - `fit_on_texts(corpus)`: This method updates the internal vocabulary based on the list of texts (corpus) provided, preparing the tokenizer to convert the text into sequences of integers.


In [26]:
tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n'+"'")
tokenizer.fit_on_texts(corpus)

In [6]:
corpus = tokenizer.texts_to_sequences(corpus)
n_samples = sum(len(s) for s in corpus )
V = len(tokenizer.word_index) + 1

In [7]:
n_samples,V

(27165, 2557)

In [8]:
window_size = 2
window_size_corpus = 4
np.random.seed(42)

In [9]:
corpus

[[305, 7, 38, 1, 92, 595],
 [11, 13, 253, 3, 106, 30, 470, 8, 342, 76, 16, 379, 20, 1],
 [828, 2, 8, 343, 136, 3, 54, 134, 57, 596, 6, 23, 829, 65, 1],
 [323, 16, 379, 13, 830, 24, 5, 23, 45, 683, 57, 1447, 12],
 [5, 2, 31, 36, 1, 212, 8, 4, 323, 59, 11, 170, 683, 57],
 [27, 6, 13, 831, 12, 16, 344, 324, 15, 70, 15, 6, 58, 25, 1],
 [471, 160, 154, 16, 415, 30, 597, 2, 529, 325, 1, 1049],
 [8, 416, 4, 1448, 1449, 49, 28, 684, 1, 530, 8, 188, 39, 2],
 [1050, 1, 1450, 56, 279, 4, 148, 92, 22, 1451, 155, 228],
 [280, 76, 16],
 [40, 13, 136, 27, 30, 1051, 12, 14, 832, 67, 11, 89, 5, 27],
 [30, 93, 35, 8, 1, 83, 3, 254, 1, 92, 96, 3, 255, 108, 156],
 [108, 156, 7, 173, 28, 531, 56, 6, 59, 5, 124, 1052, 5],
 [1053, 3, 16, 14, 6, 256, 3, 55, 1452, 18, 32, 24, 18, 1, 62],
 [5, 21, 164, 86, 685, 24, 56, 1, 92, 1453, 180, 4, 417],
 [35, 8, 78, 1054, 472, 2, 109, 18, 5, 2, 43, 345, 20],
 [11, 1055, 3, 16, 204, 25, 5, 1454, 598, 16, 324, 14, 6, 23],
 [103, 128, 238, 4, 92, 22, 346, 4, 1054, 472, 57

### Function Explanation: `generate_data_skipgram`

This function generates training data for skip-gram model training.

#### Parameters:
- `corpus`: A list of lists where each inner list represents a sentence or sequence of words.
- `window_size`: Integer, the size of the window around the center word.
- `V`: Integer, the vocabulary size.

#### Returns:
- A tuple `(all_in, all_out)` where:
  - `all_in`: Numpy array of input words.
  - `all_out`: Numpy array of one-hot encoded output words.

#### Explanation:
1. **Initialization**:
   ```python
   maxlen = window_size * 2
   all_in = []
   all_out = []


In [10]:
def generate_data_skipgram(corpus,window_size,V):
    maxlen = window_size*2
    all_in = []
    all_out = []
    for words in corpus:
        L = len(words)
        for index,word in enumerate(words):
            p = index - window_size
            n = index + window_size + 1
            in_words = []
            labels = []
            for i in range(p,n):
                if i != index and 0 <= i <L:
                    all_in.append(word)
                    all_out.append(to_categorical(words[i],V))
    return(np.array(all_in),np.array(all_out))

In [11]:
x_skip,y_skip = generate_data_skipgram(corpus,window_size,V)
x_skip.shape,y_skip.shape

((94556,), (94556, 2557))

### Code Explanation

#### Variable and Loop Initialization
- `dims = [50, 150, 300]`: Defines a list of embedding dimensions `[50, 150, 300]`.
- `skipgram_models = []`: Initializes an empty list to store Sequential models for each dimension.

#### Model Creation and Compilation Loop
- Iterates through each dimension (`dim`) in `dims`.
- **Model Creation**:
  - Creates a `Sequential` model (`skipgram`).
  - Adds an `Embedding` layer:
    - `input_dim=V`, `output_dim=dim`, `input_length=1`, `embeddings_initializer='glorot_uniform'`.
  - Adds a `Reshape` layer to reshape output to `(dim,)`.
  - Adds a `Dense` layer with `V` units, `softmax` activation, `kernel_initializer='glorot_uniform'`.
- **Compilation**:
  - Compiles the model with Adam optimizer, `categorical_crossentropy` loss, `accuracy` metric.
- Prints model summary.
- Appends `skipgram` model to `skipgram_models`.


In [12]:
    dims = [50,150,300]
    skipgram_models = []
    for dim in dims:
        skipgram = Sequential()
        skipgram.add(Embedding(input_dim=V,output_dim=dim,input_length=1,embeddings_initializer='glorot_uniform'))
        skipgram.add(Reshape((dim, )))
        skipgram.add(Dense(V,activation='softmax',kernel_initializer='glorot_uniform'))
        skipgram.compile(optimizer=keras.optimizers.Adam(),loss='categorical_crossentropy',metrics=['accuracy'])
        skipgram.summary()
        print("")
        skipgram_models.append(skipgram)


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 1, 50)             127850    
                                                                 
 reshape (Reshape)           (None, 50)                0         
                                                                 
 dense (Dense)               (None, 2557)              130407    
                                                                 
Total params: 258257 (1008.82 KB)
Trainable params: 258257 (1008.82 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 1, 150)            383550    
                                                        

In [13]:
for skipgram in skipgram_models:
    skipgram.fit(x_skip,y_skip,batch_size=64,epochs=13,verbose=1)
    print("")

Epoch 1/13


Epoch 2/13
Epoch 3/13
Epoch 4/13
Epoch 5/13
Epoch 6/13
Epoch 7/13
Epoch 8/13
Epoch 9/13
Epoch 10/13
Epoch 11/13
Epoch 12/13
Epoch 13/13

Epoch 1/13
Epoch 2/13
Epoch 3/13
Epoch 4/13
Epoch 5/13
Epoch 6/13
Epoch 7/13
Epoch 8/13
Epoch 9/13
Epoch 10/13
Epoch 11/13
Epoch 12/13
Epoch 13/13

Epoch 1/13
Epoch 2/13
Epoch 3/13
Epoch 4/13
Epoch 5/13
Epoch 6/13
Epoch 7/13
Epoch 8/13
Epoch 9/13
Epoch 10/13
Epoch 11/13
Epoch 12/13
Epoch 13/13



In [14]:
weights = skipgram.get_weights()
embedding = weights[0]
embedding.shape

(2557, 300)

In [16]:
for skipgram in skipgram_models:
    weights = skipgram.get_weights()
    embedding = weights[0]
    f = open(f"vector_skipgram_{len(embedding[0])}.txt","w")
    columns = ["words"] + [f"value_{i+1}" for i in range(embedding.shape[1])]
    f.write(" ".join(columns))
    f.write("\n")
    for word,i in tokenizer.word_index.items():
        f.write(word)
        f.write(" ")
        f.write(" ".join(map(str, list(embedding[i,:]))))
        f.write("\n")
    f.close()

'for skipgram in skipgram_models:\n    weights = skipgram.get_weights()\n    embedding = weights[0]\n    f = open(f"vector_skipgram_{len(embedding[0])}.txt","w")\n    columns = ["words"] + [f"value_{i+1}" for i in range(embedding.shape[1])]\n    f.write(" ".join(columns))\n    f.write("\n")\n    for word,i in tokenizer.word_index.items():\n        f.write(word)\n        f.write(" ")\n        f.write(" ".join(map(str, list(embedding[i,:]))))\n        f.write("\n")\n    f.close()'

In [17]:

from keras.preprocessing import sequence

# Prepare the data for the CBOW model
def generate_data_cbow(corpus, window_size, V):
    all_in = []
    all_out = []

    # Iterate over all sentences
    for sentence in corpus:
        L = len(sentence)
        for index, word in enumerate(sentence):
            start = index - window_size
            end = index + window_size + 1

            # Empty list which will store the context words
            context_words = []
            for i in range(start, end):
                # Skip the 'same' word
                if i != index:
                    # Add a word as a context word if it is within the window size
                    if 0 <= i < L:
                        context_words.append(sentence[i])
                    else:
                        # Pad with zero if there are no words 
                        context_words.append(0)
            # Append the list with context words
            all_in.append(context_words)

            # Add one-hot encoding of the target word
            all_out.append(to_categorical(word, V))
                 
    return (np.array(all_in), np.array(all_out))

In [18]:
X_cbow, y_cbow = generate_data_cbow(corpus, window_size, V)
X_cbow.shape, y_cbow.shape

((27165, 4), (27165, 2557))

In [23]:
cbow_models = []

for dim in dims:
    cbow = Sequential()

    cbow.add(Embedding(input_dim=V, 
                       output_dim=dim, 
                       input_length=window_size*2, # Note that we now have 2L words for each input entry
                       embeddings_initializer='glorot_uniform'))

    cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(dim, )))

    cbow.add(Dense(V, activation='softmax', kernel_initializer='glorot_uniform'))

    cbow.compile(optimizer=keras.optimizers.Adam(),
                 loss='categorical_crossentropy',
                 metrics=['accuracy'])
    
    cbow.summary()
    print("")
    cbow_models.append(cbow)

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 4, 50)             127850    
                                                                 
 lambda_1 (Lambda)           (None, 50)                0         
                                                                 
 dense_3 (Dense)             (None, 2557)              130407    
                                                                 
Total params: 258257 (1008.82 KB)
Trainable params: 258257 (1008.82 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_5 (Embedding)     (None, 4, 150)            383550    
                                                       

In [24]:
for cbow in cbow_models:
    cbow.fit(X_cbow, y_cbow, batch_size=64, epochs=50, verbose=1)
    print("")

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50


Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50



In [25]:
for cbow in cbow_models:
    # Save embeddings for vectors of length 50, 150 and 300 using cbow model
    weights = cbow.get_weights()

    # Get the embedding matrix
    embedding = weights[0]

    # Get word embeddings for each word in the vocabulary, write to file
    f = open(f'vectors_cbow_{len(embedding[0])}.txt', 'w')

    # Create columns for the words and the values in the matrix, makes it easier to read as dataframe
    columns = ["word"] + [f"value_{i+1}" for i in range(embedding.shape[1])]

    # Start writing to the file, start with the column names
    f.write(" ".join(columns))
    f.write("\n")

    for word, i in tokenizer.word_index.items():
        f.write(word)
        f.write(" ")
        f.write(" ".join(map(str, list(embedding[i,:]))))
        f.write("\n")
    f.close()