In [45]:
import torch
import torch.nn as nn
import pandas as pd
import spacy
from torch.utils.data import Dataset, DataLoader

# Create a simple dataset

In [47]:
data = {
    'Sentence': ['Hello, how are you?', 'I am learning about AI!', 'Transformers are interesting.']
}

df = pd.DataFrame(data)
df.to_csv("sentences.csv", index=False)

# Tokenisation

## Tokenization in Natural Language Processing (NLP)

Tokenization in Natural Language Processing (NLP) is the process of breaking down text into smaller units, called tokens. These tokens can be words, subwords, characters, or parts of words. Tokenization transforms unstructured text into a format that can be easily processed by algorithms.

### Types of Tokenization

1. **Word Tokenization**: This involves splitting text into individual words.
   - Example: "Hello world" becomes ["Hello", "world"]

2. **Subword Tokenization**: Often used in advanced NLP models like BERT, it breaks words into sub-units or characters to better handle rare or unknown words.
   - Example: "smarter" might be tokenized as ["smart", "er"]

3. **Character Tokenization**: This approach splits text into individual characters, useful for certain types of linguistic analysis or languages without clear word boundaries.
   - Example: "cat" becomes ["c", "a", "t"]

4. **Sentence Tokenization**: This method breaks text into individual sentences, often used for tasks that require understanding the context of whole sentences.
   - Example: "Hello world. It's a great day." becomes ["Hello world.", "It's a great day."]

### Purpose and Importance

- **Machine Readability**: Tokenization converts text into a form that is easier for machines to understand and process.
- **Simplification of Text Analysis**: By breaking text into smaller parts, tokenization simplifies complex natural language processing tasks like parsing and sentiment analysis.
- **Handling Ambiguity and Context**: It helps in understanding the context and meaning of text, especially important in languages with complex grammar and syntax.

### Example in Context

Consider the sentence "NLP stands for Natural Language Processing." When tokenized, it becomes:

```python
["NLP", "stands", "for", "Natural", "Language", "Processing"]

In [59]:
# Load spacy model for tokenization
nlp = spacy.load("en_core_web_sm")

# Tokenization function using spacy -> The function returns a list of lowercase tokens.
def tokenize(text):
    return [token.text.lower() for token in nlp.tokenizer(text)]

# Create a simple dataset
data = {
    'Sentence': ['Hello, how are you?', 'I am learning about AI!', 'Transformers are interesting.']
}
df = pd.DataFrame(data)
df.to_csv("sentences.csv", index=False)

# Create the vocabulary

`build_vocab` is a function that takes a DataFrame as input.

It initializes an empty dictionary vocab, which will store each unique token as a key and its corresponding index as a value.
The function iterates over each sentence in the DataFrame, tokenizes it, and then iterates over each token.

For each token, if it is not already in vocab, it is added to vocab with a value that is the current length of vocab. This means each token gets a unique index.

The function returns the vocab dictionary, which represents the vocabulary built from the dataset.

**vocab is a dictionary representing the vocabulary (mapping of tokens to indices).**

In [61]:
def build_vocab(dataframe):
    vocab = {}
    for sentence in dataframe['Sentence']:
        tokens = tokenize(sentence)
        for token in tokens:
            if token not in vocab:
                vocab[token] = len(vocab)
    return vocab

vocab = build_vocab(df)

In [62]:
vocab

{'hello': 0,
 ',': 1,
 'how': 2,
 'are': 3,
 'you': 4,
 '?': 5,
 'i': 6,
 'am': 7,
 'learning': 8,
 'about': 9,
 'ai': 10,
 '!': 11,
 'transformers': 12,
 'interesting': 13,
 '.': 14}

In [64]:
# Custom Dataset class
class CustomDataset(Dataset):
    def __init__(self, dataframe, vocab):
        self.dataframe = dataframe
        self.vocab = vocab

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        text = self.dataframe.iloc[idx]['Sentence']
        print(text)
        return torch.tensor([self.vocab[token] for token in tokenize(text)], dtype=torch.long)

# Create dataset and dataloader
dataset = CustomDataset(df, vocab)
dataloader = DataLoader(dataset, batch_size=1, shuffle=True)

Each item in the dataset is a tensor representing **a sequence of indices corresponding to the tokens of a sentence from the DataFrame**, which is a common format for NLP tasks in deep learning.

In [43]:
text = df.iloc[0]['Sentence']
[vocab[token] for token in tokenize(text)]

[0, 1, 2, 3, 4, 5]

In [73]:
# Embedding Layer
class TransformerEmbedding(nn.Module):
    def __init__(self, vocab_size, embedding_dim, max_length):
        super(TransformerEmbedding, self).__init__()
        self.token_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.position_embeddings = nn.Embedding(max_length, embedding_dim)

    def forward(self, x):
        positions = torch.arange(x.size(0)).unsqueeze(0)
        return self.token_embeddings(x) + self.position_embeddings(positions)

In [None]:
# Initialize model
embedding_dim = 64
vocab_size = len(vocab)
max_length = max([len(tokenize(sentence)) for sentence in df['Sentence']])
model = TransformerEmbedding(vocab_size, embedding_dim, max_length)

In [69]:
# Example input and embedding extraction
for batch in dataloader:
    embeddings = model(batch[0])
    print(embeddings.shape)
    break 

Transformers are interesting.
torch.Size([1, 4, 64])
