# ***Embeddings - CSV Data***

## ***1. Importing Libraries***

In [8]:
from transformers import GPT2Tokenizer, GPT2Model
import torch
import pandas as pd

## ***2. Loading the Pre -Trained Model***

#### ***Model Used : GPT-2 from transformers module***

*In short, the GPT-2 uses the Transformer architecture and is trained using the backpropagation algorithm. The Transfomer architecture is an improvement upon the traditional RNN topology and is based on a "self-attention mechanism". The model is trained using the unsupervised learning process and so-called deep neural networks to allow it to understand complex patterns and their dependencies in human language. This enables it to be used for numerous language processing tasks including text completion, summmarization and question-answering making it a very important step for AI and machine learning.*

In [9]:
model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2Model.from_pretrained(model_name)

## ***3. Loading the CSV Data***

In [10]:
file_path = "data/d1.csv"

# Read the CSV file
dataset = pd.read_csv(file_path, header=None)

## ***4. Generate Embeddings***

In [7]:
# Get embeddings for each line in the dataset
for index, row in dataset.iterrows():
    # Access each column using the numerical index
    line = str(row[0])

    # Tokenize the line
    tokens = tokenizer.encode(line, return_tensors='pt', truncation=True)

    with torch.no_grad():
        output = model(tokens)

    embeddings = output.last_hidden_state.mean(dim=1).squeeze().numpy()

    # Print the vector form of the embeddings
    print(f"Line: {line}")
    print("Embeddings:", embeddings)
    print("=" * 50)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 -2.73400635e-01  1.28525093e-01 -5.35286963e-03 -1.37983244e-02
 -4.62070778e-02 -2.40903512e-01 -1.32398963e-01  1.24140605e-01
 -1.64411962e-01  1.96969770e-02  1.07950740e-01 -1.78226873e-01
 -3.72114778e-02 -8.66092891e-02  2.23531142e-01 -1.65793449e-01
 -7.09848031e-02  2.44542807e-01 -1.29984751e-01 -1.34018674e-01
 -2.79396534e-01 -5.72028346e-02 -6.47597536e-02  2.00075954e-02
  1.91742182e-01  2.05104023e-01 -1.43622085e-01 -4.16761398e-01
  1.44669518e-01  5.34253657e-01 -1.28868714e-01 -2.76069373e-01
 -1.01869240e-01  4.01415415e-02 -2.17964053e-01 -3.23266059e-01
 -3.32383275e-01  1.21508174e-01  7.21959770e-02  1.70222633e-02
 -1.01175003e-01 -1.03179060e-01 -2.02957451e-01  2.36469060e-02
 -3.34074467e-01 -9.86804366e-02 -6.78114891e-02  1.84486099e-02
  7.28823915e-02  1.88938618e-01 -3.43842655e-01 -4.62888539e-01
 -3.41967523e-01 -1.76550057e-02 -2.91866422e-01 -7.38906115e-02
  7.92292058e-02  8.54095