<a href="https://colab.research.google.com/github/GarettGazay/ai_projects/blob/master/Seq2Seq_Longformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This code uses the LED model from Hugging Face to generate a sequence of tokens based on a given set of inputs. Here's how it works:

1. The code defines the model and tokenizer using the LEDForConditionalGeneration and LEDTokenizer classes from the transformers library. It loads a pre-trained version of the LED model from Hugging Face's model hub.

2. The code defines the input and target sequences as lists of strings. In this case, the inputs are a list of four rides, and the targets are the same rides in a different order.

3. The code tokenizes the input and target sequences using the batch_encode_plus method of the tokenizer. This method converts the lists of strings into tensors of token IDs and attention masks, which the model can process.

4. The code retrieves the input and target token IDs and attention masks from the encoded input and target sequences.


5. The code generates an output sequence using the generate method of the model. This method takes the input token IDs, attention mask, and target token IDs as inputs, and generates a sequence of tokens that best matches the target sequence.

6. The code decodes the output sequence using the batch_decode method of the tokenizer. This method converts the generated sequence of token IDs back into a list of strings.

7. The code prints the generated sequence of rides.The code prints the generated sequence of rides.

Q&A:

G: Is the model actually learning?

CGPT: Yes, the model is definitely learning, as it is able to generate the correct output sequence given the input sequence. The learning process is based on the optimization of the model's parameters to minimize the difference between the predicted output and the actual target output during training. The more data the model is trained on, the better it can generalize and make predictions on unseen data. So, if you provide the model with a year's worth of transportation rides data, it has the potential to learn about the relationships between the rides and make accurate predictions on new, unseen data.

In [5]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [14]:
import torch
from transformers import LEDTokenizer, LEDForConditionalGeneration

# define the model and tokenizer
model_name = "allenai/led-base-16384"
tokenizer = LEDTokenizer.from_pretrained(model_name)
model = LEDForConditionalGeneration.from_pretrained(model_name)



In [4]:
# define the input and target sequences
inputs = ['ride0', 'ride3', 'ride1', 'ride2']
targets = ['ride0', 'ride1', 'ride2', 'ride3']


In [15]:
inputs = ['On 2021-10-30 at 08:30, 388484 with the condition of STR will be picked up from (-121.9516678,37.38991928) and taken to their appointment located at (-122.0140457,37.37662125) the value of this ride is 207.38 and the distance from start to finish is 5.49 miles *',
  'On 2021-10-30 at 12:45, 388505 with the condition of BWC will be picked up from (-121.8270276,37.2519668) and taken to their appointment located at (-121.8325653,37.33570862) the value of this ride is 93.6 and the distance from start to finish is 10.86 miles *',
  'On 2021-10-30 at 10:25, 388498 with the condition of WCH will be picked up from (-121.8635483,37.30686951) and taken to their appointment located at (-121.931572,37.3112793) the value of this ride is 67.85 and the distance from start to finish is 6.25 miles *',
  'On 2021-10-30 at 12:00, 388494 with the condition of WCH will be picked up from (-122.1583481,37.44995499) and taken to their appointment located at (-121.9947433,37.33644867) the value of this ride is 91.7 and the distance from start to finish is 16.83 miles *',
  'On 2021-10-30 at 10:30, 388622 with the condition of STR will be picked up from (-121.8897095,37.44872284) and taken to their appointment located at (-121.8134232,37.35033035) the value of this ride is 216.6 and the distance from start to finish is 9.63 miles *',
  'On 2021-10-30 at 05:00, 411690 with the condition of WCH will be picked up from (-121.9942169,37.34215927) and taken to their appointment located at (-121.9790364,37.322465) the value of this ride is 100.85 and the distance from start to finish is 2.6 miles *',
  'On 2021-10-30 at 17:45, 388677 with the condition of BWC will be picked up from (-121.933667,37.315553) and taken to their appointment located at (-121.8692703,37.38211441) the value of this ride is 89.6 and the distance from start to finish is 9.89 miles *',
  'On 2021-10-30 at 17:15, 388494 with the condition of WCH will be picked up from (-121.9947433,37.33644867) and taken to their appointment located at (-122.1583481,37.44995499) the value of this ride is 57.8 and the distance from start to finish is 16.83 miles *',
  'On 2021-10-30 at 17:30, 390537 with the condition of WCH will be picked up from (-121.8120978,37.3060479) and taken to their appointment located at (-121.8320389,37.29574585) the value of this ride is 25.18 and the distance from start to finish is 2.26 miles *',
  '<START_0>']

targets = ['On 2021-10-30 at 05:00, driver SC301 will pick up 411690 with the condition of WCH from (-121.9942169,37.34215927) and dropped off at (-121.9790364,37.322465) the value of this ride is 100.85 and the distance from start to finish is 2.6 miles *',
  'On 2021-10-30 at 05:15, driver SC107 will pick up 388852 with the condition of BWC from (-121.5949326,37.02376938) and dropped off at (-121.8120978,37.3060479) the value of this ride is 154.4 and the distance from start to finish is 26.14 miles *',
  'On 2021-10-30 at 05:27, driver SC204 will pick up 390529 with the condition of STR from (-121.8948135,37.32189178) and dropped off at (-121.9315708,37.3112781) the value of this ride is 201.07 and the distance from start to finish is 2.74 miles *',
  'On 2021-10-30 at 05:30, driver SC105 will pick up 390198 with the condition of WCH from (-121.9492645,37.30254364) and dropped off at (-121.9337082,37.31555939) the value of this ride is 58.85 and the distance from start to finish is 2.21 miles *',
  'On 2021-10-30 at 05:30, driver SC105 will pick up 390534 with the condition of WCH from (-121.9642258,37.29271698) and dropped off at (-121.9337045,37.3155593) the value of this ride is 103.33 and the distance from start to finish is 3.69 miles *',
  'On 2021-10-30 at 05:43, driver SC301 will pick up 388496 with the condition of WCH from (-121.8491211,37.32672882) and dropped off at (-121.8120956,37.30604935) the value of this ride is 63.8 and the distance from start to finish is 4.44 miles *',
  'On 2021-10-30 at 05:53, driver SC105 will pick up 388574 with the condition of BWC from (-121.9619141,37.27531433) and dropped off at (-121.8917009,37.2454899) the value of this ride is 77.2 and the distance from start to finish is 6.76 miles *',
  'On 2021-10-30 at 06:00, driver SC301 will pick up 388896 with the condition of WCH from (-121.8397293,37.31420898) and dropped off at (-121.8113861,37.30517197) the value of this ride is 101.3 and the distance from start to finish is 2.78 miles *',
  'On 2021-10-30 at 07:15, driver SC105 will pick up 417084 with the condition of BWC from (-121.9148788,37.30213547) and dropped off at (-121.7762604,37.22948456) the value of this ride is 104.0 and the distance from start to finish is 13.5 miles *',
  '<START_0>']

In [16]:
# tokenize the input and target sequences
input_ids = tokenizer.batch_encode_plus(inputs, padding=True, return_tensors="pt")
target_ids = tokenizer.batch_encode_plus(targets, padding=True, return_tensors="pt")

In [6]:
input_ids

{'input_ids': tensor([[    0, 23167,   288,     2],
        [    0, 23167,   246,     2],
        [    0, 23167,   134,     2],
        [    0, 23167,   176,     2]]), 'attention_mask': tensor([[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]])}

In [7]:
target_ids

{'input_ids': tensor([[    0, 23167,   288,     2],
        [    0, 23167,   134,     2],
        [    0, 23167,   176,     2],
        [    0, 23167,   246,     2]]), 'attention_mask': tensor([[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]])}

In [17]:
# retrieve the input and target token IDs and attention masks
encoder_input_ids = input_ids["input_ids"]
decoder_input_ids = target_ids["input_ids"]
decoder_attention_mask = target_ids["attention_mask"]

In [18]:
# generate output sequence
generated_ids = model.generate(
    input_ids=encoder_input_ids,
    attention_mask=input_ids["attention_mask"],
    decoder_start_token_id=tokenizer.pad_token_id,
    decoder_input_ids=decoder_input_ids,
    decoder_attention_mask=decoder_attention_mask,
    use_cache=True,
    max_length=10000,
    num_beams=1,
)


In [20]:
# decode output sequence
generated_sequence = tokenizer.batch_decode(generated_ids.tolist(), skip_special_tokens=True)

print(generated_sequence)

for i in generated_sequence:
  print(i)

['On 2021-10-30 at 05:00, driver SC301 will pick up 411690 with the condition of WCH from (-121.9942169,37.34215927) and dropped off at (-121.9790364,37.322465) the value of this ride is 100.85 and the distance from start to finish is 2.6 miles *', 'On 2021-10-30 at 05:15, driver SC107 will pick up 388852 with the condition of BWC from (-121.5949326,37.02376938) and dropped off at (-121.8120978,37.3060479) the value of this ride is 154.4 and the distance from start to finish is 26.14 miles *', 'On 2021-10-30 at 05:27, driver SC204 will pick up 390529 with the condition of STR from (-121.8948135,37.32189178) and dropped off at (-121.9315708,37.3112781) the value of this ride is 201.07 and the distance from start to finish is 2.74 miles *.', 'On 2021-10-30 at 05:30, driver SC105 will pick up 390198 with the condition of WCH from (-121.9492645,37.30254364) and dropped off at (-121.9337082,37.31555939) the value of this ride is 58.85 and the distance from start to finish is 2.21 miles *', 

In [4]:
import torch
from transformers import LEDTokenizer, LEDForConditionalGeneration
from sklearn.metrics import f1_score, accuracy_score

# define the model and tokenizer
model_name = "allenai/led-base-16384"

tokenizer = LEDTokenizer.from_pretrained(model_name)
model = LEDForConditionalGeneration.from_pretrained(model_name)

# add vocab
tokenizer.add_tokens(['ride0', 'ride3', 'ride1', 'ride2'])
model.resize_token_embeddings(len(tokenizer))


# define the input and target sequences
inputs = ['ride0', 'ride3', 'ride1', 'ride2']
targets = ['ride0', 'ride1', 'ride2', 'ride3']

# tokenize the input and target sequences
input_ids = tokenizer.batch_encode_plus(inputs, padding=True, return_tensors="pt")
target_ids = tokenizer.batch_encode_plus(targets, padding=True, return_tensors="pt")

# retrieve the input and target token IDs and attention masks
encoder_input_ids = input_ids["input_ids"]
decoder_input_ids = target_ids["input_ids"]
decoder_attention_mask = target_ids["attention_mask"]

# define the optimizer and loss function
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
criterion = torch.nn.CrossEntropyLoss()

# train the model for 10 epochs
for epoch in range(5):
    # set the model to train mode
    model.train()

    # pass the inputs and targets through the model and calculate the loss
    outputs = model(
        input_ids=encoder_input_ids,
        attention_mask=input_ids["attention_mask"],
        decoder_input_ids=decoder_input_ids,
        decoder_attention_mask=decoder_attention_mask,
        use_cache=False
    )
    loss = criterion(outputs.logits.view(-1, outputs.logits.size(-1)), decoder_input_ids.view(-1))

    # zero the gradients and perform backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # calculate the accuracy and F1 score
    predicted_ids = model.generate(
        input_ids=encoder_input_ids,
        attention_mask=input_ids["attention_mask"],
        decoder_start_token_id=tokenizer.pad_token_id,
        use_cache=True,
        max_length=128,
        num_beams=1,
    )
    predicted_sequence = tokenizer.batch_decode(predicted_ids.tolist(), skip_special_tokens=True)
    target_sequence = tokenizer.batch_decode(target_ids["input_ids"].tolist(), skip_special_tokens=True)
    accuracy = accuracy_score(target_sequence, predicted_sequence)
    f1 = f1_score(target_sequence, predicted_sequence, average="weighted")

    print(f"Epoch {epoch + 1} loss: {loss:.2f}, accuracy: {accuracy:.2f}, F1 score: {f1:.2f}")



KeyboardInterrupt: ignored

In [4]:
# decode output sequence
predicted_sequence = tokenizer.batch_decode(predicted_ids.tolist(), skip_special_tokens=True)

print(predicted_sequence)

for i in predicted_sequence:
  print(i)

['.', '', '. /', '.']
.

. /
.


In this example, we create a pandas dataframe with transportation rides that includes the ride date and ride text. We group the rides by date and concatenate the ride text into a single string. We create the input and target sequences from the ride text, with each input sequence representing the rides on a particular day and the corresponding target sequence representing the rides on the following day. Finally, we tokenize the input and target sequences, pass them through the model, and calculate the loss, accuracy, and F1 score as before.

In [1]:
!nvidia-smi


Wed Mar 22 09:24:19 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   65C    P0    31W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
import torch
from transformers import LEDTokenizer, LEDForConditionalGeneration
from sklearn.metrics import f1_score, accuracy_score
import pandas as pd

# define the model and tokenizer
model_name = "allenai/led-base-16384"

tokenizer = LEDTokenizer.from_pretrained(model_name)
model = LEDForConditionalGeneration.from_pretrained(model_name)

# add vocab
tokenizer.add_tokens(['ride0', 'ride3', 'ride1', 'ride2'])
model.resize_token_embeddings(len(tokenizer))


# create a pandas dataframe with transportation rides
rides = pd.DataFrame({
    'ride_id': ['ride0', 'ride1', 'ride2', 'ride3'],
    'ride_date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04'],
    'ride_text': ['I took the bus to work today', 'I rode my bike to the gym', 'I took the train to the city', 'I drove my car to the park']
})

# group the rides by date and concatenate the ride text into a single string
rides_by_date = rides.groupby('ride_date')['ride_text'].apply(lambda x: ' '.join(x)).reset_index()

# create the input and target sequences from the ride text
inputs = rides_by_date['ride_text'].tolist()
targets = inputs[1:] + [inputs[0]]

# tokenize the input and target sequences
input_ids = tokenizer.batch_encode_plus(inputs, padding=True, return_tensors="pt")
target_ids = tokenizer.batch_encode_plus(targets, padding=True, return_tensors="pt")

# retrieve the input and target token IDs and attention masks
encoder_input_ids = input_ids["input_ids"]
decoder_input_ids = target_ids["input_ids"]
decoder_attention_mask = target_ids["attention_mask"]

# train the model for 10 epochs
for epoch in range(10):
    # set the model to train mode
    model.train()

    # pass the inputs and targets through the model and calculate the loss
    outputs = model(
        input_ids=encoder_input_ids,
        attention_mask=input_ids["attention_mask"],
        decoder_input_ids=decoder_input_ids,
        decoder_attention_mask=decoder_attention_mask,
        use_cache=False
        
    )
    loss = criterion(outputs.logits.view(-1, outputs.logits.size(-1)), decoder_input_ids.view(-1))

    # zero the gradients and perform backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # calculate the accuracy and F1 score
    predicted_ids = model.generate(
        input_ids=encoder_input_ids,
        attention_mask=input_ids["attention_mask"],
        decoder_start_token_id=tokenizer.pad_token_id,
        use_cache=True,
        max_length=128,
        num_beams=1,
    )
    predicted_sequence = tokenizer.batch_decode(predicted_ids.tolist(), skip_special_tokens=True)
    target_sequence = tokenizer.batch_decode(target_ids["input_ids"].tolist(), skip_special_tokens=True)
    accuracy = accuracy_score(target_sequence, predicted_sequence)
    f1 = f1_score(target_sequence, predicted_sequence, average="weighted")

    print(f"Epoch {epoch + 1} loss: {loss:.2f}, accuracy: {accuracy:.2f}, F1 score: {f1:.2f}")


In [4]:
import torch
from transformers import LEDTokenizer, LEDForConditionalGeneration
from sklearn.metrics import f1_score, accuracy_score
import pandas as pd

# define the model and tokenizer
model_name = "allenai/led-base-16384"

tokenizer = LEDTokenizer.from_pretrained(model_name)
model = LEDForConditionalGeneration.from_pretrained(model_name)

# add vocab
tokenizer.add_tokens(['ride0', 'ride3', 'ride1', 'ride2'])
model.resize_token_embeddings(len(tokenizer))


# create a pandas dataframe with transportation rides
rides = pd.DataFrame({
    'ride_id': ['ride0', 'ride1', 'ride2', 'ride3'],
    'ride_date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04'],
    'ride_text': ['I took the bus to work today', 'I rode my bike to the gym', 'I took the train to the city', 'I drove my car to the park']
})

# group the rides by date and concatenate the ride text into a single string
rides_by_date = rides.groupby('ride_date')['ride_text'].apply(lambda x: ' '.join(x)).reset_index()

# create the input and target sequences from the ride text
inputs = rides_by_date['ride_text'].tolist()
targets = inputs[1:] + [inputs[0]]


# tokenize the input and target sequences
input_ids = tokenizer.batch_encode_plus(inputs, padding=True, return_tensors="pt")
target_ids = tokenizer.batch_encode_plus(targets, padding=True, return_tensors="pt")

# retrieve the input and target token IDs and attention masks
encoder_input_ids = input_ids["input_ids"]
decoder_input_ids = target_ids["input_ids"]
decoder_attention_mask = target_ids["attention_mask"]

# define the optimizer and loss function
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
criterion = torch.nn.CrossEntropyLoss()

# train the model for 10 epochs
for epoch in range(3):
    # set the model to train mode
    model.train()

    # pass the inputs and targets through the model and calculate the loss
    outputs = model(
        input_ids=encoder_input_ids,
        attention_mask=input_ids["attention_mask"],
        decoder_input_ids=decoder_input_ids,
        decoder_attention_mask=decoder_attention_mask,
        use_cache=False
    )
    loss = criterion(outputs.logits.view(-1, outputs.logits.size(-1)), decoder_input_ids.view(-1))

    # zero the gradients and perform backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # calculate the accuracy and F1 score
    predicted_ids = model.generate(
        input_ids=encoder_input_ids,
        attention_mask=input_ids["attention_mask"],
        decoder_start_token_id=tokenizer.pad_token_id,
        use_cache=True,
        max_length=128,
        num_beams=1,
    )
    predicted_sequence = tokenizer.batch_decode(predicted_ids.tolist(), skip_special_tokens=True)
    target_sequence = tokenizer.batch_decode(decoder_input_ids.tolist(), skip_special_tokens=True)
    accuracy = accuracy_score(target_sequence, predicted_sequence)
    f1 = f1_score(target_sequence, predicted_sequence, average="weighted")

    print(f"Epoch {epoch + 1} loss: {loss:.2f}, accuracy: {accuracy:.2f}, F1 score: {f1:.2f}")


Epoch 1 loss: 5.87, accuracy: 0.00, F1 score: 0.00
Epoch 2 loss: 3.19, accuracy: 0.00, F1 score: 0.00


In [5]:
# decode output sequence
predicted_sequence = tokenizer.batch_decode(predicted_ids.tolist(), skip_special_tokens=True)
print(predicted_sequence)


['I took the bus to work todayI took the bus to work today', 'II rode my bike to the the the the the the the the theI rode my bike to the', 'I took the train to the train to the train to the train to the train to the train to the train to the train to the train', 'I drove my car to the car to the car i drove my car to the car']
