This is the template notebook I use to finetune BART, T5, and PEGASUS. It is based on code found in this tutorial [Finetuning Transformer for Summary Generation](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb) by [Abhishek Kumar Mishra](https://github.com/abhimishra91). After finetuning each model, I save it to the my google cloud bucket. 

##Initial Setup

In [None]:
!pip install transformers -q
!pip install wandb -q

[K     |████████████████████████████████| 778kB 4.8MB/s 
[K     |████████████████████████████████| 1.1MB 24.1MB/s 
[K     |████████████████████████████████| 3.0MB 40.3MB/s 
[K     |████████████████████████████████| 890kB 46.4MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 1.4MB 4.7MB/s 
[K     |████████████████████████████████| 102kB 9.3MB/s 
[K     |████████████████████████████████| 163kB 22.8MB/s 
[K     |████████████████████████████████| 112kB 26.6MB/s 
[K     |████████████████████████████████| 102kB 11.0MB/s 
[K     |████████████████████████████████| 71kB 10.1MB/s 
[K     |████████████████████████████████| 71kB 8.7MB/s 
[?25h  Building wheel for watchdog (setup.py) ... [?25l[?25hdone
  Building wheel for subprocess32 (setup.py) ... [?25l[?25hdone
  Building wheel for gql (setup.py) ... [?25l[?25hdone
  Building wheel for pathtools (setup.py) ... [?25l[?25hdone
  Building wheel for graphql-core

In [None]:
# Importing stock libraries
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

# Importing the T5 modules from huggingface/transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration #AutoTokenizer, AutoModel

# WandB – Import the wandb library
import wandb



In [None]:
!nvidia-smi

Wed Jul 29 22:49:01 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   66C    P8    11W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# # Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

In [None]:
!wandb login

[34m[1mwandb[0m: You can find your API key in your browser here: https://app.wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter: c0c39a95c6893ca35844835d27d282e38eced5e4
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[32mSuccessfully logged in to Weights & Biases![0m


#Data Processing

In [None]:
from google.colab import auth
auth.authenticate_user()

In [None]:
project_id = 'test-281700'
!gcloud config set project {project_id}
!gsutil ls

Updated property [core/project].


To take a quick anonymous survey, run:
  $ gcloud survey

gs://spotify_asr_dataset/
gs://staging.test-281700.appspot.com/
gs://test-281700.appspot.com/


In [None]:
bucket_name = 'spotify_asr_dataset'
#download dataset
!gsutil -m cp -r gs://{bucket_name}/dataset.csv /content/
#download metadata for episodes
!gsutil -m cp -r gs://{bucket_name}/metadata.tsv /content/
#download filtered episodes
!gsutil -m cp -r gs://{bucket_name}/filtered-episode-ids.txt /content/

Copying gs://spotify_asr_dataset/dataset.csv...
/ [1/1 files][  2.9 GiB/  2.9 GiB] 100% Done  16.5 MiB/s ETA 00:00:00           
Operation completed over 1 objects/2.9 GiB.                                      
Copying gs://spotify_asr_dataset/metadata.tsv...
/ [1/1 files][112.2 MiB/112.2 MiB] 100% Done                                    
Operation completed over 1 objects/112.2 MiB.                                    
Copying gs://spotify_asr_dataset/filtered-episode-ids.txt...
/ [1/1 files][  2.5 MiB/  2.5 MiB] 100% Done                                    
Operation completed over 1 objects/2.5 MiB.                                      


In [None]:
dataset = pd.read_csv('dataset.csv')
podcasts_metadata = pd.read_csv('metadata.tsv', sep='\t')

In [None]:
full_dataset = pd.merge(left=podcasts_metadata, right=dataset, how='left', left_on='episode_uri', right_on='episode_id')
del full_dataset['episode_uri']

In [None]:
filter = pd.read_csv('filtered-episode-ids.txt', sep=" ", header=None, names=["episode_id"])
filter.head()

Unnamed: 0,episode_id
0,spotify:episode:000A9sRBYdVh66csG2qEdj
1,spotify:episode:001UfOruzkA3Bn1SPjcdfa
2,spotify:episode:001i89SvIQgDuuyC53hfBm
3,spotify:episode:0025RWNwe2lnp6HcnfzwzG
4,spotify:episode:002NDlaaJN4vUczXHDHqWZ


In [None]:
train_data = full_dataset.loc[full_dataset["episode_id"].isin(filter['episode_id'])]

In [None]:
train_data = train_data[['episode_id', 'transcript', 'episode_description']]

In [None]:
len(train_data)

66242

In [None]:
train_data['transcript'] = 'summarize: ' + train_data['transcript'] 

In [None]:
train_data.reset_index(drop=True, inplace=True)
train_data.head(5)

Unnamed: 0,episode_id,transcript,episode_description
0,spotify:episode:000A9sRBYdVh66csG2qEdj,summarize: Hello. Hello. Hello everyone. This...,On the first ever episode of Kream in your Kof...
1,spotify:episode:001UfOruzkA3Bn1SPjcdfa,summarize: Welcome to inside the 18. Today's ...,Today’s episode is a sit down Michael and Omar...
2,spotify:episode:001i89SvIQgDuuyC53hfBm,summarize: Hey cheese fans before we get star...,Join us as we take a look at all current Chief...
3,spotify:episode:0025RWNwe2lnp6HcnfzwzG,"summarize: Sorry to interrupt the show, but I...",The modern morality tail of how to stay good f...
4,spotify:episode:002NDlaaJN4vUczXHDHqWZ,summarize: If you haven't heard about anchor ...,Miss Jenn Davis reads the final part of The Si...


In [None]:
validation_data = full_dataset.loc[~full_dataset["episode_id"].isin(filter['episode_id'])]
validation_data = validation_data[['episode_id', 'transcript', 'episode_description']]
validation_data['transcript'] = 'summarize: ' + validation_data['transcript']
validation_data.reset_index(drop=True, inplace=True)
validation_data.head(5)

Unnamed: 0,episode_id,transcript,episode_description
0,spotify:episode:000HP8n3hNIfglT2wSI2cA,summarize: There were two more murders 15 mil...,"See something, say something. It’s a mantra ma..."
1,spotify:episode:0025w0gdgkl11Nzkmg1wnm,summarize: This is all India radio in the pro...,.
2,spotify:episode:003wT7YPtDMpA8r62joD9M,"summarize: Hey everyone, welcome to another e...",How are relationships made? What is trust buil...
3,spotify:episode:004eNUDOyWSUN3n1UJsToy,summarize: Everybody Anton here from e-commer...,In this interview episode of the eCommerce Lif...
4,spotify:episode:004scar91tc5UcMpthhoCG,summarize: Welcome to Mom Hood everybody. Hap...,Dr. Zelana Montminy or Dr. Z is redefining par...


In [None]:
len(validation_data)

39118

In [None]:
#use only a subset of the data, struggling with the full thing
train_data = train_data[:3000] #train on a subset of the data because colab runs out of memory or run time disconnects
validation_data = validation_data[:20]
print(len(train_data), len(validation_data))

3000 20


#Training

In [None]:
class CustomDataset(Dataset):

    def __init__(self, dataframe, tokenizer, source_len, summ_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.source_len = source_len
        self.summ_len = summ_len
        self.text = self.data.episode_description #target summary
        self.ctext = self.data.transcript #original transcript

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        ctext = str(self.ctext[index])
        ctext = ' '.join(ctext.split())

        text = str(self.text[index])
        text = ' '.join(text.split())

        source = self.tokenizer.batch_encode_plus([ctext], max_length= self.source_len, pad_to_max_length=True,return_tensors='pt')
        target = self.tokenizer.batch_encode_plus([text], max_length= self.summ_len, pad_to_max_length=True,return_tensors='pt')

        source_ids = source['input_ids'].squeeze()
        source_mask = source['attention_mask'].squeeze()
        target_ids = target['input_ids'].squeeze()
        target_mask = target['attention_mask'].squeeze()

        return {
            'source_ids': source_ids.to(dtype=torch.long), 
            'source_mask': source_mask.to(dtype=torch.long), 
            'target_ids': target_ids.to(dtype=torch.long),
            'target_ids_y': target_ids.to(dtype=torch.long)
        }

In [None]:
def train(epoch, tokenizer, model, device, loader, optimizer):
    model.train()
    for _,data in enumerate(loader, 0):
        y = data['target_ids'].to(device, dtype = torch.long)
        y_ids = y[:, :-1].contiguous()
        lm_labels = y[:, 1:].clone().detach()
        lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
        ids = data['source_ids'].to(device, dtype = torch.long)
        mask = data['source_mask'].to(device, dtype = torch.long)

        outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, lm_labels=lm_labels)
        loss = outputs[0]
        
        if _%10 == 0:
            wandb.log({"Training Loss": loss.item()})

        if _%500==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # xm.optimizer_step(optimizer)
        # xm.mark_step()

In [None]:
def validate(epoch, tokenizer, model, device, loader):
    model.eval()
    predictions = []
    actuals = []
    with torch.no_grad():
        for _, data in enumerate(loader, 0):
            y = data['target_ids'].to(device, dtype = torch.long)
            ids = data['source_ids'].to(device, dtype = torch.long)
            mask = data['source_mask'].to(device, dtype = torch.long)

            generated_ids = model.generate(
                input_ids = ids,
                attention_mask = mask, 
                max_length=150, 
                num_beams=2,
                repetition_penalty=2.5, 
                length_penalty=1.0, 
                early_stopping=True
                )
            preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
            target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]
            if _%100==0:
                print(f'Completed {_}')

            predictions.extend(preds)
            actuals.extend(target)
    return predictions, actuals

In [None]:
!mkdir t5-model-3000

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
def main():
    # WandB – Initialize a new run
    wandb.init(project="transformers_summarization")

    # WandB – Config is a variable that holds and saves hyperparameters and inputs
    # Defining some key variables that will be used later on in the training  
    config = wandb.config          # Initialize config
    config.TRAIN_BATCH_SIZE = 2    # input batch size for training (default: 64)
    config.VALID_BATCH_SIZE = 2    # input batch size for testing (default: 1000)
    config.TRAIN_EPOCHS = 2        # number of epochs to train (default: 10)
    config.VAL_EPOCHS = 1 
    config.LEARNING_RATE = 1e-4    # learning rate (default: 0.01)
    config.SEED = 42               # random seed (default: 42)
    config.MAX_LEN = 512
    config.SUMMARY_LEN = 150 

    # Set random seeds and deterministic pytorch for reproducibility
    torch.manual_seed(config.SEED) # pytorch random seed
    np.random.seed(config.SEED) # numpy random seed
    torch.backends.cudnn.deterministic = True

    # tokenzier for encoding the text
    tokenizer = T5Tokenizer.from_pretrained("t5-base") #replace with relevant transformer model
    

    # Importing and Pre-Processing the domain data
    # Selecting the needed columns only. 
    # Adding the summarzie text in front of the text. This is to format the dataset similar to how T5 model was trained for summarization task. 

    
    # Creation of Dataset and Dataloader
    # Defining the train size. So 80% of the data will be used for training and the rest will be used for validation. 

    train_dataset= train_data
    val_dataset= validation_data

    print("TRAIN Dataset: {}".format(train_dataset.shape))
    print("TEST Dataset: {}".format(val_dataset.shape))


    # Creating the Training and Validation dataset for further creation of Dataloader
    training_set = CustomDataset(train_dataset, tokenizer, config.MAX_LEN, config.SUMMARY_LEN)
    val_set = CustomDataset(val_dataset, tokenizer, config.MAX_LEN, config.SUMMARY_LEN)

    # Defining the parameters for creation of dataloaders
    train_params = {
        'batch_size': config.TRAIN_BATCH_SIZE,
        'shuffle': True,
        'num_workers': 0
        }

    val_params = {
        'batch_size': config.VALID_BATCH_SIZE,
        'shuffle': False,
        'num_workers': 0
        }

    # Creation of Dataloaders for testing and validation. This will be used down for training and validation stage for the model.
    training_loader = DataLoader(training_set, **train_params)
    val_loader = DataLoader(val_set, **val_params)


    
    # Defining the model. We are using t5-base model and added a Language model layer on top for generation of Summary. 
    # Further this model is sent to device (GPU/TPU) for using the hardware.
    model = T5ForConditionalGeneration.from_pretrained("t5-base") #replace with relevant transformer model
    model = model.to(device)

    # Defining the optimizer that will be used to tune the weights of the network in the training session. 
    optimizer = torch.optim.Adam(params =  model.parameters(), lr=config.LEARNING_RATE)

    # Log metrics with wandb
    wandb.watch(model, log="all")
    # Training loop
    print('Initiating Fine-Tuning for the model on our dataset')

    for epoch in range(config.TRAIN_EPOCHS):
        train(epoch, tokenizer, model, device, training_loader, optimizer)
      
    
    #torch.save(model, '/t5_model.pt')

    # model = torch.load(PATH)
    # model.eval()

    


    # Validation loop and saving the resulting file with predictions and acutals in a dataframe.
    # Saving the dataframe as predictions.csv
    print('Now generating summaries on our fine tuned model for the validation dataset and saving it in a dataframe')
    for epoch in range(config.VAL_EPOCHS):
        predictions, actuals = validate(epoch, tokenizer, model, device, val_loader)
        final_df = pd.DataFrame({'Generated Text':predictions,'Actual Text':actuals})
        final_df.to_csv('predictions.csv', index=False)
        print('Output Files generated for review')
    
    model.save_pretrained('./t5-model-3000')
    tokenizer.save_pretrained('./t5-model-3000')

if __name__ == '__main__':
    main()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…


TRAIN Dataset: (3000, 3)
TEST Dataset: (20, 3)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1199.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=891691430.0, style=ProgressStyle(descri…




Some weights of T5ForConditionalGeneration were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Initiating Fine-Tuning for the model on our dataset


Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (

Epoch: 0, Loss:  9.14786148071289


Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (

Epoch: 0, Loss:  3.3052241802215576


Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (

Epoch: 0, Loss:  3.009028196334839


Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (

Epoch: 1, Loss:  2.5413997173309326


Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (

Epoch: 1, Loss:  2.8675293922424316


Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (

Epoch: 1, Loss:  2.34614634513855


Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (

Now generating summaries on our fine tuned model for the validation dataset and saving it in a dataframe


Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (

Completed 0


Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (

Output Files generated for review


#Predictions and Model Saving

In [None]:
predictions = pd.read_csv('predictions.csv')

In [None]:
predictions.iloc[12]['Generated Text']

"Episode 61 of Mastering Your Fertility is all about reclaiming health, enhancing fertility and preparing for pregnancy. This week's episode will discuss some of our favorite book recommendations for fertility, preconception, and general women's health. Listen to find out which ones we recommend in which situations to help narrow down your reading list. We also share some great books that we haven't talked about before. Check it out on Amazon.com/masteringyourfertility/ or subscribe to the podcast at https://www.youtube.com/masteringyourfertility/watch?v=1#KristyCornett@gmail.com/kristycor"

In [None]:
predictions.iloc[12]['Actual Text']

'In episode 61, we share the top books we recommend for preconception, fertility, and women’s health. There are so many incredible authors and experts out there, many of whom we’ve hosted as guests on the podcast over the past year. We wanted to take this week to share our top picks for reading material when you’re trying to conceive and/or trying to work out imbalances with your hormones and cycle. You’ll learn which books we recommend, what you can expect to learn from each one, and who can benefit most from the information shared. You can find show notes for the episode, along with links to all the books we talk about and any corresponding podcast episodes, on our'

In [None]:
!sudo apt install zip unzip
!zip -r t5-model-3000 t5-model-3000

Reading package lists... Done
Building dependency tree       
Reading state information... Done
unzip is already the newest version (6.0-21ubuntu1).
zip is already the newest version (3.0-11build1).
The following package was automatically installed and is no longer required:
  libnvidia-common-440
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
  adding: t5-model-3000/ (stored 0%)
  adding: t5-model-3000/pytorch_model.bin (deflated 7%)
  adding: t5-model-3000/tokenizer_config.json (stored 0%)
  adding: t5-model-3000/config.json (deflated 64%)
  adding: t5-model-3000/spiece.model (deflated 48%)
  adding: t5-model-3000/special_tokens_map.json (deflated 83%)


In [None]:
#save model to gcp bucket
!gsutil -m cp /content/t5-model-3000.zip gs://{bucket_name}/

Copying file:///content/t5-model-3000.zip [Content-Type=application/zip]...
/ [0/1 files][    0.0 B/787.9 MiB]   0% Done                                    ==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

|
Operation completed over 1 objects/787.9 MiB.                                    
