
<div style="font-family: 'Arial', sans-serif; font-size: 18px; color: #5f5c9c;">

## **Architecture Overview**

#### **T5**

- At a high level, T5 is built using a transformer based architecture. It is different from other architectures like BERT, RoBERTa, GPT-2 and XLNET in several ways.

- The building blocks of T5 are similar to those of other transformer based models. It consists of an encoder and a decoder. The encoder takes in the input seq and generates a hidden representation of the input. The decoder takes in the hidden representation and generates the output sequence.

- The encoder and decoder are connected by an attention mechanism that allows the decoder to attend to different parts of the input sequence.

- One of the key differences between T5 and other architectures like BERT,RoBERTa, GPT2 and XLNet is that T5 is a text to text model. This means that It can be trained on a wide range of nlp taks by simply changing the input and output format. 

- For example, It can be trained on machine translation by providing it with a source language sentence and a target language sentence as input-output pair. It can also be trained on summarization by providing it with a long document as input and short summary as output.

- T5 uses a pretraining and fine tuning approach. During pre-training, T5 is trained on a large corpus of text using a masked language modelling objective whereas during the fine tuning stage, The pre-trained model is fine tuned on a specific task by providing it with task-specific input and output pairs.

- T5 also uses a technique called task-specific prompts, which allows it to perform well on a wide range of tasks with minimal task specific training. The prompts are short text strings that are concatenated to the input sequence to provide task specific information to the model.

</div>

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint
from transformers import T5Tokenizer, TFT5Model
import tensorflow as tf

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [2]:
# Define the number of GPUs to use
num_gpus = 2


<div style="font-family: 'Arial', sans-serif; font-size: 18px; color: #5f5c9c;">

## Load and Prepare Data: 
The following code snippets loads the training and test data from CSV files using pandas and splits the training data into input (question + answer) and target variables.
    
</div>

In [3]:
# Load the data
train_data = pd.read_csv('/kaggle/input/google-quest-challenge/train.csv')
test_data = pd.read_csv('/kaggle/input/google-quest-challenge/test.csv')


In [4]:
# Split the training data into input (question + answer) and target variables
X = train_data['question_title'] + ' ' + train_data['question_body'] + ' ' + train_data['answer']
y = train_data.iloc[:, 11:]


<div style="font-family: 'Arial', sans-serif; font-size: 18px; color: #5f5c9c;">

## Tokenize Input Data: 

The input data is tokenized using the T5 tokenizer from the Transformers library. The tokenizer converts the text into a numerical representation suitable for input to the T5 model.</div>

In [5]:
# Tokenize the input data
tokenizer = T5Tokenizer.from_pretrained('t5-base')
X_encoded = tokenizer.batch_encode_plus(
    X.tolist(),
    padding='longest',
    truncation=True,
    return_tensors='tf'
)

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


<div style="font-family: 'Arial', sans-serif; font-size: 18px; color: #5f5c9c;">

## Split Data into Training and Validation Sets: 
The tokenized input data and target variables are split into training and validation sets using the train_test_split function from scikit-learn.
    
</div>

In [6]:
# Convert the tensor array to a numpy array of integers
X_encoded_ids = np.array(X_encoded['input_ids'])

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_encoded_ids, y, test_size=0.2, random_state=42)


<div style="font-family: 'Arial', sans-serif; font-size: 18px; color: #5f5c9c;">

## Define T5 Model Architecture: 
This snippet defines the architecture of the T5 model using the TFT5Model class from the Transformers library. The model takes the tokenized input IDs as input and produces a sequence of hidden states. The final hidden state corresponding to the first token is extracted and passed through a dense layer with sigmoid activation to obtain the model's output.
    
</div>

In [7]:
# Define the T5 model architecture
input_ids = Input(shape=(X_encoded['input_ids'].shape[1],), dtype='int32')
decoder_input_ids = Input(shape=(X_encoded['input_ids'].shape[1],), dtype='int32')
t5_model = TFT5Model.from_pretrained('t5-base')
output = t5_model(input_ids=input_ids, decoder_input_ids=decoder_input_ids).last_hidden_state[:, 0, :]
output = Dense(y_train.shape[1], activation='sigmoid')(output)
model = Model(inputs=[input_ids, decoder_input_ids], outputs=output)


Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFT5Model.

All the weights of TFT5Model were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5Model for predictions without further training.


<div style="font-family: 'Arial', sans-serif; font-size: 18px; color: #5f5c9c;">

## Compile and Train the Model: 
The model is compiled with the Adam optimizer and binary cross-entropy loss. It is then trained on the training data using the fit method, with the validation data used for monitoring the model's performance during training.

</div>

In [8]:
# Compile the model
model.compile(optimizer=Adam(learning_rate=1e-5), loss='binary_crossentropy')


<div style="font-family: 'Arial', sans-serif; font-size: 18px; color: #5f5c9c;">

## Define Distribution Strategy for Multi-GPU Training: 
A distribution strategy is defined to enable multi-GPU training. The MirroredStrategy from TensorFlow is used, which replicates the model across multiple GPUs and synchronizes their updates.
    
</div>

In [9]:
# Define the distribution strategy for multi-GPU training
strategy = tf.distribute.MirroredStrategy()


<div style="font-family: 'Arial', sans-serif; font-size: 18px; color: #5f5c9c;">

## Create and Compile Distributed Model: 
The model is wrapped with the distribution strategy using the strategy.scope() context manager, which creates the distributed model and compiles it. This allows the model to be trained on multiple GPUs.


   

In [10]:
# Create and compile the distributed model
with strategy.scope():
    distributed_model = model


In [None]:
# Compile and train the model
distributed_model.compile(optimizer='adam', loss='binary_crossentropy')
distributed_model.fit(
    [X_train, X_train],
    y_train,
    validation_data=([X_val, X_val], y_val),
    batch_size=1 * strategy.num_replicas_in_sync,
    epochs=10
)

<div style="font-family: 'Arial', sans-serif; font-size: 18px; color: #5f5c9c;">

## Make Predictions on Test Data: 
The test data is tokenized using the same tokenizer used for the training data. The tokenized input IDs are then passed to the trained distributed model to make predictions on the test data.

In [13]:
# Tokenize the test data
X_test = test_data['question_title'] + ' ' + test_data['question_body'] + ' ' + test_data['answer']
X_test_encoded = tokenizer.batch_encode_plus(
    X_test.tolist(),
    padding='longest',
    truncation=True,
    return_tensors='tf'
)

In [14]:
# Convert the tensor array to a numpy array of integers
X_test_encoded_ids = np.array(X_test_encoded['input_ids'])


In [15]:
# Make predictions on the test data
predictions = distributed_model.predict([X_test_encoded_ids, X_test_encoded_ids])




<div style="font-family: 'Arial', sans-serif; font-size: 18px; color: #5f5c9c;">

## Create Submission DataFrame: 
The predicted values are used to create a DataFrame for the submission file. The predictions are organized in columns corresponding to the target variables, and the qa_id column from the test data is included for identification.

In [16]:
# Create a DataFrame for the submission file
submission_df = pd.DataFrame(predictions, columns=y_train.columns)
submission_df.insert(0, 'qa_id', test_data['qa_id'])


<div style="font-family: 'Arial', sans-serif; font-size: 18px; color: #5f5c9c;">

## Save Submission DataFrame to CSV: 
The submission DataFrame is saved as a CSV file named "submission.csv" without including the index column.

In [17]:
# Save the submission DataFrame to a CSV file
submission_df.to_csv('submission.csv', index=False)

In [18]:
submission_df

Unnamed: 0,qa_id,question_asker_intent_understanding,question_body_critical,question_conversational,question_expect_short_answer,question_fact_seeking,question_has_commonly_accepted_answer,question_interestingness_others,question_interestingness_self,question_multi_intent,...,question_well_written,answer_helpful,answer_level_of_information,answer_plausible,answer_relevance,answer_satisfaction,answer_type_instructions,answer_type_procedure,answer_type_reason_explanation,answer_well_written
0,39,0.919736,0.720155,0.172387,0.737997,0.693735,0.758682,0.650969,0.614660,0.217058,...,0.864714,0.911911,0.691923,0.948988,0.941793,0.860930,0.102672,0.096489,0.693883,0.903641
1,46,0.852762,0.457538,0.015420,0.803854,0.731723,0.884745,0.533372,0.440967,0.106015,...,0.709771,0.950297,0.675851,0.970681,0.973296,0.884662,0.776433,0.178853,0.372724,0.891988
2,70,0.895918,0.696801,0.074210,0.788232,0.724790,0.814511,0.608176,0.524826,0.178956,...,0.849852,0.922942,0.677749,0.952771,0.952158,0.863300,0.329377,0.135897,0.531592,0.893110
3,132,0.860401,0.457747,0.011842,0.809626,0.736134,0.894841,0.537488,0.438102,0.096054,...,0.711140,0.956427,0.684720,0.974240,0.977973,0.892506,0.820227,0.187396,0.346095,0.897584
4,200,0.901813,0.569932,0.108243,0.702484,0.703878,0.750246,0.603424,0.578567,0.218860,...,0.770060,0.900358,0.651802,0.945203,0.937745,0.833813,0.194209,0.115866,0.622912,0.896693
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
471,9569,0.881836,0.534002,0.013219,0.815507,0.743243,0.886533,0.563547,0.452329,0.103742,...,0.757519,0.954384,0.679831,0.973254,0.977064,0.891134,0.803854,0.202071,0.312506,0.897262
472,9590,0.874853,0.519922,0.034245,0.739269,0.728903,0.830708,0.546320,0.478859,0.163506,...,0.758962,0.931827,0.668114,0.958713,0.960989,0.861795,0.573305,0.164343,0.476341,0.891656
473,9597,0.858172,0.453844,0.016616,0.804158,0.735860,0.880214,0.542408,0.448858,0.112948,...,0.698306,0.947991,0.668020,0.970566,0.972866,0.881061,0.746435,0.176517,0.367195,0.895275
474,9623,0.910676,0.725312,0.133036,0.763369,0.702743,0.770729,0.639397,0.590993,0.204018,...,0.862555,0.905421,0.672075,0.944085,0.939848,0.849545,0.168045,0.115110,0.599056,0.895735
