## What are TPUs?
The Tensor Processing Unit (TPU) is a custom integrated chip, designed specifically to accelerate the process of training machine learning models. 

## TPUs for free at Kaggle
**You can use up to 30 hours per week of TPUs and up to 9h at a time in a single session.**
**For more info you can visit [here](https://www.kaggle.com/docs/tpu).**

## Why do we need TFRecord format?
The TFRecord format is tensorflow's custom data format which is simple for storing a sequence of binary records. The advantages of using TFRecords are amazingly more efficient storage, fast I/O, self-contained files, etc. The main advantage of TPUs are faster I/O which results in faster model training.

For understanding the basics of TFRecords, please visit Ryan Holbrook notebook: [TFRecords Basics](https://www.kaggle.com/ryanholbrook/tfrecords-basics).

**In this notebook you will learn how to convert text dataset into TFRecord format.**

## Useful resources which helped me:¶
- https://www.tensorflow.org/tutorials/load_data/tfrecord
- https://www.kaggle.com/mgornergoogle/five-flowers-with-keras-and-xception-on-tpu
- https://towardsdatascience.com/a-practical-guide-to-tfrecords-584536bc786c
- https://www.kaggle.com/omkargangan/commonlit-readability-competition
- https://cloud.google.com/blog/products/ai-machine-learning/what-makes-tpus-fine-tuned-for-deep-learning
- https://pub.towardsai.net/writing-tfrecord-files-the-right-way-7c3cee3d7b12

# Imports

In [None]:
import re
import numpy as np 
import pandas as pd 
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from nltk.corpus import stopwords
from string import punctuation
from nltk.tokenize import word_tokenize
from tensorflow.keras.preprocessing import sequence
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer
from tensorflow.keras.utils import to_categorical

# Load the data

In [None]:
df = pd.read_csv('../input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv')
df = df.drop(labels = ['id','severe_toxic','obscene','threat','insult','identity_hate'], axis=1)
df.columns = ['comment','toxic']

# Clean the data

In [None]:
def clean_document(doc):
    doc = doc.lower()
    # tokenization
    tokens = word_tokenize(doc)
    stop = stopwords.words('english')
    bad_tokens = stop + list(punctuation)
    clean_tokens = [t for t in tokens if t.lower() not in bad_tokens]
    # lemmatization
    lemma = WordNetLemmatizer()
    clean_tokens = [lemma.lemmatize(t) for t in clean_tokens]
    return ' '.join(clean_tokens)

In [None]:
df.comment = df.comment.apply(clean_document)

In [None]:
X = df['comment']
y = df['toxic']

# One-hot encoding the labels

In [None]:
y = to_categorical(y, num_classes=2)

# Divide into test-train

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1, random_state=1)

# Data Processing

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)

In [None]:
vocabulary = tokenizer.index_word
vocab_len = len(vocabulary)
vocab_len

In [None]:
train_sequence = tokenizer.texts_to_sequences(X_train)

In [None]:
doc_len = []
for doc in train_sequence:
    doc_len.append(len(doc))
max(doc_len)

In [None]:
np.quantile(doc_len, 0.99)


In [None]:
max_len = 347

In [None]:
train_sequence_matrix = sequence.pad_sequences(train_sequence, maxlen= max_len)
test_sequence = tokenizer.texts_to_sequences(X_test)
test_sequence_matrix = sequence.pad_sequences(test_sequence, maxlen= max_len)

In [None]:
len(train_sequence_matrix)

In [None]:
print(train_sequence_matrix.shape)
print(y_train.shape)

In [None]:
train_sequence_matrix = train_sequence_matrix.reshape(-1,347,1)
train_sequence_matrix.shape

In [None]:
train_sequence_matrix = train_sequence_matrix.reshape(-1,347,1)
train_sequence_matrix.shape

## Feature Creation functions

The following functions can be used to convert a value to a type compatible which takes a scalar input values and returns a tf.train.Feature.

In [None]:
def _bytes_feature(value):
    if isinstance(value, type(tf.constant(0))): # if value ist tensor
        value = value.numpy() # get value of tensor
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def _float_feature(value):
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def serialize_array(array):
    array = tf.io.serialize_tensor(array)
    return array

## Serializing and Writing 

Now, we'll create a dictionary to store the actual image, height, width and depth of the image and the label where we first serialize the array and then convert it to a bytes_feature.  All these `key:value` mappings make up the features for one Example.


In [None]:
def parse_text_data(text, label):
    data = {
        'text' : _bytes_feature(serialize_array(text)),
        'label' : _bytes_feature(serialize_array(label))
    }
    out = tf.train.Example(features=tf.train.Features(feature=data))
    return out

In [None]:
def write_text_to_tfr(text_data, label, filename:str="text"):
    filename= filename+".tfrecords"
    writer = tf.io.TFRecordWriter(filename)
    count = 0
    for index in range(len(text_data)):
        current_text = text_data[index] 
        current_label = label[index]
        out = parse_text_data(text=current_text, label=current_label)
        writer.write(out.SerializeToString())
        count += 1
    writer.close()
    print(f"Wrote {count} elements to TFRecord")
    return count

In [None]:
write_text_to_tfr(text_data=train_sequence_matrix, label=y_train, filename="jigsaw_toxic_comment_train")

In [None]:
write_text_to_tfr(text_data=test_sequence_matrix, label=y_test, filename="jigsaw_toxic_comment_test")