# 1. Library imports

Make sure to install the ktrain package with the help of pip: `!pip install ktrain`

The `ktrain` library is a lightweight wrapper for the deep learning library Keras. It provides an easier way to use TensorFlow and Keras for various tasks, including text classification, and it's particularly useful for working with BERT models. This cell provides instructions for installing `ktrain` using pip.

In [1]:
# Import essential libraries
# - pandas and numpy for data manipulation
# - tensorflow for building and training neural network models
# - ktrain for simplifying the use of TensorFlow and Keras
# - certifi for handling SSL certificates
# It takes some time to load it, since tensorflow and ktrain are a bit heavy

import pandas as pd
import numpy as np
import tensorflow as tf
import ktrain
import os
import certifi

from ktrain import text

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Configure the SSL certificate environment variable
# This ensures secure data transfer when downloading datasets or other resources over HTTPS

os.environ["SSL_CERT_FILE"] = certifi.where()

# 2. Data preprocessing

Data preprocessing is a crucial step in any machine learning pipeline, especially in text analysis. This section involves preparing and structuring the data before it's fed into the model. It includes tasks like loading the dataset, cleaning the text, tokenizing, and converting it into a format suitable for the BERT model.

In [3]:
%%time

# Load the imdb dataet (it also takes a bit of time)
dataset = tf.keras.utils.get_file(
    fname='aclImdb_v1.tar.gz',
    origin='https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz',
    extract=True
)

data_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')

print(dataset)
print(data_dir)

/Users/rodrigofranciozi/.keras/datasets/aclImdb_v1.tar.gz
/Users/rodrigofranciozi/.keras/datasets/aclImdb
CPU times: user 5.13 s, sys: 9.17 s, total: 14.3 s
Wall time: 15.1 s


In [4]:
# Loading the training and testing sets
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_folder(datadir=data_dir,
                                                                       classes=['pos', 'neg'],
                                                                       maxlen=500,
                                                                       train_test_names=['train', 'test'],
                                                                       preprocess_mode='bert' # stantard preprocess mode for BERT
                                                                      )

detected encoding: utf-8
preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


# 3. Building the BERT model

In [5]:
model = text.text_classifier(
    name='bert',
    train_data=(x_train, y_train),
    preproc=preproc
)

Is Multi-Label? False
maxlen is 500




done.


# 4. Training the BERT model

In [6]:
learner = ktrain.get_learner(
    model=model,
    train_data=(x_train, y_train),
    val_data=(x_test, y_test),
    batch_size=6
)

In [7]:
# Uncomment this to train the model (it takes a Loooooooooooooot of time)

# learner.fit_onecycle(
#     lr=2e-5,
#     epochs=1
# )