## Importing Required Models

- pandas    =>  to manipulate Dataframes
- numpy     =>  to manipulate arrays
- ktrain    =>  to train bert model
- text      =>  to preprocess text

In [None]:
import pandas as pd
import numpy as np

import ktrain
from ktrain import text

## Importing Training and Validation Datasets

In [None]:
data_train = pd.read_csv('data/Train.csv') # Training Dataframe
data_test = pd.read_csv('data/Valid.csv') # Validation Dataframe

### Preprocessing Train and Validation Dataframe
-   label all 0's as negatives
-   label all 1's as positives

In [None]:
for i in range(len(data_train['label'])): # From Training Dataset label all 0's as negative and all 1's as positives
  if data_train['label'][i] == 0:
    data_train['label'][i] = 'neg'
  elif data_train['label'][i] == 1:
    data_train['label'][i] = 'pos'
data_train.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,neg
1,"When I put this movie in my DVD player, and sa...",neg
2,Why do people who do not know what a particula...,neg
3,Even though I have great interest in Biblical ...,neg
4,Im a die hard Dads Army fan and nothing will e...,pos


In [None]:
for i in range(len(data_test['label'])): # From Validation Dataset label all 0's as negative and all 1's as positives
  if data_test['label'][i] == 0:
    data_test['label'][i] = 'neg'
  elif data_test['label'][i] == 1:
    data_test['label'][i] = 'pos'
data_test.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Unnamed: 0,text,label
0,It's been about 14 years since Sharon Stone aw...,neg
1,someone needed to make a car payment... this i...,neg
2,The Guidelines state that a comment must conta...,neg
3,This movie is a muddled mish-mash of clichés f...,neg
4,Before Stan Laurel became the smaller half of ...,neg


### Converting Dataframes into numpy arrays

In [None]:
X_train = data_train.text.tolist() # Convert text column of Training Dataset into list and assign it to X_train(independent variable in Training)
X_test = data_test.text.tolist() # Convert text column of Validation Dataset into list and assign it to X_test(independent variable in Validation)

y_train = data_train.label.tolist() # Convert label column of Training Dataset into list and assign it to y_train(dependent variable in Training)
y_test = data_test.label.tolist() # Convert text column of Validation Dataset into list and assign it to y_test(dependent variable in Validation)

# Visualising Total Data (Both Training and Validation)
data = data_train.append(data_test, ignore_index=True)

class_names = ['pos', 'neg']

print('size of training set: %s' % (len(data_train['text'])))
print('size of validation set: %s' % (len(data_test['text'])))
print(data.label.value_counts())

data.head(10)

size of training set: 40000
size of validation set: 5000
neg    22505
pos    22495
Name: label, dtype: int64


Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,neg
1,"When I put this movie in my DVD player, and sa...",neg
2,Why do people who do not know what a particula...,neg
3,Even though I have great interest in Biblical ...,neg
4,Im a die hard Dads Army fan and nothing will e...,pos
5,A terrible movie as everyone has said. What ma...,neg
6,Finally watched this shocking movie last night...,pos
7,I caught this film on AZN on cable. It sounded...,neg
8,It may be the remake of 1987 Autumn's Tale aft...,pos
9,My Super Ex Girlfriend turned out to be a plea...,pos


In [None]:
encoding = {
    'pos': 0,
    'neg': 1
}

# Integer values for each class
y_train = [encoding[x] for x in y_train]
y_test = [encoding[x] for x in y_test]

#### Preprocess data, get text from lists and assign them to respective variables

In [None]:
(x_train,  y_train), (x_test, y_test), preproc = text.texts_from_array(x_train=X_train, y_train=y_train,
                                                                       x_test=X_test, y_test=y_test,
                                                                       class_names=class_names,
                                                                       preprocess_mode='bert',
                                                                       maxlen=350, 
                                                                       max_features=35000)

downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


task: text classification


#### Defining model and learner for Training the Classifier

In [None]:
model = text.text_classifier('bert', train_data=(x_train, y_train), preproc=preproc)    # Defining Model
learner = ktrain.get_learner(model, train_data=(x_train, y_train), 
                             val_data=(x_test, y_test),
                             batch_size=6)  # Defining learner

Is Multi-Label? False
maxlen is 350
done.


In [None]:
learner.fit_onecycle(2e-5, 3)   # fit and train model on training data



begin training using onecycle policy with max lr of 2e-05...
Epoch 1/3
Epoch 2/3

In [None]:
learner.validate(val_data=(x_test, y_test), class_names=class_names) # validate trained model on validation data

In [None]:
predictor = ktrain.get_predictor(learner.model, preproc) # get predictor by combining trained model and preproc
predictor.get_classes()

### Predicting single variable messages using above Trained model

In [None]:
import time 

message = 'I just broke up with my boyfriend'

start_time = time.time() 
prediction = predictor.predict(message)

print('predicted: {} ({:.2f})'.format(prediction, (time.time() - start_time)))

In [None]:
predictor.save("bert_model") # Save model