<a href="https://colab.research.google.com/github/DSPOWER93/quora-insincere/blob/main/Distilbert_Classifier_Insincere_Questions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Synopsis:

The Following file is a working on NLP classifier to identify insincere questions. Data is taken from kaggle Competition [Quora Insicere Question Classification](https://www.kaggle.com/c/quora-insincere-questions-classification).

- **Model**: Hugingface Distilbert (Hybrid)
- **Params**: 66M
- **Framework**: TensorFlow


### Mounting Drive to Import Data.

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


#### Installing necessary Libraries.

In [None]:
%%capture
!pip install transformers
!pip install numpy requests nlpaug
!pip install pyspellchecker
!pip install pandas --upgrade

### Importing necessary Libraries 

In [None]:
# Transformer Libraries
from transformers import DistilBertTokenizer 
from transformers import TFDistilBertForSequenceClassification
from transformers import DistilBertModel
from transformers import TFDistilBertModel, DistilBertConfig
import tensorflow as tf


#  For NLP Text augmentation
import nlpaug.augmenter.word as nlpaw
from tqdm import tqdm


import pandas as pd
import numpy as np
np.set_printoptions(suppress=True) # to negate scientific notations
import json
import gc

### NLP Pre-processing Libraries

In [None]:
# Importing Libraries for NLP 
import re
import string
import nltk
import spacy
spacy.prefer_gpu()
# to make spacy work in pipeline.
nlp_vocab = spacy.load('en_core_web_sm', disable=['tagger', 'parser', 'ner'])
nlp_vocab.add_pipe(nlp_vocab.create_pipe('sentencizer'))

# Importing spellchecker & NLP
from spellchecker import SpellChecker
from nltk.corpus import stopwords
nltk.download("stopwords")

# setting variable for stop words
stop_words = stopwords.words("english")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Corpus Cleaning

Defining function to clean corpus. 

In [None]:
# lemmatization of words from spacy. 
def spacy_lemmatize(x):
  x = nlp_vocab(x)
  x = [s.lemma_ for s in x]
  x = " ".join(x)
  return x


# Spelling collection 
spell = SpellChecker()
def correct_spellings(x, spell=spell):
    """correct the misspelled words of a given corpus"""
    x = x.split()
    misspelled = spell.unknown(x)
    result = map(lambda word : spell.correction(word) if word in  misspelled else word, x)
    return " ".join(result)

# corpus cleaning. we will keep Lemmatization as False as it's time consuming activity we will later use it in parallel computing.
def corpus_cleaning(x, correct_spelling=True, remove_emojis=True, remove_stop_words=False, lemmatize=False):
    """Apply function to a clean a corpus"""
    x = x.lower().strip()
    # romove urls
    url = re.compile(r'https?://\S+|www\.\S+')
    x = url.sub(r'',x)
    # remove html tags
    html = re.compile(r'<.*?>')
    x = html.sub(r'',x)
    # remove punctuation
    operator = str.maketrans('','',string.punctuation) #????
    x = x.translate(operator)
    if correct_spelling:
        x = correct_spellings(x)
    if lemmatize:
        x = spacy_lemmatize(x)
    if remove_emojis:
        x = x.encode('ascii', 'ignore').decode('utf8').strip()
    if remove_stop_words:
        x = ' '.join([word for word in x.split(' ') if word not in stop_words])
    return x

### Lemmatization using parallel processing

Spacy in general creates metadata of each corpus element which is quite of time consuming task in general. This creates room for parallel processing as running for loop will be doing sequential job not utilizing potential power of computing instance. The codes used to execute were used from well witten article on Spacy for parallel processing. [Link](https://prrao87.github.io/blog/spacy/nlp/performance/2020/05/02/spacy-multiprocess.html)

In [None]:
#%%time
from joblib import Parallel, delayed

def lemmatize_pipe(doc):
    lemma_list = [s.lemma_ for s in doc] 
    return lemma_list
    
def preprocess_pipe(texts):
    preproc_pipe = []
    for doc in nlp_vocab.pipe(texts, batch_size=20):
        preproc_pipe.append(lemmatize_pipe(doc))
    return preproc_pipe

def chunker(iterable, total_length, chunksize):
    return (iterable[pos: pos + chunksize] for pos in range(0, total_length, chunksize))

def flatten(list_of_lists):
    "Flatten a list of lists to a combined list"
    return [item for sublist in list_of_lists for item in sublist]

def process_chunk(texts):
    preproc_pipe = []
    for doc in nlp_vocab.pipe(texts, batch_size=20):
        preproc_pipe.append(lemmatize_pipe(doc))
    return preproc_pipe

def preprocess_parallel(texts, chunksize=1000):
    executor = Parallel(n_jobs=7, backend='multiprocessing', prefer="processes")
    do = delayed(process_chunk)
    tasks = (do(chunk) for chunk in chunker(texts, len(texts), chunksize=chunksize))
    result = executor(tasks)
    return flatten(result)

### Importing Clean Dataset

As we had mentioned about text preprocess methodology followed to clean corpus in Bi-LSTM training notebook. Bi-LSTM link for reference [Link](https://github.com/DSPOWER93/quora-insincere/blob/main/Bi_LSTM_insincere_question_Classifier.ipynb)

In [None]:
clean_df = pd.read_csv('/content/drive/MyDrive/Quora_project/final_data.csv', index_col= 'Unnamed: 0').reset_index()

# Drop Null entries
clean_df = clean_df[clean_df['Text'].notnull()]

### Filtering out Sample Data

The base Dataset is having category ratio  of 93:6, which makes distribution inbalance. Have resized the proportion 90:10 to reduce data imbalance.

In [None]:
import random
random.seed(0)

# Seperating insincere questions.
#  Insincere Question 
insincere=clean_df[clean_df['target'] == 1]
# Normal Question 
sincere=clean_df[clean_df['target'] == 0]


#  Consuming 50% of insincere questions in train & test of Data
top_50 = int(round(len(insincere)*0.5,0))

# Generating seperate Dataframe with 30K 
insincere_train = insincere[:top_50]
insincere_prod =  insincere[top_50:]

sincere_train = sincere[:270000]
sincere_prod = sincere[270000:]

insincere_train = insincere_train.append(sincere_train)
insincere_prod = insincere_prod.append(sincere_prod)


train_shuffle = (list(random.sample(range(len(insincere_train)), len(insincere_train))))
test_shuffle = (list(random.sample(range(len(insincere_prod)), len(insincere_prod))))


insincere_train = insincere_train.iloc[train_shuffle,:]
insincere_prod = insincere_prod.iloc[test_shuffle,:]

del(sincere_train,sincere_prod)


insincere_train.drop(['index','Rows'], axis=1, inplace= True)
insincere_prod.drop(['index','Rows'], axis=1, inplace= True)

print(insincere_train.shape, insincere_prod.shape)

(300894, 2) (699104, 2)


### Train & Test Split (60:40)

In [None]:
from sklearn.model_selection import train_test_split

# Split Train and Validation data
X_train, X_test = train_test_split( insincere_train, test_size=0.4, random_state=0)

y_train = X_train['target']
y_valid = X_test['target']

print(X_train.shape, X_test.shape)

(180536, 2) (120358, 2)


#### Step could be used to perform augmentation on Data, 
Due to augmentation activity being computational expensive activity, this part of code was not executed. This activity can be very useful when size of data is small  & imbalanced.

following article was good reference 
[Link](https://www.analyticsvidhya.com/blog/2021/08/nlpaug-a-python-library-to-augment-your-text-data/)

In [None]:
# Define nlpaug augmentation object, using word substitute augmentation module.
aug10p = nlpaw.ContextualWordEmbsAug(model_path='bert-base-uncased', aug_min=1, aug_p=0.2, action="substitute")

#### Testing out sample Text for augmentation 

In [None]:
# Text Augmentation

def augment_sentence(sentence, aug = aug10p, num_threads = 2):
    """""""""
    Constructs a new sentence via text augmentation.
    
    Input:
        - sentence:     A string of text
        - aug:          An augmentation object defined by the nlpaug library
        - num_threads:  Integer controlling the number of threads to use if
                        augmenting text via CPU
    Output:
        - A string of text that been augmented
    """""""""
    return aug.augment(sentence, num_thread=num_threads)


print(X_train[X_train['target']== 1]['Text'][2:3])
print(augment_sentence(X_train[X_train['target']== 1]['Text'][2:3].to_list(), aug10p, num_threads = 2))

#### Function Performing Text Augmentation on Train Data for category of insincere questions.  

In [None]:
def augment_text(df, aug, num_threads, aug_perc): #num_times
    """""""""
    Takes a pandas DataFrame and augments its text data.
    
    Input:
        - df:            A pandas DataFrame containing the columns:
                                - 'text' containing strings of text to augment.
                                - 'target' binary target variable containing 0's and 1's.
        - aug:           Augmentation object defined by the nlpaug library.
        - num_threads:   Integer controlling number of threads to use if augmenting
                         text via CPU
        - aug_perc:     percentage of text data to be augmented.
    Output:
        - df:            Copy of the same pandas DataFrame with augmented data 
                         appended to it and with rows randomly shuffled.
    """""""""
    
    # Get rows of data to augment

    to_augment = df[df['target']==1]
    sample_rows = int(len(to_augment) * aug_perc )
    sample_rnge = (list(random.sample(range(len(to_augment)), sample_rows)))
    to_augment = to_augment.iloc[sample_rnge,:]
    to_augmentX = to_augment['Text']
    to_augmentY = np.ones( int(len(to_augmentX.index)), dtype=np.int8)
    

    # Build up dictionary containing augmented data
    aug_dict = {'Text':[], 'target':to_augmentY}
    for i in tqdm(range(1)):
        augX = [augment_sentence(x, aug, num_threads) for x in to_augmentX]
        aug_dict['Text'].extend(augX)
    
    # Build DataFrame containing augmented data
    aug_df = pd.DataFrame.from_dict(aug_dict)
    return aug_df

In [None]:

# from google.colab import files
# Upsample minority class ('isToxic' == 1) to create a roughly 50-50 class distribution
# augmented_df = augment_text(X_train, aug10p, num_threads=4, aug_perc= 0.1)

# augmented_df.to_csv('augmented_data.csv')
# files.download('augmented_data.csv') 

### Downloading Distiliber model from Transformers

In [None]:
from transformers import DistilBertTokenizerFast

# Instantiate DistilBERT tokenizer...we use the Fast version to optimize runtime
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

### Creating word tokenizer

In [None]:

# Define the maximum number of words to tokenize, keeping it 40 for model to train faster (DistilBERT can tokenize up to 512)
MAX_LENGTH = 40

# Define function to encode text data in batches
def batch_encode(tokenizer, texts, batch_size=256, max_length=MAX_LENGTH):
    """""""""
    A function that encodes a batch of texts and returns the texts'
    corresponding encodings and attention masks that are ready to be fed 
    into a pre-trained transformer model.
    
    Input:
        - tokenizer:   Tokenizer object from the PreTrainedTokenizer Class
        - texts:       List of strings where each string represents a text
        - batch_size:  Integer controlling number of texts in a batch
        - max_length:  Integer controlling max number of words to tokenize in a given text
    Output:
        - input_ids:       sequence of texts encoded as a tf.Tensor object
        - attention_mask:  the texts' attention mask encoded as a tf.Tensor object
    """""""""
    input = tokenizer.batch_encode_plus(texts,
                                             max_length=max_length,
                                             padding= "max_length", 
                                             truncation=True,
                                             return_attention_mask=True,
                                             return_token_type_ids=False)
    input_ids = input['input_ids']
    attention_mask = input['attention_mask']


    return tf.convert_to_tensor(input_ids) , tf.convert_to_tensor(attention_mask)
    
    
# Encode X_train
X_train_ids , X_train_attention = batch_encode(tokenizer, X_train['Text'].values.tolist()) 

# Encode X_valid
X_valid_ids, X_valid_attention = batch_encode(tokenizer, X_test['Text'].values.tolist())

# Encode production data to test model performance in production.
# X_test_ids, X_test_attention = batch_encode(tokenizer, X_test.tolist())

### Visulaizing the Input Data structure Model.

In [None]:

# Importing Pretrained Model of Distilbert. 
my_model = DistilBertModel.from_pretrained('distilbert-base-uncased')


# Sample text from the corpus.
sample_t = X_train['Text'][0:1].to_list()


#  About Corpus
print('Sample Corpus: ', sample_t)
print('Length of Corpus: ' , len(sample_t[0].split()) )

#  Tokenizing Sample Sentence
sample_inputs = tokenizer.batch_encode_plus(sample_t,return_tensors="pt",
                                               max_length=MAX_LENGTH,
                                               padding= "max_length",
                                               truncation=True,
                                               return_attention_mask=True,
                                               return_token_type_ids=False)


# Input Ids 
print('Input Ids: ', sample_inputs['input_ids'])
print('length of input Ids: ', len(sample_inputs['input_ids'][0]))

# Attention mask on Corpus
print('Attention Mask: ', sample_inputs['attention_mask'])
print('length of Attention Mask: ', len(sample_inputs['attention_mask'][0]))

#  Corpus tranformation Ouput that would be input for the model. 
#  the model generates 3D structure which would be holding information for  each word having vector size of 768.
#  Using the 3D structure can provide high good results but would take up quite amount of time to train the model.
# We would using 768 vecotr of the first word, as that also holds information  for complete sentence. 
# The first word  embedding vector also contains the sentence feature, so we will be using First word embedding.

outputs = my_model(**sample_inputs)

last_hidden_states = outputs.last_hidden_state

# 3D structure of the 40 words corpus 
print('Structure of  corpus: ', last_hidden_states.shape)

# We only care about DistilBERT's output for the [CLS] token, 
# which is located at index 0 of every encoded sequence.  
# Splicing out the [CLS] tokens gives us 2D data.
# sentence embedding of 768 vector size.
print('Ouput of CLS token(Sentence embedding): ', last_hidden_states[:,0,:])

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Sample Corpus:  ['what be a good attention grabber for my argumentative essay on whether scientology be a cult or religion']
Length of Corpus:  18
Input Ids:  tensor([[  101,  2054,  2022,  1037,  2204,  3086,  6723,  5677,  2005,  2026,
          6685,  8082,  9491,  2006,  3251, 23845,  2022,  1037,  8754,  2030,
          4676,   102,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0]])
length of input Ids:  40
Attention Mask:  tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
length of Attention Mask:  40
Structure of  corpus:  torch.Size([1, 40, 768])
Ouput of CLS token(Sentence embedding):  tensor([[ 4.2617e-02,  4.7115e-02, -3.1044e-01, -2.4072e-01, -3.0574e-02,
         -1.4295e-01,  2.2153e-01,  3.9752e-01, -4.8928e-02, -9.3441e-02,
          1.2566e-01, -7.9719e-02, -1.3623e-01,  2.3626e-01,  8.3005e-02,


### Initializing the base model

In [None]:
from transformers import TFDistilBertModel, DistilBertConfig

DISTILBERT_DROPOUT = 0.2
DISTILBERT_ATT_DROPOUT = 0.2
 
# Configure DistilBERT's initialization
config = DistilBertConfig(dropout=DISTILBERT_DROPOUT, 
                          attention_dropout=DISTILBERT_ATT_DROPOUT, 
                          output_hidden_states=True)
                          
# The bare, pre-trained DistilBERT transformer model outputting raw hidden-states 
# and without any specific head on top.
distilBERT = TFDistilBertModel.from_pretrained('distilbert-base-uncased', config=config)

Downloading:   0%|          | 0.00/347M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['activation_13', 'vocab_transform', 'vocab_layer_norm', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


### Model Evaluation Parameters.

In [None]:
#https://datascience.stackexchange.com/questions/45165/how-to-get-accuracy-f1-precision-and-recall-for-a-keras-model

from keras import backend as K

def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

In [None]:

MAX_LENGTH = 40
LAYER_DROPOUT = 0.2
LEARNING_RATE = 5e-5
RANDOM_STATE = 42

def build_model(transformer, max_length=MAX_LENGTH, trainable = False):
    """
    Template for building a model off of the BERT or DistilBERT architecture
    for a binary classification task.
    
    Input:
      - transformer:  a base Hugging Face transformer model object (BERT or DistilBERT)
                      with no added classification head attached.
      - max_length:   integer controlling the maximum number of encoded tokens 
                      in a given sequence.
    
    Output:
      - model:        a compiled tf.keras.Model with added classification layers 
                      on top of the base pre-trained model architecture.
    """
    
    # Define weight initializer with a random seed to ensure reproducibility
    weight_initializer = tf.keras.initializers.GlorotNormal(seed=RANDOM_STATE) 
    
    # Define input layers
    input_ids_layer = tf.keras.layers.Input(shape=(max_length,), 
                                            name='input_ids', 
                                            dtype='int32')
    input_attention_layer = tf.keras.layers.Input(shape=(max_length,), 
                                                  name='input_attention', 
                                                  dtype='int32')
    
    # DistilBERT outputs a tuple where the first element at index 0
    # represents the hidden-state at the output of the model's last layer.
    # It is a tf.Tensor of shape (batch_size, sequence_length, hidden_size=768).
    last_hidden_state = transformer([input_ids_layer, input_attention_layer])[0]
    
    # We only care about DistilBERT's output for the [CLS] token, 
    # which is located at index 0 of every encoded sequence.  
    # Splicing out the [CLS] tokens gives us 2D data.
    cls_token = last_hidden_state[:, 0, :]
    
    ##                                                 ##
    ## Define additional dropout and dense layers here ##
    ##                                                 ##
    
    # Define a single node that makes up the output layer (for binary classification)
    output = tf.keras.layers.Dense(1, 
                                   activation='sigmoid',
                                   kernel_initializer=weight_initializer,  
                                   kernel_constraint=None,
                                   bias_initializer='zeros'
                                   )(cls_token)
    
    # Define the model
    model = tf.keras.Model([input_ids_layer, input_attention_layer], output)
    
    # Make DistilBERT layers untrainable
    for layer in distilBERT.layers:
      layer.trainable = trainable

    # Compile the model
    model.compile(tf.keras.optimizers.Adam(learning_rate =LEARNING_RATE), 
                  loss='binary_crossentropy',
                  metrics=['acc',f1_m,precision_m, recall_m, tf.keras.metrics.AUC()])
    
    return model

### Adding Setting for the Model. (Non-Trainable)

In [None]:
from keras import backend as K
#  To start new session of model training from start
K.clear_session()

#  Building the Model 
model = build_model(distilBERT, trainable= False)

#adding Hyper Parameter Settings
# early stopping
earlyStopping = tf.keras.callbacks.EarlyStopping(monitor= 'val_auc', 
                                                 patience=3,
                                                 mode='max',
                                                 restore_best_weights=True)
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 40)]         0           []                               
                                                                                                  
 input_attention (InputLayer)   [(None, 40)]         0           []                               
                                                                                                  
 tf_distil_bert_model (TFDistil  TFBaseModelOutput(l  66362880   ['input_ids[0][0]',              
 BertModel)                     ast_hidden_state=(N               'input_attention[0][0]']        
                                one, 40, 768),                                                    
                                 hidden_states=((No                                           

### Fitting the Model 
Had found good reference for setting [Epoch Steps](https://stackoverflow.com/questions/49922252/choosing-number-of-steps-per-epoch).

In [None]:

EPOCHS = 30
BATCH_SIZE = 32
NUM_STEPS = len(X_train.index) // BATCH_SIZE

# Train the model
model.fit(
    x = [X_train_ids, X_train_attention],
    y = y_train.to_numpy(),
    epochs = EPOCHS,
    batch_size = BATCH_SIZE,
    callbacks=[earlyStopping],
    steps_per_epoch = NUM_STEPS,
    validation_data = ([X_valid_ids, X_valid_attention], y_valid.to_numpy()),
    verbose=1)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30


<keras.callbacks.History at 0x7fa650116dd0>

### Saving Distilbert Model

In [None]:
#  Saving Model weights
model.save_weights('distilbert/my_model')

### Loading saved model.

In [None]:
# loading the model params on which it was trained
load_model = build_model(distilBERT)

#  loading the training weights back to model.
load_model.load_weights('distilbert/my_model')

### Downloading the Model to  local file.

In [None]:
from google.colab import files
!zip -r /content/distilbert_retrained_final.zip /content/distilbert
files.download('distilbert_retrained_final.zip')

In [None]:
y_pred = load_model.predict([X_train_ids, X_train_attention])
ytest = load_model.predict([X_valid_ids, X_valid_attention])

In [None]:
train_pred_binary = np.where(y_pred> 0.49,1,0)
test_pred_binary = np.where(ytest> 0.49,1,0)

# (180536, 2) (120358, 2)
# reshaping array 
train_pred_binary = train_pred_binary.reshape(180536,)
test_pred_binary = test_pred_binary.reshape(120358,)


y_train_np = y_train.to_numpy()
y_val_np = y_valid.to_numpy()

# Accuracy
acc = sum(train_pred_binary == y_train_np)/  len(y_train_np)
test_acc = sum(test_pred_binary == y_val_np)/  len(y_val_np)
print('Accuracy of train model is {}'.format(acc))
print('Accuracy of test model is {}'.format(test_acc))


# Confusion Matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_val,test_pred_binary))


#  Recall  , f1 , precision 
from sklearn.metrics import classification_report
print(classification_report(y_val,test_pred_binary))

In [None]:
ques = ['will democrats win 2021', 'is he still alive', 'are democrats dumb', 'do blacks support gay', 'name famous black celebreties',
        'are republican not racists', 'are white people', 'which race is the smartest', 'what are achivements by trump' ,'kiss',
        'are republican racists', 'are muslim extremist', 'is india safe', 'are jews loyal', 'which race is not loyal',
        'why republican cry always', 'does trump hate gay', 'is america under dictatorship', 'are muslim terrorist', 'do you support taliban',
        'are you fool', 'is china facist country' , 'do americans support holocaust']
ques = preprocess_parallel(ques)

' '.join(ques[0])
# ques = ' '.join(preprocess_parallel(ques))

join_corpus =[]
for l in ques:
  join_corpus.append(' '.join(l))


sample_ids, sample_attention = batch_encode(tokenizer, join_corpus)


(load_model.predict([sample_ids , sample_attention]))

### Adding Setting for the Model. (Trainable)

In [None]:
from keras import backend as K
#  To start new session of model training from start
K.clear_session()

#  Building the Model 
model1 = build_model(distilBERT, trainable= True)

#adding Hyper Parameter Settings
# early stopping
earlyStopping = tf.keras.callbacks.EarlyStopping(monitor= 'val_recall_m', 
                                                 patience=3,
                                                 mode='max',
                                                 restore_best_weights=True)
model1.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 40)]         0           []                               
                                                                                                  
 input_attention (InputLayer)   [(None, 40)]         0           []                               
                                                                                                  
 tf_distil_bert_model (TFDistil  TFBaseModelOutput(l  66362880   ['input_ids[0][0]',              
 BertModel)                     ast_hidden_state=(N               'input_attention[0][0]']        
                                one, 40, 768),                                                    
                                 hidden_states=((No                                           

### Fitting Model 2

In [None]:
EPOCHS = 30
BATCH_SIZE = 64
NUM_STEPS = len(X_train.index) // BATCH_SIZE

# Train the model
model1.fit(
    x = [X_train_ids, X_train_attention],
    y = y_train.to_numpy(),
    epochs = EPOCHS,
    batch_size = BATCH_SIZE,
    callbacks=[earlyStopping],
    steps_per_epoch = NUM_STEPS,
    validation_data = ([X_valid_ids, X_valid_attention], y_valid.to_numpy()),
    verbose=1)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30


<keras.callbacks.History at 0x7fa6421fea10>

In [None]:
#  Saving Model weights
model1.save_weights('distilbert_trained_weights/my_model')

# loading the model params on which it was trained
load_model2 = build_model(distilBERT)

#  loading the training weights back to model.
load_model2.load_weights('distilbert_trained_weights/my_model')

### Downloading the Model to  local file.

# from google.colab import files
# !zip -r /content/distilbert_retrained_final_V3.zip /content/distilbert_trained_weights
# files.download('distilbert_retrained_final_V3.zip')

y_pred2 = load_model2.predict([X_train_ids, X_train_attention])
ytest2 = load_model2.predict([X_valid_ids, X_valid_attention])

train_pred_binary = np.where(y_pred2> 0.49,1,0)
test_pred_binary = np.where(ytest2> 0.49,1,0)

# (180536, 2) (120358, 2)
# reshaping array 
train_pred_binary = train_pred_binary.reshape(180536,)
test_pred_binary = test_pred_binary.reshape(120358,)


y_train_np = y_train.to_numpy()
y_val_np = y_valid.to_numpy()

# Accuracy
acc = sum(train_pred_binary == y_train_np)/  len(y_train_np)
test_acc = sum(test_pred_binary == y_val_np)/  len(y_val_np)
print('Accuracy of train model is {}'.format(acc))
print('Accuracy of test model is {}'.format(test_acc))


# Confusion Matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_val_np,test_pred_binary))


#  Recall  , f1 , precision 
from sklearn.metrics import classification_report
print(classification_report(y_val_np,test_pred_binary))

ques = ['will democrats win 2021', 'is he still alive', 'are democrats dumb', 'do blacks support gay', 'name famous black celebreties',
        'are republican not racists', 'are white people', 'which race is the smartest', 'what are achivements by trump' ,'kiss',
        'are republican racists', 'are muslim extremist', 'is india safe', 'are jews loyal', 'which race is not loyal',
        'why republican cry always', 'does trump hate gay', 'is america under dictatorship', 'are muslim terrorist', 'do you support taliban',
        'are you fool', 'is china facist country' , 'do americans support holocaust']
ques = preprocess_parallel(ques)

' '.join(ques[0])
# ques = ' '.join(preprocess_parallel(ques))

join_corpus =[]
for l in ques:
  join_corpus.append(' '.join(l))


sample_ids, sample_attention = batch_encode(tokenizer, join_corpus)


(load_model2.predict([sample_ids , sample_attention]))

Accuracy of train model is 0.9734457393539239
Accuracy of test model is 0.9469416241546055
[[105083   3003]
 [  3383   8889]]
              precision    recall  f1-score   support

           0       0.97      0.97      0.97    108086
           1       0.75      0.72      0.74     12272

    accuracy                           0.95    120358
   macro avg       0.86      0.85      0.85    120358
weighted avg       0.95      0.95      0.95    120358



array([[0.3186586 ],
       [0.01233614],
       [0.9667449 ],
       [0.488376  ],
       [0.2814764 ],
       [0.82733625],
       [0.8401279 ],
       [0.29417583],
       [0.05591594],
       [0.04337768],
       [0.81961596],
       [0.529867  ],
       [0.00629866],
       [0.8962034 ],
       [0.4541255 ],
       [0.8782962 ],
       [0.82919824],
       [0.04869279],
       [0.942593  ],
       [0.00664379],
       [0.65925974],
       [0.09689412],
       [0.68639326]], dtype=float32)

### End