<div style="color:white;
            display:fill;
            border-radius:5px;
            background-color:#da5351;
            font-size:130%;
            font-family:Verdana;
            letter-spacing:0.5px;
            text-align:center">
  <h1 id="News Classification Using Bert Transformer Model" style="padding: 10px; color:white; text-align:center;">
    News Classification Using Bert Transformer
    <a class="anchor-link" href="https://www.kaggle.com/code/amirhoseinsedaghati/news-classification-using-bert-transformer-72-acc#News_Classification_Using_Bert_Transformer">¶</a>
  </h1>
</div>

<div style="color:black;
           display:fill;
           border-radius:5px;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px;
           text-align:left">
    
Welcome to this notebook :)
      
</div>

<div style="color:black;
           display:fill;
           border-radius:5px;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px;
           text-align:left">
    
<strong>Table of Contents:</strong><br />

<p>
    <ul>
        <li>1. Install and Import Dependencies</li>
        <li>2. Loading Data</li>
        <li>3. Exploratory Data Analysis (EDA)</li>
        <li>
            4. Data Preprocessing
            <ul>
                <li>Data Transformation</li>
                <li>Data Splitting</li>
            </ul>
        </li>
        <li>
            5. Data Modeling and Evaluation
            <ul>
                <li>Transfer Learning</li>
                <li>Training Model</li>
                <li>Comparing model metrics using the history attribute</li>
                <li>Inference and Evaluating Model</li>
            </ul>
        </li>
        <li>6. Conclusion</li>
    </ul>
</p>

</div>

<div style="color:white;
            display:fill;
            border-radius:5px;
            background-color:#44479b;
            font-size:110%;
            font-family:Verdana;
            letter-spacing:0.5px;
            text-align:center">
  <h1 id="Install and Import Dependencies" style="padding: 10px; color:white; text-align:center;">
   1. Install and Import Dependencies
    <a class="anchor-link" href="https://www.kaggle.com/code/amirhoseinsedaghati/news-classification-using-bert-transformer-72-acc#Install_and_Import_Dependencies"">¶</a>
  </h1>
</div>

<div style="color:black;
           display:fill;
           border-radius:5px;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px;
           text-align:left">
    
Before running the following code cells, You have to change the Accelerator status from None to <code>TPU VM v3-8</code> because, in this notebook, we're going to use TPU as the computing processing units. You can do this in the edit mode by clicking on the three dots located in the upper right corner.
      
</div>

In [None]:
!pip install seaborn

In [None]:
!pip install nltk

In [None]:
!pip install --upgrade pip

In [None]:
!pip install tensorflow

In [None]:
!pip install tensorflow_addons

In [None]:
!pip install transformers

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
import nltk
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
import tensorflow as tf
from tensorflow.keras.models import Model, save_model, load_model
from tensorflow.keras.layers import Dropout, Input, Dense
from transformers import DistilBertTokenizer, TFAutoModel, AdamWeightDecay
from tensorflow.keras.callbacks import ReduceLROnPlateau, ModelCheckpoint, EarlyStopping
from tensorflow_addons.metrics import F1Score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
import regex as re
import pickle
import warnings
warnings.filterwarnings('ignore')

nltk.download('punkt')
nltk.download('wordnet')

<div style="color:black;
           display:fill;
           border-radius:5px;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px;
           text-align:left">
    
<strong>How to use a TPU Cluster and a distribution strategy?</strong><br />

<p>
    <ul>
        <li>1. Initializing a TPU cluster object</li>
        <li>2. Connecting to a TPU cluster using the created TPU cluster object</li>
        <li>3. Initializing a TUP system using the created TPU cluster object</li>
        <li>4. Initializing a distribution strategy object using the created TPU cluster object</li>
    </ul>
</p>
      
</div>

In [None]:
try:
    tpu_cluster = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU', tpu_cluster.master())
except ValueError:
    tpu_cluster = None # It assigns None to the tpu_cluster variable, indicating that no TPU is available

if tpu_cluster: # is not None
    tf.config.experimental_connect_to_cluster(tpu_cluster)
    tf.tpu.experimental.initialize_tpu_system(tpu_cluster)
    dist_strategy = tf.distribute.TPUStrategy(tpu_cluster) # use a distribution strategy related to the presence of TPU 
else: # is None
    dist_strategy = tf.distribute.get_strategy() # use a distribution strategy related to the absence of TPU 

print('The number of Replica involved in strategy :', dist_strategy.num_replicas_in_sync)

<div style="color:black;
           display:fill;
           border-radius:5px;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px;
           text-align:left">
    
We'll use the distribution strategy later, in the block of the <code>dist_strategy.scope</code> command, when we create our neural network model. Then, TensorFlow will distribute the training data among the eight TPU cores by creating eight different replicas of the model, one for each core. Each replica processes a portion of the training data and computes gradients, which are then aggregated and used to update the model weights.
      
</div>

<div style="color:white;
            display:fill;
            border-radius:5px;
            background-color:#44479b;
            font-size:110%;
            font-family:Verdana;
            letter-spacing:0.5px;
            text-align:center">
  <h1 id="Loading Data" style="padding: 10px; color:white; text-align:center;">
   2. Loading Data
    <a class="anchor-link" href="https://www.kaggle.com/code/amirhoseinsedaghati/news-classification-using-bert-transformer-72-acc#Loading_Data"">¶</a>
  </h1>
</div>


In [None]:
news = pd.read_json("/kaggle/input/news-category-dataset/News_Category_Dataset_v3.json", lines=True)
news.head()

<div style="color:white;
            display:fill;
            border-radius:5px;
            background-color:#44479b;
            font-size:110%;
            font-family:Verdana;
            letter-spacing:0.5px;
            text-align:center">
  <h1 id="EDA" style="padding: 10px; color:white; text-align:center;">
   3. EDA
    <a class="anchor-link" href="https://www.kaggle.com/code/amirhoseinsedaghati/news-classification-using-bert-transformer-72-acc#EDA"">¶</a>
  </h1>
</div>

In [None]:
news = news[['headline', 'short_description', 'category']] # feature selection
news.head()

In [None]:
news.category.unique()

<div style="color:black;
           display:fill;
           border-radius:5px;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px;
           text-align:left">
    
<strong>We want to Merge similar categories into a single category such as:</strong><br />

<p>
    <ul>
        <li>"HEALTHY LIVING" to "WELLNESS"</li>
        <li>"QUEER VOICES" to "GROUPS VOICES"</li>
        <li>"BUSINESS" to "BUSINESS & FINANCES"</li>
        <li>"PARENTS" to "PARENTING"</li>
        <li>"BLACK VOICES" to "GROUPS VOICES"</li>
        <li>"THE WORLDPOST" to "WORLD NEWS"</li>
        <li>"STYLE" to "STYLE & BEAUTY"</li>
        <li>"GREEN" to "ENVIRONMENT"</li>
        <li>"TASTE" to "FOOD & DRINK"</li>
        <li>"WORLDPOST" to "WORLD NEWS"</li>
        <li>"SCIENCE" to "SCIENCE & TECH"</li>
        <li>"TECH" to "SCIENCE & TECH"</li>
        <li>"MONEY" to "BUSINESS & FINANCES"</li>
        <li>"ARTS" to "ARTS & CULTURE"</li>
        <li>"COLLEGE" to "EDUCATION"</li>
        <li>"LATINO VOICES" to "GROUPS VOICES"</li>
        <li>"CULTURE & ARTS" to "ARTS & CULTURE"</li>
        <li>"FIFTY" to "MISCELLANEOUS"</li>
        <li>"GOOD NEWS" to "MISCELLANEOUS</li>
    </ul>
</p>
      
</div>

In [None]:
len(news.category.unique())

In [None]:
news.category = news.category.replace({"HEALTHY LIVING": "WELLNESS",
              "QUEER VOICES": "GROUPS VOICES",
              "BUSINESS": "BUSINESS & FINANCES",
              "PARENTS": "PARENTING",
              "BLACK VOICES": "GROUPS VOICES",
              "THE WORLDPOST": "WORLD NEWS",
              "STYLE": "STYLE & BEAUTY",
              "GREEN": "ENVIRONMENT",
              "TASTE": "FOOD & DRINK",
              "WORLDPOST": "WORLD NEWS",
              "SCIENCE": "SCIENCE & TECH",
              "TECH": "SCIENCE & TECH",
              "MONEY": "BUSINESS & FINANCES",
              "ARTS": "ARTS & CULTURE",
              "COLLEGE": "EDUCATION",
              "LATINO VOICES": "GROUPS VOICES",
              "CULTURE & ARTS": "ARTS & CULTURE",
              "FIFTY": "MISCELLANEOUS",
              "GOOD NEWS": "MISCELLANEOUS"}
            )

In [None]:
len(news['category'].unique())

In [None]:
plt.figure(figsize=(10, 10))
plt.pie(x=news.category.value_counts(), labels=news.category.value_counts().index, autopct='%1.1f%%', textprops={'fontsize' : 8,
                                                                                                                'alpha' : .7});
plt.title('The precentage of instance belonging to each class', alpha=.7);
plt.tight_layout();

<div style="color:white;
            display:fill;
            border-radius:5px;
            background-color:#44479b;
            font-size:110%;
            font-family:Verdana;
            letter-spacing:0.5px;
            text-align:center">
  <h1 id="Data Preprocessing" style="padding: 10px; color:white; text-align:center;">
   4. Data Preprocessing
    <a class="anchor-link" href="https://www.kaggle.com/code/amirhoseinsedaghati/news-classification-using-bert-transformer-72-acc#Data_Preprocessing"">¶</a>
  </h1>
</div>

<div style="color:black;
           display:fill;
           border-radius:5px;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px;
           text-align:left">
    
<strong>4.x. Data Transformation</strong>

</div>

In [None]:
with open('/kaggle/input/english-stopwords/EN-Stopwords.txt', 'r') as f:
    stopwords = f.readlines()
    f.close()
stopwords = [re.sub('\n', '', w) for w in stopwords]

In [None]:
def text_preprocessing(df:pd.DataFrame):
    """
    performing tasks such as tokenization, stop words, numebrs, punctuations, and empty strings removal, stemming,
    and lemmatization
    """
    lem = WordNetLemmatizer()
    new_df = pd.DataFrame(columns=['head_desc', 'category'])
    max_len = 0
    for index, row in df.iterrows():
        head_desc = row.headline + " " + row.short_description
        head_desc_tokenized = word_tokenize(head_desc) # Word Tokenization
        punctuation_stopwords_removed = [re.sub('[^\w\s]', '', token) for token in head_desc_tokenized if not token in stopwords] # punctuations and stopwords removal
        number_removed = [re.sub('\d+', '', token) for token in punctuation_stopwords_removed] # numbers removal
        head_desc_lemmatized = [lem.lemmatize(token) for token in number_removed] # Word Lemmatization
        empty_str_removed = [token for token in head_desc_lemmatized if token != ''] # empty strings removal
        if len(empty_str_removed) > max_len:
            max_len = len(empty_str_removed)
        new_df.loc[index] = {
            'head_desc' : " ".join(empty_str_removed),
            'category' : row['category']
        }
    X, y = new_df['head_desc'], new_df['category']
    return X, y, max_len

In [None]:
X, y, max_len = text_preprocessing(news)

In [None]:
def save_data(name, data):
    with open(name, 'wb') as f:
        pickle.dump(data, f)
        f.close()

In [None]:
save_data('X.h5', X)
save_data('y.h5', y)
save_data('max_len', max_len)

In [None]:
def load_data(name):
    return pickle.load(open(name, 'rb'))

In [None]:
# X = load_data('X.h5')
# y = load_data('y.h5')
# max_len = load_data('max_len')

In [None]:
max_len

In [None]:
X.shape, y.shape

In [None]:
X

In [None]:
y.head()

In [None]:
y = pd.get_dummies(y) # OHE
classes_name = y.columns.tolist()
y.head()

In [None]:
y = y.replace([True, False], [1, 0]).values
y.shape

<div style="color:black;
           display:fill;
           border-radius:5px;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px;
           text-align:left">
    
<strong>4.xx. Data Splitting</strong>

</div>

In [None]:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=.3, random_state=42)
X_eval, X_test, y_eval, y_test = train_test_split(X_temp, y_temp, test_size=.5, random_state=42)

<div style="color:black;
           display:fill;
           border-radius:5px;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px;
           text-align:left">
    
<strong>4.x. Data Transformation</strong>

</div>

In [None]:
# create a DistilBertTokenizer object
tokenizer = DistilBertTokenizer.from_pretrained(pretrained_model_name_or_path="distilbert-base-uncased") 

In [None]:
save_data('tokenizer.h5', tokenizer)

In [None]:
# tokenizer = load_data('tokenizer.h5')

In [None]:
def tokenizer_preprocessing(texts, tokenizer):
    """
    In text classification tasks, It is common to have varying lengths of input texts.
    To handle this variability, we should apply the helper method batch_encode_plus on
    a DistilBertTokenizer object to ensure that all the input texts have the same lenght.
    """
    encoded_dict = tokenizer.batch_encode_plus(
        texts,
        return_token_type_ids=False,
        pad_to_max_length=True, # the length of all texts will be equal to a text which has the maximum tokens
        max_length=max_len
    )
    return np.array(encoded_dict['input_ids']) # convert a list to an array

In [None]:
padded_train = tokenizer_preprocessing(X_train, tokenizer)
padded_eval = tokenizer_preprocessing(X_eval, tokenizer)
padded_test = tokenizer_preprocessing(X_test, tokenizer)

In [None]:
save_data('padded_train.h5', padded_train)
save_data('padded_eval.h5', padded_eval)
save_data('padded_test.h5', padded_test)

In [None]:
# padded_train = load_data('padded_train.h5')
# padded_eval = load_data('padded_eval.h5')
# padded_test = load_data('padded_test.h5')

In [None]:
padded_train.shape, padded_eval.shape, padded_test.shape

<div style="color:white;
            display:fill;
            border-radius:5px;
            background-color:#44479b;
            font-size:110%;
            font-family:Verdana;
            letter-spacing:0.5px;
            text-align:center">
  <h1 id="Data Modeling and Evaluation" style="padding: 10px; color:white; text-align:center;">
   5. Data Modeling and Evaluation
    <a class="anchor-link" href="https://www.kaggle.com/code/amirhoseinsedaghati/news-classification-using-bert-transformer-72-acc#Data_Modeling_and_Evaluation"">¶</a>
  </h1>
</div>

<div style="color:black;
           display:fill;
           border-radius:5px;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px;
           text-align:left">
    
<strong>5.x. Transfer Learning</strong>
      
</div>

In [None]:
with dist_strategy.scope():
    pretrained_model = TFAutoModel.from_pretrained(pretrained_model_name_or_path='distilbert-base-uncased')
    
    # Transfer Learning -> using transfer learning, we can use a pre-trained model in a customized way.
    input_layer = Input(shape=(max_len,), dtype=tf.int32)
    # we utilize a pre-trained transformer model to process input sequences or texts
    transformer_layers = pretrained_model(input_layer)[0] 
    # The CLS variable stores the representation of the [CLS] token, which is typically used as a aggregate representation of the entire input sequence.
    CLS = transformer_layers[:, 0, :] 
    drop1 = Dropout(.8)(CLS)
    output = Dense(27, activation='softmax')(drop1)
    
    bert_tf = Model(inputs=input_layer, outputs=output)
    bert_tf.compile(
        loss = 'categorical_crossentropy',
        optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01),
        metrics = F1Score(num_classes=27, average='macro')
    )

bert_tf.summary()

<div style="color:black;
           display:fill;
           border-radius:5px;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px;
           text-align:left">
    
<strong>5.xx. Training Model</strong>

</div>

In [None]:
EPOCHS = 50
BATCH_SIZE = 32 * dist_strategy.num_replicas_in_sync # The number of input sequences in each batch
STEPS_PER_EPOCH = X_train.shape[0] // BATCH_SIZE # The numebr of batches
early_stopping = EarlyStopping(patience=10, restore_best_weights=True)
model_checkpoint = ModelCheckpoint('/kaggle/working/model_weights.h5', monitor='val_f1_score', save_best_only=True) 
lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=2)

history = bert_tf.fit(
    padded_train,
    y_train,
    validation_data=(padded_eval, y_eval),
    epochs=EPOCHS,
    steps_per_epoch=STEPS_PER_EPOCH,
    callbacks=[lr, early_stopping, model_checkpoint]
)

<div style="color:black;
           display:fill;
           border-radius:5px;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px;
           text-align:left">
    
<strong>5.xxx. Comparing model metrics using the history attribute</strong>

</div>

In [None]:
plt.figure(figsize=(7, 7))
plt.plot(history.history['loss'], label='loss');
plt.plot(history.history['val_loss'], label='val_loss');
plt.legend();
plt.title('Loss vs Validation Loss');
plt.tight_layout();

In [None]:
plt.figure(figsize=(7, 7))
plt.plot(history.history['f1_score'], label='f1_score');
plt.plot(history.history['val_f1_score'], label='val_f1_score');
plt.legend();
plt.title('F1 Score vs Validation F1 Score');
plt.tight_layout();

<div style="color:black;
           display:fill;
           border-radius:5px;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px;
           text-align:left">
    
<strong>5.xxxx. Inference and Evaluating Model</strong>

</div>

In [None]:
y_pred = bert_tf.predict(padded_test)
y_pred_class = np.argmax(y_pred, axis=1) # convert the One-Hot-Encoded vecotrs to a single vector
y_pred_class

In [None]:
y_test_class = np.argmax(y_test, axis=1)
y_test_class

In [None]:
print(classification_report(y_true=y_test_class, y_pred=y_pred_class))

In [None]:
conf_matrix = confusion_matrix(y_true=y_test_class, y_pred=y_pred_class)
plt.figure(figsize=(10, 7))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Oranges', cbar=False);
plt.tight_layout();
plt.xticks(range(27), classes_name, rotation=90);
plt.yticks(range(27), classes_name, rotation=360);

<div style="color:white;
            display:fill;
            border-radius:5px;
            background-color:#44479b;
            font-size:110%;
            font-family:Verdana;
            letter-spacing:0.5px;
            text-align:center">
  <h1 id="Conclusion" style="padding: 10px; color:white; text-align:center;">
   6. Conclusion
    <a class="anchor-link" href="https://www.kaggle.com/code/amirhoseinsedaghati/news-classification-using-bert-transformer-72-acc#Conclusion">¶</a>
  </h1>
</div>

<div style="color:black;
           display:fill;
           border-radius:5px;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px;
           text-align:left">
    
Since we are dealing with an imbalanced dataset, we couldn't use accuracy as an evaluation metric. Instead, we opted for the F1 score metric available in the <code>tensorflow_addons.metrics</code> module. I believe this issue affects both our F1 score and accuracy. However, considering the complexity of the classification task with 27 classes, we still managed to achieve good results.<br />

<br />You can try running this code with a balanced dataset using oversampling or undersampling techniques provided in the <code>imblearn</code> pakage to see if we can achieve even better results. 
      
</div>