<font color="#006400" face="sans-serif" size="7">
<b>Erika Shrestha</b>
</font>
<BR>
<font color="#006400" face="sans-serif" size="4.5">
<b>London Met ID: 23048598</b>
</font>

---


<b>FindCite is a text classification model that automatically classify citations in scientific research papers according to their purpose. This project aims to compare baseline model (Logistic Regression) and advanced model (SciBERT) to validate their adaptability and limitations.</b>

In [None]:
#INSTALL necessary librabries that are not by-default stored in golab
#!pip install --upgrade transformers

---
---

# **TEMPORARY DATA LOAD**
This notebook is designed to be fully reproducible in Google Colab.
Please ensure the setup cells (dataset download and folder creation) are run sequentially before executing the rest of the notebook.

---
---

In [None]:
#CREATE an empty folder in colab temporary files
!mkdir -p data

In [None]:
#CONNECT the data files which contains train, test and validation
!wget https://ai2-s2-research.s3-us-west-2.amazonaws.com/scicite/scicite.tar.gz

In [None]:
#EXTRACT the connected taz file
!tar -xvzf scicite.tar.gz

In [None]:
#MOVE the required json files into the created data folder
!mv scicite/train.jsonl scicite/dev.jsonl scicite/test.jsonl data/

-------
-------

# **Let's Get Started**

In [None]:
#IMPORT all the necessary libraries used in the project
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
import re
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder, label_binarize
from sklearn.utils.class_weight import compute_class_weight
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import learning_curve
from wordcloud import WordCloud
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf
from scipy.special import softmax
from tf_keras.callbacks import EarlyStopping
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, precision_recall_curve, roc_curve, auc
from statsmodels.stats.contingency_tables import mcnemar
import joblib
from google.colab import files
import warnings
warnings.filterwarnings("ignore")

# **HELPER METHODS**

In [None]:
#FUNCTION to plot word cloud
def plot_wordcloud_per_class(X_texts, y_labels, title="Word Cloud"):
  classes = sorted(set(y_labels))
  fig, axes = plt.subplots(1, len(classes), figsize=(18, 6))

  for ax, cls in zip(axes, classes):
      text = " ".join(X_texts[y_labels == cls])

      wc = WordCloud(
          background_color="white",
          max_words=200,
          width=800,
          height=400
      )

      wc.generate(text)
      ax.imshow(wc, interpolation="bilinear")
      ax.set_title(f"{title} ‚Äì Class {cls}")
      ax.axis("off")

  plt.tight_layout()
  plt.show()

In [None]:
#FUNCTION to plot learning curve to check overfitting and underfitting
def plot_learning_curve(model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 5), scoring='f1_macro', title="Learning Curve"):

  train_sizes, train_scores, val_scores = learning_curve(
        model, X, y, cv=cv, train_sizes=train_sizes, scoring=scoring, n_jobs=-1
    )

  plt.plot(train_sizes, train_scores.mean(axis=1), marker='o', color='blue', label='Training Score')
  plt.plot(train_sizes, val_scores.mean(axis=1), marker='s', color='purple', label='Validation Score')
  plt.xlabel('Training Size')
  plt.ylabel('F1 Score')
  plt.title(f'{title} Learning Curve')
  plt.legend()
  plt.show()

In [None]:
#FUNCTION to plot confusion matrix to check model predictions accuracy
def plot_confusion_matrix(y_true, y_pred, labels, title="Confusion Matrix", cmap='Blues'):

  class_names = ['Background', 'Method', 'Result']

  cm = confusion_matrix(y_true, y_pred, labels=labels)
  disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=class_names)

  disp.plot(cmap=cmap, xticks_rotation=45)
  plt.title(title)
  plt.show()

In [None]:
#FUNCTION to plot PR curve to check model actual performance and not just accuracy
def plot_precision_recall_curve(y_true, y_scores, title="Precision-Recall Curve"):

    classes = np.unique(y_test)
    y_test_bin = label_binarize(y_test, classes=classes)

    plt.figure(figsize=(8, 6))
    colors = ['purple', 'blue', 'yellow']

    for i, cls in enumerate(classes):
        precision, recall, _ = precision_recall_curve(y_test_bin[:, i], y_scores[:, i])
        auc_score = auc(recall, precision)
        plt.plot(recall, precision, color=colors[i], label=f'Class {cls} (AUC={auc_score:.2f})')

    plt.xlabel("recall")
    plt.ylabel("precision")
    plt.title(f'{title} PR Curve')
    plt.legend()
    plt.grid(False)
    plt.show()

In [None]:
#FUNCTION to plot ROC-AUC curve to check one vs rest
def plot_roc_auc_curve(y_true, y_scores, class_names=None, title="ROC-AUC Curve"):
    classes = np.unique(y_true)
    y_bin = label_binarize(y_true, classes=classes)

    plt.figure(figsize=(8,6))
    colors = ['purple', 'blue', 'yellow']

    for i, cls in enumerate(classes):
        fpr, tpr, _ = roc_curve(y_bin[:, i], y_scores[:, i])
        auc_score = auc(fpr, tpr)
        plt.plot(fpr, tpr, color=colors[i], label=f'Class {class_names[i]} (AUC={auc_score:.2f})')

    # Random classifier baseline
    plt.plot([0,1], [0,1], 'k--', label='Random')

    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title(title)
    plt.legend()
    plt.grid(False)
    plt.show()


In [None]:
#FUNCTION to plot scibert history behavior based on metrics: loss and accuracy
def plot_history(history, metric='loss', title="Training vs Validation"):

    plt.figure(figsize=(8,6))
    plt.plot(history.history[metric], label=f'Train {metric}', marker='o', color='blue')
    plt.plot(history.history['val_' + metric], label=f'Val {metric}', marker='s', color='purple')
    plt.xlabel('Epochs')
    plt.ylabel(metric)
    plt.title(f'{title} {metric}')
    plt.legend()
    plt.grid(False)
    plt.show()

----
----

# **EXPLORATORY DATA ANALYSIS (EDA)** üìä

**The original Scicite dataset provides separate train, development, and test splits. For understanding the characteristics of the datasets, the train, test and development sets were concatenated to form a larger labeled dataset while test sets is carefully examined due to absence of 1 column (label_confidence)**

In [None]:
#IMPORT training data from data Folder
train_df = pd.read_json('data/train.jsonl', lines=True)
#head() shows only 5 documents by default
train_df.head()

In [None]:
#IMPORT testing data from data Folder
test_df = pd.read_json('data/test.jsonl', lines=True)
#head() shows only 5 documents by default
test_df.head()

In [None]:
#IMPORT validation data from data Folder
val_df = pd.read_json('data/dev.jsonl', lines=True)
#head() shows only 5 documents by default
val_df.head()

In [None]:
#CONCATINATE train and validation sets
explore_df = pd.concat([train_df, val_df, test_df])
explore_df

**The individual dataset splits were first concatenated into a single dataframe with a shape of (11020 x 15) to facilitate unified preprocessing and exploratory data analysis (EDA).**

----
----

# **Data Quality Check** ‚òëÔ∏è

1.   Missing Values Check
2.   Data Type Consistency
3.   Duplication Check
4.   Label Distribution
5.   Text Length Distribution


---
---

#####**1. Missing Values Check**

In [None]:
#CHECKS the missing values of all the columns in the findcite_df
#logarithm scaling is used for y-axis to visualize columns with large gap in missing values
missing_counts = explore_df.isnull().sum()
plt.figure(figsize=(10,6))
plt.bar(missing_counts.index, missing_counts.values, color='#E0AAFF')
plt.yscale('log')
plt.title('Missing Values Per Column', fontweight='bold')
plt.xticks(rotation=90)
plt.ylabel('Missing Values')
plt.xlabel('Columns')
plt.tight_layout()

In [None]:
#CHECKS whether there is presence of high missing values (case = missing_values > 50%)
total_rows = len(explore_df)
missing_percent_per_columns = ((missing_counts / total_rows) * 100).round(1)
missing_percent_per_columns

**Eight columns have no missing values, which means the core data is complete and reliable. A few columns such as source, citeStart, citeEnd, and sectionName have only a small number of missing entries so they are not a major issue.**

**However, the columns label_confidence, label2, and label2_confidence have a very large number of missing values. This means secondary labels and their confidence scores are mostly unavailable and may not be useful to perform multi label classification.**

---
---

#####**2. Data Type Consistency**

In [None]:
#CHECKS data types for each dataframes
print(f"\033[1mTrain sets data types:\033[0m\n{train_df.dtypes}\n")
print(f"\033[1mValidation sets data types:\033[0m\n{val_df.dtypes}\n")
print(f"\033[1mTest sets data types:\033[0m\n{test_df.dtypes}\n")

**According to the information, All the data types are consistent.**

----
----

#####**3. Duplication Check**

In [None]:
#SHOWS all the duplicate document in the dataframe through keep=false
duplicates = explore_df[explore_df.duplicated(keep=False)]
print(duplicates.to_string())

The **4378** and **503** rows were found to have same entries for every columns.

In [None]:
#REMOVES the duplicated document but keeps the first occurrence
explore_df = explore_df.drop_duplicates(keep='first')

In [None]:
#CROSS CHECKING the removal of duplicates
explore_df.duplicated().sum()

**After finding a duplicate document in the dataframe, we remove but keep the first occurring document.**

----
----

#####**4. Label Distribution**

In [None]:
#CALCULATES total occurrence for each classes to check class imbalance
class_counts = explore_df['label'].value_counts()
plt.figure(figsize=(4,4))
plt.pie(class_counts, labels=class_counts.index, autopct='%1.1f%%', colors=['#FFD580','#FFFF99','#F08080'])
plt.title('Citation Intent Class Distribution', fontsize=16, pad=20)
plt.show()

Upon calculation results, we see that the classes are distributed unevenly in label columns indicating class imbalance where **background** label has **58.7%** dominance, followed by **method** with **27.8%** and **result** with just **13.5%** which may bias the model toward the majority class (i.e. Background).

---
---

#####**5. Text Length Distribution**

In [None]:
#FINDS the text length in string column for each row
text_lengths = explore_df['string'].str.len()
#GROUPS the extracted text length by label and calculates the average through mean()
avg_length = text_lengths.groupby(explore_df['label']).mean()

In [None]:
#ROUNDS UP the value to 1
avg_length.round(1)

**This check is done to ensure that differences in text length between classes do not unfairly influence the model performance. This ensures that the model should focus on semantic meaning rather than length.**

----
----

# **Feature Selection** üîé

**Before training the model, it is essential to identify and retain only the most relevant features and target variables from the dataset. Since the focus of this project is text classification selection of only the column containing the citation sentences as the feature and the column containing the citation intent labels as the target.**

In [None]:
#PERFORM Feature Selection for explore_df by copying selected features into new dataframe
findcite_df = explore_df[['string', 'label']].copy()
findcite_df

----
----

# **DATA SPLIT** üìë

**The dataset is divided into training and test sets with 80-20 ratio. The training set is further partitioned to perform cross validation resulting in three distinct subsets: Train, Validation, and Test. This split ensures that the model can learn from the training data, tune hyperparameters using the validation set, and be evaluated on an unseen test set to provide an unbiased assessment of its performance. Stratification is applied during splitting to preserve the original class distribution across all subsets which is particularly important for multi-class classification tasks.**

In [None]:
#DEFINES X (Features) and y (Target)
X = findcite_df['string']
y = findcite_df['label']

In [None]:
#SPLITS train and tests with stratify
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

---
---

# **Text Preprocessing** üî§

1.   Tokenization
2.   Text Removal (lower casing, punctuation removal, characters removal, space parsing removal, stopwords removal)
3.   Lemmatization

In [None]:
#DOWNLOADS NLTK resources required for text preprocessing
nltk.download('stopwords')
nltk.download('wordnet')

In [None]:
#INITIALIZES the list of English stopwords and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [None]:
#FUNCTION to preprocess string columns into cleaned texts for logistic regression model training
def preprocess_text_for_logreg(text):

    #TOKENIZES
    words = text.split()

    #REMOVES space sparse
    cleaned_words = []
    for word in words:
        word = word.strip()
        if word != '':
            cleaned_words.append(word)

    #CONVERTS to lowercase
    lowercase_words = []
    for word in cleaned_words:
        lowercase_words.append(word.lower())

    #REMOVES stopwords
    filtered_words = []
    for word in lowercase_words:
        if word not in stop_words:
            filtered_words.append(word)

    #LEMMATIZES
    lemmatized_words = []
    for word in filtered_words:
        lemma = lemmatizer.lemmatize(word)
        lemmatized_words.append(lemma)

    #STORES back to strings
    cleaned_text = ' '.join(lemmatized_words)

    return cleaned_text

In [None]:
#APPLIES preprocessing function for logistic regression
X_train_lr = X_train.apply(preprocess_text_for_logreg)
X_test_lr  = X_test.apply(preprocess_text_for_logreg)

In [None]:
#FUNCTION to preprocess string columns into cleaned texts for scibert model training
def preprocess_text_for_scibert(text):

    #REMOVES space sparse
    text = text.strip()
    text = re.sub(r'\s+', ' ', text)
    return text

In [None]:
#APPLIES preprocessing function for scibert
X_train_scibert = X_train.apply(preprocess_text_for_scibert)
X_test_scibert  = X_test.apply(preprocess_text_for_scibert)

**The data is preprocessed differently depending on the model. For Logistic Regression, the text is cleaned by removing extra spaces, lowercased, stopwords removed and lemmatized. For SciBERT, only leading and trailing spaces are removed since the model can handle raw scientific text on its own.**

## **WORD CLOUDS** ‚òÅÔ∏è

In [None]:
#SHOWS word cloud for logistic regression
plot_wordcloud_per_class(X_train_lr, y_train, title="Word Cloud for Logistic Regression:")

In [None]:
#SHOWS word cloud for sciBERT
plot_wordcloud_per_class(X_train_scibert, y_train, title="Word Cloud for SciBERT:")

----
----

# **Target Preprocessing** üéØ

In [None]:
#CONVERTS labels into label numberings
le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)
y_test_enc = le.transform(y_test)

**It converts the labels (Target) into categorical encoding like 0, 1 and 2.**

---
---

# **Model Training** ü§ñ

**In this section, we train two different models to classify the citation intent to compare their performances. Apart from just performance comparison, experimental analysis through hyperparameter tuning on the baseline models is also done handling real-life scenarios and not just to focus on accuracy but identified class imbalances and future occuring overfits.**

---

#####**1. TF-IDF + Logistic Regression**

**A multinomial Logistic Regression model (with Softmax activation) is then trained on the TF-IDF features to predict the probability of each citation intent class.**

In [None]:
#CREATES pipeline to perform TF-IDF to only the preprocessed training set avoiding data leakage
logreg_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])

In [None]:
#TRAINS the model using the pipeline
logreg_pipeline.fit(X_train_lr, y_train_enc)

In [None]:
#SHOWS training and validation scores for baseline model training behavior
plot_learning_curve(
    logreg_pipeline,
    X_train, y_train,
    cv=5,
    scoring='f1_macro',
    title="Baseline Logistic Regression"
)

In [None]:
#PREDICTIONS of trained baseline model
y_pred = logreg_pipeline.predict(X_test_lr)

In [None]:
#SHOWS classification report
print(classification_report(y_test_enc, y_pred))

In [None]:
#SHOWS confusion matrix
plot_confusion_matrix(
    y_test_enc,
    y_pred,
    labels=logreg_pipeline.classes_,
    title="Baseline Logistic Regression Confusion Matrix"
)

**For the baseline Logistic Regression model, we use a simple train/test split to evaluate performance using default settings. Here, we can see the baseline model performs exceptionally nice with macro F1-score of 76% and accuracy of 80%. However, the model shows a clear sign of overfitting with start off difference of approximately 25%. Also, the macro F1-score is basically influenced by the majority class whereas the recall performance for minority classes is not seen to be very good. Keeping these observations in mind, it is evident that while the baseline Logistic Regression gives a reasonable overall performance, issues such as overfitting and lower recall for minority classes need to be addressed.**

---

#####**Hyperparameter tuning**

The hyperparameter tuning techniques used in order to address the issue are:

1. Grid Search cv üó≥Ô∏è
2. Class weights ‚öñÔ∏è

In [None]:
#ADDS class weight = 'balanced' to address the minority classes and have a balanced result
tuned_logreg_pipeline = Pipeline([
  ('tfidf', TfidfVectorizer()),
  ('clf', LogisticRegression(
      max_iter=1000,
      class_weight = 'balanced',
      random_state=42
  ))
])

In [None]:
#INITIALIZES each fold has roughly the same class proportions as the whole dataset
skf = StratifiedKFold(
  n_splits=5,
  shuffle=True,
  random_state=42
)

In [None]:
#SELECTS appropriate candidates for the best grid search combination
param_grid = {
    'tfidf__max_features': [3000, 4000, 6000],
    'clf__C': [0.005, 0.01, 0.03, 0.05, 0.1],
    'tfidf__ngram_range': [(1, 1),(1,2)],
    'tfidf__min_df': [5, 10]
}

In [None]:
#RETRIEVES scorings of F1-score since accuracy can be misleading
scoring = {
    'f1_macro': 'f1_macro',
    'f1_micro': 'f1_micro',
    'f1_weighted': 'f1_weighted'
}

In [None]:
#PERFORMS hyperparameter tuning using GridSearchCV
grid_search = GridSearchCV(
    tuned_logreg_pipeline,
    param_grid,
    cv=skf,
    scoring=scoring,
    refit='f1_macro',
    verbose=2
)

In [None]:
#TRAINS and DISPLAYS the best hyperparameter combination and the best macro-F1 score achieved
grid_search.fit(X_train_lr, y_train_enc)
print("The best settings are:", grid_search.best_params_)
print("The best F1 Score reached:", grid_search.best_score_)

In [None]:
#PLOTS top TF-IDF features
feat = grid_search.best_estimator_.named_steps['tfidf'].get_feature_names_out()
imp  = grid_search.best_estimator_.named_steps['clf'].coef_

top = np.argsort(np.mean(np.abs(imp), axis=0))[-10:]

plt.barh(feat[top], np.mean(np.abs(imp), axis=0)[top], color='plum')
plt.show()

In [None]:
#SHOWS training and validation scores for model training behavior
plot_learning_curve(
    model=grid_search.best_estimator_,
    X=X_train_lr,
    y=y_train_enc,
    cv=5,
    scoring='f1_macro',
    title="Tuned Logistic Regression"
)

In [None]:
#CHECKS model convergence
grid_search.best_estimator_.named_steps['clf'].n_iter_

**The model has learned as much as it can from the training data since it converges at 24 steps.**

In [None]:
#PREDICTS using the best estimator
y_pred= grid_search.best_estimator_.predict(X_test_lr)
y_proba = grid_search.best_estimator_.predict_proba(X_test_lr)
confidence = np.max(y_proba, axis=1)

In [None]:
#SHOWS classification report
print(classification_report(y_test_enc, y_pred))

In [None]:
#SHOWS confusion matrix
plot_confusion_matrix(
    y_test_enc,
    y_pred,
    labels=grid_search.best_estimator_.classes_,
    title="Tuned Logistic Regression Confusion Matrix"
)

In [None]:
#SHOWS the one vs rest ROC-AUC curve
y_scores_lr = grid_search.best_estimator_.predict_proba(X_test_lr)
plot_roc_auc_curve(y_test_enc, y_scores_lr, class_names=le.classes_, title="Tuned Logistic Regression")

In [None]:
#SHOWS the one vs rest PR curve
y_scores_lr = grid_search.best_estimator_.predict_proba(X_test)
plot_precision_recall_curve(y_test_enc, y_scores_lr, title="Tuned Logistic Regression Precision-Recall")

**The training and validation scores are much closer together with only 4% gap. The validation score rises to approximately 0.75-0.79, converging more tightly with the training line. In addition, even though the accuracy has decreased from 80% to 79%, the F1-scores for each classes have increased. Since the data is imbalanced, the priority lies to macro f1-score more than accuracy.**

**The baseline model was clearly showing signs of overfitting as training score reached to ~ 89% but the model achieved only 80% whereas on the hyperparameter tuned model the training score and test score was closer being 80% to 78%.**


---

**ADDITIONAL PERFORMANCE METRICS USING GRID SEARCH**

In [None]:
#DISPLAYS best combination by ranks
full_results = pd.DataFrame(grid_search.cv_results_)
necessary_results = full_results[['params', 'mean_test_f1_macro', 'std_test_f1_macro', 'rank_test_f1_macro']]
necessary_results.sort_values(by='rank_test_f1_macro').head()

In [None]:
#DISPLAYS the selected best combination metrics by each folds (K=5)
best_idx = grid_search.best_index_
for i in range(5):
    print(
        f"Fold {i+1} F1-macro:",
        grid_search.cv_results_[f'split{i}_test_f1_macro'][best_idx],
        " | F1-micro:",
        grid_search.cv_results_[f'split{i}_test_f1_micro'][best_idx],
        " | F1-weighted:",
        grid_search.cv_results_[f'split{i}_test_f1_weighted'][best_idx]
    )

---
---

#####**2. SciBERT**

**SciBERT is atransformer-based language model specifically trained on scientific text. It captures contextual relationships and has own tokenizer unlike other traditional models.**

In [None]:
#DEFINES the model in use
scibert_model = "allenai/scibert_scivocab_uncased"
num_labels = 3

In [None]:
#LOADS tokenizer
tokenizer = AutoTokenizer.from_pretrained(scibert_model)

In [None]:
#scibert = TFAutoModelForSequenceClassification.from_pretrained(scibert_model, num_labels=num_labels, from_pt=True)
scibert = TFAutoModelForSequenceClassification.from_pretrained(scibert_model, num_labels=num_labels, from_pt=True)

In [None]:
#FREEZES all layers except for the classification
for layer in scibert.layers:
    layer.trainable = False

scibert.classifier.trainable = True

In [None]:
#COMPUTE class weights to handle class imbalance for scibert
classes = np.unique(y_train_enc)
class_weights = compute_class_weight(class_weight='balanced', classes=classes, y=y_train_enc)
class_weights_dict = dict(zip(classes, class_weights))

---

#####**VALIDATION SPLIT**

In [None]:
#SPLITS the training sets into another subset with validation set as 20% of it
X_train_sub, X_val, y_train_sub, y_val = train_test_split(
    X_train_scibert, y_train_enc,
    test_size=0.2,
    random_state=42,
    stratify=y_train_enc
)

In [None]:
#USES scibert's own tokenizer to encode the sets
train_encodings = tokenizer(list(X_train_sub), truncation=True, padding=True, max_length=128, return_tensors='tf')
val_encodings = tokenizer(list(X_val), truncation=True, padding=True, max_length=128, return_tensors='tf')
test_encodings = tokenizer(list(X_test_scibert), truncation=True, padding=True, max_length=128, return_tensors='tf')

In [None]:
#CONVERTS tokenized inputs into TensorFlow datasets and batch them for training, validation, and testing
batch_size=8
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    y_train_sub
)).shuffle(1000).batch(batch_size)

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    y_val
)).batch(batch_size)

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    y_test_enc
)).batch(batch_size)

In [None]:
#COMPILES the SciBERT model
scibert.compile(
    optimizer="adam",
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"]
)

In [None]:
#CREATES an early stop to avoid overfitting over epochs (currently it is not used since epochs are properly training if needed add in history as a callbacks=[early_stop])
early_stop = EarlyStopping(
    monitor="val_accuracy",
    patience=0,
    restore_best_weights=True
)

In [None]:
#TRAINS the SciBERT model
history = scibert.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=5,
    class_weight=class_weights_dict,
)

**Stop training immediately when validation accuracy does not improve compared to the best seen so far.**

In [None]:
#SHOWS the fined tuned model training vs validation loss
plot_history(history, title='SciBERT Training vs Validation Accuracy')

In [None]:
#SHOWS the fined tuned model training vs validation accuracy
plot_history(history, metric='accuracy', title='SciBERT Training vs Validation Accuracy')

In [None]:
#CHECKS test loss and accuracy
test_loss, test_acc = scibert.evaluate(test_dataset)
print(f"Test Loss: {test_loss:.4f}, Test Accuracy: {test_acc:.4f}")

In [None]:
#GETS predicted class labels from model outputs
predictions = scibert.predict(test_dataset)
logits = predictions.logits
pred_labels = tf.argmax(logits, axis=1).numpy()

In [None]:
#SHOWS classification report for scibert
print(classification_report(y_test_enc, pred_labels))

In [None]:
#SHOWS confusion matrix for scibert
plot_confusion_matrix(
    y_true=y_test_enc,
    y_pred=pred_labels,
    labels=[0, 1, 2],
    title="SciBERT Confusion Matrix"
)

In [None]:
#SHOWS the one vs rest PR curve
all_logits = []
for batch in test_dataset:
    outputs = scibert(batch)
    batch_logits = outputs[0]
    all_logits.append(batch_logits)

logits = tf.concat(all_logits, axis=0)
y_scores_sci = softmax(logits, axis=1)
plot_precision_recall_curve(y_test_enc, y_scores_sci, title="SciBERT")

In [None]:
#SHOWS the roc-auc curve for scibert
plot_roc_auc_curve(y_test_enc, y_scores_sci, class_names=le.classes_, title="SciBERT")

In [None]:
#GIVES model internal flow
scibert.summary()

**The training and validation scores for both loss and accuracy is improving which means the model is generalizing after all layers except for classification layer was dropped out. In addition, even though the validation accuracy slightly fluctuated at mid-way, the F1-scores for each classes have increased and validation loss tends to decrease till the 5th epochs. Since the data is imbalanced, the priority lies to macro f1-score more than accuracy.**

----
----

# **Model Difference Validity Check** üìù

**McNemar Test is used to check whether the difference is statistically significant. The p-value evaluates the hypothesis.**
</br>
</br>
**H‚ÇÄ: Any performance difference is due to chance.**</br>
**H‚ÇÅ: The performance difference is statistically significant.**

In [None]:
#COMPUTES p-value to validate the statement
y_pred_scibert = []
y_pred_logreg = grid_search.best_estimator_.predict(X_test)
for i in range(0, len(X_test), 8):
    batch = tokenizer(X_test[i:i+8].tolist(), padding=True, truncation=True, max_length=128, return_tensors="tf")
    logits = scibert(batch)['logits']
    y_pred_scibert.extend(tf.argmax(logits, axis=1).numpy())

a = ((y_pred_logreg == y_test_enc) & (y_pred_scibert == y_test_enc)).sum()
b = ((y_pred_logreg == y_test_enc) & (y_pred_scibert != y_test_enc)).sum()
c = ((y_pred_logreg != y_test_enc) & (y_pred_scibert == y_test_enc)).sum()
d = ((y_pred_logreg != y_test_enc) & (y_pred_scibert != y_test_enc)).sum()

table = [[a, b],
         [c, d]]

result = mcnemar(table, exact=True)
print("p-value:", result.pvalue)

**Since p-value is less than 0.05, the test successfully rejects the null hypothesis. This means that the difference between the models is statistically significantly.**

---
---

# **Model Download** üì©

In [None]:
#SAVES the Logistic regression model
joblib.dump(grid_search.best_estimator_, 'logreg_model.pkl')
files.download('logreg_model.pkl')

In [None]:
#SAVES the sciBERT model
scibert.save_pretrained("scibert_model")
tokenizer.save_pretrained("scibert_model")
!zip -r scibert_model.zip scibert_model
files.download("scibert_model.zip")

----
----

# **Thank you!!!** üòä