<a href="https://colab.research.google.com/github/Navashakthi/Fact-Checking-Complete-MLOps-using-HuggingFace-Models/blob/main/Execution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Model Selection**
To decide on the best model to use, we can evaluate each based on a few critical factors such as Model Architecture, Model Size & Computational Requirements, Performance metrics, and Inference Speed.

Here we take performance metrics such as accuracy, F1-score, and Loss of each model to get the best model.

In [None]:
import pandas as pd

# Metrics for three models
data = {
    'Model': ['Model A', 'Model B', 'Model C'],
    'Accuracy': [0.6285, 0.797, 0.933],
    'Loss': [1.1227, 0.5858, 0.3454],
    'F1 Score': [0.6545, 0.9234, 0.9154],
    'Micro F1': [0.0, 0.8122, 0.8130],
    'Macro F1': [0.0, 0.6830, 0.6874]
}

df = pd.DataFrame(data)
weights = {
    'Accuracy': 0.2,
    'Loss': 0.2,
    'F1 Score': 0.3,
    'Micro F1': 0.15,
    'Macro F1': 0.15
}


# Calculate composite score
def calculate_score(row):
    return (
        row['Accuracy'] * weights['Accuracy'] +
        (1 - row['Loss']) * weights['Loss'] +  # Minimize loss
        row['F1 Score'] * weights['F1 Score'] +
        row['Micro F1'] * weights['Micro F1'] +
        row['Macro F1'] * weights['Macro F1']
    )

df['Composite Score'] = df.apply(calculate_score, axis=1)

# Rank models
best_model = df.loc[df['Composite Score'].idxmax()]
print("Best Model:", best_model['Model'])
print("Composite Score:", best_model['Composite Score'])


Best Model: Model C
Composite Score: 0.8172


In [None]:
!pip install -r /content/requirements.txt

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [None]:
!python /content/ingest.py --save_dir data

Loading the PUBHEALTH dataset...
README.md: 100% 8.61k/8.61k [00:00<00:00, 32.6MB/s]
health_fact.py: 100% 7.08k/7.08k [00:00<00:00, 8.39MB/s]
Downloading data: 100% 24.9M/24.9M [00:00<00:00, 87.2MB/s]
Generating train split: 100% 9832/9832 [00:02<00:00, 4745.51 examples/s]
Generating test split: 100% 1235/1235 [00:00<00:00, 6733.52 examples/s]
Generating validation split: 100% 1225/1225 [00:00<00:00, 7439.37 examples/s]
Saving the dataset to data...
Creating json from Arrow format: 100% 10/10 [00:00<00:00, 14.96ba/s]
Saved train split to data/pubhealth_train.jsonl
Creating json from Arrow format: 100% 2/2 [00:00<00:00, 24.82ba/s]
Saved test split to data/pubhealth_test.jsonl
Creating json from Arrow format: 100% 2/2 [00:00<00:00, 23.02ba/s]
Saved validation split to data/pubhealth_validation.jsonl
Download and save complete.


In [None]:
!python /content/prepare.py --data_dir data --output_dir processed_data

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
Loading the PUBHEALTH dataset...
Generating train split: 9832 examples [00:00, 26306.34 examples/s]
Generating validation split: 1225 examples [00:00, 17909.82 examples/s]
Generating test split: 1235 examples [00:00, 26452.55 examples/s]
Processing train split...
Processing validation split...
Processing test split...
Processed dataset saved to processed_data/pubhealth_train.csv


In [None]:
!python /content/train.py

2024-11-06 18:23:50.333980: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-06 18:23:50.368114: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-06 18:23:50.379059: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Map: 100% 9800/9800 [00:05<00:00, 1692.77 examples/s]
Map: 100% 2451/2451 [00:01<00:00, 2439.02 examples/s]
[34m[1mwandb[0m: Currently logged in as: [33mnavashakthi[0m ([33mnavashakthi-capgemini[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.18.5
[34m[1mwandb[0m: Run data is saved locally 

In [None]:
!python /content/evaluate.py

In [None]:
!!python /content/serve.py

In [None]:
# main.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import gradio as gr
import threading

# Initialize FastAPI app
app = FastAPI()

# Load the model and tokenizer (ensure your model path or model loading code is correct)
model_path = '/content/fine-tuned-model'  # Adjust the model path if needed
model = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Define the Pydantic model for input validation
class Claim(BaseModel):
    text: str

# FastAPI endpoint to get prediction
@app.post("/claim/v1/predict")
async def predict_claim(claim: Claim):
    try:
        # Tokenize input text
        inputs = tokenizer(claim.text, return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            outputs = model(**inputs)
            logits = outputs.logits
            predicted_label = torch.argmax(logits, dim=1).item()
        return {"claim": claim.text, "veracity": predicted_label}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

# Define the Gradio interface function
def gradio_predict(text):
    try:
        inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            outputs = model(**inputs)
            logits = outputs.logits
            predicted_label = torch.argmax(logits, dim=1).item()
        return "True" if predicted_label == 1 else "False"
    except Exception as e:
        return f"Error: {e}"

# Set up Gradio interface
gr_interface = gr.Interface(fn=gradio_predict, inputs="text", outputs="text",
                            title="Claim Veracity Predictor",
                            description="Enter a claim to predict its veracity.")

# Run Gradio in a separate thread
def run_gradio():
    gr_interface.launch(server_name="0.0.0.0", server_port=7861)

gradio_thread = threading.Thread(target=run_gradio)
gradio_thread.start()

if __name__ == "__main__":
    import uvicorn
    uvicorn.run("main:app", host="0.0.0.0", port=8000, reload=True)


INFO:     Will watch for changes in these directories: ['/content']
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Started reloader process [1998] using StatReload


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://07c210e93405940b2d.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Define the path to the fine-tuned model
model_path = '/content/fine-tuned-model'

# Load the fine-tuned model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Sample text for inference
sample_text = "Vaccines are safe and effective for preventing many diseases."

# Tokenize the input text
inputs = tokenizer(sample_text, return_tensors="pt", padding=True, truncation=True)

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_label = torch.argmax(logits, dim=1).item()

# Print the prediction
print(f"Sample Text: {sample_text}")
print(f"Predicted Label: {predicted_label}")


Sample Text: The coronavirus is “simply the common cold.
Predicted Label: 0


In [None]:
import pandas as pd
import numpy as np

In [None]:
pubhealth_df = pd.read_csv('/content/pubhealth_train.csv')

In [None]:
pubhealth_df.head()

Unnamed: 0,claim,label,explanation
0,"""The money the Clinton Foundation took from fr...",0,"""Gingrich said the Clinton Foundation """"took m..."
1,Annual Mammograms May Have More False-Positives,1,This article reports on the results of a study...
2,SBRT Offers Prostate Cancer Patients High Canc...,1,This news release describes five-year outcomes...
3,"Study: Vaccine for Breast, Ovarian Cancer Has ...",2,"While the story does many things well, the ove..."
4,Some appendicitis cases may not require ’emerg...,2,We really don’t understand why only a handful ...


In [None]:
claim_samp = pubhealth_df['claim'].to_list()

In [None]:
claim_samp[23:33]

['Rhode Island will become just the second state to mandate the vaccine … and the only state to do so by regulatory fiat, without public debate, and without consideration from the elected representatives of the people.',
 'I’ll never go through shoulder surgery again, so here’s what I did',
 'CostCo stores will require shoppers to wear masks beginning on May 4, 2020.',
 'Brazil cities lurch to lockdowns amid virus crisis red flags.',
 'Laser Used to Blast Away Cells Causing Irregular Heartbeat',
 "Slovakia's new government to sharply ramp up coronavirus testing.",
 'The coronavirus is “simply the common cold.”',
 'Encouraging news about reversing heart disease',
 'Microwave ovens were banned in the USSR in 1976 to protect its citizens from harmful health effects.',
 'Jimi Hendrix, Jim Morrison, Janis Joplin, and Kurt Cobain all died with white Bic lighters in their pockets.']

In [None]:
pubhealth_df.shape

(12292, 5)

In [None]:
def clean_text(text_list):
    cleaned_list = []
    for text in text_list:
        # Check if the text is a string, if not, convert it to string
        if isinstance(text, float):
            # Skip NaN or None values (optional: replace with a placeholder)
            continue
        if not isinstance(text, str):
            text = str(text)
        cleaned_text = text.replace('\xa0', ' ').strip()
        cleaned_list.append(cleaned_text)
    return cleaned_list

def apply_clean_text(df):
    # Drop rows with NaN values in the 'cleaned_claim' column
    df_cleaned = df.dropna(subset=['cleaned_claim']).copy()

    # Apply the clean_text function to the 'cleaned_claim' column using list comprehension
    df_cleaned['cleaned_claim'] = clean_text(df_cleaned['cleaned_claim'].tolist())

    # Replace the original dataframe's cleaned_claim column with the cleaned data while maintaining the original indices
    df.update(df_cleaned)

    # Drop rows where the 'cleaned_claim' column might still be empty or contain NaN
    df = df.dropna(subset=['cleaned_claim'])

    return df

# Example usage:
# Assuming df is your dataframe with a column 'cleaned_claim'
df = apply_clean_text(pubhealth_df)


In [None]:
df.shape

(12277, 5)

In [None]:
df['label'].unique()

array([ 0,  1,  2,  3, -1])

In [None]:
df.count()

Unnamed: 0,0
claim,12277
label,12277
explanation,12277
cleaned_claim,12277
cleaned_explanation,12275


In [None]:
X = list(df['cleaned_claim'])

In [None]:
y = list(df['label'])

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)


In [None]:
len(X_train),len(X_test), len(y_train), len(y_test)


(9821, 2456, 9821, 2456)

In [None]:
# Import necessary libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset, Dataset
import pandas as pd
from sklearn.model_selection import train_test_split
import torch

train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [None]:
train_df = train_df[train_df['label'] >= 0]
val_df = val_df[val_df['label'] >= 0]

In [None]:
print("Unique labels in training set:", train_df['label'].unique())
print("Unique labels in validation set:", val_df['label'].unique())

Unique labels in training set: [2 0 1 3]
Unique labels in validation set: [1 2 0 3]


In [None]:
# Check for any -1 labels and remove them if necessary
train_df = train_df[train_df['label'] >= 0]
val_df = val_df[val_df['label'] >= 0]

# Convert the training and validation dataframes to Hugging Face datasets
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

# Load pre-trained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("austinmw/distilbert-base-uncased-finetuned-health_facts")
model = AutoModelForSequenceClassification.from_pretrained("austinmw/distilbert-base-uncased-finetuned-health_facts")

# Tokenize the dataset
def tokenize_function(texts):
    return tokenizer(texts['cleaned_claim'], padding="max_length", truncation=True)

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)

# Set the format to PyTorch tensors
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
val_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    load_best_model_at_end=True
)

# Create Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

# Train the model
trainer.train()

# Save the fine-tuned model
model.save_pretrained('./fine-tuned-model')
tokenizer.save_pretrained('./fine-tuned-model')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/807 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Map:   0%|          | 0/9800 [00:00<?, ? examples/s]

Map:   0%|          | 0/2451 [00:00<?, ? examples/s]

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss
1,0.898,0.848246


('./fine-tuned-model/tokenizer_config.json',
 './fine-tuned-model/special_tokens_map.json',
 './fine-tuned-model/vocab.txt',
 './fine-tuned-model/added_tokens.json',
 './fine-tuned-model/tokenizer.json')