<a href="https://colab.research.google.com/github/Nimeesh-Patel/NLPProjectAFakeNewDetection/blob/main/Collab%20Files/NLP_Part_3_Language_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

### **Text Classification by Fine-tuning Language Model**

---

### 1. **Data Loading**
   - Load the dataset (CSV format in this case).
   - Perform exploratory data analysis (EDA) to understand class distributions and data structure.
   - Split the dataset into training and validation sets.

In [None]:
# Install simpletransformers package
!pip install simpletransformers

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset (replace with your dataset path)
data = pd.read_excel('/content/NLP dataset.xlsx',nrows=500)

# Exploratory Data Analysis (EDA)
print(data.info())  # Overview of data structure
print(data['label'].value_counts())  # Class distribution

# Split dataset into train and validation sets
train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)

# Preparing the data in the correct format for SimpleTransformers
train_df = pd.DataFrame({
    'title': train_data['title'],
    'labels': train_data['label']
})

val_df = pd.DataFrame({
    'title': val_data['title'],
    'labels': val_data['label']
})

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   500 non-null    object
 1   label   500 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 7.9+ KB
None
label
0    263
1    237
Name: count, dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   500 non-null    object
 1   label   500 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 7.9+ KB
None
label
0    263
1    237
Name: count, dtype: int64


---

### 2. **Text Processing**
   - Here we clean the text by removing special characters, converting to lowercase, removing numbers, and stripping any extra whitespace.

In [None]:
import re

# Define a function to clean text data
def clean_text(title):
    # Convert to lowercase
    title = title.lower()

    # Remove special characters and numbers
    title = re.sub(r'[^a-zA-Z\s]', '', title)

    # Remove extra whitespace
    title = title.strip()

    return title

# Apply the cleaning function to the dataset
train_df['title'] = train_df['title'].apply(clean_text)
val_df['title'] = val_df['title'].apply(clean_text)

print(train_df.head())

                                                 title  labels
249  pew research centers national survey on the ov...       0
433  we could talk about cooking as a function of c...       0
19   wednesday after   donald trumps press conferen...       0
322  on newsmax tvs the steve malzberg show gov gre...       0
332  las frases ms destacadas del debate de investi...       1


---

### 3. **Text Embedding using BERT and RoBERTa**
   - Use BERT and RoBERTa models for embedding the cleaned text. These models automatically tokenize and embed the text.

In [None]:
from simpletransformers.classification import ClassificationModel

# Create a BERT model for text classification
bert_model = ClassificationModel('bert', 'bert-base-uncased', num_labels=2, use_cuda=False)  # Set use_cuda=True if using a GPU

# Create a RoBERTa model for text classification
roberta_model = ClassificationModel('roberta', 'roberta-base', num_labels=2, use_cuda=False)  # Set use_cuda=True if using a GPU

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a dow

---

### 4. **Model Training with BERT and RoBERTa**

#### **Basic Model Training**

#### **Train the BERT Model**

In [None]:
from simpletransformers.classification import ClassificationArgs, ClassificationModel

# Update the model arguments to allow overwriting the output directory
model_args = ClassificationArgs()
model_args.overwrite_output_dir = True  # Allow overwriting the 'outputs/' directory

# Create or load the BERT model with updated arguments
bert_model = ClassificationModel("bert", "bert-base-uncased", args=model_args, use_cuda=False)

# Train the BERT model
bert_model.train_model(train_df)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0it [00:00, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 1 of 1:   0%|          | 0/50 [00:00<?, ?it/s]

(50, 0.5728035014867783)

#### **Train the RoBERTa Model**

In [None]:
roberta_model.train_model(
    train_df,
    args={"overwrite_output_dir": True}  # Allow overwriting
)



0it [00:00, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 1 of 1:   0%|          | 0/50 [00:00<?, ?it/s]

(50, 0.45436446242034434)

In [None]:
from simpletransformers.classification import ClassificationArgs

# Set up model arguments with overwrite enabled
model_args = ClassificationArgs(
    num_train_epochs=3,
    train_batch_size=8,
    eval_batch_size=8,
    learning_rate=3e-5,
    max_seq_length=128,
    weight_decay=0.01,
    warmup_steps=0,
    logging_steps=50,
    save_steps=200,
    overwrite_output_dir=True  # <-- Enable overwriting
)

# Train the BERT model
bert_model = ClassificationModel('bert', 'bert-base-uncased', num_labels=2, args=model_args, use_cuda=False)
bert_model.train_model(train_df)

# Train the RoBERTa model
roberta_model = ClassificationModel('roberta', 'roberta-base', num_labels=2, args=model_args, use_cuda=False)
roberta_model.train_model(train_df)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0it [00:00, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/50 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/50 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/50 [00:00<?, ?it/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0it [00:00, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/50 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/50 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/50 [00:00<?, ?it/s]

(150, 0.22865236798146119)

### 5. **Evaluation on Validation Set**
   - Evaluate the performance of both BERT and RoBERTa models on the validation set using accuracy, precision, recall, and F1-score.

#### **Evaluate BERT Model**

In [None]:
# Evaluate BERT on validation data
result_bert, model_outputs_bert, wrong_predictions_bert = bert_model.eval_model(val_df)

print("BERT Evaluation Results:")
print(result_bert)



0it [00:00, ?it/s]

Running Evaluation:   0%|          | 0/13 [00:00<?, ?it/s]

BERT Evaluation Results:
{'mcc': np.float64(0.8852275638978949), 'accuracy': 0.94, 'f1_score': 0.9391233766233766, 'tp': np.int64(41), 'tn': np.int64(53), 'fp': np.int64(0), 'fn': np.int64(6), 'auroc': np.float64(0.9835407466880771), 'auprc': np.float64(0.986201687263787), 'eval_loss': 0.15426083718641445}


#### **Evaluate RoBERTa Model**

In [None]:
# Evaluate RoBERTa on validation data
result_roberta, model_outputs_roberta, wrong_predictions_roberta = roberta_model.eval_model(val_df)

print("RoBERTa Evaluation Results:")
print(result_roberta)



0it [00:00, ?it/s]

Running Evaluation:   0%|          | 0/13 [00:00<?, ?it/s]

RoBERTa Evaluation Results:
{'mcc': np.float64(0.8630535721759582), 'accuracy': 0.93, 'f1_score': 0.9291426257718393, 'tp': np.int64(41), 'tn': np.int64(52), 'fp': np.int64(1), 'fn': np.int64(6), 'auroc': np.float64(0.9927739863508631), 'auprc': np.float64(0.9921250318277015), 'eval_loss': 0.36417234553104766}


---

### 6. **Saving the Best Model**
   - Save the best-performing model for later use.

#### **Saving the BERT Model**

In [None]:
bert_model.save_model('bert_best_model')

#### **Saving the RoBERTa Model**

In [None]:
roberta_model.save_model('roberta_best_model')

---

### 7. **Prediction on Real-World Input**
   - Test the saved model on real-world input data. Preprocess the input text, use the model to predict the class, and output the results.

In [None]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs

# Step 1: Define model arguments
model_args = ClassificationArgs()
model_args.overwrite_output_dir = True  # Allow overwriting the output folder
model_args.output_dir = "bert_best_model"  # Output folder for saved model

# Step 2: Initialize and train the model
bert_model = ClassificationModel("bert", "bert-base-uncased", args=model_args, use_cuda=False)
bert_model.train_model(train_df)

# Step 3: Save the trained model (including tokenizer)
bert_model.save_model("bert_best_model")  # Saves model, config, and weights
bert_model.tokenizer.save_pretrained("bert_best_model")  # Save tokenizer files

# Step 4: Load the saved model from the specified directory
bert_model = ClassificationModel('bert', 'bert_best_model', use_cuda=False)

# Step 5: Predict on real-world input
real_world_text = [
    "Ever get the feeling your life circles the roundabout rather than heads in a straight line toward the intended destination? [Hillary Clinton remains the big woman on campus in leafy, liberal Wellesley, Massachusetts. Everywhere else votes her most likely to don her inauguration dress for the remainder of her days the way Miss Havisham forever wore that wedding dress.  Speaking of Great Expectations, Hillary Rodham overflowed with them 48 years ago when she first addressed a Wellesley graduating class. The president of the college informed those gathered in 1969 that the students needed “no debate so far as I could ascertain as to who their spokesman was to be” (kind of the like the Democratic primaries in 2016 minus the   terms unknown then even at a Seven Sisters school). “I am very glad that Miss Adams made it clear that what I am speaking for today is all of us —  the 400 of us,” Miss Rodham told her classmates. After appointing herself Edger Bergen to the Charlie McCarthys and Mortimer Snerds in attendance, the    bespectacled in granny glasses (awarding her matronly wisdom —  or at least John Lennon wisdom) took issue with the previous speaker. Despite becoming the first   to win election to a seat in the U. S. Senate since Reconstruction, Edward Brooke came in for criticism for calling for “empathy” for the goals of protestors as he criticized tactics. Though Clinton in her senior thesis on Saul Alinsky lamented “Black Power demagogues” and “elitist arrogance and repressive intolerance” within the New Left, similar words coming out of a Republican necessitated a brief rebuttal. “Trust,” Rodham ironically observed in 1969, “this is one word that when I asked the class at our rehearsal what it was they wanted me to say for them, everyone came up to me and said ‘Talk about trust, talk about the lack of trust both for us and the way we feel about others. Talk about the trust bust.’ What can you say about it? What can you say about a feeling that permeates a generation and that perhaps is not even understood by those who are distrusted?” The “trust bust” certainly busted Clinton’s 2016 plans. She certainly did not even understand that people distrusted her. After Whitewater, Travelgate, the vast   conspiracy, Benghazi, and the missing emails, Clinton found herself the distrusted voice on Friday. There was a load of compromising on the road to the broadening of her political horizons. And distrust from the American people —  Trump edged her 48 percent to 38 percent on the question immediately prior to November’s election —  stood as a major reason for the closing of those horizons. Clinton described her vanquisher and his supporters as embracing a “lie,” a “con,” “alternative facts,” and “a   assault on truth and reason. ” She failed to explain why the American people chose his lies over her truth. “As the history majors among you here today know all too well, when people in power invent their own facts and attack those who question them, it can mark the beginning of the end of a free society,” she offered. “That is not hyperbole. ” Like so many people to emerge from the 1960s, Hillary Clinton embarked upon a long, strange trip. From high school Goldwater Girl and Wellesley College Republican president to Democratic politician, Clinton drank in the times and the place that gave her a degree. More significantly, she went from idealist to cynic, as a comparison of her two Wellesley commencement addresses show. Way back when, she lamented that “for too long our leaders have viewed politics as the art of the possible, and the challenge now is to practice politics as the art of making what appears to be impossible possible. ” Now, as the big woman on campus but the odd woman out of the White House, she wonders how her current station is even possible. “Why aren’t I 50 points ahead?” she asked in September. In May she asks why she isn’t president. The woman famously dubbed a “congenital liar” by Bill Safire concludes that lies did her in —  theirs, mind you, not hers. Getting stood up on Election Day, like finding yourself the jilted bride on your wedding day, inspires dangerous delusions."
]

# Mapping numerical predictions to price range labels
  # Extract single prediction

# Step 6: Make predictions and print the results
predictions_bert, _ = bert_model.predict(real_world_text)
label_map = {0: "Fake", 1: "Real"}
predicted_label = label_map[predictions_bert[0]]
print(f"BERT Predictions: {predicted_label}")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0it [00:00, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 1 of 1:   0%|          | 0/50 [00:00<?, ?it/s]

0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

BERT Predictions: Fake


In [None]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs

# Step 1: Define model arguments
model_args = ClassificationArgs()
model_args.overwrite_output_dir = True  # Allow overwriting the output folder
model_args.output_dir = "roberta_best_model"  # Output folder for saved model

# Step 2: Initialize and train the RoBERTa model
roberta_model = ClassificationModel("roberta", "roberta-base", args=model_args, use_cuda=False)
roberta_model.train_model(train_df)  # Make sure 'train_df' contains the training data

# Step 3: Save the trained model (including tokenizer)
roberta_model.save_model("roberta_best_model")  # Save model weights and config
roberta_model.tokenizer.save_pretrained("roberta_best_model")  # Save tokenizer files

# Step 4: Load the saved RoBERTa model from the specified directory
roberta_model = ClassificationModel('roberta', 'roberta_best_model', use_cuda=False)

# Step 5: Predict on real-world input
real_world_text = [
    "Ever get the feeling your life circles the roundabout rather than heads in a straight line toward the intended destination? [Hillary Clinton remains the big woman on campus in leafy, liberal Wellesley, Massachusetts. Everywhere else votes her most likely to don her inauguration dress for the remainder of her days the way Miss Havisham forever wore that wedding dress.  Speaking of Great Expectations, Hillary Rodham overflowed with them 48 years ago when she first addressed a Wellesley graduating class. The president of the college informed those gathered in 1969 that the students needed “no debate so far as I could ascertain as to who their spokesman was to be” (kind of the like the Democratic primaries in 2016 minus the   terms unknown then even at a Seven Sisters school). “I am very glad that Miss Adams made it clear that what I am speaking for today is all of us —  the 400 of us,” Miss Rodham told her classmates. After appointing herself Edger Bergen to the Charlie McCarthys and Mortimer Snerds in attendance, the    bespectacled in granny glasses (awarding her matronly wisdom —  or at least John Lennon wisdom) took issue with the previous speaker. Despite becoming the first   to win election to a seat in the U. S. Senate since Reconstruction, Edward Brooke came in for criticism for calling for “empathy” for the goals of protestors as he criticized tactics. Though Clinton in her senior thesis on Saul Alinsky lamented “Black Power demagogues” and “elitist arrogance and repressive intolerance” within the New Left, similar words coming out of a Republican necessitated a brief rebuttal. “Trust,” Rodham ironically observed in 1969, “this is one word that when I asked the class at our rehearsal what it was they wanted me to say for them, everyone came up to me and said ‘Talk about trust, talk about the lack of trust both for us and the way we feel about others. Talk about the trust bust.’ What can you say about it? What can you say about a feeling that permeates a generation and that perhaps is not even understood by those who are distrusted?” The “trust bust” certainly busted Clinton’s 2016 plans. She certainly did not even understand that people distrusted her. After Whitewater, Travelgate, the vast   conspiracy, Benghazi, and the missing emails, Clinton found herself the distrusted voice on Friday. There was a load of compromising on the road to the broadening of her political horizons. And distrust from the American people —  Trump edged her 48 percent to 38 percent on the question immediately prior to November’s election —  stood as a major reason for the closing of those horizons. Clinton described her vanquisher and his supporters as embracing a “lie,” a “con,” “alternative facts,” and “a   assault on truth and reason. ” She failed to explain why the American people chose his lies over her truth. “As the history majors among you here today know all too well, when people in power invent their own facts and attack those who question them, it can mark the beginning of the end of a free society,” she offered. “That is not hyperbole. ” Like so many people to emerge from the 1960s, Hillary Clinton embarked upon a long, strange trip. From high school Goldwater Girl and Wellesley College Republican president to Democratic politician, Clinton drank in the times and the place that gave her a degree. More significantly, she went from idealist to cynic, as a comparison of her two Wellesley commencement addresses show. Way back when, she lamented that “for too long our leaders have viewed politics as the art of the possible, and the challenge now is to practice politics as the art of making what appears to be impossible possible. ” Now, as the big woman on campus but the odd woman out of the White House, she wonders how her current station is even possible. “Why aren’t I 50 points ahead?” she asked in September. In May she asks why she isn’t president. The woman famously dubbed a “congenital liar” by Bill Safire concludes that lies did her in —  theirs, mind you, not hers. Getting stood up on Election Day, like finding yourself the jilted bride on your wedding day, inspires dangerous delusions."
]



# Step 6: Make predictions and print the results
predictions_roberta, _ = roberta_model.predict(real_world_text)
predictions_bert, _ = bert_model.predict(real_world_text)
label_map = {0: "Fake", 1: "Real"}
predicted_label = label_map[predictions_bert[0]]
print(f"BERT Predictions: {predicted_label}")

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0it [00:00, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 1 of 1:   0%|          | 0/50 [00:00<?, ?it/s]

0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

BERT Predictions: Fake
