## Package Installation and Imports
We install the required libraries and import essential modules such as transformers, datasets, sklearn, and PyTorch. These are needed for tokenization, model training, evaluation, and handling the dataset.

In [1]:
!pip install -U transformers datasets
import gdown
import zipfile
import pandas as pd
import numpy as np
from transformers import BertTokenizer,BertForSequenceClassification,Trainer,TrainingArguments
from datasets import Dataset
import torch
from sklearn.metrics import accuracy_score, f1_score,classification_report
import re

Collecting transformers
  Downloading transformers-4.52.4-py3-none-any.whl.metadata (38 kB)
Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading transformers-4.52.4-py3-none-any.whl (10.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.5/10.5 MB[0m [31m91.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m37.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, transformers, datasets
  Attempting uninstall: fsspec
    Found existing installation: 

## Download Dataset from Google Drive
We download the compressed dataset from Google Drive using gdown by providing the file ID and constructing a direct download URL.

In [2]:
file_id = "1UE1EscK33qLu0bRfTpiq5iQLETsJbi50"
output_path = "data.zip"

url = f"https://drive.google.com/uc?id={file_id}"
gdown.download(url, output_path, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1UE1EscK33qLu0bRfTpiq5iQLETsJbi50
To: /content/data.zip
100%|██████████| 4.71M/4.71M [00:00<00:00, 93.6MB/s]


'data.zip'

## Extract Dataset Zip File
We extract the contents of the downloaded ZIP file into a local folder named "data" so we can access train and test CSV files.

In [3]:
with zipfile.ZipFile(output_path, 'r') as zip_ref:
    zip_ref.extractall("data")

## Load Train and Test CSV Files
We load the training and test datasets into Pandas DataFrames for further processing and exploration.

In [4]:
train = pd.read_csv("data/nlpData/train.csv")
test = pd.read_csv("data/nlpData/test.csv")

## 🔍 Preview Training and Test Data
We display the first few rows of the training and test dataset to understand its structure and the types of values it contains.

In [5]:
train.head()

Unnamed: 0,text,label
0,"U.S. Strikes Zarqawi Network, Killing 15 BAGHD...",0
1,MGM shares jump 7.5 pct; report suggests deal ...,2
2,Logitech launches laser-tracking mouse SAN FRA...,3
3,Orb Unveils New Service for Digital Media (AP)...,3
4,"Norwegian police hunt for motive, robbers in M...",0


In [6]:
test.head()

Unnamed: 0,text,label
0,Fan v Fan: Manchester City-Tottenham Hotspur T...,1
1,Paris Tourists Search for Key to 'Da Vinci Cod...,0
2,Net firms: Don't tax VoIP The Spanish-American...,3
3,Dependent species risk extinction The global e...,3
4,EDS Is Charter Member of Siebel BPO Alliance (...,3


## Check Column Names and Data Types
We print the column names and their data types to confirm the structure of the dataset and ensure the expected fields are present for processing.

In [7]:
print(train.columns)
print(train.dtypes)

Index(['text', 'label'], dtype='object')
text     object
label     int64
dtype: object


## Check for Missing Values
We check for any missing (null) values in the training dataset to determine if data cleaning is needed before model training.

In [8]:
print(train.isnull().sum())

text     0
label    0
dtype: int64


## Check Class Distribution
We examine the distribution of class labels in the training dataset to ensure that the data is balanced across all categories.

In [9]:
print(train['label'].value_counts())

label
0    7500
2    7500
3    7500
1    7500
Name: count, dtype: int64


## Analyze Text Length
We calculate and describe the number of words in each article to understand the typical input length, which will inform our BERT tokenization and max_length setting.

In [10]:
train['text_length'] = train['text'].apply(lambda x: len(str(x).split()))
print(train['text_length'].describe())

count    30000.000000
mean        49.099733
std         10.475755
min         15.000000
25%         43.000000
50%         47.000000
75%         51.000000
max        177.000000
Name: text_length, dtype: float64


## Initialize BERT Tokenizer
We load the BERT base uncased tokenizer to convert raw text into token IDs compatible with the BERT model.

In [11]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

## Define Tokenization Function
We define a function to tokenize each text sample, ensuring uniform input length through padding and truncation. This prepares the text for BERT input.

In [12]:
def tokenize_function(example):
    return tokenizer(
        example["text"],
        padding="max_length",
        truncation=True,
        max_length=128
    )

## Convert to Hugging Face Dataset and Tokenize
We convert the training DataFrame into a Hugging Face Dataset and apply the tokenization function in batches to prepare it for model training.

In [13]:
hf_dataset = Dataset.from_pandas(train)

tokenized_dataset = hf_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/30000 [00:00<?, ? examples/s]

In [14]:
tokenized_dataset

Dataset({
    features: ['text', 'label', 'text_length', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 30000
})

text : The original raw news article text.

label : The numeric class label for the news article (0, 1, 2, 3). Each number represents a category

text_length : The number of words in the article. This is a custom column we added during EDA for statistical analysis.

input_ids : Token IDs representing the input text after BERT tokenization. Each token (word or subword) is mapped to a specific ID.

token_type_ids : Segment IDs used in tasks with sentence pairs (e.g., sentence A vs. sentence B). All values will be 0 in single-sentence tasks.

attention_mask : Binary mask indicating which tokens should be attended to (1 = keep, 0 = ignore). Helps the model ignore padding tokens.

## Rename Label Column for Trainer Compatibility
We rename the "label" column to "labels" to match the expected input format required by Hugging Face's Trainer API.

In [15]:
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")

## Remove Unnecessary Columns
We remove columns that are not needed for model training to reduce memory usage and avoid potential conflicts during training.

In [16]:
tokenized_dataset = tokenized_dataset.remove_columns(["text", "text_length"])

## Set Dataset Format to PyTorch
We format the tokenized dataset to return PyTorch tensors, which are required for model training with the Trainer API.

In [17]:
tokenized_dataset.set_format("torch")

## Split Dataset into Train and Validation Sets
We split the tokenized dataset into training and validation subsets to evaluate model performance during training.

In [18]:
tokenized_dataset = tokenized_dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = tokenized_dataset['train']
val_dataset = tokenized_dataset['test']

## Load Pretrained BERT Model for Classification
We load the BERT base uncased model and adapt it for 4-class sequence classification by specifying the number of output labels.

In [19]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=4)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Define Evaluation Metrics Function
We define a custom function to compute accuracy and macro-averaged F1 score, which are used to evaluate model performance during training and validation.

In [20]:
def compute_metricsMyFunction(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)
    acc = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average='macro')
    return {"accuracy": acc, "f1": f1}

## Set Training Arguments
We configure the training parameters, including batch size, learning rate, number of epochs, and evaluation strategy. The model will track F1 score and save the best-performing version automatically.

In [21]:
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    eval_strategy="epoch",
    save_strategy="epoch",
    metric_for_best_model="f1",
)

## Initialize Trainer
We create a Trainer object that handles model training, evaluation, and metric calculation using the specified arguments and datasets.

In [22]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metricsMyFunction,
)

## Enable Weights & Biases Tracking (WandB)
We install and initialize WandB for experiment tracking and visualization of training metrics. The API key is used to authenticate the session.

In [23]:
!pip install wandb
import wandb
wandb.login(key="7494eba65bcc11acd4f2f1a01a508b1199ad54d2")



[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mardacikgoz9[0m ([33mardacikgoz9-hacettepe-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

## Train the BERT Model
We start the training process using the Hugging Face Trainer, which fine-tunes the BERT model on our classification task and tracks performance metrics across epochs.

In [24]:
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2037,0.163332,0.947333,0.947354
2,0.1229,0.195868,0.951333,0.95139
3,0.0659,0.211416,0.951,0.951159


TrainOutput(global_step=5064, training_loss=0.1450530842864683, metrics={'train_runtime': 465.7306, 'train_samples_per_second': 173.92, 'train_steps_per_second': 10.873, 'total_flos': 5328094546944000.0, 'train_loss': 0.1450530842864683, 'epoch': 3.0})

## Load Trained Model from Checkpoint
We load the best-performing BERT model from a saved checkpoint and initialize the tokenizer to use it for inference on new samples.

In [25]:
model_path = "results/checkpoint-5064"

ourModel = BertForSequenceClassification.from_pretrained(model_path)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

## Reinitialize Trainer with Trained Model
We re-create the Trainer instance using the previously saved model checkpoint to perform evaluation or prediction without retraining.

In [26]:
trainedModelTrainer = Trainer(
    model=ourModel,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metricsMyFunction,
)

## Load Test Dataset
We load the test dataset to evaluate the performance of the trained BERT model on unseen data.

In [27]:
test = pd.read_csv("data/nlpData/test.csv")

## Prepare Test Dataset for Evaluation
We convert the test DataFrame into a Hugging Face Dataset, apply the same tokenization process, and format it as PyTorch tensors for evaluation.

In [28]:
test_dataset = Dataset.from_pandas(test)

tokenized_test = test_dataset.map(tokenize_function, batched=True)
tokenized_test = tokenized_test.remove_columns(["text"])
tokenized_test = tokenized_test.rename_column("label", "labels")
tokenized_test.set_format("torch")

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

## Evaluate Model on Test Set
We evaluate the trained BERT model on the test dataset and print performance metrics such as accuracy and F1 score to assess generalization.

In [29]:
metrics = trainedModelTrainer.evaluate(tokenized_test)
print(metrics)

{'eval_loss': 0.34017911553382874, 'eval_model_preparation_time': 0.0038, 'eval_accuracy': 0.9282894736842106, 'eval_f1': 0.9283473529014608, 'eval_runtime': 13.201, 'eval_samples_per_second': 575.713, 'eval_steps_per_second': 35.982}


## Display Sample Predictions
We generate predictions on the test dataset and compare the model's output with the true labels for the first five samples to visually inspect performance.

In [32]:
predictions = trainedModelTrainer.predict(tokenized_test)
preds = predictions.predictions.argmax(-1)

for i in range(10):
    print(f"True label: {test['label'][i]}, Predicted: {preds[i]}")

True label: 1, Predicted: 1
True label: 0, Predicted: 3
True label: 3, Predicted: 2
True label: 3, Predicted: 3
True label: 3, Predicted: 3
True label: 2, Predicted: 2
True label: 0, Predicted: 0
True label: 2, Predicted: 2
True label: 3, Predicted: 3
True label: 2, Predicted: 0


## Classification Report
We compute and display precision, recall, and F1-score for each class to gain a detailed understanding of the model’s performance across all categories.

In [33]:
y_true = test['label']
y_pred = preds

print(classification_report(y_true, y_pred, digits=4))

              precision    recall  f1-score   support

           0     0.9671    0.9137    0.9396      1900
           1     0.9670    0.9879    0.9773      1900
           2     0.8892    0.8953    0.8922      1900
           3     0.8924    0.9163    0.9042      1900

    accuracy                         0.9283      7600
   macro avg     0.9289    0.9283    0.9283      7600
weighted avg     0.9289    0.9283    0.9283      7600



# STAGE 2

## Install and Configure Gemini API
We install the Google Generative AI SDK and configure it with our API key to enable access to Gemini models for text generation tasks.

In [34]:
!pip install -U google-generativeai
import google.generativeai as genai
genai.configure(api_key="AIzaSyBcFUErokaWXQ5hU8yoqdqfZZWjlNJxzzM")



## Generate Title and Polished Article with Gemini
We use the Gemini 1.5 Pro model to generate a professional title and a rewritten version of a news article using a structured prompt designed for consistent output formatting.

In [35]:
gemini_model = genai.GenerativeModel("gemini-1.5-pro")
raw_article = test['text'][0]

prompt = f"""
You are an AI editor for a news website.

Given the following raw news article:

---
{raw_article}
---

Please extract:
- A professional news title.
- A polished version of the article in a clear journalistic tone.

Return your response in exactly this format:

Title: <title>
Article: <rewritten version>
"""

response = gemini_model.generate_content(prompt)
print(response.text)

Title: Manchester City and Tottenham Hotspur Set for Potential Thriller This Weekend

Article: Manchester City and Tottenham Hotspur face off this weekend in a fixture that promises excitement.  Last season's thrilling seven-goal FA Cup encounter between the two sides is still fresh in the minds of fans, suggesting another entertaining match could be on the cards.



# STAGE 3

## Extract Title and Article from Gemini Output
We define a function that uses regular expressions to parse and extract the title and polished article text from the Gemini model’s response.

In [36]:
def extract_title_and_article(gemini_output):
    title_match = re.search(r"Title:\s*(.*)", gemini_output)
    article_match = re.search(r"Article:\s*(.*)", gemini_output, re.DOTALL)

    title = title_match.group(1).strip() if title_match else None
    article = article_match.group(1).strip() if article_match else None

    return {"title": title, "article": article}

## Display Extracted Gemini Output
We extract and print the structured title and article content from the Gemini response using our custom parsing function.

In [37]:
gemini_output = response.text
result = extract_title_and_article(gemini_output)
print(result)

{'title': 'Manchester City and Tottenham Hotspur Set for Potential Thriller This Weekend', 'article': "Manchester City and Tottenham Hotspur face off this weekend in a fixture that promises excitement.  Last season's thrilling seven-goal FA Cup encounter between the two sides is still fresh in the minds of fans, suggesting another entertaining match could be on the cards."}


## Full Pipeline: Classification + Generation
We define a function that performs the complete pipeline: predicting the article's category using the trained BERT model and generating a title and polished version using the Gemini model.

In [38]:
def process_article(text, model, tokenizer, gemini_model):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128).to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)
        pred_label = torch.argmax(outputs.logits, dim=1).item()

    label_map = {0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"}
    category = label_map[pred_label]

    prompt = f"""
    Given the raw news article below:

    {text}

    Please return:
    Title: <a professional news headline>
    Article: <a polished version of the article>
    """

    gemini_response = gemini_model.generate_content(prompt)
    parsed = extract_title_and_article(gemini_response.text)

    return {
        "category": category,
        "title": parsed['title'],
        "article": parsed['article']
    }

## Apply Pipeline to Sample Articles
We apply the full classification and generation pipeline to the first five test articles and store the results in a structured list of dictionaries.

In [39]:
results = [process_article(test['text'][i], ourModel, tokenizer, gemini_model) for i in range(5)]

## Display Final Results
We neatly print the predicted category, generated title, and polished article for each of the five processed test samples to present the final output of our pipeline.

In [40]:
for i, item in enumerate(results):
    print(f"\n--- ARTICLE {i+1} ---")
    print(f"Category: {item['category']}")
    print(f"Title: {item['title']}")
    print(f"Article:\n{item['article']}")


--- ARTICLE 1 ---
Category: Sports
Title: ** Man City vs. Spurs: Another Goal Fest on the Cards?
Article:
** This weekend's clash between Manchester City and Tottenham Hotspur promises excitement, with memories of last season's seven-goal FA Cup thriller still vivid.  Fans can only hope for another entertaining encounter reminiscent of that goal-filled spectacle.

--- ARTICLE 2 ---
Category: Sci/Tech
Title: ** Louvre Tourists Seek "Da Vinci Code" Clues
Article:
**  Paris - The blockbuster novel "The Da Vinci Code" has sparked a real-world treasure hunt at the Louvre Museum.  Tourists, inspired by Dan Brown's fictional thriller, are peppering tour guides with questions about locations and symbols mentioned in the book, adding a new dimension to the traditional art appreciation experience at the home of the Mona Lisa.

--- ARTICLE 3 ---
Category: Business
Title: ** Internet Phone Providers Oppose Extending Spanish-American War Tax to VoIP
Article:
**  Internet telephony companies are ur