### Installing Required Libraries

This block installs the essential Python libraries for the project:

1. **Transformers**: For leveraging pre-trained transformer models.
2. **Accelerate**: For optimizing and accelerating model training.
3. **PEFT**: For parameter-efficient fine-tuning.
4. **TRL**: For training language models with reinforcement learning.
5. **BitsAndBytes**: For memory-efficient model quantization.
6. **Wandb**: For experiment tracking and logging.
7. **Requests**: For making HTTP requests.
8. **BeautifulSoup4**: For web scraping.
9. **GoogleSearch-Python**: For programmatic Google searches.
10. **NLTK**: For natural language processing tasks.

In [1]:
%pip install -U transformers 
%pip install -U accelerate 
%pip install -U peft 
%pip install -U trl 
%pip install -U bitsandbytes 
%pip install -U wandb
%pip install requests
%pip install beautifulsoup4
%pip install googlesearch-python
%pip install nltk


### Setting Up the Environment and Loading the Model

#### 1. **Importing Libraries**
Key libraries like Transformers, HuggingFace Hub, and Torch are imported for model loading and configuration.

#### 2. **Authentication**
The HuggingFace token (`HUGGINGFACE_TOKEN`) is retrieved from environment variables for downloading models. Ensure the token is set.

#### 3. **Model Setup**
- **Base Model**: `NousResearch/Llama-2-7b-chat-hf`, a chat-optimized LLaMA-2 variant.
- **Fine-Tuned Model**: Fine tuned model stored at `llama-fine-tuned1/pytorch/default/1`.

### 4. **Torch Configuration**
Configures `torch_dtype` and attention implementation based on GPU capabilities, with support for hardware acceleration via `flash-attn`.

### 5. **QLoRA Setup**
The `BitsAndBytesConfig` enables efficient 4-bit quantization with optimizations like double quantization for reduced memory usage.

### 6. **Model & Tokenizer Loading**
- The base model is loaded with QLoRA and device mapping.
- The tokenizer is configured with EOS and padding tokens for chat-based tasks.

In [2]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)

In [3]:
import os
from huggingface_hub import login

# Get the token from environment variables
hf_token = os.getenv("HUGGINGFACE_TOKEN")

if hf_token:
    login(token=hf_token)
else:
    print("HuggingFace token not found. Please set the HUGGINGFACE_TOKEN environment variable.")

In [4]:
# Model from Hugging Face hub
base_model = "NousResearch/Llama-2-7b-chat-hf"

#fine-tuned model
fine_tuned_model = "llama-fine-tuned1/pytorch/default/1"

In [5]:
import torch
# Set torch dtype and attention implementation
if torch.cuda.get_device_capability()[0] >= 8:
    !pip install -qqq flash-attn
    torch_dtype = torch.bfloat16
    attn_implementation = "flash_attention_2"
else:
    torch_dtype = torch.float16
    attn_implementation = "eager"

In [None]:
# QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch_dtype,
    llm_int8_enable_fp32_cpu_offload=True,
    bnb_4bit_use_double_quant=True,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation=attn_implementation
    
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

### Defining Core Functions

#### **Imports and Setup**
Key libraries (`Transformers`, `PEFT`, `NLTK`, `BeautifulSoup`, `pandas`) enable:
- Model initialization
- Text analysis
- Web scraping

#### **Model Initialization**

##### `initialise_base_model(base_model_dir)`
- Loads a 4-bit quantized LLaMA model using `BitsAndBytesConfig`.
- Configures for CUDA or CPU, loads tokenizer.
- **Returns:** tokenizer, model, and device.

##### `initialise_fine_tuned_model(base_model, adapter_dir)`
- Loads a fine-tuned model via PEFT adapter.
- Sets model to evaluation mode.
- **Returns:** fine-tuned model.


#### **Phase 1: News Analysis**

##### `analyze_news(headline, tokenizer, model, device)`
- **Input:** News headline.
- **Process:** Generates:
  - Confidence score (0–100)
  - Detailed explanation about truthfulness.
- **Output:** Model's response.

##### `extract_confidence_score(response)`
- **Input:** Generated response.
- **Process:** Extracts confidence score using regex.
- **Validation:** Ensures score is between 0 and 100.
- **Output:** Extracted score or `None`.


#### **Phase 2: Web Scraping and Summarization**

##### `scrape_important_content(url)`
- **Input:** URL.
- **Process:** 
  - Fetches webpage using `BeautifulSoup`.
  - Extracts headings (`h1`, `h2`, `h3`) and up to 8 key paragraphs.
- **Output:** Scraped content or error message.

##### `process_query(query, filename)`
- **Input:** Search query and output filename.
- **Process:** 
  - Extracts keywords from the query (removes stopwords).
  - Performs Google search and retrieves URLs.
  - Scrapes content from URLs and saves it to a CSV file.
- **Output:** Scraped data.

##### `generate_summary_with_llama(file_path, tokenizer, model, device)`
- **Input:** File path to scraped data CSV.
- **Process:** 
  - Reads and combines content from the CSV.
  - Generates a concise summary (up to 100 words) using the LLaMA model.
- **Output:** Generated summary.


In [9]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
from transformers import BitsAndBytesConfig
import re
import requests
from bs4 import BeautifulSoup
import csv
from googlesearch import search
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd
# NLTK setup
nltk.download('punkt')
nltk.download('stopwords')

def initialise_base_model(base_model_dir):
    bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16,
            llm_int8_enable_fp32_cpu_offload=True,
            bnb_4bit_use_double_quant=True,
    )

    # Load model
    base_model = AutoModelForCausalLM.from_pretrained(
            base_model_dir,
            quantization_config=bnb_config,
            device_map="auto"
        )
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(base_model_dir)

    device = "cuda" if torch.cuda.is_available() else "cpu"
    base_model = base_model.to(device)
    base_model.eval()

    return tokenizer, base_model, device

def initialise_fine_tuned_model(base_model, adapter_dir):
    model = PeftModel.from_pretrained(base_model, adapter_dir)
        
    # Move Model to Appropriate Device
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = model.to(device)
    model.eval()
    return model


# PHASE 1 FUNCTIONS
def analyze_news(headline, tokenizer, model, device):
    # Input Prompt
    input_text = (
        "You are a news analyzer. Given the headline, provide a confidence score (0-100) indicating how likely the news is true, "
        "and give a detailed explanation for your assessment. "
        f"Headline: '{headline}'\n"
    )

    # Tokenize Input
    inputs = tokenizer(
        input_text,
        return_tensors="pt",
        truncation=True,
        padding="max_length",
    )
    inputs = {key: value.to(device) for key, value in inputs.items()}

    # Generate Response
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=250,   # Limits generated tokens only
            num_beams=5,          # Enhance quality with beam search
            temperature=0.7,      # Balance randomness
            top_k=40,             # Limit to top-k tokens
            top_p=0.9,            # Nucleus sampling
            repetition_penalty=1.2  # Reduce repetitive outputs
        )
    
    # Decode and Post-process
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.strip()

def extract_confidence_score(response):
    match = re.search(r"\bconfidence(?:\s+score(?:\s(?:of|is)?)?)?\s*[:=]?\s*(\d{1,3})", response, re.IGNORECASE)
    if match:
        try:
            score = int(match.group(1))
            if 0 <= score <= 100:  # Ensure the score is within the valid range
                return score
        except ValueError:
            return None
    return None  # Return None if no valid score is found



# PHASE 2 FUNCTIONS
def scrape_important_content(url):
    try:
        response = requests.get(url, timeout=10)
        if response.status_code != 200:
            return "Failed to fetch content"
        
        # Parse the webpage content
        soup = BeautifulSoup(response.content, "html.parser")
        
        # Extract the main headings and paragraphs
        headings = soup.find_all(['h1', 'h2', 'h3'])  # Extract headings
        paragraphs = soup.find_all('p')  # Extract paragraphs
        
        # Combine content
        content = ""
        for h in headings:
            content += h.get_text(strip=True) + " | "
        for p in paragraphs[:8]:  # Limit paragraphs to avoid too much text
            content += p.get_text(strip=True) + " "
        
        return content.strip() if content else "No significant content found."
    except Exception as e:
        return f"Error: {e}"
        
def process_query(query, filename):
    # Step 1: Extract keywords from the query
    words = word_tokenize(query)
    stop_words = set(stopwords.words('english'))
    keywords = [word for word in words if word.isalpha() and word.lower() not in stop_words]

    # Step 2: Perform Google search using extracted keywords
    search_query = " ".join(keywords)
    search_results = [url for url in search(search_query, num_results=10)]

    # Step 3: Scrape content from search results
    scraped_data = []
    for url in search_results:
        content = scrape_important_content(url)
        scraped_data.append([url, content])

    # Step 4: Save scraped data to a CSV file
    with open(filename, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(["URL", "Important Content"])
        writer.writerows(scraped_data)
    return scraped_data

def generate_summary_with_llama(file_path,tokenizer,model,device):

    df = pd.read_csv(file_path)
    col = df['Important Content'].tolist()
    corpus = [i for i in col if i not in ("No significant content found.","Failed to fetch content")]
    
    combined_corpus = "\n".join(corpus)[:4000]  # Limit the input to avoid exceeding model input size

    input_text = (
        f"You are a news summarization expert. analyse the data scrapped from web which is: [{combined_corpus}] and provide an overall summary in maximum 100 words.\n"
    )

    # Tokenize input
    inputs = tokenizer(
        input_text,
        return_tensors="pt",
        truncation=True,
        padding="max_length",
    )
    inputs = {key: value.to(device) for key, value in inputs.items()}

    # Generate response
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,   # Limit generated tokens
            num_beams=5,          # Enhance quality with beam search
            temperature=0.7,      # Balance randomness
            top_k=40,             # Limit to top-k tokens
            top_p=0.9,            # Nucleus sampling
            repetition_penalty=1.2  # Reduce repetitive outputs
        )

    # Decode and post-process
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.strip()
    

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Combined Pipeline
- Integrates **Phase 1 (News Analysis)** and **Phase 2 (Web Scraping and Summarization)** into a unified process.
- Output: Detailed response, confidence score, and summary.

In [13]:
def combinedPipeline(txt):
    adapter_dir = "/kaggle/input/llama-fine-tuned1/pytorch/default/1"
    base_model_dir = "NousResearch/Llama-2-7b-chat-hf" 
    # Initialize Model
    tokenizer, base_model, device = initialise_base_model(base_model_dir)
    fine_tuned_model = initialise_fine_tuned_model(base_model, adapter_dir)
    
    # PHASE 1
    headline = txt
    fine_tune_response = analyze_news(headline, tokenizer, fine_tuned_model, device)
    print(fine_tune_response)
    
    # Extract Confidence Score
    confidence_score_phase1 = extract_confidence_score(fine_tune_response)
    print("\nExtracted Confidence Score:")
    print(confidence_score_phase1)
    
    #phase 2
    filename = "web_content_summary.csv"
    print("scrapping web")
    scraped_data = process_query(txt, filename)   

    print("\nReading scraped content from CSV and Generating summary using Llama model...")
    filepath = filename
    news_summary = generate_summary_with_llama(filepath,tokenizer,base_model,device)
    start_index = news_summary.find("provide an overall summary in maximum 100 words."
    )
    if start_index != -1:
        news_summary = news_summary[start_index:]
    print(news_summary)
        
    return fine_tune_response, confidence_score_phase1, news_summary    

### Testing the model pipeline

In [16]:
txt='Post Office Recruitment 2020: Big vacancy of over 1371 posts for 10th pass'
fine_tune_response, confidence_score_phase1, news_summary = combinedPipeline(txt)
print("@@@@@@@@pipeline working@@@@@@@@@@")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


You are a news analyzer. Given the headline, provide a confidence score (0-100) indicating how likely the news is true, and give a detailed explanation for your assessment. Headline: 'Post Office Recruitment 2020: Big vacancy of over 1371 posts for 10th pass'
Confidence Score: 80
Explanation: The Post Office Recruitment 2020: Big vacancy of over 1371 posts for 10th pass is a false news article. The article claims that there are 1371 vacancies for 10th pass candidates in the post office recruitment 2020. However, there is no such information available on the official website of the India Post. In fact, the official website of the India Post does not mention anything about 1371 vacancies for 10th pass candidates in the post office recruitment 2020. Therefore, this news article is false and should be disregarded.
The news article is created to mislead people who are looking for job opportunities in the post office. It is a common tactic used by fake news creators to spread false informati

### Confidence Score Generation

- **Purpose**: Analyzes the truthfulness of a news headline based on a fine-tuned LLaMA model's response and web-scraped content, returning a confidence score.
- **Input**: Takes the fine-tuned model's response, web-scraped news summary, and headline as input.
- **Output**: Returns the confidence score and overall result on whether the news is true or false.
- **Example Usage** on headline = `Post Office Recruitment 2020: Big vacancy of over 1371 posts for 10th pass`


In [10]:
# confidence score
def generatescore(tokenizer,model,device,fine_tune_response,news_summary,headline):
# fine_tune_response, confidence_score_phase1, news_summary
    input_text = (
        f"You are a news analyser. under the result from a fine tuned LLM which is [{fine_tune_response}] and the data scrapped from web which is: [{news_summary}] and provide an overall resultt that whether the news is true and false and a confidence score to it for the headline [{headline}].\n"
    )

    # Tokenize input
    inputs = tokenizer(
        input_text,
        return_tensors="pt",
        truncation=True,
        padding="max_length",
    )
    inputs = {key: value.to(device) for key, value in inputs.items()}

    # Generate response
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,   # Limit generated tokens
            num_beams=5,          # Enhance quality with beam search
            temperature=0.7,      # Balance randomness
            top_k=40,             # Limit to top-k tokens
            top_p=0.9,            # Nucleus sampling
            repetition_penalty=1.2  # Reduce repetitive outputs
        )

    # Decode and post-process
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.strip()

# adapter_dir = "/kaggle/input/llama-fine-tuned1/pytorch/default/1"
base_model_dir = "NousResearch/Llama-2-7b-chat-hf" 
# Initialize Model
tokenizer, base_model, device = initialise_base_model(base_model_dir)

headline = 'Post Office Recruitment 2020: Big vacancy of over 1371 posts for 10th pass'
fine_tune_response, confidence_score_phase1, news_summary = combinedPipeline(txt)

generatescore(tokenizer,base_model,device,fine_tune_response,news_summary,headline)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


'You are a news analyser. under the result from a fine tuned LLM which is [The Post Office Recruitment 2020: Big vacancy of over 1371 posts for 10th pass is a false news article. The article claims that there are 1371 vacancies for 10th pass candidates in the post office recruitment 2020. However, there is no such information available on the official website of the India Post. In fact, the official website of the India Post does not mention anything about 1371 vacancies for 10th pass candidates in the post office recruitment 2020. Therefore, this news article is false and should be disregarded.The news article is created to mislead people who are looking for job opportunities in the post office. It is a common tactic used by fake news creators to spread false information and mislead people. In this case, the news article is false and has been created to mislead people who are looking for job opportunities in the post office.The confidence score for this news article is 80 as there is 