<a href="https://colab.research.google.com/github/Pavithrakumar19/fake-news-detection/blob/main/fakenewsss.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install requests pandas numpy scikit-learn transformers torch

In [1]:
!pip install numpy pandas requests aiohttp torch transformers scikit-learn
!pip install nest_asyncio aiohttp transformers torch scikit-learn



In [None]:
import requests
import pandas as pd
import numpy as np
import json
import re
import warnings
import asyncio
import aiohttp
from typing import List, Dict, Any, Optional

# Advanced Machine Learning Libraries
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

class EnhancedMisinformationDetector:
    def __init__(self):
        self.google_api_key = "AIzaSyDbtMIhzgOqaPvnW_F8bOOP15NLhC2kk7Y"
        self.google_fact_check_url = "https://factchecktools.googleapis.com/v1alpha1/claims:search"

        # Fact-Checking Resources Configuration
        self.fact_check_resources = {
            'apis': [
                {
                    'name': 'Google Fact Check',
                    'url': self.google_fact_check_url,
                    'type': 'primary_fact_check'
                },
                {
                    'name': 'PolitiFact',
                    'url': 'https://www.politifact.com/api/statements/search/',
                    'type': 'secondary_fact_check'
                },
                {
                    'name': 'Snopes',
                    'url': 'https://www.snopes.com/api/search',
                    'type': 'secondary_fact_check'
                }
            ]
        }

        # Advanced NLP and Machine Learning Setup
        self.tokenizer = AutoTokenizer.from_pretrained('roberta-base')
        self.model = AutoModelForSequenceClassification.from_pretrained('roberta-base')

        # Local Cache
        self.verified_claims_cache = pd.DataFrame(columns=[
            'statement', 'verified_status', 'sources', 'confidence_score'
        ])

    async def fetch_google_fact_check(self, statement: str) -> Dict[str, Any]:
        """
        Fetch results from Google Fact Check API

        Args:
            statement (str): Statement to verify

        Returns:
            Dictionary containing fact check results
        """
        params = {
            'key': self.google_api_key,
            'query': statement,
            'languageCode': 'en'
        }

        try:
            async with aiohttp.ClientSession() as session:
                async with session.get(self.google_fact_check_url, params=params) as response:
                    if response.status == 200:
                        data = await response.json()
                        claims = data.get('claims', [])

                        if claims:
                            # Process and aggregate results
                            ratings = [claim.get('textualRating', '').lower() for claim in claims]
                            sources = [claim.get('claimReview', [{}])[0].get('publisher', {}).get('name', '')
                                     for claim in claims]

                            # Calculate confidence based on number of sources and consistency
                            confidence = min(len(claims) / 5.0, 1.0)  # Max confidence at 5+ sources

                            return {
                                'source': 'Google Fact Check',
                                'type': 'primary_fact_check',
                                'status': self._aggregate_google_ratings(ratings),
                                'confidence': confidence,
                                'raw_ratings': ratings,
                                'sources': sources
                            }

                        return {
                            'source': 'Google Fact Check',
                            'type': 'primary_fact_check',
                            'status': 'unknown',
                            'confidence': 0.0,
                            'raw_ratings': [],
                            'sources': []
                        }

        except Exception as e:
            print(f"Error with Google Fact Check API: {e}")
            return None

    def _aggregate_google_ratings(self, ratings: List[str]) -> str:
        """
        Aggregate multiple Google Fact Check ratings

        Args:
            ratings (List[str]): List of ratings from different sources

        Returns:
            Aggregated rating status
        """
        true_keywords = ['true', 'correct', 'accurate', 'fact']
        false_keywords = ['false', 'incorrect', 'inaccurate', 'fake']
        mixed_keywords = ['mixed', 'partial', 'mostly']

        true_count = sum(1 for r in ratings if any(k in r for k in true_keywords))
        false_count = sum(1 for r in ratings if any(k in r for k in false_keywords))
        mixed_count = sum(1 for r in ratings if any(k in r for k in mixed_keywords))

        total = len(ratings)
        if total == 0:
            return 'unknown'

        # Determine predominant rating
        max_count = max(true_count, false_count, mixed_count)
        if max_count == true_count:
            return 'true'
        elif max_count == false_count:
            return 'false'
        else:
            return 'mixed'

    async def fetch_fact_check_results(self, statement: str) -> List[Dict[str, Any]]:
        """
        Fetch results from all fact-checking sources
        """
        # First get Google Fact Check results
        google_results = await self.fetch_google_fact_check(statement)
        results = [google_results] if google_results else []

        # Fetch from other sources (simplified for example)
        async def check_other_source(session, source):
            try:
                async with session.get(source['url'], params={'query': statement}) as response:
                    if response.status == 200:
                        return {
                            'source': source['name'],
                            'type': source['type'],
                            'status': 'mixed',  # Simplified for example
                            'confidence': 0.5
                        }
            except Exception as e:
                print(f"Error checking {source['name']}: {e}")
                return None

        async with aiohttp.ClientSession() as session:
            other_sources = [s for s in self.fact_check_resources['apis']
                           if s['name'] != 'Google Fact Check']
            tasks = [check_other_source(session, source) for source in other_sources]
            other_results = await asyncio.gather(*tasks)
            results.extend([r for r in other_results if r is not None])

        return results

    def machine_learning_verification(self, statement: str) -> float:
        """
        Machine learning-based verification
        """
        inputs = self.tokenizer(
            statement,
            return_tensors='pt',
            truncation=True,
            max_length=512
        )

        with torch.no_grad():
            outputs = self.model(**inputs)

        probabilities = torch.softmax(outputs.logits, dim=1)
        credibility_score = probabilities.numpy()[0][1]

        return credibility_score

    def cross_reference_sources(self, results: List[Dict]) -> float:
        """
        Cross-reference verification results with updated weights
        """
        # Updated weights to prioritize Google Fact Check and ML model
        source_weights = {
            'primary_fact_check': 0.7,    # Google Fact Check
            'secondary_fact_check': 0.3    # Other fact checkers
        }

        weighted_scores = []
        for result in results:
            if result and 'confidence' in result:
                weight = source_weights.get(result.get('type', 'secondary_fact_check'), 0.1)
                weighted_scores.append(result['confidence'] * weight)

        return np.mean(weighted_scores) if weighted_scores else 0.0

    def preprocess_text(self, text: str) -> str:
        """
        Text preprocessing
        """
        text = text.lower()
        text = re.sub(r'[^\w\s]', '', text)
        text = re.sub(r'\s+', ' ', text).strip()
        return text

    async def comprehensive_verification(self, statement: str) -> Dict[str, Any]:
        """
        Comprehensive verification with updated weights
        """
        cleaned_statement = self.preprocess_text(statement)

        # Get fact checking results
        fact_check_results = await self.fetch_fact_check_results(cleaned_statement)

        # Get ML verification
        ml_credibility = self.machine_learning_verification(cleaned_statement)

        # Calculate cross-source confidence
        cross_source_confidence = self.cross_reference_sources(fact_check_results)

        # Updated weights: 40% Google Fact Check, 40% ML model, 20% other sources
        final_credibility = (
            0.4 * (fact_check_results[0]['confidence'] if fact_check_results else 0.0) +
            0.4 * ml_credibility +
            0.2 * cross_source_confidence
        )

        return {
            'original_statement': statement,
            'cleaned_statement': cleaned_statement,
            'fact_check_results': fact_check_results,
            'ml_credibility': ml_credibility,
            'cross_source_confidence': cross_source_confidence,
            'final_credibility': final_credibility,
            'recommendation': self._generate_recommendation(final_credibility)
        }

    def _generate_recommendation(self, credibility_score: float) -> str:
        """
        Generate recommendation based on credibility score
        """
        if credibility_score < 0.3:
            return "🔴 HIGH MISINFORMATION RISK: Critically Evaluate - Likely False"
        elif credibility_score < 0.6:
            return "🟡 MIXED CREDIBILITY: Proceed with Caution - Verify Further"
        else:
            return "🟢 VERIFIED: High Reliability - Trustworthy Information"

async def main():
    # Initialize detector (API key is now built into the class)
    detector = EnhancedMisinformationDetector()

    while True:
        statement = input("\nEnter a statement to verify (or 'quit'): ")

        if statement.lower() == 'quit':
            break

        result = await detector.comprehensive_verification(statement)

        print("\n--- Comprehensive Verification Report ---")
        print(f"Original Statement: {result['original_statement']}")

        if result['fact_check_results']:
            print("\nGoogle Fact Check Results:")
            google_result = result['fact_check_results'][0]
            print(f"Status: {google_result['status']}")
            print(f"Confidence: {google_result['confidence']:.2f}")
            if 'raw_ratings' in google_result:
                print("Raw Ratings:", google_result['raw_ratings'])
            if 'sources' in google_result:
                print("Sources:", google_result['sources'])

        print(f"\nML Model Credibility: {result['ml_credibility']:.2f}")
        print(f"Final Credibility Score: {result['final_credibility']:.2f}")
        print(f"Recommendation: {result['recommendation']}")

if __name__ == "__main__":
    import nest_asyncio
    nest_asyncio.apply()
    asyncio.run(main())

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



--- Comprehensive Verification Report ---
Original Statement: Japan's top diplomat in China to address 'challenges'  Read more at: http://timesofindia.indiatimes.com/articleshow/116647220.cms?utm_source=contentofinterest&utm_medium=text&utm_campaign=cppst

Google Fact Check Results:
Status: unknown
Confidence: 0.00
Raw Ratings: []
Sources: []

ML Model Credibility: 0.54
Final Credibility Score: 0.22
Recommendation: 🔴 HIGH MISINFORMATION RISK: Critically Evaluate - Likely False

--- Comprehensive Verification Report ---
Original Statement: BEIJING: Japanese foreign minister Takeshi Iwaya

Google Fact Check Results:
Status: unknown
Confidence: 0.00
Raw Ratings: []
Sources: []

ML Model Credibility: 0.53
Final Credibility Score: 0.21
Recommendation: 🔴 HIGH MISINFORMATION RISK: Critically Evaluate - Likely False

--- Comprehensive Verification Report ---
Original Statement: China and Japan are key trading partners, but increased fric ..  Read more at: http://timesofindia.indiatimes.com/ar