<a href="https://colab.research.google.com/github/BrianSward/colab-work/blob/main/fake_news_machine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# imports and such for the learning part

import numpy as np 
import pandas as pd

from sklearn.model_selection import train_test_split 
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.svm import LinearSVC

In [None]:
# installs for the web scraping

!pip install charset-normalizer
!pip install spacy
!pip install trafilatura

In [3]:
# imports for the web scraping

from bs4 import BeautifulSoup
import json
import numpy as np
import requests
from requests.models import MissingSchema
import spacy
import trafilatura

In [4]:
data = pd.read_csv("https://raw.githubusercontent.com/lutzhamel/fake-news/master/data/fake_or_real_news.csv")

In [5]:
data

Unnamed: 0,id,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
...,...,...,...,...
6330,4490,State Department says it can't find emails fro...,The State Department told the Republican Natio...,REAL
6331,8062,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,FAKE
6332,8622,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...,FAKE
6333,4021,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene...",REAL


Begin Data Training

In [6]:
data['fake'] = data['label'].apply(lambda x: 0 if x == "REAL" else 1)
data = data.drop("label", axis=1)

In [7]:
X, y = data["text"], data["fake"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [8]:
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
X_train_vectorized = vectorizer.fit_transform(X_train) 
X_test_vectorized = vectorizer.transform(X_test)

In [9]:
clf = LinearSVC()
clf.fit(X_train_vectorized, y_train)

Lets test the accuracy of our model


In [10]:
_score = round(clf.score(X_test_vectorized, y_test), 2)
print(f'{_score*100}%')

93.0%


The cell below will accept a url, and then process it

In [14]:
user_url = input("Feed me a url!!! ")

text_data = {}
resp = requests.get(user_url)
if resp.status_code == 200:
  text_data[user_url] = resp.text

def beautifulsoup_extract_text_fallback(response_content):
    """
    Extract text from HTML using BeautifulSoup as a backup.
    """
    soup = BeautifulSoup(response_content, 'html.parser')
    text = soup.find_all(text=True)
    blacklist = ['[document]', 'noscript', 'header', 'html', 'meta', 'head', 'input', 'script', 'style']
    cleaned_text = ' '.join(item for item in text if item.parent.name not in blacklist).replace('\t', '')
    return cleaned_text.strip()

def extract_text_from_single_web_page(url):
    """
    Extract text from a web page using Trafilatura or BeautifulSoup as a backup.
    """
    downloaded_url = trafilatura.fetch_url(url)
    try:
        extracted_data = trafilatura.extract(downloaded_url, output_format='json', with_metadata=True, include_comments=False,
                                date_extraction_params={'extensive_search': True, 'original_date': True})
    except AttributeError:
        extracted_data = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True,
                                date_extraction_params={'extensive_search': True, 'original_date': True})
    if extracted_data:
        json_output = json.loads(extracted_data)
        return json_output['text']
    else:
        try:
            resp = requests.get(url)
            # We will only extract the text from successful requests:
            if resp.status_code == 200:
                return beautifulsoup_extract_text_fallback(resp.content)
            else:
                # This line will handle for any failures in both the Trafilature and BeautifulSoup4 functions:
                return np.nan
        # Handling for any URLs that don't have the correct protocol
        except requests.exceptions.MissingSchema:
            return np.nan
          
single_url = user_url
_text = extract_text_from_single_web_page(url=single_url)
vectorized_text = vectorizer.transform([_text])

if clf.predict(vectorized_text) == [0]:
  print(f"with a model that is {_score*100}% accurate, this is seen as real news")

else:
  print(f"with a model that is {_score*100}% accurate, this is seen as fake news")

Feed me a url!!! https://www.mlive.com/life/2023/04/willie-nelson-to-celebrate-his-90th-birthday-on-the-road-in-michigan-in-2023.html
with a model that is 93.0% accurate, this is seen as fake news


#### References

Standing on these shoulders of these fine giants:
- [This tutorial was used and adapted for the webscraping portion](https://understandingdata.com/posts/how-to-extract-the-text-from-multiple-webpages-in-python/)

- [This is the source of my data](https://github.com/lutzhamel/fake-news/blob/master/data/fake_or_real_news.csv)

- [This code was adapted for my use in the ML portion](https://github.com/NeuralNine/youtube-tutorials/blob/main/Fake%20News%20Detection/Fake%20News%20Detection.ipynb)