# FAKE NEWS DETECTION ALGORITHM

Import required libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import nltk
from nltk.corpus import stopwords

In [2]:
#importing sklearn libraries
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix,classification_report

Read the training dataset

In [3]:
news = pd.read_csv('news.csv')

In [4]:
news.head()                          #check the head of dataset

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


The only reqiured columns for us are the "text" and "label". 
Create a dataframe with the two columns

In [5]:
data = pd.DataFrame(data = news,columns=['label','text'])        

In [6]:
data.head()

Unnamed: 0,label,text
0,FAKE,"Daniel Greenfield, a Shillman Journalism Fello..."
1,FAKE,Google Pinterest Digg Linkedin Reddit Stumbleu...
2,REAL,U.S. Secretary of State John F. Kerry said Mon...
3,FAKE,"— Kaydee King (@KaydeeKing) November 9, 2016 T..."
4,REAL,It's primary day in New York and front-runners...


In [7]:
data.shape

(6335, 2)

In [8]:
import string

# TEXT PRE-PROCESSING

The following section is the data pre-processing section

We cannot use the input data in the form as it is. We have to do certain pre-processing on the input data so that model training will be much more easier and the output will have a better accuracy

In the following code cell, a function called 'text_process()' has been defined. Since we have the input data in the form of very long strings, the following function removes all the punctuation marks from the string and it also removes the 'stopwords' i.e. it removes the words in english vocabulary which are too commonly used. eg : 'is','are,'hello','bye' etc. This is done to maintain uniqueness in the input string so that the machine learning model can identify different inputs easily.

In [9]:
def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    # Check characters to see if they are in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    # Now just remove any stopwords
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

In [10]:
data['text'].head(5).apply(text_process)     #This code is just to check if the function works well.

0    [Daniel, Greenfield, Shillman, Journalism, Fel...
1    [Google, Pinterest, Digg, Linkedin, Reddit, St...
2    [US, Secretary, State, John, F, Kerry, said, M...
3    [—, Kaydee, King, KaydeeKing, November, 9, 201...
4    [primary, day, New, York, frontrunners, Hillar...
Name: text, dtype: object

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

# Train - Test - Split

In the following code cell we have performed splitting on the data in two parts in which 80% of the data will be used to train the model and remaining 20% data will be used in testing the model

In [12]:
msg_train, msg_test, label_train, label_test = \
train_test_split(data['text'], data['label'], test_size=0.2)

A pipeline has been created which includes 3 steps. first 2 steps are data processing steps and 3rd step is defining the actual model which we'll be using to train on the data. In this case we will be using Passive Aggressive Classifier model. The CountVectorizer step will apply the text_process function on the data. 

The TFidf Transformer basically converts the string list obtained from the text_process into integer values using the Tdidf scores

In [13]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=text_process)),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', PassiveAggressiveClassifier(max_iter=200)),  # train on TF-IDF vectors w/ Naive Bayes classifier
])

In the following cell the pipeline is actually trained (fitted) on the training data

In [14]:
pipeline.fit(msg_train,label_train)

Pipeline(memory=None,
         steps=[('bow',
                 CountVectorizer(analyzer=<function text_process at 0x00000203AC82E798>,
                                 binary=False, decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w...
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('classifier',
                 PassiveAggressiveClassifier(C=1.0, average=False,
                                             class_weight=None,
                                             early_stopping=False,
       

After the model has been trained on the training data (80% data), now it's time to test our model on the remaining 20% of the data

In [15]:
predict = pipeline.predict(msg_test)          #Model tested on test data

# Analyzing the model predictions

In [16]:
print(classification_report(predict,label_test))    #printing the classification report 

              precision    recall  f1-score   support

        FAKE       0.96      0.95      0.95       655
        REAL       0.94      0.95      0.95       612

    accuracy                           0.95      1267
   macro avg       0.95      0.95      0.95      1267
weighted avg       0.95      0.95      0.95      1267



In [17]:
print(confusion_matrix(predict,label_test))         #printing confusion matrix

[[621  34]
 [ 29 583]]


In [18]:
print(accuracy_score(predict,label_test)*100,' % accuracy')   #printing accuracy-score

95.02762430939227  % accuracy


# so now we have tested the model on test data and we have achieved an accuracy of 95 %

# Now let's test the model on a different test data and check if still the model works fine

Read the test data

In [20]:
test = pd.read_csv('test.csv')

In [21]:
test.head()

Unnamed: 0,id,title,author,text
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning..."
1,20801,Russian warships ready to strike terrorists ne...,,Russian warships ready to strike terrorists ne...
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...
3,20803,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor,"If at first you don’t succeed, try a different..."
4,20804,Keiser Report: Meme Wars (E995),Truth Broadcast Network,42 mins ago 1 Views 0 Comments 0 Likes 'For th...


So our input will be the 'text' column. But we dont have to test our model on all the data set. We will test our model on any random text and check the accuracy

In [22]:
test_data = pd.Series(data = test['text'][3])

In [23]:
test_data.head()

0    If at first you don’t succeed, try a different...
dtype: object

In [24]:
test_data.shape

(1,)

In [25]:
type(test_data)

pandas.core.series.Series

Let's predict our model for the test_data

In [26]:
pred = pipeline.predict(test_data)

In [27]:
print(pred)

['FAKE']


Looks like we have got it correct

# Let's go further and test this model on the latest news articles 

The following news articles have been copied from sources like New York Times etc. it's the latest news articles. so let's test it

We have to copy paste the whole article to get accurate results

In [28]:
new_test = pd.Series("""Ben Lupo sat in his basement in Omaha one recent afternoon, trying to kill a brigade of heavily armed Russians before they killed him.

“I’m getting shot at already, dog,” he said into a headset, as the sound of machine guns echoed in the air. “So, this is not cool.”

Moments later, the Russians had cornered and finished him off — also not cool. It was a grisly end to an ill-fated campaign in Call of Duty: Modern Warfare, a first-person shooter video game set in the fictional country of Urzikstan.

Mr. Lupo did not stew over his demise. He didn’t have time. About 13,000 people were watching him live on Twitch, the streaming platform where hordes of fans can pay to follow the best online gamers in the business. Few attract bigger crowds than Mr. Lupo, and since the coronavirus began forcing people to shelter in place, his crowds have only grown. He estimates that his viewership is up 25 to 30 percent.

ADVERTISEMENT. “I feel,” he said in an interview, “like I’ve been preparing for this moment my whole life.”

It’s hard to think of a job title more pandemic-proof than “superstar live streamer.” While the coronavirus has upended the working lives of hundreds of millions of people, Dr. Lupo, as he’s known to acolytes, has a basically unaltered routine. He has the same seven-second commute down a flight of stairs. He sits in the same seat, before the same configuration of lights, cameras and monitors. He keeps the same marathon hours, starting every morning at 8.

Social distancing? He’s been doing that since he went pro, three years ago.

For 11 hours a day, six days a week, he sits alone, hunting and being hunted on games like Call of Duty and Fortnite. With offline spectator sports canceled, he and other well-known gamers currently offer one of the only live contests that meet the standards of the Centers for Disease Control and Prevention. Viewership numbers on Twitch leapt 31 percent from March 8 to March 22, according to Arsenal.gg, a data analytics firm. (By then, one in four Americans was under shelter-in-place orders.) During that two-week span, the numbers of hours a day watched on Twitch rose to 43 million from 33 million.

“Live streaming and online video games are the only sports we can watch, right?” said Doron Nir, the chief executive of StreamElements, a company that provides tools and services to streamers. “This is a huge moment of validation.” Mr. Lupo and his peers were having the best financial year of their lives even before Covid-19 struck. Three of the biggest tech companies in the world — Microsoft, Facebook and Google — have been trying to raise the profile of their online gaming platforms: Mixer, Facebook Gaming and YouTube Gaming, respectively. Their goal is to catch up with Amazon, which owns Twitch and roughly 70 percent of online gaming viewership.

 All four of these giants have embraced the same strategy that keeps LeBron James in Nike sneakers: sign superstars to huge, exclusive contracts.

“You’ve got the biggest tech companies in the world competing for the top talent to stream exclusively on their platform,” said Rod Breslau, who helped start the e-sports section of ESPN’s site. “That gives the talent agency that works for a guy like Lupo a huge amount of leverage to negotiate.”

In December, Mr. Breslau said, Twitch signed Mr. Lupo and two other streaming stars to multiyear deals worth millions. It was a counterattack of sorts. Over the summer, Tyler Blevins, who plays under the name Ninja and is widely considered one of the best Fortnite players in the world, left Twitch for Mixer in a multiyear deal reportedly worth as much as $30 million.

These are sums that may startle the uninitiated. But Mr. Lupo and Mr. Blevins are celebrities in a gaming industry that generates more than $150 billion a year in revenue, according to Newzoo, a gaming analytics company — more than double the global film and music industries combined.

Marquee professional athletes from the worlds of basketball, baseball and football are about to jump into this fray. Josh Swartz, an executive at Popdog, which owns a talent agency for gamers, said he is preparing deals on behalf of stars of Major League Baseball, the National Basketball Association and the National Football League, many of whom are part-time gamers.

“My phone is ringing off the hook with sports agents saying, ‘My guy plays Call of Duty,’ or ‘My guy plays Fortnite.’” he said. “These athletes are just stuck at home. In a lot of cases, they are going to end up with their own streaming channel,” watched by thousands of fans eager to interact with their heroes. Mr. Lupo says he rose to the top of the crowded, highly competitive live-gaming pile through luck. Five years ago, he was an information technology specialist at an insurance company and started live streaming part time on a game called Destiny. At first, eight people watched, but the audience grew quickly. Mr. Lupo has top-notch skills, the warm, authoritative voice of a drive-time radio D.J. and a gift for wry wit, even when mortally wounded.

“Why would I need to practice?” he asked viewers, after losing that Call of Duty game. “I’m a god. I’m insane. Look at my body, dude.”

What truly launched Mr. Lupo was a perfectly tossed virtual grenade. He lobbed it at Mr. Blevins while the two faced off in a first-person shooter called PUBG. A video of the encounter shows a stupefied look on Mr. Blevins’s face, displayed in a corner of the screen, which gradually segues into laughter and delight, as the death of his avatar sinks in.

“We hit it off immediately,” said Mr. Lupo. “We were like brothers, and people liked watching that friendship grow.”

Around this time, Fortnite made its debut and became a cultural phenomenon. Mr. Lupo and Mr. Blevins started teaming up to play against others. (Each game starts with 100 players). Mr. Blevins later asked Mr. Lupo to serve as a play-by-play commentator during a Fortnite event at the Luxor Resort and Casino in Vegas.

“About 300,000 people watched live,” Mr. Lupo said. “And a couple million more watched later.”

Mr. Lupo spends each day with an overhead camera pointed at his hands, another camera pointed at the side of his face and a display of what he sees on the screen. Most of the time, he controls an avatar who is both running for his life and in the midst of a frantic killing spree. He and his online teammates — he usually has a few, whom he talks to through a headset — scramble at breakneck speed, defusing bombs, sniping at enemies and hurtling over landscapes in hijacked trucks. It seems the opposite of relaxing.

 Mr. Lupo’s fan base is riveted. It skews older than the average Twitch gaming channel. He is married — his wife, Samantha, is his manager — and has a 4-year-old son. His biggest supporters tend to hail from a similar demographic.

“He became a dad a couple months after I did,” said Nick Kallner, 34, who lives near Albany and has been watching Mr. Lupo since his Destiny days. “I have the sense watching him that he’s a dad like me, a real-world guy. Plus, he’s funny.”

And while Mr. Lupo is fluent in the language of bro-speak, his devotees include plenty of women.

“What cemented it for me is how he built a respectful community,” said Lindsey Hladik, who lives in Orlando, Fla. “As a woman, you get a lot of harassment, people casually throwing off offensive terms, and he’s always good about shutting down that kind of behavior.”

Ms. Hladik, 34, is a manager at an e-commerce site and has been working from home for the last three weeks. Mr. Lupo’s channel plays in the background, all day, every day.

“It’s like having the TV on,” she said.

The difference is that the star of this show performs for hours on end. Keeping energy levels up is just one of Mr. Lupo’s challenges. On a few occasions over the years, pranksters have sicced the police on him, calling the authorities and claiming that some horrible crime was unfolding at his house — a toxic gag known as “swatting.” That has made it uncomfortable when fans knock on his door to introduce themselves. In the moment, Mr. Lupo can’t help but imagine worst-case scenarios.

Even when there’s just peace and quiet, he spends most of his waking hours in a windowless room. It would be a grim existence, he said, if he didn’t love video games and performing before an audience that keeps growing.

“People are finding ways to distract themselves a bit from what’s going on in the outside world,” he said. “If I’m helping, that’s fantastic.”

""")

Time to predict if it is a 'Real' or 'Fake' news

In [29]:
new_pred = pipeline.predict(new_test)
print(new_pred)

['REAL']


Turns out that it is a 'Real news'

In [30]:
article = pd.Series(["""WASHINGTON — Coronavirus patients in areas that had high levels of air pollution before the pandemic are more likely to die from the infection than patients in cleaner parts of the country, according to a new nationwide study that offers the first clear link between long-term exposure to pollution and Covid-19 death rates.

In an analysis of 3,080 counties in the United States, researchers at the Harvard University T.H. Chan School of Public Health found that higher levels of the tiny, dangerous particles in air known as PM 2.5 were associated with higher death rates from the disease.

For weeks, public health officials have surmised a link between dirty air and death or serious illness from Covid-19, which is caused by the coronavirus. The Harvard analysis is the first nationwide study to show a statistical link, revealing a “large overlap” between Covid-19 deaths and other diseases associated with long-term exposure to fine particulate matter.

“The results of this paper suggest that long-term exposure to air pollution increases vulnerability to experiencing the most severe Covid-19 outcomes,” the authors wrote. The paper found that if Manhattan had lowered its average particulate matter level by just a single unit, or one microgram per cubic meter, over the past 20 years, the borough would most likely have seen 248 fewer Covid-19 deaths by this point in the outbreak.

Over all, the research could have significant implications for how public health officials choose to allocate resources like ventilators and respirators as the coronavirus spreads. The paper has been fast-tracked for peer review and publication in the New England Journal of Medicine. It found that just a slight increase in long-term pollution exposure could have serious coronavirus-related consequences, even accounting for other factors like smoking rates and population density. For example, it found that a person living for decades in a county with high levels of fine particulate matter is 15 percent more likely to die from the coronavirus than someone in a region with one unit less of the fine particulate pollution. The District of Columbia, for instance, is likely to have a higher death rate than the adjacent Montgomery County, Md. Cook County, Ill., which includes Chicago, should be worse than nearby Lake County, Ill. Fulton County, Ga., which includes Atlanta, is likely to suffer more deaths than the adjacent Douglas County. “This study provides evidence that counties that have more polluted air will experience higher risks of death for Covid-19,” said Francesca Dominici, a professor of biostatistics at Harvard who led the study.

Counties with higher pollution levels, Dr. Dominici said, “will be the ones that will have higher numbers of hospitalizations, higher numbers of deaths and where many of the resources should be concentrated.”

The study is part of a small but growing body of research, mostly still out of Europe, that offers a view into how a lifetime of breathing dirtier air can make people more susceptible to the coronavirus, which has already killed more than 10,000 people in the United States and 74,000 worldwide.

In the short term, Dr. Dominici and other public health experts said the study’s finding meant that places like the Central Valley of California, or Cuyahoga County, Ohio, may need to prepare for more severe cases of Covid-19.

The analysis did not look at individual patient data and did not answer why some parts of the country have been hit harder than others. It also remains unclear whether particulate matter pollution plays any role in the spread of the coronavirus or whether long-term exposure directly leads to a greater risk of falling ill.

Dr. John R. Balmes, a spokesman for the American Lung Association and a professor of medicine at University of California, San Francisco, said the findings were particularly important for hospitals in poor neighborhoods and communities of color, which tend to be exposed to higher levels of air pollution than affluent, white communities. “We need to make sure that hospitals taking care of folks who are more vulnerable and with even greater air pollution exposure have the resources they need,” Dr. Balmes said. As more is learned about the recurrence of Covid-19, the study also could have far-reaching implications for clean-air regulations, which the Trump administration has worked to roll back over the past three years on the grounds that they have been onerous to industry.

“The study results underscore the importance of continuing to enforce existing air pollution regulations to protect human health both during and after the Covid-19 crisis,” the study said.

Last week, the Trump administration announced a plan to weaken Obama-era regulations on automobile tailpipe emissions, asserting the rollback would save lives because Americans would buy newer, safer vehicles. But the administration’s own analysis also found that there would be even more premature deaths from increased air pollution.

In weakening a regulation last year on carbon pollution from coal-fired power plants, the Environmental Protection Agency similarly acknowledged that the measure was likely to result in about 1,400 additional premature deaths a year because of more pollution.

Asked whether the E.P.A. was also studying the link between air pollution and the virus or considering policies to address the link, Andrea Woods, a spokeswoman for the agency, referred the question to the Centers for Disease Control and Prevention, and asserted that the Trump administration rollbacks would lead to some air quality improvements. Beth Gardiner, a journalist and the author of “Choked: Life and Breath in the Age of Air Pollution,” said she was particularly worried about what the coronavirus outbreak would mean for countries with far worse pollution, such as India.

“Most countries don’t take it seriously enough and aren’t doing enough given the scale of the harm that air pollution is doing to all of our health,” she said.

Most fine particulate matter comes from fuel combustion, like automobiles, refineries and power plants, as well as some indoor sources like tobacco smoke. Breathing in such microscopic pollutants, experts said, inflames and damages the lining of the lungs over time, weakening the body’s ability to fend off respiratory infections.

Multiple studies have found that exposure to fine particulate matter puts people at heightened risk for lung cancer, heart attacks, strokes and even premature death. In 2003, Dr. Zuo-Feng Zhang, the associate dean for research at the University of California, Los Angeles, Fielding School of Public Health, found that SARS patients in the most polluted parts of China were twice as likely to die from the disease as those in places with low air pollution.

To conduct the Harvard study, researchers collected particulate matter data for the past 17 years from more than 3,000 counties and Covid-19 death counts for each county through April 4 from the Center for Systems Science and Engineering Coronavirus Resource Center at the Johns Hopkins University. The resulting model, which examines aggregated rather than individual data, suggested what Dr. Dominici called a statistically significant link between pollution and coronavirus deaths.

The researchers also conducted six secondary analyses to adjust for factors they felt might compromise the results. For example, because New York state has experienced the most severe coronavirus outbreak in the country and death rates there are five times higher than anywhere else, the researchers repeated the analysis excluding all of the counties in the state. They also ran the model excluding counties with fewer than 10 confirmed Covid-19 cases. And they adjusted for various other factors that are known to affect health outcomes, like smoking rates, population density and poverty levels.

Dr. Balmes noted that without studying individual characteristics of patients, the study could only suggest a causal connection between air pollution and Covid-19 deaths and would need to be confirmed by more research — a point with which Dr. Dominici agreed. But, Dr. Balmes said, “It’s still a valuable finding.” """])   

In [31]:
article_pred = pipeline.predict(article)
print(article_pred)

['REAL']


WAY TO GOOO...

# The Model has been trained and tested successfully on different test data.