# "Fake" vs. "Real" news: NLP

The topic of "fake" versus "real" news is one that's become more pressing as social media continues to evolve and information is more rapidly spread. The concept of news itself is not new and is a fundamental aspect of modern-day democracies. Journalists work to hold powerful entities and figures accountable and are supposed to be an ally of the people. Yet, distrust of the media is extremely common due to the nature of present-day society and officials constantly calling the media into question. Harmful conspiracies and propaganda aren't new but now have a platform to thrive and be spread among social sites. Companies and people are able to mask "pink slime," or garbage-level information, as quality journalism. This is a problem, as most people aren't entirely media literate and can't tell the difference -- but a computer can.

#### The purpose of this notebook is to use Natural Language Processing to train a model to tell the difference between "fake" and "real" news.

## Table of Contents

#### 1. Importing libraries and data 
#### 2. Data cleaning
#### 3. Feature Extraction
#### 4. Training the model
#### 5. Analyzing and exploring some more
#### 6. Conclusion + takeaways


## 1. Importing libraries and  data 

Let's import all the libraries we'll need to import and analyze the data.

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Now, let's import our csv files of both "fake" and "real" news datasets. The dataset is from: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset?select=True.csv. Then, we'll take a look at the heads of our csv files.

In [None]:
fake = pd.read_csv('Fake.csv', index_col=0)
real = pd.read_csv('True.csv', index_col=0)

In [None]:
fake.head()

In [None]:
real.head()

The datasets both contain four columns which include: the article's title, the article's body text, the subject of the news, and the date the article was published. It's a fairly straightforward dataset, which is great to work with.

Let's check the info, describe, and value_counts aspects of the datasets.

In [None]:
fake.info()

In [None]:
fake.describe()

In [None]:
fake.value_counts()

In [None]:
real.info()

In [None]:
real.describe()

In [None]:
real.value_counts()

## 2. Data cleaning

I'm going to drop the "Subject" columns to make it easier for our model to figure out what is real and what is fake. It'll also be ideal for me to concatenate the two types of news, but I'll go and add more labels to identify everything more clearly.

In [None]:
real['label'] ='real'
fake['label'] = 'fake'

In [None]:
news_data = pd.concat([fake,real],axis=0)
news_data = news_data.sample(frac=1).reset_index(drop=True)
news_data.drop('subject',axis=1)

Let's split up our dataset into test vs. training!

First, I'll import train_test_split

In [None]:
from sklearn.model_selection import train_test_split

I'll run the train_test_split and check the heads of our training data as well as the length of X_train

In [None]:
X = news_data['text']
y = news_data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
X_train.head()

In [None]:
y_train.head()

In [None]:
len(X_train)

## 3. Feature Extraction¶

I will be using tf-idf term weighting as the feature to extract from the texts.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
my_tfidf = TfidfVectorizer(stop_words='english',max_df=0.8)

let's fit the vectorizer and then transform X_train into a tf-idf matrix. 
Then, we will use that same vectorizer to transform the X_test

In [None]:
tfidf_train = my_tfidf.fit_transform(X_train)
tfidf_test = my_tfidf.transform(X_test)

tfidf_train

## 4. Training the model

I will be using PassiveAggressiveClassifier

In [None]:
from sklearn.linear_model import PassiveAggressiveClassifier

In [None]:
pa_clf = PassiveAggressiveClassifier(max_iter=50)
pa_clf.fit(tfidf_train, y_train)

We can use the same algorithm to the test dataset to see how well it performs.

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
from mlxtend.plotting import plot_confusion_matrix


y_pred = pa_clf.predict(tfidf_test)

conf_mat = confusion_matrix(y_test, y_pred)
plot_confusion_matrix(conf_mat,
                      show_normed=True, colorbar=True,
                      class_names=['Fake', 'Real'])
accscore = accuracy_score(y_test, y_pred)
f1score = f1_score(y_test,y_pred,pos_label='real')

print('The accuracy of prediction is {:.2f}%.\n'.format(accscore*100))
print('The F1 score is {:.3f}.\n'.format(f1score))

The model does a great job at predicting if news is fake, with a 99.34% level of accuracy. The F1 score is also extremely high, as well, with a score of 0.993.

## 5. Analyzing and exploring some more

I'm going to explore: 

 - What criteria the model learned to make it incredibly accurate

- If this model can be applied well to other articles that aren't in the data set or if this dataset had a particular characertistic that made it stand out

To begin with, let's see what the model's criteria was.

In [None]:
from sklearn.utils.extmath import density
from sklearn.pipeline import make_pipeline

In [None]:
print("Dimensionality (i.e., number of features): {:d}".format(pa_clf.coef_.shape[1]))
print("Density (i.e., fraction of non-zero elements): {:.3f}".format(density(pa_clf.coef_)))

The algorithm found that less than half of the features were not useful in determining whether or not an article is real. But let's examine the other features:

Non-zero weight sorting:

In [None]:
weights_nonzero = pa_clf.coef_[pa_clf.coef_!=0]
feature_sorter_nonzero = np.argsort(weights_nonzero)
weights_nonzero_sorted =weights_nonzero[feature_sorter_nonzero]

Plotting

In [None]:
fig, axs = plt.subplots(1,2, figsize=(9,3))
sns.lineplot(data=weights_nonzero_sorted, ax=axs[0])
axs[0].set_ylabel('Weight')
axs[0].set_xlabel('Feature number \n (Zero-weight omitted)')

axs[1].hist(weights_nonzero_sorted,
            orientation='horizontal', bins=500,)
axs[1].set_xlabel('Count')

fig.suptitle('Weight distribution in features with non-zero weights')

plt.show()

It looks like even with the features that have non-zero weights, a lot of them have a value close to zero. This isn't shocking, as there were almost one-hundred thousand tokens, so most of them were probably useless for the task at hand.

But what tokens were  useful?

### Let's extract "Indicator" tokens

In [None]:
tokens = my_tfidf.get_feature_names()
tokens_nonzero = np.array(tokens)[pa_clf.coef_[0]!=0]
tokens_nonzero_sorted = np.array(tokens_nonzero)[feature_sorter_nonzero]

num_tokens = 10
fake_indicator_tokens = tokens_nonzero_sorted[:num_tokens]
real_indicator_tokens = np.flip(tokens_nonzero_sorted[-num_tokens:])

In [None]:
fake_indicator = pd.DataFrame({
    'Token': fake_indicator_tokens,
    'Weight': weights_nonzero_sorted[:num_tokens]
})

real_indicator = pd.DataFrame({
    'Token': real_indicator_tokens,
    'Weight': np.flip(weights_nonzero_sorted[-num_tokens:])
})

In [None]:
print('The top {} tokens likely to appear in fake news were the following: \n'.format(num_tokens))
display(fake_indicator)

print('\n\n...and the top {} tokens likely to appear in real news were the following: \n'.format(num_tokens))
display(real_indicator)

In [None]:
fake_contain_fake = fake.text.loc[[np.any([token in body for token in fake_indicator.Token])
                                for body in fake.text.str.lower()]]
real_contain_real = real.text.loc[[np.any([token in body for token in real_indicator.Token])
                                for body in real.text.str.lower()]]

print('Articles that contained any of the matching indicator tokens:\n')

print('FAKE: {} out of {} ({:.2f}%)'
      .format(len(fake_contain_fake), len(fake), len(fake_contain_fake)/len(fake) * 100))
print(fake_contain_fake)

print('\nREAL: {} out of {} ({:.2f}%)'
      .format(len(real_contain_real), len(real), len(real_contain_real)/len(real) * 100))
print(real_contain_real)

### Some noticable points:

- Fake news tens to use Getty Images, most likely because a lot of fake articles aren't necessarily done by actual journalists, which means they need to find photos elsewhere.

- Weekdays are often included in real news, like "Tuesday","Wednesday",etc. because AP style prefers articles to state the day it took place if it happened that week, or the actual date of the event if it was prior.

- The categories went beyond politics, but many indicator terms seemed relevant to U.S. Politics. This includes terms like "gop", "sen", "republican", and more.

- "gop" is often used more in fake news that real news, while "republican" is more often used in real news. This is likely because AP style tells journalists to refer to the party as "Republicans."

### Other questions:

- Why are "read" and "featured" the top two fake-news indicator tokens? Is it because an author was trying to claim that the story is real because it's been read a lot and featured elsewhere?

- The same question goes for "nov" and "washington", which perhaps infers that a lot of fake articles came around election time in November and discussed the month and the capitol a lot.

- It is clear that Reuters is reputable, but a lot of articles begin with a "City Name (Reuters)" which the algorithim must have identified as real. I wonder if the algorithm could still tell if an article is real if this identifier was removed.

These are all speculations, but it would be interesting to see how these terms are actually used within the test. But that is beyond the scope of this project at hand.

## 6. Conclusion + takeaways

I used the TfidVectorizer and PassiveAggressiveClassifier algorithms to find "fake news" within the dataset. It was extremely accurate and able to identify the "fake news" at a consistently high rate with a high f1 score.