# Fake News Detection with Machine Learning
## Overview

### What You'll Learn
In this section, you'll learn
1. How to use various scikit-learn machine learning algorithms
2. How to select features for a real-world machine learning problem
3. How to design a neural network that makes predictions based on our selected features

### Prerequisites
Before starting this section, you should have an understanding of
1. [scikit-learn and Tensorflow](https://colab.research.google.com/github/HackBinghamton/MachineLearningWorkshopWeek1/)
2. [Basic Python (functions, loops, lists)](https://github.com/HackBinghamton/PythonWorkshop)
3. [Numpy and Pandas](https://github.com/HackBinghamton/DataScienceWorkshop)

### Introduction
We've all heard about fake news over the past few years. This workshop will guide you through designing a relatively primitive fake news detector based on a modified version of the [FakeNewsNet dataset](https://github.com/KaiDMML/FakeNewsNet).

### Setup
#### Package Installations

In [None]:
!pip3 install tensorflow
!pip3 install sklearn
!pip3 install python-whois
!pip3 install pandas
!pip3 install textstat
!pip3 install -U textblob
!pip3 install requests

In [None]:
import pandas as pd
import datetime
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
import tensorflow as tf
import textblob
from textstat.textstat import textstat
import requests

## Step 1: Gathering data and selecting features
### Selecting a dataset
For the purpose of this workshop, we'll be using a modified version of the FakeNewsNet dataset. The data provided for you has been cleaned and a few features have been added. Namely, the dataset did not originally include information from ICANN WHOIS or article text.

When starting a machine learning project, it is very important to select a good dataset. Your dataset should have diverse information, be well-formed (no missing data), and not have incorrect data. It should also have a lot of data points - the 350 articles used for this exercise are not a sufficiently sized dataset.

### Loading the data
Methods that load training and testing data have been provided for you:

In [None]:
def load_fake_news_data(file_name):
    url = "https://raw.githubusercontent.com/HackBinghamton/MachineLearningWorkshopWeek2/master/fake_news_detection/" + file_name
    json_data = requests.get(url).text
    fake_news_data = pd.read_json(json_data)

    fake_news_features = fake_news_data.drop(columns=["is_fake"])
    fake_news_labels = fake_news_data["is_fake"]

    return fake_news_features, fake_news_labels


def load_fake_news_training_data():
    return load_fake_news_data("fakenewsnet_modified_training_set.json")


def load_fake_news_testing_data():
    return load_fake_news_data("fakenewsnet_modified_testing_set.json")


Let's take a look at what we're working with.

In [None]:
print(load_fake_news_training_data()[0].shape)
print(load_fake_news_training_data()[0].columns)

print(load_fake_news_testing_data()[0].shape)
print(load_fake_news_testing_data()[0].columns)

### Adding new features
Although there's a good amount of information in this dataset, not all of it is terribly useful (yet). Let's make some functions to create new features from the data we have.

### New Feature: ICANN WHOIS registered country
A lot of fake news comes from Macedonia, Panama, or from websites whose owners hide behind domain privacy services. We can add a new column that contains a 1 if a given news article's host website was registered in Macedonia, Panama, or the location is hidden by a privacy service.

In [None]:
def is_suspicious_country(country):
    sus_countries = ["MK", "PA"]

    return int(country in sus_countries or "REDACTED" in country)


def add_suspicious_country_column(fake_news_df):
    fake_news_df["is_suspicious_country"] = fake_news_df["country"].apply(lambda x: is_suspicious_country(x))

    return fake_news_df

### New Features: Text complexity
Professional journalists are generally much better writers than those who create fake news. If an article is easy to read, it may have been written by a professional journalist rather than a propagandist.

Let's start by writing a function that measures an article's Flesch-Kincaid reading ease level.

In [None]:
def add_flesch_reading_ease_column(fake_news_df):
    fake_news_df["flesch_reading_ease"] = fake_news_df["article_text"].apply(
        lambda x: (textstat.flesch_reading_ease(x))
    )

    return fake_news_df

It might also be helpful to determine how many difficult words the author used. Usage of more difficult words may imply higher proficiency with the English language, which may indicate the writing was done by a professional.

In [None]:
def percent_difficult_words(article):
    if textstat.lexicon_count(article) == 0:
        return 0

    return textstat.difficult_words(article) / textstat.lexicon_count(article)


def add_percent_difficult_words_column(fake_news_df):
    fake_news_df["percent_difficult_words"] = fake_news_df["article_text"].apply(lambda x: percent_difficult_words(x))

    return fake_news_df

### New Feature: Text sentiment
Professional journalists are expected to be objective and calm in their writing. Fake news, on the other hand, is usually opinion-heavy and designed to provoke anger from its readers. Let's add two columns which calculate the article's polarity and subjectivity.

In [None]:
def add_sentiment_columns(fake_news_df):
    fake_news_df["article_polarity"], fake_news_df["article_subjectivity"] = zip(
        *fake_news_df["article_text"].map(analyze_sentiment)
    )

    return fake_news_df

### Adding the features
Now that we have functionality to create new features, let's create a function that applies all of this functionality to our datasets.

In [None]:
def add_features(fake_news_df):
    # Comment or uncomment these features as you see fit. More features isn't always better -
    # sometimes, features you think are helpful might actually harm your accuracy!
    fake_news_df = add_suspicious_country_column(fake_news_df)
    fake_news_df = add_flesch_reading_ease_column(fake_news_df)
    fake_news_df = add_percent_difficult_words_column(fake_news_df)
    fake_news_df = add_sentiment_columns(fake_news_df)

    return fake_news_df

### Dropping unused features
Unprocessed, stuff like the article ID, url, title, or article text don't make much sense to ML algorithms. Let's make a function that drops this information from our dataframe.

In [None]:
def drop_features(fake_news_df):
    # Drop features we're not using for our machine learning algorithm
    fake_news_df = fake_news_df.drop(columns=["id", "article_text", "country", "title", "news_url"])
    fake_news_df = fake_news_df.reset_index(drop=True)

    return fake_news_df

### Scaling existing features
As we learned last week, we want to make sure our features are scaled properly. Let's scale our creation date timestamp to be between 0 and 1.

In [None]:
def scale_creation_dates(fake_news_df):
    now_timestamp = datetime.datetime.now().timestamp()
    fake_news_df["creation_date"] = fake_news_df["creation_date"].apply(lambda x: x / now_timestamp)

    return fake_news_df


def scale_features(fake_news_df):
    fake_news_df = scale_creation_dates(fake_news_df)

    return fake_news_df

### Pulling EVERYTHING together
Finally, let's write a function that does all the feature creation, deletion, and scaling for us.

In [None]:
def refine_fake_news_data(fake_news_df):
    fake_news_df = add_features(fake_news_df)
    fake_news_df = drop_features(fake_news_df)
    fake_news_df = scale_features(fake_news_df)

    return fake_news_df

## Step 2: Training our algorithms
### The scikit-learn approach
Let's begin by first testing out some `scikit-learn` algorithms and observing how they perform.

In [None]:
def evaluate_sklearn_models(training_features, training_labels):
    models = [
        ("Logistic Regression", LogisticRegression(solver="lbfgs")),
        ("Linear Discriminant Analysis", LinearDiscriminantAnalysis()),
        ("K-Nearest Neighbors", KNeighborsClassifier()),
        ("Decision Tree", DecisionTreeClassifier()),
        ("Gaussian Naive Bayes", GaussianNB()),
        ("Support Vector Machine", SVC(gamma="scale")),
        ("Bagging Classifier", BaggingClassifier()),
        ("Random Forest Classifier", RandomForestClassifier(n_estimators=100))
    ]

    for name, model in models:
        kfold = model_selection.KFold(n_splits=10)

        cv_results = model_selection.cross_val_score(
            model, training_features, training_labels, cv=kfold, scoring="accuracy"
        )

        msg = "%s: \n\tAverage accuracy: %f \n\tStandard deviation: %f" % (
            name, cv_results.mean() * 100, cv_results.std() * 100
        )

        print(msg)

### Designing and training a neural network
Let's now try designing a neural network.

In [None]:
def create_neural_network():
    # This is the same design as last week's neural network, with the exception that:
    #     1. There is no input to flatten
    #     2. The dense softmax layer has been reduced from 10 units to 2 units, since our labels 
    #        can either be true or false (2 options) as opposed to a digit between 0 and 9 (10 options)
    dense_relu_layer = tf.keras.layers.Dense(1024, activation="relu")
    dropout_layer = tf.keras.layers.Dropout(0.2)
    dense_softmax_layer = tf.keras.layers.Dense(2, activation="softmax")

    neural_network_model = tf.keras.models.Sequential([
        dense_relu_layer,
        dropout_layer,
        dense_softmax_layer
    ])

    neural_network_model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

    return neural_network_model


def train_neural_network(neural_network_model, training_features, training_labels):
    neural_network_model.fit(training_features.values, training_labels.values, epochs=400)

    return neural_network_model


def evaluate_neural_network(neural_network_model, testing_features, testing_labels):
    test_loss, test_acc = neural_network_model.evaluate(testing_features.values, testing_labels.values)

    return test_acc

## Step 3: Evaluating our algorithms
Now that we've designed our approach to the problem, let's execute!

In [None]:
def main():
    fake_news_training_features, fake_news_training_labels = load_fake_news_training_data()
    fake_news_testing_features, fake_news_testing_labels = load_fake_news_testing_data()

    fake_news_training_features = refine_fake_news_data(fake_news_training_features)
    fake_news_testing_features = refine_fake_news_data(fake_news_testing_features)

    evaluate_sklearn_models(fake_news_training_features, fake_news_training_labels)

    neural_network_model = create_neural_network()
    neural_network_model = train_neural_network(
        neural_network_model, fake_news_training_features, fake_news_training_labels
    )

    print(evaluate_neural_network(neural_network_model, fake_news_testing_features, fake_news_testing_labels))


main()

There are many ways we can do better. Try doing the following on your own:
1. Playing around with tensorflow parameters/adding different layers
2. Adding new features

Possibly helpful further reading:
1. [Types of Keras layers](https://keras.io/layers/core/)
2. [Types of Keras Activations](https://keras.io/activations/)
3. [Fake News Detector from HackBU 2018](https://github.com/cfiutak1/HackBU2018-Fake-News-Detector/)