In [None]:
%matplotlib inline

In [None]:
import re
import itertools

import numpy as np
import matplotlib.pyplot as plt

import pandas as pd
import pandas_profiling

from wordcloud import WordCloud

In [None]:
import nltk

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download("stopwords")
nltk.download('omw-1.4')
nltk.download('wordnet')

In [None]:
# from sklearnex import patch_sklearn
# patch_sklearn()

from sklearn.experimental import enable_halving_search_cv

from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split, HalvingRandomSearchCV

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

from sklearn.pipeline import Pipeline

from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.utils.extmath import softmax

from joblib import dump

In [None]:
from sklearn import set_config

set_config(display = 'diagram')

In [None]:
english_stopwords = set(stopwords.words('english'))

In [None]:
plt.rcParams["figure.figsize"] = (6, 6)

In [None]:
HASHING_N_FEATURES = 150_000
HASHING_NGRAM_RANGE = (1, 3)
FEATURE_SELECTION_MAX_FEATURES = 75_000
RFE_STEP = 25_000
TEST_DATA_SIZE = 0.2

# Fake news detction

## Abstract

In this notebook we will gather and clean data, create a fake news classifier, compare it with two other implementations, optimize the model performance on new data and create a pipeline for easier model usage.

## Contents:

Introduction: [Introduction](#Introduction)

Terminologies: [Terminologies](#Terminologies)

Project: [Project](#Project)

Testing our model on different data: [Testing our model on different data](#Testing-our-model-on-different-data)

Conclusion: [Conclusion](#Conclusion)

References: [References](#References)

## Introduction

We consume news through several mediums throughout the day in our daily routine, but sometimes it becomes difficult to decide which one is fake and which one is authentic.

Do you trust all the news you consume from online media?

Every news that we consume is not real. If you listen to fake news it means you are collecting the wrong information from the world which can affect society because a person’s views or thoughts can change after consuming fake news which the user perceives to be true.

Since all the news we encounter in our day-to-day life is not authentic, how do we categorize if the news is fake or real?

In this notebook, we will focus on text-based news and try to build a model that will help us to identify if a piece of given news is fake or real.

Before moving to the practical things let’s get aware of few terminologies.

## Terminologies

### Fake News

A sort of sensationalist reporting, counterfeit news embodies bits of information that might be lies and is, for the most part, spread through web-based media and other online media.

This is regularly done to further or force certain kinds of thoughts or for false promotion of products and is frequently accomplished with political plans.

Such news things may contain bogus and additionally misrepresented cases and may wind up being virtualized by calculations, and clients may wind up in a channel bubble.

### Hashing Vectorizer

Hashing vectorizer is a vectorizer which uses the hashing trick to find the token string name to feature integer index mapping. Conversion of text documents into matrix is done by this vectorizer where it turns the collection of documents into a sparse matrix which are holding the token occurence counts. Advantage of hashing vectorizer is: 

* As there is no need of storing the vocabulary dictionary in the memory, for large data sets it is very low memory scalable. As there in no state during the fit, it can be used in a streaming or parallel pipeline.

### Confusion Matrix

The confusion matrix is a matrix used to determine the performance of the classification models for a given set of test data. It can only be determined if the true values for test data are known. The matrix itself can be easily understood, but the related terminologies may be confusing. Since it shows the errors in the model performance in the form of a matrix, hence also known as an error matrix. Some features of Confusion matrix are given below:

* The confusion matrix is an $ n * n $ matrix, where $ n $ is the count of classifier classes. In our case $ n $ is equal to 2 (we have two classes: fake and real news), so the confusion matrix will be a $ 2 * 2 $ matrix

* The matrix is divided into two dimensions, that are predicted values and actual values along with the total number of predictions.

* Predicted values are those values, which are predicted by the model, and actual values are the true values for the given observations.

* A $ 2 * 2 $ confusion matrix looks like below:

![Confusion matrix](images/confusion-matrix.jpeg)

Source: [https://medium.com/analytics-vidhya/what-is-a-confusion-matrix-d1c0f8feda5](https://medium.com/analytics-vidhya/what-is-a-confusion-matrix-d1c0f8feda5)


Note: Type 1 error is called false positive and type 2 error is called false negative.

## Project

To get the accurately classified collection of news as real or fake we have to build a machine learning model.

To deals with the detection of fake or real news, we will develop the project in python with the help of "sklearn", we will use "HashingVectorizer" in our news data which we will gather from online media.

After the first step is done, we will initialize the classifier, transform and fit the model. In the end, we will calculate the performance of the model using the appropriate performance matrix/matrices. Once will calculate the performance matrices we will be able to see how well our model performs.

The practical implementation of these tools is very simple and will be explained step by step in this notebook.

Let's start.

### Data preparation

Let's start by reading our train dataset, which is done below.

In [None]:
train_news = pd.read_csv('data/fake-news/train.csv')

Now let's see how our data looks like by getting the first 5 rows.

In [None]:
train_news.head()

The label column tells us if a news is real or fake. 1 means fake news, 0 means real news.

Now we will see the columns types.

In [None]:
train_news.dtypes

After that let's see how many observations (rows) and features (columns) we have.

In [None]:
train_news.shape

We can see that we have 20800 observations on 5 features. One of those features is id which we will set as index.

In [None]:
train_news = train_news.set_index("id")

In [None]:
train_news.shape

We now have 4 features which are title, author, text and label, because we set id as an index.

Next thing we will do is to check if we have missing values.

In [None]:
train_news.isna().any()

We see that we have missing values at every column except label. Since the missing values are only on text columns we can fill them with empty string.

In [None]:
train_news = train_news.fillna(value = {"title": "", "author": "", "text": ""})

In [None]:
train_news.shape

After we fixed missing values, we will see if we have any duplicated values.

In [None]:
train_news.duplicated().any()

We have duplicated values. Let's drop them.

In [None]:
train_news = train_news.drop_duplicates()

In [None]:
train_news.shape

We now have 20691 observations. The row drops are done to improve model performance. After we cleaned our data we can plot a histogram to see the distribution of news in train dataset.

In [None]:
plt.bar(range(2), [len(train_news[train_news["label"] == 0]),len(train_news[train_news["label"] == 1])])

plt.title("Distribution of news in train dataset")

plt.xticks(range(2), ["Real", "Fake"])

plt.xlabel("Type of news")
plt.ylabel("Count")

plt.show()

Let's see if our dtypes are normal.

In [None]:
train_news.dtypes

The types are normal but the columns are not ordered well. We will fix that

In [None]:
train_news = train_news[["title", "author", "text", "label"]]

Now let's clean our test dataset.

In [None]:
test_news = pd.read_csv("data/fake-news/test.csv")

Below we can see how our test dataset looks like.

In [None]:
test_news.head()

We can see we have no label column, which is right since this is a test dataset. The labels are in "fake-news/test-labels.csv"

Let's see how many observations and features we have.

In [None]:
test_news.shape # We have 5200 observations on 4 features

We will now see if we have any missing values.

In [None]:
test_news.isna().any()

Now let's fix the missing values.

In [None]:
test_news = test_news.fillna(value = {"title": "", "author": "", "text": ""})

In [None]:
test_news.shape

Before we read our test labels we have to see if we have any duplicate values.

In [None]:
test_news.duplicated().any()

We have no duplicated values and we can proceed to test labels preparation.

In [None]:
test_labels = pd.read_csv("data/fake-news/test-labels.csv")

In [None]:
test_labels.head()

Since we dropped values from test dataset we need to remove the rows related to the removed records.

In [None]:
test_labels["id"] = test_news["id"] 

# In this cell we set the ids of test labels to the ids of test news because it is easier to now drop rows with missing ids

In [None]:
test_news["label"] = test_labels["label"]

In [None]:
test_labels = test_labels.dropna()

Before we proceed to next section we have to merge both datasets and clean the merged dataset.

In [None]:
merged_news = pd.concat([train_news, test_news])

Here we will drop the `id` column since it came from test dataset and most values will be `NaN`.

In [None]:
merged_news = merged_news.drop(columns = ["id"])

After this processing we will check if we have missing values.

In [None]:
merged_news.isna().sum()

Let's plot a histogram for distribution of classes.

In [None]:
plt.bar(range(2), [len(merged_news[merged_news["label"] == 0]),len(merged_news[merged_news["label"] == 1])])

plt.title("Distribution of news in merged news dataset")

plt.xticks(range(2), ["Real", "Fake"])

plt.xlabel("Type of news")
plt.ylabel("Count")

plt.show()

We see that there are more fake than real news. Let's check what is the difference.

In [None]:
merged_news.label.value_counts()

### Data analysis
In this section we will "get to know" our data.

Firstly, we will create a wordcloud plot function.

In [None]:
def plot_wordcloud(data, title):
    wordcloud = WordCloud(
        background_color='white',
        stopwords=english_stopwords,
        random_state = 42,
        min_word_length = 4,
    ).generate(str(data))
    
    plt.figure(figsize=(15, 10))
    plt.axis("off")
    plt.title(title, fontsize=15)
    plt.imshow(wordcloud.recolor(colormap= 'viridis', random_state = 42), interpolation = 'bilinear')
    plt.show()

After that we will get statistics for our train dataset.

In [None]:
merged_news.describe(include = "all")

In the four cells below we filter all titles and texts that contaion the words `sensation` and `breitbart`

The first filter is for titles containing breitbart.

In [None]:
news_containing_breitbart_in_title_filter = merged_news.title.str.lower().str.contains("breitbart")

news_containing_breitbart_in_title = merged_news[news_containing_breitbart_in_title_filter]

news_containing_breitbart_in_title.head()

We can see that all 5 initial values are real. This is interesting, we can see the labels.

In [None]:
news_containing_breitbart_in_title.label.value_counts()

We see that there are more real than fake news. Next, we will see all news whose titles contain the word sensation.

In [None]:
news_containing_sensation_in_title_filter = merged_news.title.str.lower().str.contains("sensation")

news_containing_sensation_in_title = merged_news[news_containing_sensation_in_title_filter]

news_containing_sensation_in_title.head()

We can see that only one piece of news contains the word sensation on title.

Next we will see all texts that contain sensation and breitbart. The first filter is for the breitbart word.

In [None]:
news_containing_breitbart_in_text_filter = merged_news.text.str.lower().str.contains("breitbart")

news_containing_breitbart_in_text = merged_news[news_containing_breitbart_in_text_filter]

news_containing_breitbart_in_text.head()

In [None]:
news_containing_breitbart_in_text.label.value_counts()

We can see again that most of the news that contain word breitbart are real.

In [None]:
news_containing_sensation_in_text_filter = merged_news.text.str.lower().str.contains("sensation")

news_containing_sensation_in_text = merged_news[news_containing_sensation_in_text_filter]

news_containing_sensation_in_text.head()

In [None]:
news_containing_sensation_in_text.label.value_counts()

We notice that most of the news that contain sensation in text are real. Now let's see all the news with author Breitbart news.

In [None]:
breitbart_news = merged_news[merged_news.author == 'Breitbart News']

In [None]:
breitbart_news.head()

We see that all 5 values shown are real. Let's get the value counts for labels.

In [None]:
breitbart_news.label.value_counts()

We see that all Breitbart news are real. We can say that Breitbart news is a credible source. Now, let's check Consortiumnews.com.

In [None]:
consortium_news = merged_news[merged_news.author == 'Consortiumnews.com']

In [None]:
consortium_news.head()

In [None]:
consortium_news.label.value_counts()

We see that most of the news from Consortiumnews.com are fake. We can say that Consortiumnews.com is not a credible source.

After that we will plot some wordclouds.

The first wordcloud is for texts of fake news.

In [None]:
fake_news = merged_news[merged_news.label == 1]

In [None]:
plot_wordcloud(fake_news["text"], "Words frequented in fake news texts")

The next wordcloud is for titles of fake news.

In [None]:
plot_wordcloud(fake_news["title"], "Words frequented in fake news titles")

The last two wordclouds are for texts and titles of real news.

In [None]:
real_news = merged_news[merged_news.label == 0]

The first wordcloud for real news is for texts.

In [None]:
plot_wordcloud(real_news["text"], "Words frequented in real news texts")

The final wordcloud is for texts of real news.

In [None]:
plot_wordcloud(real_news["title"], "Words frequented in real news titles")

Now we will try to find other factors than the words that determine whether news is real or fake. Let's start by creating a dataframe which includes title length and label of test news.

In [None]:
title_length_and_label = pd.DataFrame({"title_len": merged_news.title.str.len(), "label": merged_news.label})

After that we will see the most frequent labels for titles with length over and less 50 symbols.

In [None]:
titles_lens_more_than_50 = title_length_and_label[title_length_and_label.title_len > 50]

In [None]:
titles_lens_more_than_50.label.value_counts()

After we got the values for titles with lengths less than 50, let's see the titles, authors and labels.

In [None]:
merged_news.loc[titles_lens_more_than_50.index][["title", "author", "label"]].head()

In [None]:
titles_lens_less_than_50 = title_length_and_label[title_length_and_label.title_len < 50]

In [None]:
titles_lens_less_than_50.label.value_counts()

After we got the values for titles with lengths more than 50, let's see their titles, authors and labels.

In [None]:
merged_news.loc[titles_lens_less_than_50.index][["title", "author", "label"]].head()

In [None]:
merged_news.loc[titles_lens_less_than_50[titles_lens_less_than_50.label == 0].index].head()

We can see that the most of the titles with length over 50 symbols are real while most of the titles with length less than 50 symbols are fake.

After that, let's see distributions of lenghts of titles.

In [None]:
grouped_titles_by_len = title_length_and_label.sort_values('title_len', ascending=False)

In [None]:
plt.hist(grouped_titles_by_len.title_len, bins = "fd")

plt.xlabel('Lengts of titles')
plt.ylabel('Frequency')

plt.title('Distributions of lengths of titles')

plt.show()

After we analyzed the infromation from titles' lengths, let's analyze texts' lengths

In [None]:
text_length_and_label = pd.DataFrame({"text_len": merged_news.text.str.len(), "label": merged_news.label})

In [None]:
grouped_texts_by_len = text_length_and_label.sort_values('text_len', ascending=False)

In [None]:
plt.hist(grouped_texts_by_len.text_len, bins = "fd")

plt.xlabel('Lengts of texts')
plt.ylabel('Frequency')

plt.title('Distributions of lengths of texts')

plt.show()

Now let's choose lengths that we want to filter by. In our case we will choose less than 5k symbols and between 5k and 10k symbols.

Firstly, we will get all titles with lengths less than 5k and the corresponding news.

In [None]:
texts_lens_less_than_5k_filter = text_length_and_label.text_len < 5000

texts_lens_less_than_5k = text_length_and_label[texts_lens_less_than_5k_filter]

news_with_texts_lens_less_than_5k = merged_news.loc[texts_lens_less_than_5k.index]

news_with_texts_lens_less_than_5k.head()

In [None]:
news_with_texts_lens_less_than_5k.label.value_counts()

We can see that the counts of real and fake news have 2k difference in counts.

In [None]:
texts_lens_between_5k_and_10k_filter = (text_length_and_label.text_len > 5000) & (text_length_and_label.text_len <= 10000)

text_lens_between_5k_and_10k = text_length_and_label[texts_lens_between_5k_and_10k_filter]

news_with_texts_lens_between_5k_and_10k = merged_news.loc[text_lens_between_5k_and_10k.index]

news_with_texts_lens_between_5k_and_10k.head()

In [None]:
news_with_texts_lens_between_5k_and_10k.label.value_counts()

We can see that we have more real news with lengths between 5k and 10k than fake news.

Before we proceed to the next section we will generate a report with information for columns.

In [None]:
merged_news.profile_report()

### Feature Extraction

Here we will work with a copy of news because we do not want to modify the original dataset.

In [None]:
messages = merged_news.copy()

Now we will get the columns we need.

In [None]:
messages = messages[["title", "text", "label"]]

In [None]:
messages.head()

Here we will lemmatize all words that are not stopwords. Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. Stemming is known to be a fairly crude method of doing this. Lemmatization, on the other hand, is a tool that performs full morphological analysis to more accurately find the root, or “lemma” for a word. For instance, stemming the word "earthquake" will generate "earthquak", while lemmatizing will generate "earthquake". That is why we will use lemmatization.

In [None]:
wordnet = WordNetLemmatizer()

def lemmatize(data):
    lemmatized_content = re.sub('[^a-zA-Z]',' ',data)
    lemmatized_content = lemmatized_content.lower()
    lemmatized_content = lemmatized_content.split()
    lemmatized_content = [wordnet.lemmatize(word) for word in lemmatized_content if not word in english_stopwords]
    lemmatized_content = ' '.join(lemmatized_content)
    
    return lemmatized_content

corpus = messages["title"].apply(lemmatize)

### Classifier Implementation

The model will be implemented by using LinearSVC classifier. The Linear Support Vector Classifier (SVC) method applies a linear kernel function to perform classification and it performs well with a large number of samples. If we compare it with the SVC model, the LinearSVC has additional parameters such as penalty normalization which applies 'L1' or 'L2' and loss function. The kernel method can not be changed in LinearSVC, because it is based on the kernel linear method. 

#### Confusion Matrix Plot Function

In this section we will create a confusion matrix plot function.

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    See full source and example: 
    http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
    
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    
    # Here we print the if confusion matrix is normalized
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    # Here we include text that shows confusion matrix values
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')


#### Model Report Function

In this section we create a function that prints a classification report and plots confusion matrix.

In [None]:
def model_report(model, X_test, y_test):
    pred = model.predict(X_test)
    
    print(accuracy_score(y_test, pred) * 100)

    print(classification_report(y_test, pred))

    cm = confusion_matrix(y_test, pred)

    plot_confusion_matrix(cm, classes=['Fake News', 'Real News'])

#### The Hashing Vectorizer

Here we will define our Hashing vectorizer. We use vectorizer to convert our text data to a feature vector. Feature vectors are used widely in machine learning because of the effectiveness and practicality of representing objects in a numerical way to help with many kinds of analyses. They are good for analysis because there are many techniques for comparing feature vectors.

In [None]:
hashing = HashingVectorizer(n_features = HASHING_N_FEATURES, ngram_range=HASHING_NGRAM_RANGE, binary = True)

Now we will pass it our corpus to fit and transform it.

In [None]:
X = hashing.transform(corpus)

#### Gathering Train and Test Data

In this section, we will gather our train and test data.

Let's start by getting our labels, which is done below.

In [None]:
y = messages['label']

After that we have to split our data and labels in train and test.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = TEST_DATA_SIZE, stratify = y, random_state = 42)

#### Creating the Model

In this section we will create our model in Python.

In [None]:
linear_svc = LinearSVC(random_state = 42)

Here we fit our model, or learning it to classify fake news.

In [None]:
linear_svc.fit(X_train, y_train)

Now let's test it. First we will get its accuracy on test data. After that we will plot the confusion matrix of the model to see how many wrong values there are.

In [None]:
model_report(linear_svc, X_test, y_test)

After we tested it on test data, we will test it on train data.

In [None]:
model_report(linear_svc, X_train, y_train)

### Comprasions

In this section we will compare our implementation of fake news classifier with two other implementations.

#### Multinomial Naive Bayes

The first model we will compare with ours is Multinomial NB classifier. Multinomial Naive Bayes algorithm is a probabilistic learning method that is mostly used in Natural Language Processing (NLP). The algorithm is based on the Bayes theorem and predicts the tag of a text such as a piece of email or newspaper article. It calculates the probability of each tag for a given sample and then gives the tag with the highest probability as output.

In [None]:
multimnomial_nb = MultinomialNB()

In [None]:
multimnomial_nb.fit(X_train, y_train)

In [None]:
model_report(multimnomial_nb, X_test, y_test)

#### Logistic Regression

The second comprasion will be done with Logistic Regression. Logistic regression estimates the probability of an event occurring, such as voted or didn't vote, based on a given dataset of independent variables. Since the outcome is a probability, the dependent variable is bounded between 0 and 1.

In [None]:
logistic_regression = LogisticRegression(
    random_state = 42,
    max_iter = 10,
    solver = 'liblinear'
)

In [None]:
logistic_regression.fit(X_train, y_train)

In [None]:
model_report(logistic_regression, X_test, y_test)

## Testing our model on different data

Here we will test our model on different data (find the data source in the References).

### Processing data

In this section our additional data will be processed.

In [None]:
additional_data = pd.read_csv("data/fake_or_real_news.csv")

In [None]:
additional_data.head()

We see that we have a column with wrong name, so lets rename it.

In [None]:
additional_data = additional_data.rename(columns = {"Unnamed: 0": "id"})


Now we will check for missing values.

In [None]:
additional_data.isna().any()

Let's see how are our news distibuted.

In [None]:
plt.bar(range(2), 
        [len(additional_data[additional_data["label"] == "REAL"]),
         len(additional_data[additional_data["label"] == "FAKE"])
])

plt.title("Distribution of news in dataset")

plt.xticks(range(2), ["Real", "Fake"])

plt.xlabel("Type of news")
plt.ylabel("Count")

plt.show()

### Testing the data on the model

In this section we will use our processed additional data to perfrom tests on our model.

Now we have to set our label column to numbers (0 and 1).

In [None]:
additional_data.label = additional_data.label.replace(["REAL", "FAKE"], [0, 1])

In [None]:
labels = additional_data.label

Now we need to use our text lemmatization function to prepare our additional data for Hashing Vectorizer.

In [None]:
additional_corpus = additional_data["title"].apply(lemmatize)

Here we use new Hashing Vectorizer to convert our sample to feature vector.

In [None]:
hashing = HashingVectorizer(n_features = HASHING_N_FEATURES, ngram_range=HASHING_NGRAM_RANGE)

In [None]:
X_additional = hashing.transform(additional_corpus)

Now let's get our splitted additional data.

In [None]:
X_additional_train, X_additional_test, y_additional_train, y_additional_test = train_test_split(X_additional,
                                                                                                labels,
                                                                                                random_state = 42,
                                                                                                test_size = TEST_DATA_SIZE,
                                                                                                stratify = labels)

Let's get our prediction values, a classification report and plot the confusion matrix.

In [None]:
predictions = linear_svc.predict(X_additional_test)

model_report(linear_svc, X_additional_test, y_additional_test)

Here we see that despite the high accuracy on train/test data the model performs not as good as it performs on train data.

## Overfitting

The problem we encounter is called `overfitting`. It occures when a model "learns" the training data too well and can't generalize on new data. `Overfitting` can be detected if there is a significant difference in accuracy between train and test data or the model has very low accuracy on data from another dataset. An image will help us understand this better:

![Overfitting](images/overfitting.png)

_While the black line fits the data well, the green line is overfit._

To prevent overfitting there are many techniques, but here we will show three of them: adding more data, tuning model hyperparameters and feature selection.

### Adding more data

Here we will add 4 more datasets, sources can be found in References. The first dataset we will read is the ISOT news dataset.

In [None]:
isot_news_fake = pd.read_csv("data/additional_train_test_data/isot_news/Fake.csv")
isot_news_real = pd.read_csv("data/additional_train_test_data/isot_news/True.csv")

Here we will set labels on both parts of the dataset.

In [None]:
isot_news_fake["label"] = 1
isot_news_real["label"] = 0

Now, we will merge the both parts.

In [None]:
isot_news = pd.concat([isot_news_real, isot_news_fake])

isot_news.shape

Let's see what are the columns

In [None]:
isot_news.head()

We do not need the `subject` and `date` columns because they do will not bring the model information.

In [None]:
isot_news = isot_news[["title", "text", "label"]]

The next dataset we will read is the WELFake dataset.

In [None]:
welfake = pd.read_csv("data/additional_train_test_data/WELFake/WELFake_Dataset.csv")

In the WELFake dataset we have 72134 news.

In [None]:
welfake.head()

The `Unnamed: 0` is an index that we do not need because the default index is the same.

In [None]:
welfake = welfake.drop(columns = ["Unnamed: 0"])

Now we will read the Source based fake news classification dataset.

In [None]:
source_based_fake_news = pd.read_csv("data/additional_train_test_data/source_based_fake_news_classification/news_articles.csv")

In [None]:
source_based_fake_news.shape

In [None]:
source_based_fake_news.head()

We do not need some of the columns because we classify whether news is fake or real by title.

In [None]:
source_based_fake_news = source_based_fake_news[["title", "text", "label"]]

In [None]:
source_based_fake_news.head()

We need to convert the label column to numeric values because our model can handle only numeric values.

In [None]:
source_based_fake_news["label"] = source_based_fake_news["label"].replace({"Real": 0, "Fake": 1})

Now, we will read the 3rd dataset.

Firstly, we will read the part with the fake news.

In [None]:
fake_real_dataset_fake = pd.read_csv("data/additional_train_test_data/fake-and-real/Fake.csv")
fake_real_dataset_fake["label"] = 1

Here, we will see how many records are in the file for fake news.

In [None]:
fake_real_dataset_fake.shape

Now, we will see how it looks like.

In [None]:
fake_real_dataset_fake.head()

Now, we will read the file with real news.

In [None]:
fake_real_dataset_real = pd.read_csv("data/additional_train_test_data/fake-and-real/True.csv")
fake_real_dataset_real["label"] = 0

Here, we will see how many records are in the file for real news.

In [None]:
fake_real_dataset_real.shape

Now, we will see how it looks like.

In [None]:
fake_real_dataset_real.head()

Now, we will merge the both datasets.

In [None]:
fake_real_dataset = pd.concat([fake_real_dataset_fake, fake_real_dataset_real])

Here we will see how many records in total there are.

In [None]:
fake_real_dataset.shape

Now, we will take a look at the merged dataset.

In [None]:
fake_real_dataset.head()

We do not need the `subject` and `date` columns, so we can remove it.

In [None]:
fake_real_dataset = fake_real_dataset.drop(columns = ["subject", "date"])

Now, we will read the last dataset.

In [None]:
last_data = pd.read_csv("data/additional_train_test_data/data.csv")

In [None]:
last_data.shape

In [None]:
last_data.head()

Here we must drop the `URLs` column and rename the other three.

In [None]:
last_data = last_data.drop(columns = ["URLs"])

In [None]:
last_data.columns = ["title", "text", "label"]

In [None]:
last_data.head()

Now the last dataset looks better. The final step is to merge all those and the initial train/test dataset.

In [None]:
news_dataset = pd.concat([merged_news, isot_news, welfake, source_based_fake_news, fake_real_dataset, last_data])

In [None]:
news_dataset.head()

Now, let's check for missing values.

In [None]:
news_dataset.isna().sum()

We see that the `author` and `data` columns have a lot of missing values, so we can remove them.

In [None]:
news_dataset = news_dataset[["title", "text", "label"]]

We see that one `label` value is missing so we can drop the row that has missing label.

In [None]:
news_dataset = news_dataset[~news_dataset.label.isna()]

In [None]:
news_dataset.head()

Now, we must preprocess the missing values of `title` and `text`.

In [None]:
news_dataset = news_dataset.fillna("")

In [None]:
news_dataset.isna().sum()

We have to reset index because we merged datasets.

In [None]:
news_dataset = news_dataset.reset_index()

We have no missing values and did reset index, so we can proceed to dropping duplicates.

In [None]:
news_dataset = news_dataset.drop_duplicates()

Now we have no duplicates missing values, which means we can proceed to data transformation. Firstly, we will lemmatize all the data.

In [None]:
news_dataset_corpus = news_dataset["title"].apply(lemmatize)

Secondly, we will get labels.

In [None]:
news_dataset_labels = news_dataset["label"]

Thirdly, we will use `HashingVectorizer` to transform our titles.

In [None]:
vectorizer = HashingVectorizer(n_features = HASHING_N_FEATURES, ngram_range = HASHING_NGRAM_RANGE)

In [None]:
news_dataset_corpus_hashed = vectorizer.transform(news_dataset_corpus)

In [None]:
X_train_more, X_test_more, y_train_more, y_test_more = train_test_split(
    news_dataset_corpus_hashed,
    news_dataset_labels,
    test_size = TEST_DATA_SIZE,
    random_state = 42,
    stratify = news_dataset_labels
)

We are done with adding more data. Now, let's go to hyperparameter tuning.

### Hyperparameter tuning

A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data.

* They are often used in processes to help estimate model parameters.
* They are often specified by the practitioner.
* They can often be set using heuristics.
* They are often tuned for a given predictive modeling problem.

We cannot know the best value for a model hyperparameter on a given problem. We may use rules of thumb, copy values used on other problems, or search for the best value by trial and error.

When a machine learning algorithm is tuned for a specific problem, then you are tuning the hyperparameters of the model or order to discover the parameters of the model that result in the most skillful predictions.

For hyperparameter tuning, we will use `HalvingRandomSearchCV` provided by `scikit-learn`. 

HalvingRandomSearchCV is randomized search on hyper parameters.

The search strategy starts evaluating all the candidates with a small amount of resources and iteratively selects the best candidates, using more and more resources.

The candidates are sampled at random from the parameter space and the number of sampled candidates is determined by `n_candidates`.

Firstly, all the parameters we try to tune will be explained.

* The `penalty` parameter determines the type of penalty. `l1` penalty sets random values of training data to zero, while `l2` penalty subtracts a number `C` from the model weights.


* The `loss` parameter determines the loss function. `hinge` loss is used for "maximum-margin" classification, while `squared_hinge` loss has the effect of the smoothing the surface of the error function and making it numerically easier to work with.


* The `fit_intercept` parameter determines whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations.


* The `class_weight` sets the parameter C of class i to ``class_weight[i]*C`` for SVC. If not given, all classes are supposed to have weight one.


* The `C` parameter is the regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive.


* The `max_iter` parameter determines the maximum number of iterations for the model.

Now, we can start the hyperparameter tuning.

In [None]:
linsvc_grid = {
    'penalty': ['l1', 'l2'],
    'loss': ['hinge', 'squared_hinge'],
    'fit_intercept': [True, False],
    'class_weight': [None, 'balanced'],
    'C': [1e-4, 1e-2, 1e-1, 1, 10, 1e2, 1e4],
    'max_iter': [1000, 1500, 2000, 2500, 3000],
    'dual': [True, False]
}

linear_svc_tuner = HalvingRandomSearchCV(
    LinearSVC(random_state = 42),
    linsvc_grid,
    verbose = 3,
    random_state = 42,
    error_score = 0,
    scoring = 'f1'
)

linear_svc_tuner.fit(X_train_more, y_train_more)

After the tuning finished, we will see what is the best combination of hyperparameters.

In [None]:
linear_svc_tuner.best_params_

Now, let's create the model.

In [None]:
linear_svc_tuned = LinearSVC(**linear_svc_tuner.best_params_, random_state = 42)

In [None]:
linear_svc_tuned.fit(X_train_more, y_train_more)

Now, let's get a model report for additional data.

In [None]:
model_report(linear_svc_tuned, X_additional_test, y_additional_test)

### Feature Selection

In the machine learning process, feature selection is used to make the process more accurate. It also increases the prediction power of the algorithms by selecting the most critical variables and eliminating the redundant and irrelevant ones. This is why feature selection is important.

Firstly, we need to initialize our feature selector and pass it the model which we select features for and the number of features we want to select.

In [None]:
feature_selector = RFE(
    LinearSVC(**linear_svc_tuner.best_params_),
    n_features_to_select = FEATURE_SELECTION_MAX_FEATURES,
    step = RFE_STEP,
    verbose = 3
)

Then we have to learn it on our data.

In [None]:
feature_selector.fit(X_train_more, y_train_more)

After we are done, let's transform train and test data.

In [None]:
X_train_more_transformed = feature_selector.transform(X_train_more)
X_test_more_transformed = feature_selector.transform(X_test_more)

Now, we will learn a new model with the same hyperparameters because we transformed the data based on the optimized model.

In [None]:
linear_svc_feature_selection = LinearSVC(**linear_svc_tuner.best_params_, random_state = 42)

In [None]:
linear_svc_feature_selection.fit(X_train_more_transformed, y_train_more)

Now, let's test it on test data and unseen data.

In [None]:
model_report(linear_svc_feature_selection, X_test_more_transformed, y_test_more)

In [None]:
X_additional_test_transformed = feature_selector.transform(X_additional_test)

In [None]:
model_report(linear_svc_feature_selection, X_additional_test_transformed, y_additional_test)

We see that our model performs well, which means that we are done with reducing overfitting of our model.

## Creating pipeline

After we improved our model accuracy on unseen data we will make using it easier. Currently the code for getting a model prediction looks like that:

````python
news_titles = ['some title 1', 'some title 2', 'some title 3']
news_titles_lemmatized = [lemmatize(title) for title in news_titles]

news_titles_hashed = vectorizer.transform(news_titles_lemmatized)

news_titles_feature_selection = feature_selector.transform(news_titles_hashed)

predictions = linear_svc_tuned.predict(news_titles_feature_selection)
````

This code seems long, right? By creating a pipeline we can use the model in an easier way. The code below shows how will our code for getting predictions look like with a pipeline:

````python

news_titles = ['some title 1', 'some title 2', 'some title 3']

predictions = fake_news_pipeline.predict(news_titles)
````

I am sure you will agree that using a pipeline makes using the model easier.

But before we create the pipeline, we need to define two custom classes: text normalization transformer and `LinearSVC` extension.


### Creating Custom Classes

Firstly, we will create a text normalization class because we can't pass a function as a step in a `scikit-learn` pipeline. We have two options:


* To invoke the `scikit-learn` class called `FunctionTransformer`, but this is going to be slow because Python is not a speedy language.


* To create a custom `scikit-learn` class that extends the proper classes because `scikit-learn` is written in `Cython`, which is Python with C-like performance. This means that our transformer will be faster than the function.

In [None]:
class TextNormalizer(BaseEstimator, TransformerMixin):
    """
    Does lemmatization and stopwords removal.
    """
    def __init__(self):
        self.stopwords = stopwords.words("english")
    
    def normalize(self, document):
        
        lemma = WordNetLemmatizer()
        stemmed_content = re.sub('[^a-zA-Z]',' ', document)
        stemmed_content = stemmed_content.lower()
        stemmed_content = stemmed_content.strip()
        stemmed_content = stemmed_content.split()
        stemmed_content = [lemma.lemmatize(word) for word in stemmed_content if not word in self.stopwords]
        stemmed_content = ' '.join(stemmed_content)

        return stemmed_content

    def fit(self, X, y=None):
        return self

    def transform(self, documents):
        result = []
        for document in documents:
            result.append(self.normalize(document))
        
        return result

The second custom class we will create is extended `LinearSVC`. The extension will add a `predict_proba` method that can calculate probabilites for classes.

In [None]:
class LinearSVCwithProbabilities(LinearSVC):
    def predict_proba(self, X):
        d = self.decision_function(X)
        d_2d = np.c_[-d, d]
        return softmax(d_2d)

We are done with creating custom classes. Now we can proceed with creating the 
pipeline.

### Defining Pipeline

Here we will define a list that will contain all steps.

In [None]:
steps = []

Firstly, we will add `TextNormalizer` class.

In [None]:
steps.append(('lemmatization', TextNormalizer()))

The second step of the pipeline is HashingVectorizer.

In [None]:
steps.append(('hashing', HashingVectorizer(n_features = HASHING_N_FEATURES, ngram_range = HASHING_NGRAM_RANGE)))

The third step is feature selection.

In [None]:
steps.append(
    ('feature_selection', 
        RFE(
            LinearSVCwithProbabilities(**linear_svc_tuner.best_params_, random_state = 42),
            n_features_to_select = FEATURE_SELECTION_MAX_FEATURES,
            step = RFE_STEP,
            verbose = 3
        )
    )
)

And the fourth and final step - our classifier.

In [None]:
steps.append(('classifier', LinearSVCwithProbabilities(**linear_svc_tuner.best_params_, random_state = 42)))

Now, we will create the pipeline.

In [None]:
fake_news_pipeline = Pipeline(
    steps = steps,
    verbose = 3
)

Since the pipeline does all the transformations, we must fit it, but on raw text data.

In [None]:
X_text_train, X_text_test, y_text_train, y_text_test = train_test_split(
    news_dataset["title"],
    news_dataset["label"],
    test_size = TEST_DATA_SIZE,
    random_state = 42,
    stratify = news_dataset["label"]
)

In [None]:
X_additional_text_train, X_additional_text_test, y_additional_text_train, y_additional_text_test = train_test_split(
    additional_data["title"],
    additional_data["label"],
    test_size = TEST_DATA_SIZE,
    random_state = 42,
    stratify = additional_data["label"]
)

In [None]:
fake_news_pipeline.fit(X_text_train, y_text_train)

Now, when our pipeline is ready, let's test the pipeline on our additional test dataset and generate a model report.

In [None]:
model_report(fake_news_pipeline, X_additional_text_test, y_additional_text_test)

## Creating a fake news detection app

We created our pipeline and this is good, but we need to make our model more accessible, because if someone wants to use our model he has to install Python, install the required packages, launch this notebook and run it. That is why we will create a web app that uses the model to classify fake news. To do this we will need the `streamlit` library.

Firstly, we have to save our pipeline to a file. This is done by the `dump` function.

In [None]:
dump(fake_news_pipeline, "fake-news-detection-pipeline.model")

After this is done, we have to move the new file to the folder `fake-news-detection-app` because in this folder we have the web app written.

## Conclusion

In this notebook we gathered and cleaned our data, implemented a model, compared it with other models, tested it on different data, improved it so it can perform well on new data, created a pipeline for our model and created a fake news detection app. Here we learned a lot of new terminologies and how to implement them with code. I hope you liked and enjoyed this notebook.

## References

Project idea: [https://www.upgrad.com/blog/data-science-project-ideas-topics-beginners/#12_Fake_News_Detection](https://www.upgrad.com/blog/data-science-project-ideas-topics-beginners/#12_Fake_News_Detection)

Initial implementation: [https://www.analyticsvidhya.com/blog/2021/07/detecting-fake-news-with-natural-language-processing/#h2_3?&utm_source=coding-window-blog&source=coding-window-blog](https://www.analyticsvidhya.com/blog/2021/07/detecting-fake-news-with-natural-language-processing/#h2_3?&utm_source=coding-window-blog&source=coding-window-blog)

Train/Test data source: [https://www.kaggle.com/c/fake-news/data](https://www.kaggle.com/c/fake-news/data)

Additional data source: [https://www.kaggle.com/datasets/jillanisofttech/fake-or-real-news](https://www.kaggle.com/datasets/jillanisofttech/fake-or-real-news)

WELFake dataset: [https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification](https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification)

ISOT news dataset: [https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset)

Source based fake news classification: [https://www.kaggle.com/datasets/ruchi798/source-based-news-classification](https://www.kaggle.com/datasets/ruchi798/source-based-news-classification)

Last additional data: [https://github.com/NamrithaGirish/FakeSites/blob/main/data.csv](https://github.com/NamrithaGirish/FakeSites/blob/main/data.csv)

Scikit-learn: [https://scikit-learn.org/stable/index.html](https://scikit-learn.org/stable/index.html)

More information on SVMs: [https://scikit-learn.org/stable/modules/svm.html](https://scikit-learn.org/stable/modules/svm.html)

Streamlit: [https://streamlit.io/](https://streamlit.io/)

In [None]:
from transformers import BartTokenizer, BartModel

tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
model = BartModel.from_pretrained('facebook/bart-base')

In [None]:
inputs = tokenizer(TextNormalizer().transform(["hello i am petar"]), return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state.detach().numpy()
last_hidden_states = last_hidden_states.reshape(last_hidden_states.shape[1], 768)

In [None]:
from sklearn.manifold import TSNE

In [None]:
tsne = TSNE(n_components = 2, learning_rate = 'auto')

res = tsne.fit_transform(last_hidden_states)

In [None]:
# fig = plt.figure()
# ax = fig.add_subplot(projection='3d')

# ax.scatter(res[:, 0], res[:, 1], res[:, 2])

# ax.set_xlabel('X')
# ax.set_ylabel('Y')
# ax.set_zlabel('Z')

plt.scatter(res[:, 0], res[:, 1])

plt.show()

In [None]:
import gc

In [None]:
gc.collect()

In [None]:
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text

In [None]:
tf.keras.backend.clear_session()

In [None]:
gnews_url = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
nnlm50_url = "https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1"

embedding = hub.KerasLayer(gnews_url, trainable = True, name = "gnews-swivel-20dim")

In [None]:
early_stop = tf.keras.callbacks.EarlyStopping(patience = 2, restore_best_weights = True)

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape = (), dtype = tf.string),
    embedding,
    tf.keras.layers.Dense(512, activation = "relu"),
    tf.keras.layers.Dense(256, activation = "relu"),
    tf.keras.layers.Dense(128, activation = "relu"),
    tf.keras.layers.Dense(64, activation = "relu"),
    tf.keras.layers.Dense(32, activation = "relu"),
    tf.keras.layers.Dense(16, activation = "relu"),
    tf.keras.layers.Dense(1, activation = "sigmoid")
])

In [None]:
model.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"])

In [None]:
model.fit(news["title"], news["label"], batch_size = 64, validation_data = (val_data["title"], val_data["label"]), callbacks = [early_stop], epochs = 50)

In [None]:
model.evaluate()

### SKLEARN

In [1]:
import re
import gc
import numpy as np

import pandas as pd

from sklearn.linear_model import RidgeClassifier, SGDClassifier, Perceptron, PassiveAggressiveClassifier, LogisticRegression
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.ensemble import StackingClassifier
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline

import joblib as jl

from nltk.stem import WordNetLemmatizer

In [2]:
news = pd.read_csv("data/news_merged_v4.csv")
news = news.fillna("")

val_data = pd.read_csv("data/fake_or_real_news.csv")
val_data.label = val_data.label.replace({"FAKE": 1, "REAL": 0})

In [4]:
def clean_titles(titles):
    non_letter_removal_regex = re.compile("[^a-zA-Z\s]")
    lemmatizer = WordNetLemmatizer()

    # remove non-letter symbols and lower titles
    removed_non_letter_symbols_and_lowered_titles = [non_letter_removal_regex.sub('', title.lower().strip()) for title in titles]

    # do whitespace tokenization
    tokenized_titles = [title.split(' ') for title in removed_non_letter_symbols_and_lowered_titles]

    # remove stopwords from tokenized titles
    titles_with_no_stopwords = [[word for word in title if word not in ENGLISH_STOP_WORDS] for title in tokenized_titles]

    # lemmatize titles
    lemmatized_titles = [[lemmatizer.lemmatize(word) for word in title] for title in titles_with_no_stopwords]

    return [' '.join(title).replace('  ', ' ') for title in lemmatized_titles]

In [19]:
pipe = Pipeline(
    [
        ("clean_titles", FunctionTransformer(clean_titles)),
        ("tfidf", TfidfVectorizer(ngram_range=(1, 3), sublinear_tf=True)),
        
#         (
#             "ensemble",
#             StackingClassifier(
#                 [
#                     ("ridge", RidgeClassifier(alpha = .1, random_state = 42)),
#                     ("logreg", LogisticRegression(solver="liblinear", random_state=42)),
#                     ("mnb", MultinomialNB(alpha=1e-3)),
#                     ("perc", Perceptron(random_state=42)),
#                     ("pa", PassiveAggressiveClassifier(C=0.1, random_state=42)),
#                     (
#                         "sgd",
#                         SGDClassifier(
#                             penalty=None,
#                             loss="squared_hinge",
#                             learning_rate="invscaling",
#                             eta0=100,
#                             random_state=42,
#                         ),
#                     ),
#                     ("linear_svc", LinearSVC(random_state = 42))
#                 ],
#                 final_estimator = GaussianNB(var_smoothing = 1)
                
#             ),
#         ),
    ],
    verbose=3,
)


In [20]:
pipe.fit(X_train, y_train)

[Pipeline] ...... (step 1 of 2) Processing clean_titles, total=  20.9s
[Pipeline] ............. (step 2 of 2) Processing tfidf, total=  27.9s


In [17]:
pipe.score(val_data["title"], val_data["label"])

0.8850828729281768

In [18]:
pipe.score(X_test, y_test)

0.8170544816878621

In [None]:
pipe.score(news["title"], news["label"])

In [21]:
x_train_vect = pipe.transform(X_train)

In [22]:
x_test_vect = pipe.transform(X_test)

In [5]:
pipe_saved = jl.load("models/best_pipe_v8.model")

In [6]:
pipe_saved.score(val_data["title"], val_data["label"])

0.9984214680347278

In [7]:
pipe_saved.score(news["title"], news["label"])

0.5158897964433868

In [14]:
from sklearn.ensemble import VotingClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, ComplementNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.model_selection import cross_val_score, KFold

In [None]:
def clean_titles(titles):
    titles = [re.sub(r'[^a-zA-Z]', ' ', title) for title in titles]
    titles = [re.sub(r'\s+', ' ', title.lower().strip()) for title in titles]
    tokenized_titles = [title.split(" ") for title in titles]
    titles_with_no_stopwords = [[word for word in title if not word in ENGLISH_STOP_WORDS] for title in tokenized_titles]
    joined_titles = [" ".join(title) for title in titles_with_no_stopwords]
    
    return joined_titles

In [15]:
X_train, X_test, y_train, y_test = train_test_split(news['title'], news['label'], test_size=0.2, random_state=0, stratify=news['label'])

In [None]:
val_data = pd.read_csv("../../Machine_Learning/Exam/data/fake_or_real_news.csv").dropna().drop_duplicates()

In [None]:
pipe = Pipeline([
    ('cleaning', FunctionTransformer(clean_titles)),
    ('tfidf', TfidfVectorizer(ngram_range=(1,3), sublinear_tf = True)),
#     ('ensemble', VotingClassifier(estimators=[
#         ('mnb', MultinomialNB()),
#         ('bnb', BernoulliNB()),
#         ('cnb', ComplementNB()),
#         ('lsvc', LinearSVC(random_state = 42)),
#         ('lgr', LogisticRegression(solver = "liblinear", random_state = 42)),
#         ('rc', RidgeClassifier(random_state = 42)),
#         ('sgd', SGDClassifier(random_state = 42)),
#         ('perc', Perceptron(random_state = 42)),
#         ('pa', PassiveAggressiveClassifier(C = 1, random_state = 42, average = True))
#     ]))
], verbose = 3)

In [None]:
pipe.fit(X_train, y_train)

In [None]:
pipe[-1].estimators_[-1].coef_.shape

In [None]:
pipe.score(X_test, y_test)

In [None]:
pipe.score(val_data["title"], val_data["label"].replace({"FAKE": 1, "REAL": 0}))

In [None]:
train_data = pipe.transform(X_train)

In [None]:
param_grid = {'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000],
              'fit_intercept': [True, False],
              'loss': ['hinge', 'squared_hinge'],
              'average': [True, False],
              'shuffle': [True, False],
#               'random_state': [0, 42],
              'class_weight': [None, 'balanced']
              }

# Create the HalvingRandomizedSearchCV object
halving_rs = HalvingRandomSearchCV(scoring="f1_macro",estimator=PassiveAggressiveClassifier(random_state = 42), param_distributions=param_grid,random_state=0,verbose=3)

In [None]:
halving_rs.fit(train_data, y_train)

In [None]:
halving_rs.best_params_

In [None]:
PassiveAggressiveClassifier(average = True, random_state = 42)\
                            .fit(train_data, y_train)\
                            .score(pipe.transform(X_test), y_test)