The dataset used for this assignment was retrieved [here](https://www.kaggle.com/c/fake-news/data)

In [1]:
import itertools
import pandas as pd
import numpy as np
import matplotlib as plt
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [2]:
# Import dataset in a dataframe
# Pandas.read_csv reads a comma-separated values (csv) file into Dataframe and returns a two-dimensional data structure with labeled axes.
Dataframe = pd.read_csv(r'C:\Users\dimde\Documents\University of Piraeus - MSc in Artificial Intelligence\Courses\First semester\Machine learning\Assignments\Machine learning\Fake news\Dataset\train.csv')

Reads a comma-separated values (csv) file into Dataframe and returns a two-dimensional data structure with labeled axes.

In [3]:
# Get the dataframe shape
# Returns a tuple representing the dimensionality of the Dataframe
Dataframe.shape
print(Dataframe.shape)

(20800, 5)


Outputs the dimensionality of the dataframe. Our dataset has 5 features and 20800 feature vectors.

In [4]:
# Get the dataframe head
# Returns the first and last 5 rows of the Dataframe
Dataframe.head()
print(Dataframe.head)

<bound method NDFrame.head of           id                                              title  \
0          0  House Dem Aide: We Didn’t Even See Comey’s Let...   
1          1  FLYNN: Hillary Clinton, Big Woman on Campus - ...   
2          2                  Why the Truth Might Get You Fired   
3          3  15 Civilians Killed In Single US Airstrike Hav...   
4          4  Iranian woman jailed for fictional unpublished...   
...      ...                                                ...   
20795  20795  Rapper T.I.: Trump a ’Poster Child For White S...   
20796  20796  N.F.L. Playoffs: Schedule, Matchups and Odds -...   
20797  20797  Macy’s Is Said to Receive Takeover Approach by...   
20798  20798  NATO, Russia To Hold Parallel Exercises In Bal...   
20799  20799                          What Keeps the F-35 Alive   

                                          author  \
0                                  Darrell Lucus   
1                                Daniel J. Flynn   
2        

Above we can take a look on the first and last 5 feature vectors of the dataset.

The 5 features are: **id, title, author, text, label.**

**id**: indicates the index of the article (from 0 to 20799, in total 20800 feature vectors).

**title**: indicates the title of the article.

**author**: indicates the author of the article.

**text**: indicates the actual main body of the article.

**label**: indicates if the article is fake or not. Value is 0 if the article represents real information and value is 1 if the article represents fake information.

The features that are going to be examined are the features **label** and **text**.

In [5]:
# Convert the 0, 1 labels to 'REAL' and 'FAKE' for simplicity
# With Dataframe.loc set value for an entire column
Dataframe.loc[(Dataframe['label'] == 1) , ['label']] = 'FAKE'
Dataframe.loc[(Dataframe['label'] == 0) , ['label']] = 'REAL'

For simplicity's sake convert feature label values "0" to "REAL" and "1" to "FAKE". This is only done for comprehension.

In [6]:
print(Dataframe.head)

<bound method NDFrame.head of           id                                              title  \
0          0  House Dem Aide: We Didn’t Even See Comey’s Let...   
1          1  FLYNN: Hillary Clinton, Big Woman on Campus - ...   
2          2                  Why the Truth Might Get You Fired   
3          3  15 Civilians Killed In Single US Airstrike Hav...   
4          4  Iranian woman jailed for fictional unpublished...   
...      ...                                                ...   
20795  20795  Rapper T.I.: Trump a ’Poster Child For White S...   
20796  20796  N.F.L. Playoffs: Schedule, Matchups and Odds -...   
20797  20797  Macy’s Is Said to Receive Takeover Approach by...   
20798  20798  NATO, Russia To Hold Parallel Exercises In Bal...   
20799  20799                          What Keeps the F-35 Alive   

                                          author  \
0                                  Darrell Lucus   
1                                Daniel J. Flynn   
2        

Now the feature label presents the 0 values with "REAL" and the 1 values with "FAKE".

In [7]:
# Isolate the feature label from the rest of the dataframe
labels = Dataframe.label
labels.head()
print(labels.head)

<bound method NDFrame.head of 0        FAKE
1        REAL
2        FAKE
3        FAKE
4        FAKE
         ... 
20795    REAL
20796    REAL
20797    REAL
20798    FAKE
20799    FAKE
Name: label, Length: 20800, dtype: object>


Isolate the label feature to split the dataset.

In [None]:
# Split the dataset
#Test for different case scenarios
# Test 1 -> 60% train, 40% test, random_state = 7 -> Accuracy: 95.82%
#x_train,x_test,y_train,y_test = train_test_split(Dataframe['text'].values.astype('str'), labels, test_size = 0.4, random_state = 7)
# Test 2 -> 65% train, 35% test, random_state = 7 -> Accuracy: 96.17%
#x_train,x_test,y_train,y_test = train_test_split(Dataframe['text'].values.astype('str'), labels, test_size = 0.35, random_state = 7)
# Test 3 -> 70% train, 30% test, random_state = 7 -> Accuracy: 95.99%
#x_train,x_test,y_train,y_test = train_test_split(Dataframe['text'].values.astype('str'), labels, test_size = 0.3, random_state = 7)
# Test 4 -> 75% train, 25% test, random_state = 7 -> Accuracy: 96.21%
#x_train,x_test,y_train,y_test = train_test_split(Dataframe['text'].values.astype('str'), labels, test_size = 0.25, random_state = 7)
# Test 5 -> 80% train, 20% test, random_state = 7 -> Accuracy: 96.56%
x_train,x_test,y_train,y_test = train_test_split(Dataframe['text'].values.astype('str'), labels, test_size = 0.2, random_state = 7)
# Test 6 -> 85% train, 15% test, random_state = 7 -> Accuracy: 96.19%
#x_train,x_test,y_train,y_test = train_test_split(Dataframe['text'].values.astype('str'), labels, test_size = 0.15, random_state = 7)

The sklearn **train_test_split** function will be used for spliting the dataset.

The reason we split the dataset is because we can't use the same data for prediction that we used for training. If we do this then our prediction evaluation will be biased. We need to evaluate our prediction based on "unseen" data by the model.

In order to have an unbiased prediction evaluation, spliting the dataset is essential. The dataset is also shuffled before applying the split. Furthermore, It is randomized during spliting.

As can be seen above, different case scenarios were used for training, testing and spliting the data.

## train_test_split parameters

**train_size**: is the number that defines the size of the training set.

**test_size**: is the number that defines the size of the test set.

**random_state**: is the object that controls randomization during splitting. It can be either an int or an instance of RandomState. The default value is None.

**shuffle**: is the object (**Τrue by default**) that determines whether to shuffle the dataset before applying the split.

**stratify**: is an array-like object that, if not None, determines how to use a stratified split.

In [None]:
# Initialize a TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words = 'english', max_df = 0.7)

## Bag-of-words model
The **bag-of-words** model is a method of representing text data when modeling text with machine learning algorithms. It is easy to comprehend and implement and has seen huge success in applications such as language modeling and document classification.

An issue with modeling text is that it is disorganized, and methods like machine learning algorithms fancy defined fixed-length inputs and outputs. Since machine learning algorithms cannot operate with text directly, the text must be changed into vectors of numbers. Therefore we do feature encoding with the bag-of-words model of text. It is a well liked and plain method of feature encoding with text.

The method is simple and adaptable and can be used in a lot of ways for pulling out features from documents. A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:
* A vocabulary of known words.
* A measure of the presence of known words.

The order of occurrence is not important. We only care about whether known words occur in the document. The bag-of-words approach, looks at the histogram of the words within the text.

The objective is to transform each document of free text into a vector that we can use as input or output for a machine learning model. The simplest scoring method is to mark the presence of words as 0 for absent, and 1 for present. New documents that overlap with the vocabulary of known words, but may include words outside of the vocabulary, can still be encoded, where only the occurrence of known words are scored and unknown words are ignored.

In this assignment each article is a "document" and all of the articles together are the entire corpus of "documents". An entire corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

For a very large corpus, such as thousands of books, the length of the vector might be thousands or millions of positions. Furthermore, each document may include very few of the known words in the vocabulary. This results in a vector with lots of zero scores, called a sparse vector or sparse representation. Sparse vectors require more memory and computational resources when modeling and the vast number of positions or dimensions can make the modeling process very challenging for traditional algorithms. Therefore, there is need to decrease the size of the vocabulary when using a bag-of-words model.

There are some techniques that can be used as a first step to reduce the size of the vocabulary. For example:
* ignoring uppercase and lowercase in words.
* ignoring punctuation.
* ignoring frequent words that don’t contain much information (called stop words) like "a", "of", "him", "the".
* fixing misspelled words.
* reducing words to their stem using stemming algorithms (for example "use" from "using").

A more sophisticated approach is to create a vocabulary of grouped words. This changes the scope of the vocabulary and also allows the bag-of-words to capture somewhat more meaning from the document. By doing this, each word or token is called a **gram**. Creating a vocabulary of two-word pairs is called a **2-gram model** or a **bigram model**. As a result, only the bigrams that appear in the corpus are modeled, not all possible bigrams. A **3-gram model** or a **trigram** is a three-word sequence of words. The general approach is called the **n-gram model**, where n refers to the number of grouped words. Often a simple bigram approach is better than a 1-gram bag-of-words model for tasks like documentation classification.

Once a vocabulary method has been chosen, the occurrence of words in the documents needs to be scored. As mentioned earlier, the simplest scoring method is **binary scoring** which marks the presence of words as 0 for absent, and 1 for present. Another scoring method is **counts**, which counts the number of times each word appears in a document. Additionally, **frequencies** scoring method calculates the frequency that each word appears in a document out of all the words in the document.

Feature hashing is when known words use a hash representation in the vocabulary. This addresses the problem of having a very large vocabulary for a large text corpus because then the size of the hash space can be defined, which is in turn the size of the vector representation of the document. Words are hashed deterministically to the same integer index in the target hash space. A binary score or count can then be used to score the word. The challenge is to choose a hash space to accommodate the chosen vocabulary size to minimize the probability of collisions and trade-off sparsity.

## TF-IDF
A problem with scoring word frequency is that highly frequent words start to dominate in the document (for example, larger score), but may not contain as much "information" to the model as more rare but perhaps domain specific words.

One approach is to rescale the frequency of words by how often they appear in entire corpus of documents, so that the scores for frequent words like "the" that are also frequent across all documents are penalized. This approach to scoring is called "Term Frequency–Inverse Document Frequency", or TF-IDF for short, where:
* Term Frequency is a scoring of the frequency of the word in the current document.
* Inverse Document Frequency is a scoring of how rare the word is across documents.

The scores are a weighting where not all words are equally as important or interesting. The scores have the effect of highlighting words that are distinct (contain useful information) in a given document. Thus, the IDF of a rare term is high, whereas the IDF of a frequent term is likely to be low.

The bag-of-words model is very simple to understand and implement and offers a lot of flexibility for customization on specific text data. It has been used with great success on prediction problems like language modeling and documentation classification. Nevertheless, it suffers from some drawbacks, such as:
* Vocabulary: The vocabulary requires careful design, most specifically in order to manage the size, which impacts the sparsity of the document representations.
* Sparsity: Sparse representations are harder to model both for computational reasons (space and time complexity) and also for information reasons, where the challenge is for the models to harness so little information in such a large representational space.
* Meaning: Discarding word order ignores the context, and in turn meaning of words in the document (semantics). Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words differently arranged (“this is interesting” as opposed to “is this interesting”), synonyms (“old bike” vs “used bike”), and many more semantics.


## TfidfVectorizer parameters

Declare a TfidfVectorizer using stop words from the English language and allow up to an article frequency of 0.7.

**stop_words**: If a string, it is passed to _check_stop_list and the appropriate stop list is returned. ‘english’ is currently the only supported string value. There are several known issues with ‘english’. See below.

**Using stop words**
Stop words are words like “and”, “the”, “him”, which are presumed to be uninformative in representing the content of a text, and which may be removed to avoid them being construed as signal for prediction. Sometimes, however, similar words are useful for prediction, such as in classifying writing style or personality.

There are several known issues in our provided ‘english’ stop word list. It does not aim to be a general, ‘one-size-fits-all’ solution as some tasks may require a more custom solution.

Please take care in choosing a stop word list. Popular stop word lists may include words that are highly informative to some tasks, such as computer.

You should also make sure that the stop word list has had the same preprocessing and tokenization applied as the one used in the vectorizer. The word we’ve is split into we and ve by CountVectorizer’s default tokenizer, so if we’ve is in stop_words, but ve is not, ve will be retained from we’ve in transformed text. Our vectorizers will try to identify and warn about some kinds of inconsistencies.

**max_df**: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float in range (0.0 , 1.0), the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

In [None]:
# Fit & transform train set, transform test set
tfidf_train = tfidf_vectorizer.fit_transform(x_train) 
tfidf_test = tfidf_vectorizer.transform(x_test)

Fit and transform the TfidfVectorizer on the training set and also transform it on the testing set.

In [None]:
# Initialize the PassiveAggressiveClassifier and fit training sets
pa_classifier = PassiveAggressiveClassifier(max_iter = 50)
pa_classifier.fit(tfidf_train, y_train)

Initialize the PassiveAggressiveClassifier.
Incorporate it into the model by using the “y_train” and “tfidf_train”.

In [None]:
# Predict and calculate accuracy
y_pred = pa_classifier.predict(tfidf_test)
score = accuracy_score(y_test, y_pred)
print(f'Accuracy: {round(score*100,2)}%')

Use the vectorizer to predict whether an article is real or fake and calculate the model’s accuracy.

In [None]:
# Build confusion matrix
Conf_matrix = confusion_matrix(y_test, y_pred, labels=['FAKE', 'REAL'])
print('Confusion matrix: ' '\n', Conf_matrix)

Build confusion matrix to check the successfull predictions and failures.

Confusion matrix results:
* Positives           FalsePositives
* FalseNegatives      Negatives