In [1]:
import itertools
import pandas as pd
import numpy as np
import matplotlib as plt
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [2]:
# Import dataset in a dataframe
# Pandas.read_csv reads a comma-separated values (csv) file into Dataframe and returns a two-dimensional data structure with labeled axes.
Dataframe = pd.read_csv(r'C:\Users\dimde\Documents\University of Piraeus - MSc in Artificial Intelligence\Courses\First semester\Machine learning\Assignments\Machine learning\Fake news\Dataset\train.csv')

In [3]:
# Get the dataframe shape
# Returns a tuple representing the dimensionality of the Dataframe
Dataframe.shape
print(Dataframe.shape)

(20800, 5)


Our dataset has 5 features and 20800 feature vectors.

In [4]:
# Get the dataframe head
# Returns the first and last 5 rows of the Dataframe
Dataframe.head()
print(Dataframe.head)

<bound method NDFrame.head of           id                                              title  \
0          0  House Dem Aide: We Didn’t Even See Comey’s Let...   
1          1  FLYNN: Hillary Clinton, Big Woman on Campus - ...   
2          2                  Why the Truth Might Get You Fired   
3          3  15 Civilians Killed In Single US Airstrike Hav...   
4          4  Iranian woman jailed for fictional unpublished...   
...      ...                                                ...   
20795  20795  Rapper T.I.: Trump a ’Poster Child For White S...   
20796  20796  N.F.L. Playoffs: Schedule, Matchups and Odds -...   
20797  20797  Macy’s Is Said to Receive Takeover Approach by...   
20798  20798  NATO, Russia To Hold Parallel Exercises In Bal...   
20799  20799                          What Keeps the F-35 Alive   

                                          author  \
0                                  Darrell Lucus   
1                                Daniel J. Flynn   
2        

Above we can take a look on the first and last 5 feature vectors of the dataset.

The 5 features are: **id, title, author, text, label.**

**id**: indicates the index of the article (from 0 to 20799, in total 20800 feature vectors).

**title**: indicates the title of the article.

**author**: indicates the author of the article.

**text**: indicates the actual main body of the article.

**label**: indicates if the article is fake or not. Value is 0 if the article represents real information and value is 1 if the article represents fake information.

The features that are going to be examined are the features label and text.

In [5]:
# Convert the 0, 1 labels to 'REAL' and 'FAKE' for simplicity
# With Dataframe.loc set value for an entire column
Dataframe.loc[(Dataframe['label'] == 1) , ['label']] = 'FAKE'
Dataframe.loc[(Dataframe['label'] == 0) , ['label']] = 'REAL'

For simplicity's sake convert feature label values "0" to "REAL" and "1" to "FAKE".

In [6]:
print(Dataframe.head)

<bound method NDFrame.head of           id                                              title  \
0          0  House Dem Aide: We Didn’t Even See Comey’s Let...   
1          1  FLYNN: Hillary Clinton, Big Woman on Campus - ...   
2          2                  Why the Truth Might Get You Fired   
3          3  15 Civilians Killed In Single US Airstrike Hav...   
4          4  Iranian woman jailed for fictional unpublished...   
...      ...                                                ...   
20795  20795  Rapper T.I.: Trump a ’Poster Child For White S...   
20796  20796  N.F.L. Playoffs: Schedule, Matchups and Odds -...   
20797  20797  Macy’s Is Said to Receive Takeover Approach by...   
20798  20798  NATO, Russia To Hold Parallel Exercises In Bal...   
20799  20799                          What Keeps the F-35 Alive   

                                          author  \
0                                  Darrell Lucus   
1                                Daniel J. Flynn   
2        

Now the feature label presents the 0 values with "REAL" and the 1 values with "FAKE".

In [7]:
# Isolate the feature label from the rest of the dataframe
labels = Dataframe.label
labels.head()
print(labels.head)

<bound method NDFrame.head of 0        FAKE
1        REAL
2        FAKE
3        FAKE
4        FAKE
         ... 
20795    REAL
20796    REAL
20797    REAL
20798    FAKE
20799    FAKE
Name: label, Length: 20800, dtype: object>


In [8]:
# Split the dataset
#Test for different case scenarios
# Test 1 -> 60% train, 40% test, random_state=7 -> Accuracy: 95.82%
#x_train,x_test,y_train,y_test=train_test_split(Dataframe['text'].values.astype('str'), labels, test_size=0.4, random_state=7)
# Test 2 -> 65% train, 35% test, random_state=7 -> Accuracy: 96.17%
#x_train,x_test,y_train,y_test=train_test_split(Dataframe['text'].values.astype('str'), labels, test_size=0.35, random_state=7)
# Test 3 -> 70% train, 30% test, random_state=7 -> Accuracy: 95.99%
#x_train,x_test,y_train,y_test=train_test_split(Dataframe['text'].values.astype('str'), labels, test_size=0.3, random_state=7)
# Test 4 -> 75% train, 25% test, random_state=7 -> Accuracy: 96.21%
#x_train,x_test,y_train,y_test=train_test_split(Dataframe['text'].values.astype('str'), labels, test_size=0.25, random_state=7)
# Test 5 -> 80% train, 20% test, random_state=7 -> Accuracy: 96.56%
x_train,x_test,y_train,y_test=train_test_split(Dataframe['text'].values.astype('str'), labels, test_size=0.2, random_state=7)
# Test 6 -> 85% train, 15% test, random_state=7 -> Accuracy: 96.19%
#x_train,x_test,y_train,y_test=train_test_split(Dataframe['text'].values.astype('str'), labels, test_size=0.15, random_state=7)

The sklearn **train_test_split** function will be used for spliting the dataset.

The reason we split the dataset is because we can't use the same data for prediction that we used for training. If we do this then our prediction evaluation will be biased. We need to evaluate our prediction based on "unseen" data by the model.

In order to have an unbiased prediction evaluation, spliting the dataset is essential.

**train_size**: is the number that defines the size of the training set.

**test_size**: is the number that defines the size of the test set.

**random_state**: is the object that controls randomization during splitting. It can be either an int or an instance of RandomState. The default value is None.

**shuffle**: is the object (**Τrue by default**) that determines whether to shuffle the dataset before applying the split.

**stratify**: is an array-like object that, if not None, determines how to use a stratified split.

Different case scenarios were used for training, testing and spliting the data.

In [None]:
# Initialize a TfidfVectorizer
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)

Introduction to bag-of-words model