Ensemle based classification performed on titles of documents. 5-fold cross-validation is used.

*   Accuracy of Stacking Classifier on titles: 93.65
*   Accuracy of Voting Classifier on titles: 93.5

---


*   Accuracy of Multinomial Naive Bayes on titles: 87.37
*   Accuracy of Random Forest classifier on titles: 93.65
*   Accuracy of Logistic Regression on titles: 92.31






# Downloading the Dataset

In [1]:
!gdown --id 1q0ZimHCtMlhftljfVuy0i-wcgPnDUYjl 

Downloading...
From: https://drive.google.com/uc?id=1q0ZimHCtMlhftljfVuy0i-wcgPnDUYjl
To: /content/fake-news.zip
48.7MB [00:01, 47.9MB/s]


# Extracting the Dataset

In [2]:
!unzip /content/fake-news.zip

Archive:  /content/fake-news.zip
  inflating: submit.csv              
  inflating: test.csv                
  inflating: train.csv               


# Importing and Downloading necessary packages and libraries

In [3]:
import numpy as np    
import pandas as pd
import csv
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB 
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier
from sklearn.ensemble import VotingClassifier
import numpy as np
import statistics

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

# Data from csv files
In this section, data is extracted from train.csv and test.csv file using pandas and stored in a list where each row contains all the fields.

In [4]:
train = pd.read_csv('train.csv')
train_data = train.iloc[:,:].values
test = pd.read_csv('test.csv')
test_data = test.iloc[:,:].values

# Data Preprocessing
In this section, data preprocessing is done on titles of training data. Also appending training labels in this section in list named y.

In [5]:
files = []
corpus = []
labels = train_data[:,-1]
y = []

punc = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
porter = PorterStemmer()
lemmatizer = WordNetLemmatizer() 
postag = nltk.corpus.wordnet
stop_words = set(stopwords.words('english'))

for row in range(len(train_data)) :
  raw_text = train_data[row][1]
  try:
    for sym in raw_text : 
      # Removing punctuation
      if sym in punc : 
        raw_text = raw_text.replace(sym, "")
    # Removing non-alphabetic text and Converting text to lower-case
    words = [word.lower() for word in raw_text.split() if word.isalpha()]
    # Removing stopwords
    words = [w for w in words if not w in stop_words]
    # Performing lemmatization
    lemmatized = [lemmatizer.lemmatize(word) for word in words]
    temp_files = []
    temp_files.append(train_data[row][0])
    temp_files.append(lemmatized)
    # Storing file data to list
    files.append(temp_files)
    doc_text = ""
    for temp_str in lemmatized:
      doc_text = doc_text + temp_str + " "
    corpus.append(doc_text)
    # appending labels of rows that have data
    y.append(labels[row])
  except:
    continue
  
print("Total Documents: ", len(files))

Total Documents:  20242


# Creating document vectors of corpus using TF-IDF

In [6]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Text preprocessing on test data

In [7]:
files2 = []
corpus = []

for row in test_data :
  raw_text = row[1]
  try:
    for sym in raw_text : 
      # Removing punctuation
      if sym in punc : 
        raw_text = raw_text.replace(sym, "")
    # Removing non-alphabetic text and Converting text to lower-case
    words = [word.lower() for word in raw_text.split() if word.isalpha()]
    # Removing stopwords
    words = [w for w in words if not w in stop_words]
    # Performing lemmatization
    lemmatized = [lemmatizer.lemmatize(word) for word in words]
    temp_files = []
    temp_files.append(row[0])
    temp_files.append(lemmatized)
    # Storing file data to list
    files2.append(temp_files)

    doc_text = ""
    for temp_str in lemmatized:
      doc_text = doc_text + temp_str + " "
    corpus.append(doc_text)
  except:
    continue

print("Total Documents: ", len(files2))

Total Documents:  5078


# Creating document vectors of test data

In [8]:
X_test = vectorizer.transform(corpus)

# Implementing individual classifiers
In this section, Multinomial Naive Bayes, Random Forest Classifier and Logistic Regression on titles data. Accuracy on training data is also acquired in this section.

In [9]:
scores = []

clf1 = MultinomialNB()
clf2 = RandomForestClassifier(random_state=1)
clf3 = LogisticRegression()

clf1.fit(X, y)
scores.append(model_selection.cross_val_score(clf1, X, y, cv=5, scoring='accuracy'))
print("Accuracy of Multinomial Naive Bayes on titles:", round(statistics.mean(scores[0])*100,2))
clf2.fit(X, y)
scores.append(model_selection.cross_val_score(clf2, X, y, cv=5, scoring='accuracy'))
print("Accuracy of Random Forest classifier on titles:", round(statistics.mean(scores[1])*100,2))
clf3.fit(X, y)
scores.append(model_selection.cross_val_score(clf3, X, y, cv=5, scoring='accuracy'))
print("Accuracy of Logistic Regression on titles:", round(statistics.mean(scores[2])*100,2))

Accuracy of Multinomial Naive Bayes on titles: 87.37
Accuracy of Random Forest classifier on titles: 93.65
Accuracy of Logistic Regression on titles: 92.31


# Implementing Stacking Classifier and Voting Classifier
In stacking classifier, logistic regression is put as meta classifier and in voting classifier, voting is kept as hard. Accuracy of both the sections is also calculated.

In [10]:
clf4 = LogisticRegression()

sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], meta_classifier=clf4)
sclf.fit(X, y)
s_score = (model_selection.cross_val_score(sclf, X, y, cv=5, scoring='accuracy'))

vclf = VotingClassifier(estimators=[('mnb', clf1), ('rf', clf2), ('lr', clf3)], voting='hard')
vclf.fit(X, y)
v_score = (model_selection.cross_val_score(vclf, X, y, cv=5, scoring='accuracy'))
print("Accuracy of Stacking Classifier on titles:", round(statistics.mean(s_score)*100,2))
print("Accuracy of Voting Classifier on titles:", round(statistics.mean(v_score)*100,2))

Accuracy of Stacking Classifier on titles: 93.65
Accuracy of Voting Classifier on titles: 93.5


# Two output files are made for both the classifiers

In [11]:
final_csv_s = []
final_csv_v = []

sy_pred = sclf.predict(X_test)
vy_pred = vclf.predict(X_test)

for i in range(len(sy_pred)):
    final_csv_s.append([files2[i][0], sy_pred[i]])
    final_csv_v.append([files2[i][0], vy_pred[i]])

with open('title_stacking_output.csv','w') as f1:
    writer = csv.writer(f1)
    writer.writerows(final_csv_s)

with open('title_voting_output.csv','w') as f2:
    writer = csv.writer(f2)
    writer.writerows(final_csv_v)