<a href="https://colab.research.google.com/github/BrianKipngeno/Fake-news-detection-project/blob/main/Fake_news_texts_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let's create a classification model that categorizes new texts news as either fake news or not given the following dataset.

Dataset URL = https://bit.ly/319PifQ


### Prerequisites

In [1]:
# Importing the standard libraries
# ---
#
import pandas as pd # library for data manipulation
import numpy as np  # library for scientific compuations
import re           # regex library to perform text preprocessing
import string       # library to work with strings
import nltk         # library for natural language processing
import scipy        # library for scientific computing

# Library for Stop words
!pip3 install wordninja
!pip3 install textblob
import wordninja
from textblob import TextBlob

nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

# Library for Lemmatization
nltk.download('wordnet')
from textblob import Word

# Library for Noun count
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Library for TD-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

Collecting wordninja
  Downloading wordninja-2.0.0.tar.gz (541 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m541.6/541.6 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wordninja
  Building wheel for wordninja (setup.py) ... [?25l[?25hdone
  Created wheel for wordninja: filename=wordninja-2.0.0-py3-none-any.whl size=541530 sha256=fe5758cc6289d17d23ae170dc8a90577c2f20498afa5439e0a35d684c04eeeef
  Stored in directory: /root/.cache/pip/wheels/aa/44/3a/f2a5c1859b8b541ded969b4cd12d0a58897f12408f4f51e084
Successfully built wordninja
Installing collected packages: wordninja
Successfully installed wordninja-2.0.0


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [2]:
# Utility Functions

# Avg. words
def avg_word(sentence):
  words = sentence.split()
  try:
    z = (sum(len(word) for word in words)/len(words))
  except ZeroDivisionError:
    z = 0
  return z

# Noun count
pos_dic = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

def pos_check(x, flag):
    cnt = 0
    try:
        wiki = TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_dic[flag]:
                cnt += 1
    except:
        pass
    return cnt

# Subjectivity
def get_subjectivity(tweet):
    try:
        textblob = TextBlob(unicode(tweet, 'utf-8'))
        subj = textblob.sentiment.subjectivity
    except:
        subj = 0.0
    return subj

# Polarity
def get_polarity(tweet):
    try:
        textblob = TextBlob(unicode(tweet, 'utf-8'))
        pol = textblob.sentiment.polarity
    except:
        pol = 0.0
    return pol

### Data exploration

In [3]:
# Importing our dataset
# ---
#
df = pd.read_csv('https://bit.ly/319PifQ')
df.columns = ['text', 'target']
df.head()

Unnamed: 0,text,target
0,Says the Annies List political group supports ...,False
1,When did the decline of coal start? It started...,True
2,"Hillary Clinton agrees with John McCain ""by vo...",True
3,Health care reform legislation is likely to ma...,False
4,The economic turnaround started at the end of ...,True


In [4]:
# Determining the shape of the datset
# ---
#
df.shape

(10240, 2)

In [5]:
# We will work with 100 sample records because we would
# be required to use high computational resources for a larger dataset
# ---
#
df = df.sample(100)

In [6]:
# Let's determine whether our columns have the right data types
# ---
#
df.dtypes

Unnamed: 0,0
text,object
target,bool


In [7]:
# What values are in our target variable?
# ---
#
df.target.unique()

array([ True, False])

From the unique values, we need to trim the spaces in the values within our target variable.

### Data preparation

#### Basic data cleaning

In [8]:
# Let's check for missing values
# ---
#
df.isnull().sum()

Unnamed: 0,0
text,0
target,0


#### Text processing

In [9]:
# We will create a custom function that will contain all the text cleaning
# techniques. We will then reuse the same function for cleaning new data.
# ---
#
def text_cleaning(text):
  # Removing url/links
  df['text'] = df.text.apply(lambda x: re.sub(r'http\S+|www\S+|https\S+','', str(x)))

  # Removing @ and # characters and replacing them with space
  df['text'] = df.text.str.replace('#',' ')
  df['text'] = df.text.str.replace('@',' ')

  # Conversion to lowercase
  df['text'] = df.text.apply(lambda x: " ".join(x.lower() for x in x.split()))

  # Removing punctuation characters
  df['text'] = df.text.str.replace('[^\w\s]','')

  # Removing stop words
  df['text'] = df.text.apply(lambda x: " ".join(x for x in x.split() if x not in stop))

  # Lemmatization
  df['text'] = df.text.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

In [10]:
# Applying the text_cleaning function to our dataframe.
# ---
#
df.text.apply(text_cleaning)
df.sample(5)

Unnamed: 0,text,target
7151,$360 million tax dollar went straight ... tali...,True
445,"common core federal government, fingerprint th...",True
3021,one illness reported raw milk texas four years...,False
9059,one three american woman abortion time reach a...,True
7328,"year, federal government revenue year history ...",True


#### Feature engineering

In [11]:
# We will create a custom function that will contain all the
# feature engineering techniques. We can then use the function for cleaning new data.
# ---
#
def feature_engineering(text):
  # Length of text
  df['length_of_text'] = df.text.str.len()

  # Word count
  df['word_count'] = df.text.apply(lambda x: len(str(x).split(" ")))

  # Word density (Average no. of words / text)
  df['avg_word_length'] = df.text.apply(lambda x: avg_word(x))

  # Noun Count
  df['noun_count'] = df.text.apply(lambda x: pos_check(x, 'noun'))

  # Verb Count
  df['verb_count'] = df.text.apply(lambda x: pos_check(x, 'verb'))

  # Adjective Count / Text
  df['adj_count'] = df.text.apply(lambda x: pos_check(x, 'adj'))

  # Adverb Count / Text
  df['adv_count'] = df.text.apply(lambda x: pos_check(x, 'adv'))

  # Pronoun
  df['pron_count'] = df.text.apply(lambda x: pos_check(x, 'pron'))

  # Subjectivity
  df['subjectivity'] = df.text.apply(get_subjectivity)

  # Polarity
  df['polarity'] = df.text.apply(get_polarity)

In [12]:
# Applying the custom feature engineering function to our dataframe.
# ---
# This process may take 2-5 min.
# ---
#
df.text.apply(feature_engineering)
df.sample(5)

Unnamed: 0,text,target,length_of_text,word_count,avg_word_length,noun_count,verb_count,adj_count,adv_count,pron_count,subjectivity,polarity
948,say donald trump doesnt make thing america.,False,43,7,5.285714,4,2,1,0,0,0.0,0.0
677,say business pay roughly 60 percent tax texas.,True,46,8,4.875,4,1,1,1,0,0.0,0.0
4998,"legislator, (marco rubio) flipped key vote mak...",False,100,14,6.214286,8,2,2,0,0,0.0,0.0
5065,michael dukakis created job three time faster ...,True,58,9,5.555556,5,1,1,1,0,0.0,0.0
2126,49th nation graduation rate,True,27,4,6.0,3,0,0,0,0,0.0,0.0


In [13]:
# Performing further feature engineering techniques
# ---
#

# Feature Construction: Word Level N-Gram TF-IDF Feature
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word', ngram_range=(1,3),  stop_words= 'english')
df_word_vect = tfidf.fit_transform(df.text)

# Feature Construction: Character Level N-Gram TF-IDF
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='char', ngram_range=(1,3),  stop_words= 'english')
df_char_vect = tfidf.fit_transform(df.text)



In [14]:
# Label Preparation i.e. replacing categorial values with numerical ones
# ---
#
y = np.array(df['target'].replace([False, True], ['0','1']))
y

array(['1', '0', '1', '0', '1', '0', '0', '0', '1', '1', '0', '1', '1',
       '0', '0', '0', '0', '1', '0', '0', '0', '0', '1', '0', '0', '0',
       '0', '0', '1', '0', '0', '0', '1', '0', '0', '1', '1', '1', '1',
       '1', '0', '0', '1', '0', '0', '1', '1', '1', '1', '0', '0', '0',
       '1', '1', '0', '0', '0', '1', '1', '0', '0', '0', '1', '1', '0',
       '1', '0', '1', '0', '1', '0', '1', '0', '1', '0', '0', '1', '1',
       '1', '1', '1', '1', '0', '1', '1', '0', '1', '1', '1', '1', '0',
       '1', '0', '1', '1', '0', '1', '0', '0', '0'], dtype=object)

In [15]:
# Let's prepare the constructed features for modeling
# ---
# We will select all variables but the target (which is the label) and text variables
# ---
#
X_metadata = np.array(df[df.columns.difference(['target', 'text'])])
X_metadata

array([[  0.        ,   0.        ,   6.        ,  27.        ,
          3.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   4.        ],
       [  0.        ,   1.        ,   6.57142857,  52.        ,
          4.        ,   0.        ,   0.        ,   0.        ,
          2.        ,   7.        ],
       [  2.        ,   0.        ,   6.4       ,  36.        ,
          2.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   5.        ],
       [  3.        ,   0.        ,   7.4       ,  41.        ,
          1.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   5.        ],
       [  2.        ,   0.        ,   6.        ,  69.        ,
          5.        ,   0.        ,   0.        ,   0.        ,
          3.        ,  10.        ],
       [  2.        ,   0.        ,   5.42857143,  44.        ,
          2.        ,   0.        ,   0.        ,   0.        ,
          2.        ,   7.        ],
       [  

In [16]:
# We combine our two tfidf (sparse) matrices and X_metadata
# ---
#
X = scipy.sparse.hstack([df_word_vect, df_char_vect, X_metadata])
X

<100x2010 sparse matrix of type '<class 'numpy.float64'>'
	with 14543 stored elements in COOrdinate format>

### Step 3: Data modelling

In [17]:
# Splitting our data
# ---
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [18]:
# Fitting our model
# ---
#

# Importing the algorithms
# ---
#
from sklearn.linear_model import LogisticRegression      # Logistic Regression Classifier
from sklearn.tree import DecisionTreeClassifier          # Decision Tree Classifier
from sklearn.svm import SVC                              # SVM Classifier
from sklearn.naive_bayes import MultinomialNB            # Naive Bayes Classifier
from sklearn.neighbors import KNeighborsClassifier       # KNN Classifier

# Ensemble classifiers
from sklearn.ensemble import BaggingClassifier           # Bagging Meta-Estimator Classifier
from sklearn.ensemble import RandomForestClassifier      # RandomForest Classifier
from sklearn.ensemble import AdaBoostClassifier          # AdaBoost Classifier
from sklearn.ensemble import GradientBoostingClassifier  # AdaBoost GradientBoostingClassifier


# Instantiating our models
# ---
#
logistic_classifier = LogisticRegression(solver='saga', max_iter=800, multi_class='multinomial') # solver works well with a large dataset like ours
decision_classifier = DecisionTreeClassifier(random_state=42)
svm_classifier = SVC()
knn_classifier = KNeighborsClassifier()
naive_classifier = MultinomialNB()

bagging_meta_classifier = BaggingClassifier()
random_forest_classifier = RandomForestClassifier()
ada_boost_classifier = AdaBoostClassifier(random_state=42)
gbm_classifier = GradientBoostingClassifier(random_state=42)

# Training our models
# ---
#
logistic_classifier.fit(X_train, y_train)
decision_classifier.fit(X_train, y_train)
svm_classifier.fit(X_train, y_train)
knn_classifier.fit(X_train, y_train)
naive_classifier.fit(X_train, y_train)

bagging_meta_classifier.fit(X_train, y_train)
random_forest_classifier.fit(X_train, y_train)
ada_boost_classifier.fit(X_train, y_train)
gbm_classifier.fit(X_train, y_train)



In [19]:
# Making predictions
# ---
#
logistic_y_prediction = logistic_classifier.predict(X_test)
decision_y_prediction = decision_classifier.predict(X_test)
svm_y_prediction = svm_classifier.predict(X_test)
knn_y_prediction = knn_classifier.predict(X_test)
naive_y_prediction = naive_classifier.predict(X_test)

bagging_y_classifier = bagging_meta_classifier.predict(X_test)
random_forest_y_classifier = random_forest_classifier.predict(X_test)
ada_boost_y_classifier = ada_boost_classifier.predict(X_test)
gbm_y_classifier = gbm_classifier.predict(X_test)

In [20]:
# Evaluating the Models
# ---
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Accuracy scores
#
print("Logistic Regression Classifier", accuracy_score(logistic_y_prediction, y_test))
print("Decision Trees Classifier", accuracy_score(decision_y_prediction, y_test))
print("SVN Classifier", accuracy_score(svm_y_prediction, y_test))
print("KNN Classifier", accuracy_score(knn_y_prediction, y_test))
print("Naive Bayes Classifier", accuracy_score(naive_y_prediction, y_test))

print("Bagging Classifier", accuracy_score(bagging_y_classifier, y_test))
print("Random Forest Classifier", accuracy_score(random_forest_y_classifier, y_test))
print("Ada Boost Classifier", accuracy_score(ada_boost_y_classifier, y_test))
print("GBM Classifier", accuracy_score(gbm_y_classifier, y_test))

Logistic Regression Classifier 0.5
Decision Trees Classifier 0.45
SVN Classifier 0.4
KNN Classifier 0.45
Naive Bayes Classifier 0.4
Bagging Classifier 0.45
Random Forest Classifier 0.6
Ada Boost Classifier 0.4
GBM Classifier 0.55


In [21]:
# Confusion matrix
# ---
# Regardless of the size of the confusion matrix, the method for intepretation is the same.
# The left-hand side contains the predicted values and the actual class labels run across the top.
# The instances that the classifier has correctly predicted run diagonally from the top-left
# to the bottom-right.
# ---
#
print('Logistic Regression Classifier:')
print(confusion_matrix(logistic_y_prediction, y_test))

print('Decision Trees Classifier:')
print(confusion_matrix(decision_y_prediction, y_test))

print('SVN Classifier:')
print(confusion_matrix(svm_y_prediction, y_test))

print('KNN Classifier:')
print(confusion_matrix(knn_y_prediction, y_test))

print('Naive Bayes Classifier:')
print(confusion_matrix(naive_y_prediction, y_test))

print('Bagging Classifier:')
print(confusion_matrix(bagging_y_classifier, y_test))

print('Random Forest Classifier:')
print(confusion_matrix(random_forest_y_classifier, y_test))

print('Ada Boost Classifier:')
print(confusion_matrix(ada_boost_y_classifier, y_test))

print('GBM Classifier:')
print(confusion_matrix(gbm_y_classifier, y_test))

Logistic Regression Classifier:
[[6 8]
 [2 4]]
Decision Trees Classifier:
[[5 8]
 [3 4]]
SVN Classifier:
[[ 8 12]
 [ 0  0]]
KNN Classifier:
[[5 8]
 [3 4]]
Naive Bayes Classifier:
[[ 8 12]
 [ 0  0]]
Bagging Classifier:
[[3 6]
 [5 6]]
Random Forest Classifier:
[[4 4]
 [4 8]]
Ada Boost Classifier:
[[3 7]
 [5 5]]
GBM Classifier:
[[4 5]
 [4 7]]


In [22]:
# Classification Reports
# ---
#
print("Logistic Regression Classifier", classification_report(logistic_y_prediction, y_test))
print("Decision Trees Classifier", classification_report(decision_y_prediction, y_test))
print("SVN Classifier", classification_report(svm_y_prediction, y_test))
print("KNN Classifier", classification_report(knn_y_prediction, y_test))
print("Naive Bayes Classifier", classification_report(naive_y_prediction, y_test))

print("Bagging Classifier", classification_report(bagging_y_classifier, y_test))
print("Random Forest Classifier", classification_report(random_forest_y_classifier, y_test))
print("Ada Boost Classifier", classification_report(ada_boost_y_classifier, y_test))
print("GBM Classifier", classification_report(gbm_y_classifier, y_test))

Logistic Regression Classifier               precision    recall  f1-score   support

           0       0.75      0.43      0.55        14
           1       0.33      0.67      0.44         6

    accuracy                           0.50        20
   macro avg       0.54      0.55      0.49        20
weighted avg       0.62      0.50      0.52        20

Decision Trees Classifier               precision    recall  f1-score   support

           0       0.62      0.38      0.48        13
           1       0.33      0.57      0.42         7

    accuracy                           0.45        20
   macro avg       0.48      0.48      0.45        20
weighted avg       0.52      0.45      0.46        20

SVN Classifier               precision    recall  f1-score   support

           0       1.00      0.40      0.57        20
           1       0.00      0.00      0.00         0

    accuracy                           0.40        20
   macro avg       0.50      0.20      0.29        20
we

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Evaluation our Models

- Accuracy: the percentage of texts that were assigned the correct topic.
- Precision: the percentage of texts the classifier classified correctly out of the total number of texts it predicted for each topic
- Recall: the percentage of texts the model predicted for each topic out of the total number of texts it should have predicted for that topic.
- F1 Score: the average of both precision and recall.