Textual data, is said to be the most available and the most unorganised form of data, and most organisations fail to manipulate its raw form, hence this huge resource for very useful insights gets wasted! In this notebook, I have explored various preprocessing techniques, ranging from basic TF-IDF matrix to tokenizers provided by the state-of-art NLP models- 
Google's BERT(Bidirectional Encoder Representations from Transformers) on the data to generate useful features.

In [None]:
#importing input files
import numpy as np 
import pandas as pd 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
fake_df = pd.read_csv("/kaggle/input/fake-and-real-news-dataset/Fake.csv")
real_df = pd.read_csv("/kaggle/input/fake-and-real-news-dataset/True.csv")

In [None]:
fake_df.head()

In [None]:
real_df.head()

In [None]:
#pandas profiling to observe the distributions in the data
import pandas_profiling
fake_report = pandas_profiling.ProfileReport(fake_df)

In [None]:
fake_report

In [None]:
fake_df.drop_duplicates(inplace = True)

In [None]:
real_report = pandas_profiling.ProfileReport(real_df)

In [None]:
real_report

In [None]:
real_df.drop_duplicates(inplace = True)

In [None]:
fake_df['label'] = 1
real_df['label'] = 0

In [None]:
#concat and shuffle real and fake dataset
data = pd.concat([fake_df,real_df],axis = 0)
data = data.sample(frac = 1)
data.reset_index(drop=True,inplace = True)
data.head()

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

import regex as re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

import tensorflow as tf
from tensorflow.python.keras.preprocessing import sequence
from tensorflow.python.keras.preprocessing import text

from textblob import TextBlob, Word

In [None]:
subj = []
for sub in data['subject']:
    sub = sub.lower()
    sub = re.sub(" ", '',sub)
    sub = re.sub("-",'',sub)
    sub = re.sub("_",'',sub)
    subj.append(sub)
data['subject'] = subj

# Feature Engineering on News Title:

The title of a news article is very useful to detect fake news, few of the reasons such as due to their style of casing (first letter of every word capitalised), polarity of the language, use of informal or obnoxious style of writing. Hence, the title can be used to generate many useful insights about the news articles, some of which I have explored below:

In [None]:
# generate sentiment subjectivity, polarity 
def sentiment_polarity(data,clean_col):
    subjectivity = []
    polarity = []
    sense = []
    for text in data[clean_col]:
        subjectivity.append(TextBlob(text).sentiment.subjectivity)
        pol = TextBlob(text).sentiment.polarity
        polarity.append(pol)
        if pol>0:
            sense.append(1)
        elif pol<0:
            sense.append(-1)
        else:
            sense.append(0)
    return subjectivity, polarity, sense

sub,pol,sentiment = sentiment_polarity(data,'title')

In [None]:
data['subjectivity'] = sub
data['polarity'] = pol
data['sentiment_polarity'] = sentiment

In [None]:
import plotly.express as px
fig = px.box(data, x="sentiment_polarity", y="subjectivity",color="label")
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)')
fig.show()

Subjectivity of the title is more for fake news, irrespective of their sentiment polarity, it is so, because probably the writers of these fake news try to include a lot of weird facts to make it a clickbait for readers. Real news, which generally are neutral reports try making it as concise as possible!

Stemming and Lemmatizing a chunk of data, makes it easier for the machine to identify numerous words from their root form, which is essential for semantic analysis of the data, generating tokens and embeddings for words in the doc, which are very useful to perform deep learning on the data. <br>
Raw form of data should also be cleaned to remove stopwords, punctuations and at times, if nessacary, rare words to prevent crowding by regular pronouns, models, verbs and spell errors which do not really contribute to understanding the subject of the sentence.<br>
But, a point to be noted, for generating embeddings (for a sequence of words in a sentence) which will further be used for training deep learning models, removal of stop words is not a good practice because doing so breaks the flow of the language which is nessaccary for the deep learning model to learn.

In [None]:
# removing stopwords, punctuations and obtain stemmed and lemmatized form of data
stopwords_en = set(stopwords.words('english'))
def clean_data(data,col):
    stemmed_col = []
    lemmatized_col = []
    for text in data[col]:
        sent = text.lower()
        sent = re.sub(r'((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*', 
                    '', sent, flags=re.MULTILINE)
        sent = re.sub(r'[^\w\s]', '', sent) 
        ps = PorterStemmer()
        stemmed_text = " ".join([ps.stem(word) for word in sent.split() if word not in stopwords_en and not word.isdigit()])
        stemmed_col.append(stemmed_text)
        
        sent = TextBlob(sent)
        lemmatized_text = " ".join([word.lemmatize() for word in sent.words if word not in stopwords_en and not word.isdigit()])
        lemmatized_col.append(lemmatized_text)
        
    return stemmed_col, lemmatized_col

In [None]:
stemmed_col, lemmatized_col = clean_data(data,'title')

In [None]:
data['stemmed_title'] = stemmed_col
data['lemmatized_title'] = lemmatized_col

In [None]:
data.tail()

In [None]:
# create vocabulary of words used in titles
import wordcloud
import matplotlib.pyplot as plt

def gen_vocab(data,col,label):
    vocab =  {}
    data = data.loc[data['label'] == label]
    for text in data[col]:
      sen = text.split()
      for word in sen:
        try:
          vocab[word] += 1
        except KeyError:
          vocab[word] = 1
    vocab = dict(sorted(vocab.items(), key=lambda item: item[1]))
    vals = " ".join(w for w in vocab.keys())
    w = wordcloud.WordCloud().generate(vals)
    plt.imshow(w, interpolation='bilinear')
    plt.axis("off")
    plt.show()
    return vocab

In [None]:
vocab_fake_title_unclean = gen_vocab(data,'title',1)

A collection of most frequent words from the unfiltered titles of fake data. There is Trump,Obama,Hillary, and also we see words like Gun, Cop, Black, Muslim, Racist, American, Right, Lie, which shows they always attack sensitve and delicate social and political issues to attract readers and influence them!

In [None]:
vocab_real_title_unclean = gen_vocab(data,'title',0)

The collection of words from the unfiltered titles of real news data. We can see that they mostly focus on countries in relation with the United States, such as Russia, Saudi, UK,other than the common occurence of the word Trump. We can also see words like Anti, Ex and Post, probably focusing on the relations with the country or the politicians maybe. They don't touch social issues much compared to fake news.

In [None]:
vocab_fake_title_clean = gen_vocab(data,'lemmatized_title',1)

Lemmatized titles highlight the common verbs occuring in the titles. For the fake titles, We see words like politicize, shatter, berate, incriminate, 6 year old(wonder why!?)etc. We can make somewhat sense the kind of language fake news makers use.

In [None]:
vocab_real_title_clean = gen_vocab(data,'lemmatized_title',0)

For the real news titles we can spot words like endanger, scrutinize, swear, refute and nouns like william, chahed(politician), barca etc.

We can use the vocabularies we generated to filter out rare words from texts, which is completely optional!

In [None]:
def rare_words(vocab):
    rare_words = []
    for key,value in vocab.items():
      if value<=5:
        rare_words.append(key)
    return rare_words

In [None]:
rare_words_fake_title = rare_words(vocab_fake_title_clean)
rare_words_real_title = rare_words(vocab_real_title_clean)

Now we can use these list of words to remove these words from the titles, with the help of regex or replace.

# Feature Engineering on News Body:

News body is a big chunk of insightful information, which can be used to summarise the content of the article, generate name entity recognizers which can be used to identify imortant information or terms in the article, answer questions about the news, ofcourse with the help of new state of art models. 

For statistical analysis, we can count the number of words, sentences, approx lengths, to see if there is any significant difference between the real and fake news on the basis of these features. We can also obtain n-grams with the help of tf-idf vectorizers, which can be further used for semantic analysis on the data. I have explored a few of these techniques to obtain useful features!

In [None]:
## word count, sentence count, average lengths
data['word_count_body'] = data["text"].apply(lambda x: len(str(x).split(" ")))
data['char_count_body'] = data["text"].apply(lambda x: sum(len(word) for word in str(x).split(" ")))
data['sentence_count_body'] = data["text"].apply(lambda x: len(str(x).split(".")))
data['avg_word_length_body'] = data['char_count_body'] / data['word_count_body']
data['avg_sentence_lenght_body'] = data['word_count_body'] / data['sentence_count_body']
data.head()

In [None]:
## plot word counts of fake and real news data
fig = px.box(data, x="sentiment_polarity", y="word_count_body",color="label")
fig.update_layout(plot_bgcolor='rgba(0,0,0,0)')
fig.show()

We can see that fake news have significantly more number of words than real news. Now that's a good discovery, though not very dependable!

In [None]:
##stemming and lemmatizing body text
## This is a very time consuming method, any optimized solution is most welcome in the comments below!
stemmed_col , lemmatized_col = clean_data(data,'text')

In [None]:
data['stemmed_body'] = stemmed_col
data['lemmatized_body'] = lemmatized_col

In [None]:
## TF-IDF vectorizer on body text to obtain bi-grams and tri- grams of the dataset.
vectorizer = TfidfVectorizer(ngram_range=(2, 3),max_features=20000,smooth_idf=True)
tfidif_matrix = vectorizer.fit_transform(data['lemmatized_body'])
print(tfidif_matrix.shape)
print(vectorizer.get_feature_names()[:50])

We can obtain embedding vectors from pre-trained embedding models, which can be used to create an embedding matrix, which is a good parameter to feed into the embedding layer of a sequential deep learning model. Embeddings are useful to study similarities of different words, hence obtain corresponding vectors which are essential for studying sequences and sequence classifcation by deep learning models.

I have used the GloVe model 200 dimensional embeddings, to generate vectors for this dataset.

In [None]:
## Load word vectors
embeddings_index = dict()
f = open('/kaggle/input/glove-global-vectors-for-word-representation/glove.6B.200d.txt')
for line in f:
  values = line.split()
  word = values[0]
  try:
    coefs = np.asarray(values[1:], dtype='float32')
  except ValueError:
    pass
  embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

In [None]:
## obtain sequnece of tokens from data
tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(data['lemmatized_body'])
body_sequence = tokenizer.texts_to_sequences(data['lemmatized_body'])
body_sequence = sequence.pad_sequences(body_sequence,maxlen=200)

In [None]:
## This generates a huge dictionary, hence I have commented it out, you could have a look if you want!
##print(tokenizer.word_counts)

In [None]:
## obtain word vectors for tokens from our dataset
tokens = len(tokenizer.word_index) + 2
embedding_matrix = np.zeros((tokens, 200))
count = 0
unknown = []
for word, i in tokenizer.word_index.items():
  embedding_vector = embeddings_index.get(word)
  if embedding_vector is not None:
    embedding_matrix[i] = embedding_vector
  else:
    unknown.append(word)
    count += 1

In [None]:
print(f'Percentage of obtained word vectors for our dataset : {100*(1-(count/len(tokenizer.word_index)))} %')

This embedding matrix can now be used as weights for an embedding layer to train a deep learning model!:)

The HuggingFace🤗  library provides a range of state of art transformer models, pre-processing methods for natural language processsing, understanding, text generation, intent classification,question answering and much more. <br>You can read more about these models on their website: [https://huggingface.co/transformers/index.html](http://)<br>In this notebook, I am exploring their tokenizers, transformer pipelines, processors etc. to learn and tune to get the most optimised results. My work is still under progress, so stay tuned for more techniques!

In [None]:
#!pip install tokenizers

In [None]:
## Token generation by the BERT tokeniser
from tokenizers import BertWordPieceTokenizer
tokenizer = BertWordPieceTokenizer("/kaggle/input/pretrained-bert-models-for-pytorch/bert-base-uncased-vocab.txt")

def bert_tokenizer(data,col):
    encoded_output = []
    for i in data[col]:
        encoded = tokenizer.encode(i)
        encoded_output.append(encoded)
    return encoded_output
data['encoded_body'] = bert_tokenizer(data,'lemmatized_body')

In [None]:
data['encoded_body'][:2]

In [None]:
encoded_sent = data['encoded_body'][0]
encoded_sent.offsets
encoded_sent.tokens

HuggingFace transformers provide a set of pipelines which are pre-trained to undergo complex functions such as sentiment analysis, name entity recognizers, summaries and even answer questions! to name a few. I have explored a few of these piplines to gain insight and features from our dataset. <br>
Since these pipelines take up a lot of RAM, it is advised to use them as and when required only!

In [None]:
import transformers
from transformers import pipeline

In [None]:
## Obtain sentiment Analysis of the news body
sentimentAnalysis = pipeline("sentiment-analysis")

## The following loop isn't an optimised way to 
## obtain analysis of a large amount of data hence 
## i have commented it out.
# sentiment_analysis = []
# for body in data['lemmatized_title']:
#     sentiment = sentimentAnalysis(body)
#     sentiment_analysis.append(sentiment)
# print (sentiment_analysis[:5])

body = data['lemmatized_body'][0]
print(sentimentAnalysis(body))

In [None]:
## Name entity recognizers 
namedEntityRecgnition = pipeline("ner")

## The following loop isn't an optimised way to 
## obtain analysis of a large amount of data hence 
## i have commented it out.
# ner_body = []
# for body in data['lemmatized_body']:
#     neR = namedEntityRecgnition(body)
#     ner_body.append(neR)
# print (sentiment_analysis[:5])

print(namedEntityRecgnition(body))

In [None]:
## Extract summary of the data
summarizer = pipeline('summarization')
## The following loop isn't an optimised way to 
## obtain analysis of a large amount of data hence 
## i have commented it out.
# summary = []
# for body in data['lemmatized_body']:
#     sum = summarizer(body)
#     summary.append(sum)
# print (summary[:5])

print(summarizer(body))

The above generated features of the news body can be used for analysis and gaining useful insights. So far we have generated a large number of features, which can be used for modelling and predictions. Stay tuned for more ideas on feature engineering in NLP!

# Work Under Progress!

Please Leave an upvote if you like my notebook! Any suggestions or corrections are welcome in the comments below, thanks!