# Bag of n_grams
- Fake news refers to misinformation or disinformation in the country which is spread through word of mouth and more recently through digital communication such as What's app messages, social media posts, etc.

- Fake news spreads faster than Real news and creates problems and fear among groups and in society.

- We are going to address these problems using classical NLP techniques and going to classify whether a given message/ text is Real or Fake Message.

- You will use a Bag of n-grams to pre-process the text and apply different classification algorithms.

- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.

## About Data: Fake News Detection
- Credits: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset

  - This data consists of two columns. - Text - label

  - Text is the statements or messages regarding a particular event/situation.

  - label feature tells whether the given Text is Fake or Real.

  - As there are only 2 classes, this problem comes under the Binary Classification.

In [7]:
# import pandas library
import pandas as pd

In [8]:
# load the fake data
fake_data = pd.read_csv('./Bag-of-n_grams/Fake.csv')

# add a column called label
fake_data['label'] = 'Fake'

# print 5 rows
fake_data.head()

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",Fake
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",Fake
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",Fake
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",Fake
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",Fake


In [9]:
# load true data
true_data = pd.read_csv('./Bag-of-n_grams/True.csv')

# add a column called label
true_data['label'] = 'True'

# print 5 rows
true_data.head()

Unnamed: 0,title,text,subject,date,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",True
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",True
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",True
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",True
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",True


In [10]:
# concatenate fake_data and true_data into one dataframe
data = pd.concat([fake_data, true_data], axis=0)
data = data[['text', 'label']]
data.head()

Unnamed: 0,text,label
0,Donald Trump just couldn t wish all Americans ...,Fake
1,House Intelligence Committee Chairman Devin Nu...,Fake
2,"On Friday, it was revealed that former Milwauk...",Fake
3,"On Christmas day, Donald Trump announced that ...",Fake
4,Pope Francis used his annual Christmas Day mes...,Fake


In [11]:
# check distribution of labels
data['label'].value_counts()

label
Fake    23481
True    21417
Name: count, dtype: int64

In [12]:
# Add the new column "label_num" which gives a unique number to each of these labels
data['label_num'] = data['label'].map({'Fake': 0, 'True': 1})

# check the results with top 5 rows
data.head(5)

Unnamed: 0,text,label,label_num
0,Donald Trump just couldn t wish all Americans ...,Fake,0
1,House Intelligence Committee Chairman Devin Nu...,Fake,0
2,"On Friday, it was revealed that former Milwauk...",Fake,0
3,"On Christmas day, Donald Trump announced that ...",Fake,0
4,Pope Francis used his annual Christmas Day mes...,Fake,0


## Modelling without Pre-processing Text data

In [13]:
# import train-test-split from sklearn
from sklearn.model_selection import train_test_split

# assign feature and target variables
X = data['text']
y = data['label_num']

# Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [14]:
# print the shapes of X_train and X_test
print('X_train shape', X_train.shape)
print('X_test shape', X_test.shape)

X_train shape (35918,)
X_test shape (8980,)


## Attempt 1 :

- using sklearn pipeline module create a classification pipeline to classify the Data.
Note:

  - using CountVectorizer with unigram, bigram, and trigrams.
  - use KNN as the classifier with n_neighbors of 10 and metric as 'euclidean' distance.
  - print the classification report.

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier

#1. create a pipeline object
clf = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1, 3))),
    ('knn', KNeighborsClassifier(n_neighbors=10, metric='euclidean'))
])


#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.76      0.82      0.79      4696
           1       0.78      0.71      0.75      4284

    accuracy                           0.77      8980
   macro avg       0.77      0.77      0.77      8980
weighted avg       0.77      0.77      0.77      8980



## Attempt 2 :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.
Note:

  - using CountVectorizer with unigram, bigram, and trigrams.
  - use KNN as the classifier with n_neighbors of 10 and metric as 'cosine' distance.
  - print the classification report.

In [16]:
#1. create a pipeline object
clf = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1, 3))),
    ('knnc', KNeighborsClassifier(n_neighbors=10, metric='cosine'))
])


#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.64      0.98      0.78      4696
           1       0.95      0.41      0.57      4284

    accuracy                           0.71      8980
   macro avg       0.80      0.69      0.67      8980
weighted avg       0.79      0.71      0.68      8980



## Attempt 3 :

- using the sklearn pipeline module create a classification pipeline to classify the Data.
Note:

  - using CountVectorizer with only trigrams.
  - use RandomForest as the classifier.
  - print the classification report.

In [18]:
# import random forest
from sklearn.ensemble import RandomForestClassifier


#1. create a pipeline object
clf = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(3, 3))),
    ('random_forest', RandomForestClassifier())
])


#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test ,y_pred))

## Attempt 4 :

- using the sklearn pipeline module create a classification pipeline to classify the Data.
Note:

  - using CountVectorizer with both unigram and bigrams.
  - use Multinomial Naive Bayes as the classifier with an alpha value of 0.75.
  - print the classification report.

In [None]:
# import MultinomialNB
from sklearn.naive_bayes import MultiNomialNB


#1. create a pipeline object
clf = Pipeline([
      ('vectoizer', CountVectorizer(ngram_range=(1, 2))),
      ('mb', MultinomialNB(alpha=0.75))
])


#2. fit with X_train and y_train
clf.fit(X_test, y_test)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classifiaction_report(y_test, y_pred))

## Use text pre-processing to remove stop words, punctuations and apply lemmatization

In [None]:
# import spacy

# load english language model and create nlp object
nlp = spacy.load('en_core_web_sm')


# define the preprocess function
def preprocess(text):
    # remove stop words and lemmatize
    doc = nlp(text)
    filtered_text = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_text.append(token.lemma_)
    
    return ' '.join(filtered_text)

In [None]:
# create a new column "preprocessed_txt" and use the utility function above to get the clean data
data['preprocessed_txt'] = data['text'].apply(preprocess)
data.head()

## Build a model with pre processed text

In [None]:
X = data['prprocessed_text']
y= data['label_num']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### Let's check the scores with our best model till now

- Random Forest
### Attempt1 :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.
Note:
    - using CountVectorizer with only trigrams.
    - use RandomForest as the classifier.
    - print the classification report.

In [None]:
#1. create a pipeline object
clf = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(3, 3))),
    ('random_forest', RandomForestClassifier())
])


#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
classification_report(y_test, y_pred)

### Attempt2 :

- using the sklearn pipeline module create a classification pipeline to classify the Data.
Note:
    - using CountVectorizer with unigram, Bigram, and trigrams.
    - use RandomForest as the classifier.
    - print the classification report.

In [None]:
#1. create a pipeline object
clf = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1, 3))),
    ('random_forest', RandomForestClassifier())
])


#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
classification_report(y_test, y_pred)

### Confusion matrix

In [None]:
# print the confusion matrix for the best model using heatmap

# import confusion_matrix from sklearn
from sklearn.metrics import confusion_matrix

# import matplotlib na d seaborn
import matplotlib.pyplot as plt
import seaborn as sns

cm = confusion_matrix(y_test, y_pred)
cm

plt.figure=(figsize= (10, 7))
sns.heatmpa(cm, annot=True, fmt='d')
plt.xlabel('Prediction')
plt.ylabel('Truth')

## Final Observations
- As machine learning algorithms do not work on text data directly, we need to convert them into numeric vectors and feed that into models while training.

- In this process, we convert text into a very high dimensional numeric vector using the technique of Bag of words and we use sklearn CountVectorizer for this.

### Without Pre-Processing Data

- From the above in most of the cases, we can see that when we have the count vectorizer above trigrams or at trigrams, the performance keeps degrading. The major possible reason for this as the ngram_range keeps increasing, the number of dimensions/features (possible combination of words) also increases enormously and models have the risk of overfitting and resulting in terrible performance.

- For this reason, models like KNN failed terribly when performed with trigrams and using the euclidean distance. K-Nearest Neighbours(KNN) doesn't work well with high-dimensional data because, with a large number of dimensions, it becomes difficult for the algorithm to calculate the distance in each dimension. In higher dimensional space, the cost to calculate distance becomes expensive and hence impacts the performance of the model. It performed well for class 1 and had terrible results for Class 0.

- Both recall and F1 scores increase better when trained with the same KNN model but with cosine distance as cosine distance does not get influenced by the number of dimensions as it uses the angle better the two text vectors to calculate the similarity.

- With respect to Naive and RandomForest models, both performed really well, and random forest with trigrams has a better edge on the recall metric.

- As Random Forest uses Bootstrapping(row and column Sampling) with many decision trees and overcomes the high variance and overfitting of high dimensional data and also uses feature importance of words for better classifying the categories.

- The easy calculation of probabilities for the words in the corpus(Bag of words) and storing them in a contingency table is the major reason for the Multinomial NaiveBayes to be a text classification friendly algorithm.

### With Pre-Processing Data

- Have trained the best model RandomForest on the pre-processed data, but RandomForest with trigrams fails to produce the same results here.

- But the same randomForest with Unigram to Trigram features helps to produce very amazing results and is tops in the entire list with very good F1 scores and Recall scores.

