In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer  # Tfidf vectorizer transforms the text to tfidf , also it count vectorizes it.
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import classification_report , confusion_matrix ,accuracy_score

## PASSIVE-AGGRESSIVE ALGORITHMS

Passive-Aggressive algorithms are generally used for large-scale learning. It is one of the few ‘online-learning algorithms‘. In online machine learning algorithms, the input data comes in sequential order and the machine learning model is updated step-by-step, as opposed to batch learning, where the entire training dataset is used at once. This is very useful in situations where there is a huge amount of data and it is computationally infeasible to train the entire dataset because of the sheer size of the data. We can simply say that an online-learning algorithm will get a training example, update the classifier, and then throw away the example.

A very good example of this would be to detect fake news on a social media website like Twitter, where new data is being added every second. To dynamically read data from Twitter continuously, the data would be huge, and using an online-learning algorithm would be ideal.

Passive-Aggressive algorithms are somewhat similar to a Perceptron model, in the sense that they do not require a learning rate. However, they do include a regularization parameter.

Passive: If the prediction is correct, keep the model and do not make any changes. i.e., the data in the example is not enough to cause any changes in the model. 
Aggressive: If the prediction is incorrect, make changes to the model. i.e., some change to the model may correct it.


### LINK1 - https://www.geeksforgeeks.org/passive-aggressive-classifiers/
### LINK2 - https://www.youtube.com/watch?v=TJU8NfDdqNQ&ab_channel=VictorLavrenko

Important parameters:

C : This is the regularization parameter, and denotes the penalization the model will make on an incorrect prediction.

max_iter : The maximum number of iterations the model makes over the training data.

tol : The stopping criterion. If it is set to None, the model will stop when (loss > previous_loss  –  tol). By default, it is set to 1e-3.

### What is a TfidfVectorizer?
TF (Term Frequency): The number of times a word appears in a document is its Term Frequency. A higher value means a term appears more often than others, and so, the document is a good match when the term is part of the search terms.

IDF (Inverse Document Frequency): Words that occur many times a document, but also occur many times in many others, may be irrelevant. IDF is a measure of how significant a term is in the entire corpus.

The TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF features.

### What is a PassiveAggressiveClassifier?
Passive Aggressive algorithms are online learning algorithms. Such an algorithm remains passive for a correct classification outcome, and turns aggressive in the event of a miscalculation, updating and adjusting. Unlike most other algorithms, it does not converge. Its purpose is to make updates that correct the loss, causing very little change in the norm of the weight vector.

In [2]:
df = pd.read_csv('news.csv')

In [3]:
df = df.drop('Unnamed: 0' ,axis=1)

In [None]:
X = df['text']

In [4]:
y = df['label']

In [5]:
map_dict = {'FAKE':0 , 'REAL':1}

In [6]:
y = y.map(map_dict)

In [7]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], y, test_size=0.2, random_state=42)

In [8]:
'''
Let’s initialize a TfidfVectorizer with stop words from the English language and a 
maximum document frequency of 0.7 (terms with a higher document frequency will be discarded).
Stop words are the most common words in a language that are to be filtered out before processing the natural language data. 
And a TfidfVectorizer turns a collection of raw documents into a matrix of TF-IDF features.'''

bow_transformer = TfidfVectorizer(stop_words='english' ,max_df=0.7)
tfidf_train=bow_transformer.fit_transform(X_train) 
tfidf_test = bow_transformer.transform(X_test)

In [20]:
pac = PassiveAggressiveClassifier(max_iter=50 ,C=0.62 ,tol=None)
pac.fit(tfidf_train,y_train)

PassiveAggressiveClassifier(C=0.62, max_iter=50, tol=None)

In [21]:
y_pred = pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(score)
print(classification_report(y_pred , y_test))
print(confusion_matrix(y_pred,y_test))
print(f'Accuracy ,rounded off: {round(score*100,2)}%')  # Round up to 2 decimal places.


# Accuracy of 93.69% and a very good model.

0.9368587213891081
              precision    recall  f1-score   support

           0       0.94      0.93      0.94       630
           1       0.94      0.94      0.94       637

    accuracy                           0.94      1267
   macro avg       0.94      0.94      0.94      1267
weighted avg       0.94      0.94      0.94      1267

[[589  41]
 [ 39 598]]
Accuracy ,rounded off: 93.69%


In [22]:
df['label'].value_counts()  # Balanced Dataset , that's why its such a good model.

REAL    3171
FAKE    3164
Name: label, dtype: int64