# Introduction

This project aims to predict whether a news item is fake or not, and for that I will use two important things: **TFIDFVECTORIZER** and **PASSIVE AGGRESSIVE CLASSIFIER**

### TFIDF Vectorizer

To understand what it is **TFIDFVECTORIZER**, you need to understand two important concepts:
- **Term Frequency (TF)** - Term frequency (TF) often used in Text Mining, NLP and Information Retrieval tells you how frequently a term occurs in a document. In the context natural language, terms correspond to words or phrases. Since every document is different in length, it is possible that a term would appear more often in longer documents than shorter ones. Thus, term frequency is often divided by the  the total number of terms in the document as a way of normalization. 

> TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

- **Inverse Document Frequency (IDF)** - Inverse Document Frequency (IDF) is a weight indicating how commonly a word is used. The more frequent its usage across documents, the lower its score. The lower the score, the less important the word becomes. For example, the word "the" appears in almost all English texts and would thus have a very low IDF score as it carries very little topic information. In contrast, if you take the word "coffee", while it is common, it’s not used as widely as the word "the". Thus, "coffee" would have a higher IDF score than "the".

The TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF features.

### Passive Agressive Classifier

Passive-Aggressive algorithms are called so because :


- Passive: If the prediction is correct, keep the model and do not make any changes. i.e., the data in the example is not enough to cause any changes in the model. 
- Aggressive: If the prediction is incorrect, make changes to the model. i.e., some change to the model may correct it.

### Important

For more information on the topics covered above, see the "references" section of this notebook, where you will find several useful links.

# Import libs

In [None]:
import numpy as np
import pandas as pd
import itertools

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Read the data

In [None]:
fake_news_data = pd.read_csv('../input/textdb3/fake_or_real_news.csv')

print(fake_news_data.shape)
fake_news_data.head()

The dataset columns are very intuitive:
- **title**: title of news
- **text**: text of news
- **label**: the news is real or fake

The column **Unnamed: 0** doesn't interest us.

In [None]:
fake_news_data['label'].value_counts()

It's a small, but well-balanced dataset, with almost the same numbers of fake and non-fake news

# Modeling

### Select labels

In [None]:
labels = fake_news_data.label

### Separation between training and testing

In [None]:
# The column "text" is used to X and labels used to y
X_train, X_test, y_train, y_test = train_test_split(fake_news_data['text'], labels, test_size=0.2, random_state=42)

### Tfidf vectorizer

In [None]:
# For more information about TfidVectorizer parameters, check the "references" section
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)

# Fit and transform train data and transform test data
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

### Create Passive Aggressive model

In [None]:
# Passive Agressive Classifier
pa_clf = PassiveAggressiveClassifier(max_iter=50)

# Fit the classifier
pa_clf.fit(tfidf_train, y_train)

# Make predictions with test data
preds = pa_clf.predict(tfidf_test)

# Evaluate

### Score

In [None]:
# Accuracy
score = accuracy_score(y_test, preds)

# Print accuracy
print('Accuracy = {}%'.format(round(score * 100, 2)))

WOW, 93.69% accuracy until it is a good result for such a simple model!

### Confusion Matrix

In [None]:
confusion_matrix(y_test, preds)

I hope you enjoyed this simple tutorial.

# References

- [What is Term-Frequency?](https://kavita-ganesan.com/what-is-term-frequency/#.Xb2W0pNKjm0)
- [What is Inverse Document Frequency (IDF)?](https://kavita-ganesan.com/what-is-inverse-document-frequency/)
- [Passive Aggressive Classifiers](https://www.geeksforgeeks.org/passive-aggressive-classifiers/)