### Loading the libraries

In [1]:
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

### Reading data

In [2]:
df=pd.read_csv("news.csv")

### First 5 Rows of data

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


### Shape of the data

In [4]:
df.shape

(6335, 4)

### To Check weather there is null value

In [5]:
df.isnull().sum()

Unnamed: 0    0
title         0
text          0
label         0
dtype: int64

In [6]:
labels=df.label

In [7]:
labels.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

### Building a Machine Learning Model

Let's build a Machine Learning Model from the existing data that can predict whether news is **Real** or **Fake** for  news given by user.

Before building a model, we need to test whether the model will work when a new news is given or not and whether we can trust it or not.

Remember that we cannot use the Training Model to evaluate our model because our Model will remember whole training data, therefore, it will predict the correct label for every point in the training set. Thus, this "remembering" does not indicate us whether the model is "Generalise" or perform well on **new data**.

**Therefore, we split our data in two parts.**    
**- Training Set:** One part will be used for **Training the Model**    
**- Test Set:** The second part will be use to **Test the Model**.

**scikit-learn** comes with a function that **shuffles the dataset** and **Splits if for you**.     
**As a Rule of Thumb:**   
- Use 75% of Data for Training Set
- Use 25% of Data for Testing Set    

How much data shall be used for Training and Testing is somewhat **arbitrary**. However, **using 75% for Training** and **25% for Testing** is a **good Rule of Thumb**.

In [8]:
X_train,X_test,Y_train,Y_test= train_test_split(df["text"],labels,test_size=0.2,random_state=20)

### Checking shape of Training and Test data

In [9]:
print("Shape of Training Data X_train: ", X_train.shape)
print("Shape of Training Data y_train: ", Y_train.shape)

Shape of Training Data X_train:  (5068,)
Shape of Training Data y_train:  (5068,)


In [10]:
print("Shape of Test Data X_test: ", X_test.shape)
print("Shape of Test Data y_test: ", Y_test.shape)

Shape of Test Data X_test:  (1267,)
Shape of Test Data y_test:  (1267,)


In [11]:
X_train.head()

4741    NAIROBI, Kenya — President Obama spoke out Sun...
2089    Killing Obama administration rules, dismantlin...
4074    Dean Obeidallah, a former attorney, is the hos...
5376      WashingtonsBlog \nCNN’s Jake Tapper hit the ...
6028    Some of the biggest issues facing America this...
Name: text, dtype: object

### TFIDVectorizer

**TF (Term Frequency):** The number of times a word appears in a document is its Term Frequency. A higher value means a term appears more often than others, and so, the document is a good match when the term is part of the search terms.

**IDF (Inverse Document Frequency):** Words that occur many times a document, but also occur many times in many others, may be irrelevant. IDF is a measure of how significant a term is in the entire corpus.

The TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF features.

**Example:** If you search something on the search engine, with the help of TFIDF values, search engines can give us the most relevant documents related to our search.

In [12]:
vector=TfidfVectorizer(stop_words='english',max_df=0.7)

In [13]:
tf_train=vector.fit_transform(X_train)
tf_test=vector.transform(X_test)

### Passive Aggressive Classifier

Passive Aggressive algorithms are online learning algorithms. Such an algorithm remains passive for a correct classification outcome, and turns aggressive in the event of a miscalculation, updating and adjusting. Unlike most other algorithms, it does not converge. Its purpose is to make updates that correct the loss, causing very little change in the norm of the weight vector.

In [14]:
fnd=PassiveAggressiveClassifier(max_iter=50)
fnd.fit(tf_train,Y_train)

PassiveAggressiveClassifier(max_iter=50)

In [15]:
y_pred=fnd.predict(tf_test)

### Confusion matrix

A confusion matrix is a tabular summary of the number of correct and incorrect predictions made by a classifier. It can be used to evaluate the performance of a classification model through the calculation of performance metrics like accuracy, precision, recall, and F1-score.

In [16]:
confusion_matrix(Y_test,y_pred,labels=['FAKE','REAL'])

array([[624,  24],
       [ 40, 579]], dtype=int64)

### Saving model

In [17]:
import pickle

In [18]:
filename='finalized_model.pkl'
pickle.dump(fnd,open(filename,'wb'))

### Checking accuracy score


In [20]:
score=accuracy_score(Y_test,y_pred)
score

0.9494869771112865

In [19]:
filename='vectorizer.pkl'
pickle.dump(vector,open(filename,'wb'))