In [1]:
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix


In [4]:
df = pd.read_csv('E:/projects/news/news.csv')
df.shape

(6335, 4)

In [5]:
labels=df.label
labels.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

In [6]:
x_train,x_test,y_train,y_test=train_test_split(df['text'], labels, test_size=0.25, random_state=7)

In [7]:
x_test

3534    A day after the candidates squared off in a fi...
6265    VIDEO : FBI SOURCES SAY INDICTMENT LIKELY FOR ...
3123    It's debate season, where social media has bro...
3940    Mitch McConnell has decided to wager the Repub...
2856    Donald Trump, the actual Republican candidate ...
                              ...                        
678     There is an path for Democrats to regain the p...
6175                                                     
126     The Republican presidential contest is not, re...
1759    By Amanda Froelich at trueactivist.com\nThe ne...
2457    David Franzoni, the writer of Gladiator, annou...
Name: text, Length: 1584, dtype: object

#### Initializing a TfidfVectorizer with stop words from the English language and a maximum document frequency of 0.7 (terms with a higher document frequency will be discarded). 
#### TfidfVectorizer turns a collection of raw documents into a matrix of TF-IDF features.
#### The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False), where n is the total number of documents in the document set and df(t) is the document frequency of t; the document frequency is the number of documents in the document set that contain the term t.

In [8]:

tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7,smooth_idf= False)


tfidf_train=tfidf_vectorizer.fit_transform(x_train) 
tfidf_test=tfidf_vectorizer.transform(x_test)

#### we’ll initialize a PassiveAggressiveClassifier. This is. We’ll fit this on tfidf_train and y_train.

In [9]:
pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)

y_pred=pac.predict(tfidf_test)
y_pred

array(['REAL', 'FAKE', 'REAL', ..., 'REAL', 'FAKE', 'FAKE'], dtype='<U4')

#### accuracy score

In [10]:
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 92.49%


In [11]:
## confusion matrix
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

array([[749,  59],
       [ 60, 716]], dtype=int64)

So with this model, we have 749 true positives, 716 true negatives, 60 false positives, and 59 false negatives.

 we  detected fake news with Python. We took a political dataset, implemented a TfidfVectorizer, initialized a PassiveAggressiveClassifier, and fit our model. We ended up obtaining an accuracy of 92.49% .