In [1]:
import numpy as np # type: ignore
import pandas as pd # type: ignore
import itertools
from sklearn.model_selection import train_test_split # type: ignore
from sklearn.feature_extraction.text import TfidfVectorizer # type: ignore
from sklearn.linear_model import PassiveAggressiveClassifier # type: ignore
from sklearn.metrics import accuracy_score, confusion_matrix # type: ignore

In [2]:
df = pd.read_csv('news.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [4]:
df.shape

(6335, 4)

In [5]:
#get labels 
labels = df.label
labels.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

In [6]:
#Now we split the data
x_train, x_test, y_train, y_test = train_test_split(df['text'], labels, test_size=0.2, random_state=7)

1. When you use random_state=7 in your code, you're setting a specific seed for the random number generator. This seed is like a unique identifier that makes sure any random operations (like shuffling, splitting data, or generating random numbers) produce the same results every time you run the code. It ensures consistency, so if you or someone else runs the code later with the same random_state, you'll both get identical results. The number 7 itself doesn't have a special meaning; it's just the chosen seed. The key thing is that using the same seed keeps things predictable and reproducible.

2. Stop Words:
Stop words are common words like “and,” “the,” or “is” that usually don’t carry much meaning on their own. Removing these words helps focus on the more important words in the documents.

3. Maximum Document Frequency (max_df):
This parameter controls which words to include based on how common they are across all documents. Setting max_df=0.7 means you only want to include words that appear in 70% or fewer of the documents. Words that appear in more than 70% of the documents are considered too common and are discarded.

In [7]:
tfidf_vectorizer = TfidfVectorizer(stop_words = 'english' , max_df=0.7)

In [8]:
tfidf_train = tfidf_vectorizer.fit_transform(x_train)
tfidf_test = tfidf_vectorizer.transform(x_test)

Use tfidf_vectorizer.fit_transform(x_train) to learn the vocabulary and transform the training data into TF-IDF features.

Use tfidf_vectorizer.transform(x_test) to transform the test data using the same vocabulary learned from the training data.

In [9]:
pac = PassiveAggressiveClassifier(max_iter=50)

Imagine you’re teaching a robot to recognize patterns in different types of fruits. You show the robot the fruits multiple times, and each time it tries to get better at recognizing them. The max_iter is like setting a limit on how many times the robot will review the fruit samples and make adjustments to its recognition skills.

In [10]:
pac.fit(tfidf_train, y_train)

#predict:
y_pred = pac.predict(tfidf_test)
score = accuracy_score(y_test, y_pred)
print(f'Accuracy: {round(score*100, 2)}%')

Accuracy: 92.42%


In [11]:
confusion_matrix (y_test, y_pred , labels=['FAKE','REAL'])

array([[582,  56],
       [ 40, 589]], dtype=int64)

So with this model, we have 589 true positives, 587 true negatives, 42 false positives, and 49 false negatives.