# TfidfVectorizer Explanation
Convert a collection of raw documents to a matrix of TF-IDF features

TF-IDF where TF means term frequency, and IDF means Inverse Document frequency.

In [2]:
%pip install sklearn

Collecting sklearn
  Using cached sklearn-0.0.post12.tar.gz (2.6 kB)
  Preparing metadata (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[15 lines of output][0m
  [31m   [0m The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
  [31m   [0m rather than 'sklearn' for pip commands.
  [31m   [0m 
  [31m   [0m Here is how to fix this error in the main use cases:
  [31m   [0m - use 'pip install scikit-learn' rather than 'pip install sklearn'
  [31m   [0m - replace 'sklearn' by 'scikit-learn' in your pip requirements files
  [31m   [0m   (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
  [31m   [0m - if the 'sklearn' package is used by one of your dependencies,
  [31m   [0m   it would be great if you take some time to track which package uses
  [31m   [0m   'sklearn' instead of 'scikit-le

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
text = ['Hello Arnav here, I love machine learning','Welcome to the Machine learning hub' ]

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer()

In [5]:
vect.fit(text)

In [6]:
## TF will count the frequency of word in each document. and IDF 
print(vect.idf_)

[1.40546511 1.40546511 1.40546511 1.40546511 1.         1.40546511
 1.         1.40546511 1.40546511 1.40546511]


In [7]:
print(vect.vocabulary_)

{'hello': 1, 'arnav': 0, 'here': 2, 'love': 5, 'machine': 6, 'learning': 4, 'welcome': 9, 'to': 8, 'the': 7, 'hub': 3}


### A words which is present in all the data, it will have low IDF value. With this unique words will be highlighted using the Max IDF values.

In [8]:
example = text[0]
example

'Hello Arnav here, I love machine learning'

In [9]:
example = vect.transform([example])
print(example.toarray())

[[0.44665616 0.44665616 0.44665616 0.         0.31779954 0.44665616
  0.31779954 0.         0.         0.        ]]


### Here, 0 is present in the which indexed word, which is not available in given sentence.

## PassiveAggressiveClassifier

### Passive: if correct classification, keep the model; Aggressive: if incorrect classification, update to adjust to this misclassified example.

Passive-Aggressive algorithms are generally used for large-scale learning. It is one of the few ‘online-learning algorithms‘. In online machine learning algorithms, the input data comes in sequential order and the machine learning model is updated step-by-step, as opposed to batch learning, where the entire training dataset is used at once. This is very useful in situations where there is a huge amount of data and it is computationally infeasible to train the entire dataset because of the sheer size of the data. We can simply say that an online-learning algorithm will get a training example, update the classifier, and then throw away the example.

## Let's start the work

In [10]:
import os
os.chdir("D:/Fake News Detection")

FileNotFoundError: [Errno 2] No such file or directory: 'D:/Fake News Detection'

In [11]:
import pandas as pd

In [48]:
dataframe = pd.read_csv('news_dataset.csv')
import pandas as pd

# Load your dataset
df = pd.read_csv("news_dataset.csv")

# Drop rows where title or text is missing
df.dropna(subset=['text', 'label'], inplace=True)

dataframe.head()



import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Load your CSV
df = pd.read_csv("news_dataset.csv")  # Replace with your actual filename

# Step 1: Drop rows where the 'text' column is NaN
df.dropna(subset=['text'], inplace=True)

# Step 2: (Optional) Convert text to string, just in case
df['text'] = df['text'].astype(str)

# Step 3: Vectorize using TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['text'])  # Feature matrix
y = df['label']  # Labels

# Now X and y are ready to use
print(X.shape)
print(y.value_counts())


(3721, 38587)
label
FAKE    1871
REAL    1850
Name: count, dtype: int64


In [49]:
x = dataframe['text']
y = dataframe['label']

In [50]:
x

0       Payal has accused filmmaker Anurag Kashyap of ...
1       A four-minute-long video of a woman criticisin...
2       Republic Poll, a fake Twitter account imitatin...
3       Delhi teen finds place on UN green list, turns...
4       Delhi: A high-level meeting underway at reside...
                              ...                        
3724    19:17 (IST) Sep 20\n\nThe second round of coun...
3725    19:17 (IST) Sep 20\n\nThe second round of coun...
3726    The Bengaluru City Police’s official Twitter h...
3727    Sep 20, 2020, 08:00AM IST\n\nSource: TOI.in\n\...
3728    Read Also\n\nRead Also\n\nAdvocate Ishkaran Bh...
Name: text, Length: 3729, dtype: object

In [51]:
y

0       REAL
1       FAKE
2       FAKE
3       REAL
4       REAL
        ... 
3724    REAL
3725    REAL
3726    FAKE
3727    REAL
3728    REAL
Name: label, Length: 3729, dtype: object

In [52]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

In [53]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)
y_train

1475    FAKE
3127    REAL
2657    REAL
1069    REAL
2781    FAKE
        ... 
835     FAKE
3264    REAL
1653    FAKE
2607    REAL
2732    FAKE
Name: label, Length: 2983, dtype: object

In [54]:
y_train

1475    FAKE
3127    REAL
2657    REAL
1069    REAL
2781    FAKE
        ... 
835     FAKE
3264    REAL
1653    FAKE
2607    REAL
2732    FAKE
Name: label, Length: 2983, dtype: object

In [64]:

print("Before cleanup:", x_train.isna().sum())

# STEP 2: Fully sanitize x_train
x_train = x_train.apply(lambda x: '' if pd.isna(x) else str(x)).reset_index(drop=True)

# STEP 3: Confirm fix
print("After cleanup:", x_train.isna().sum())
print("Types in x_train:", x_train.apply(type).unique())
tfvect = TfidfVectorizer(stop_words='english', max_df=0.7)
tfid_x_train = tfvect.fit_transform(x_train)  # ✅ This should now work
# tfid_x_train = tfvect.fit_transform(x_train)
# tfid_x_test = tfvect.transform(x_test)
x_test = x_test.apply(lambda x: '' if pd.isna(x) else str(x)).reset_index(drop=True)
tfid_x_test = tfvect.transform(x_test)

# STEP 1: Check for any remaining NaNs

Before cleanup: 0
After cleanup: 0
Types in x_train: [<class 'str'>]


* max_df = 0.50 means "ignore terms that appear in more than 50% of the documents".
* max_df = 25 means "ignore terms that appear in more than 25 documents".

In [65]:
classifier = PassiveAggressiveClassifier(max_iter=50)
classifier.fit(tfid_x_train,y_train)

In [66]:
y_pred = classifier.predict(tfid_x_test)
score = accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 99.46%


In [67]:
cf = confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])
print(cf)

[[359   2]
 [  2 383]]


In [68]:
def fake_news_det(news):
    input_data = [news]
    vectorized_input_data = tfvect.transform(input_data)
    prediction = classifier.predict(vectorized_input_data)
    print(prediction)

In [77]:
fake_news_det('U.S. Secretary of State John F. Kerry said Monday that he will stop in Paris later this week, amid criticism that no top American officials attended Sundayâ€™s unity march against terrorism.')

['REAL']


In [79]:
fake_news_det("""A set of two old and unrelated images are viral on social media with a false claim that a man in Bangladesh burnt to death while celebrating the Indian cricket team's defeat to New Zealand at the World Cup semi finals, last week.

The posts claim that a man named Shaikh Mujibur, who had arranged an elaborate feast to celebrate the forgettable performance of the Indian cricket team, accidentally burnt to death when a cooking gas cylinder exploded.

The post has been captioned as, 'Bangladesh's Shaikh Mujibur arranged an elaborate feast to enjoy India's loss in the semi-finals. But Allah did not agree and he died when the gas cylinder burst and fire engulfed him.'

(Original caption: ‡§∏‡•á‡§Æ‡•Ä‡§´‡§æ‡§á‡§®‡§≤ ‡§Æ‡•á‡§Ç ‡§≠‡§æ‡§∞‡§§ ‡§ï‡•á ‡§π‡§æ‡§∞‡§®‡•á ‡§ï‡•Ä ‡§ú‡§¨‡§∞‡§¶‡§∏‡•ç‡§§ ‡§ñ‡•Å‡§∂‡•Ä ‡§Æ‡•á‡§Ç #‡§¨‡§æ‡§Ç‡§ó‡•ç‡§≤‡§æ‡§¶‡•á‡§∂ ‡§ï‡•á #‡§∂‡•á‡§ñ_‡§Æ‡•Å‡§ú‡•Ä‡§¨‡•Å‡§∞ ‡§®‡•á ‡§ú‡§∂‡•ç‡§® ‡§Æ‡§®‡§æ‡§®‡•á ‡§ï‡•á ‡§≤‡§ø‡§Ø‡•á ‡§¶‡§æ‡§µ‡§§ ‡§ï‡§æ ‡§á‡§Ç‡§§‡§ú‡§æ‡§Æ ‡§ï‡§ø‡§Ø‡§æ‡•§ ‡§≤‡•á‡§ï‡§ø‡§® ‡§Ö‡§≤‡•ç‡§≤‡§æ‡§π ‡§ï‡•ã ‡§Ø‡•á ‡§Æ‡§Ç‡§ú‡•Ç‡§∞ ‡§®‡§π‡•Ä‡§Ç ‡§•‡§æ ‡§î‡§∞ ‡§ó‡•à‡§∏ ‡§∏‡§ø‡§≤‡•á‡§Ç‡§°‡§∞ ‡§´‡§ü‡§®‡•á ‡§∏‡•á #‡§Æ‡•Å‡§ú‡•Ä‡§¨‡•Å‡§∞ ‡§§‡§Ç‡§¶‡•Ç‡§∞‡•Ä ‡§Æ‡•Å‡§∞‡•ç‡§ó‡§æ ‡§¨‡§®‡§ï‡§∞ ‡§ú‡§≤‡§ï‡§∞ ‡§ñ‡§æ‡§ï ‡§π‡•ã ‡§ó‡§Ø‡§æ‡•§ )

The archive of the post can be viewed here.

The same narrative is viral in Bengali where netizens have linked a blog to the Facebook post.

The post can be viewed here.

Fact Check

BOOM found that the blog mentioned in the Bengali Facebook posts credits the news item to bdnews.com. However, no such website could be found. The blog mentioned that the incident happened at night, around 11 pm, in Bangladesh's Khulna, where 57-year-old Shaikh Mujibur lost his life.

We then ran a keyword search with a custom ranged time both in Bengali and English, however, no reports of cooking gas tragedy that occurred in the recent past could be found.

Misleading Images

BOOM ran a reverse image search on both the images and found that they are unrelated and misleading. The photo of the man, identified as deceased Shaikh Mujibur, has been taken from photographer Chantal Aim√©e Ehrhardt's 2017 collection of 'Orange beards of Bangladesh'.

Below is an excerpt of how the photographer described the collection.

"In March/April 2017 I travelled to Bangladesh to make several photo reportages. One of the things that amazed me were the man with orange beards that I saw on the street. I decided to walk through Old Dhaka and ask every single man with an orange beard if I could take a picture of him. When I asked why they dyed their hair, most said because of Prophet Muhammed. Who is believed to have dyed his hair with henna as well. And some said as a first reason because they think it‚Äôs beautiful."

The second image was found in photographer Bill Townsend's Flickr account. The image was taken in January, 2007. BOOM also found that the exif data of the image matches with the metadata of the photo provided on Flickr.

BOOM has reached out to both the photographers. The article will be updated upon receiving a response.""")

['FAKE']


In [80]:
import pickle
pickle.dump(classifier,open('model.pkl', 'wb'))

In [81]:
# load the model from disk
loaded_model = pickle.load(open('model.pkl', 'rb'))

In [82]:
def fake_news_det1(news):
    input_data = [news]
    vectorized_input_data = tfvect.transform(input_data)
    prediction = loaded_model.predict(vectorized_input_data)
    print(prediction)

In [83]:
fake_news_det1("""Go to Article 
President Barack Obama has been campaigning hard for the woman who is supposedly going to extend his legacy four more years. The only problem with stumping for Hillary Clinton, however, is sheâ€™s not exactly a candidate easy to get too enthused about.  """)

['REAL']


In [84]:
fake_news_det1("""U.S. Secretary of State John F. Kerry said Monday that he will stop in Paris later this week, amid criticism that no top American officials attended Sundayâ€™s unity march against terrorism.""")

['REAL']


In [85]:
fake_news_det('''U.S. Secretary of State John F. Kerry said Monday that he will stop in Paris later this week, amid criticism that no top American officials attended Sundayâ€™s unity march against terrorism.''')

['REAL']
