# 📊Fake News Detection,

Dataset Link : " https://www.kaggle.com/datasets/vishakhdapat/fake-news-detection "


### Step 1: Import Libraries and Load Dataset

* pandas is a powerful Python library for data analysis and manipulation.

* read_csv() loads the CSV file into a DataFrame.

* head() displays the first 5 rows of the dataset.

In [3]:
import pandas as pd

# Read the CSV file, specifying error handling for bad lines
data = pd.read_csv("/content/fake_and_real_news.csv")

# Print some info to check the data
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9900 entries, 0 to 9899
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    9900 non-null   object
 1   label   9900 non-null   object
dtypes: object(2)
memory usage: 154.8+ KB


In [9]:
df = data

#### ❓Why it's needed:
Before doing anything with the data, we need to load and inspect it. This helps us:

* Understand the structure of the dataset

* Identify which columns are important

* Plan the next steps in cleaning and processing

### Step 2: Data Cleaning and Preprocessing

In [10]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    text = text.lower()                              # Convert to lowercase
    text = re.sub(r'\d+', '', text)                  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)              # Remove punctuation
    text = text.split()                              # Tokenization
    text = [word for word in text if word not in stop_words]  # Remove stopwords
    text = [lemmatizer.lemmatize(word) for word in text]      # Lemmatization
    return ' '.join(text)

df['clean_text'] = df['Text'].apply(preprocess)
df[['Text', 'clean_text']].head()


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Text,clean_text
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,top trump surrogate brutally stab back he path...
1,U.S. conservative leader optimistic of common ...,u conservative leader optimistic common ground...
2,"Trump proposes U.S. tax overhaul, stirs concer...",trump proposes u tax overhaul stir concern def...
3,Court Forces Ohio To Allow Millions Of Illega...,court force ohio allow million illegally purge...
4,Democrats say Trump agrees to work on immigrat...,democrat say trump agrees work immigration bil...


###  ✅ Why This is Needed?
* Cleans and normalizes text for better model performance.

* Reduces noise and standardizes vocabulary (ex: running, runs, ran → run).

### Step 3: Feature Extraction using TF-IDF

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['clean_text'])


#### ✅ Why This is Needed?
* Converts text to numerical format for machine learning models.

* TF-IDF captures word importance across documents.

### Step 4: Label Encoding


In [12]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(df['label'])  # 'Fake'=0, 'Real'=1


#### ✅ Why This is Needed?
Converts categorical labels (Fake, Real) to numerical form required by ML models.

### Step 5: Train-Test Split

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


#### ✅ Why This is Needed?
Splits data to evaluate model performance on unseen data (generalization).



### Step 6: Model Building with Naive Bayes

In [14]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train, y_train)


#### Extract the top keywords (features) that have the most influence on predictions for each class (Fake or Real).

In [19]:
import numpy as np

# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Get log probabilities for each class
class_labels = model.classes_  # [0 = Fake, 1 = Real]
log_probs = model.feature_log_prob_

# Top keywords for Fake News
top_n = 20
top_fake_indices = np.argsort(log_probs[0])[-top_n:]
top_fake_keywords = feature_names[top_fake_indices]

# Top keywords for Real News
top_real_indices = np.argsort(log_probs[1])[-top_n:]
top_real_keywords = feature_names[top_real_indices]

print("🛑 Top 20 Keywords for FAKE News:")
print(top_fake_keywords)

print("\n✅ Top 20 Keywords for REAL News:")
print(top_real_keywords)


🛑 Top 20 Keywords for FAKE News:
['get' 'woman' 'said' 'know' 'time' 'white' 'would' 'clinton'
 'realdonaldtrump' 'even' 'republican' 'obama' 'via' 'president' 'like'
 'one' 'donald' 'people' 'image' 'trump']

✅ Top 20 Keywords for REAL News:
['united' 'administration' 'democrat' 'senator' 'official' 'committee'
 'russia' 'white' 'washington' 'bill' 'would' 'reuters' 'president'
 'state' 'senate' 'republican' 'house' 'tax' 'trump' 'said']


#### ✅ Why This is Needed?
Naive Bayes is effective for text classification due to the bag-of-words assumption.

###  Step 7: Model Evaluation

In [15]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.9641414141414142
Confusion Matrix:
 [[931  42]
 [ 29 978]]
Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.96      0.96       973
           1       0.96      0.97      0.96      1007

    accuracy                           0.96      1980
   macro avg       0.96      0.96      0.96      1980
weighted avg       0.96      0.96      0.96      1980



#### ✅ Why This is Needed?
* Measures how well the model performs.

* Confirms how well it distinguishes between Fake and Real news.



###   Step: Save Model for Deployment

In [16]:
import joblib

joblib.dump(model, 'fake_news_model.pkl')
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')


['tfidf_vectorizer.pkl']

In [18]:

# 📝 Input from user
user_input = input("Enter a news headline or article: ")

# Preprocess & predict
processed_input = preprocess(user_input)
vectorized_input = vectorizer.transform([processed_input])
prediction = model.predict(vectorized_input)

# Output
print("Prediction:", "Real News 📰" if prediction[0] == 1 else "Fake News 🚨")

Enter a news headline or article: "Hillary Clinton Adopts Alien Baby In Shocking Secret Ceremony"
Prediction: Fake News 🚨
