<a href="https://colab.research.google.com/github/Santosh-Reddy1310/Machine_Learning_Projects/blob/main/FakeNewsDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 Build a logistic regression model to classify news articles as either "real" or "fake" (binary classification).


In [1]:
#importing necessary dependencies

import pandas as pd
import numpy as np
import nltk

Step 1 : Load and combine Data

In [2]:
true_news = pd.read_csv('/content/True.csv')
fake_news = pd.read_csv('/content/Fake.csv')

print('Files are Loaded successfully')

Files are Loaded successfully


Adding label column to each Dataframe and asiigning 0 for true and 1 for fake news

In [3]:
true_news['Label'] = 0
fake_news['Label'] = 1

#combining dataframes
df = pd.concat([true_news, fake_news],ignore_index=True)

print(df.head())

                                               title  \
0  As U.S. budget fight looms, Republicans flip t...   
1  U.S. military to accept transgender recruits o...   
2  Senior U.S. Republican senator: 'Let Mr. Muell...   
3  FBI Russia probe helped by Australian diplomat...   
4  Trump wants Postal Service to charge 'much mor...   

                                                text       subject  \
0  WASHINGTON (Reuters) - The head of a conservat...  politicsNews   
1  WASHINGTON (Reuters) - Transgender people will...  politicsNews   
2  WASHINGTON (Reuters) - The special counsel inv...  politicsNews   
3  WASHINGTON (Reuters) - Trump campaign adviser ...  politicsNews   
4  SEATTLE/WASHINGTON (Reuters) - President Donal...  politicsNews   

                 date  Label  
0  December 31, 2017       0  
1  December 29, 2017       0  
2  December 31, 2017       0  
3  December 30, 2017       0  
4  December 29, 2017       0  


In [4]:
#checking the labels
print(df['Label'].value_counts())

Label
1    23481
0    21417
Name: count, dtype: int64


In [5]:
#checking for missing values before preprocessing
print(df.isnull().sum())

title      0
text       0
subject    0
date       0
Label      0
dtype: int64


Step 2 :Data Preprocessing & Exploration (EDA)

In [6]:
# Fill any potential NaN values in 'title' or 'text' with an empty string
df['title'] = df['title'].fillna('')
df['text'] = df['text'].fillna('')
print("Missing values after filling with empty string:")
print(df.isnull().sum())

Missing values after filling with empty string:
title      0
text       0
subject    0
date       0
Label      0
dtype: int64


In [7]:
df['full_text'] = df['title'] + " " + df['text']

In [8]:
#Initialize the lemmatizer and stop words
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('wordnet')


lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [9]:
#Defining a text xleaning function
def preprocess_text(text):
    #convert to lowercase
    text = text.lower()
    #Remove punctuation and numbers , keep only alphabets and spaces
    text = re.sub(r'[^a-z\s]','',text)
    #tokenize and remove stop words , then lemmatize
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    #join the words back into a single string
    return ' '.join(words)

#Apply the cleaning function to the full_text column
print("\nApplying text preprocessing (this might take a moment, especially for large datasets)...")
df['clean_text'] = df['full_text'].apply(preprocess_text)
print("Text preprocessing complete!")



Applying text preprocessing (this might take a moment, especially for large datasets)...
Text preprocessing complete!


In [10]:
#display the size
print(df.shape)

(44898, 7)


In [11]:
#display the first few rows in columns
print(df[['title' , 'text', 'full_text', 'clean_text','Label']].head())

                                               title  \
0  As U.S. budget fight looms, Republicans flip t...   
1  U.S. military to accept transgender recruits o...   
2  Senior U.S. Republican senator: 'Let Mr. Muell...   
3  FBI Russia probe helped by Australian diplomat...   
4  Trump wants Postal Service to charge 'much mor...   

                                                text  \
0  WASHINGTON (Reuters) - The head of a conservat...   
1  WASHINGTON (Reuters) - Transgender people will...   
2  WASHINGTON (Reuters) - The special counsel inv...   
3  WASHINGTON (Reuters) - Trump campaign adviser ...   
4  SEATTLE/WASHINGTON (Reuters) - President Donal...   

                                           full_text  \
0  As U.S. budget fight looms, Republicans flip t...   
1  U.S. military to accept transgender recruits o...   
2  Senior U.S. Republican senator: 'Let Mr. Muell...   
3  FBI Russia probe helped by Australian diplomat...   
4  Trump wants Postal Service to charge 'much 

In [12]:
#we also check for length of clean_text to ensure content is present
df['clean_text_length'] = df['clean_text'].apply(len)
print("\nCleaned Text Length Distribution (first 5):")
print(df['clean_text_length'].head())
print("\nOverall Cleaned Text Length Statistics:")
print(df['clean_text_length'].describe())


Cleaned Text Length Distribution (first 5):
0    3318
1    3030
2    1986
3    1822
4    3620
Name: clean_text_length, dtype: int64

Overall Cleaned Text Length Statistics:
count    44898.000000
mean      1765.604659
std       1512.065830
min         22.000000
25%        919.000000
50%       1554.000000
75%       2192.000000
max      37930.000000
Name: clean_text_length, dtype: float64


Step 3 : Text Vectorization

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

#Defining features and target
X = df['clean_text']
y = df['Label']

print("Features and Target values are defined")

Features and Target values are defined


In [15]:
tfidf_vectorizer = TfidfVectorizer(max_features=10000)
print("Vectorizing text data...")
X_vectorized = tfidf_vectorizer.fit_transform(X)
print(f"Shape of vectorized data: {X_vectorized.shape}")

Vectorizing text data...
Shape of vectorized data: (44898, 10000)


In [16]:
print("\nTop 10 features (words) learned by TF-IDF:")
feature_names = tfidf_vectorizer.get_feature_names_out()
print(feature_names[:10])


Top 10 features (words) learned by TF-IDF:
['aaron' 'abadi' 'abandon' 'abandoned' 'abandoning' 'abbas' 'abbott' 'abc'
 'abdel' 'abducted']


step 4 : Model Building

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix

print("Splitting data into training and testing sets...")
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=42)
print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")

Splitting data into training and testing sets...
Training data shape: (35918, 10000)
Testing data shape: (8980, 10000)


In [18]:
#Initializing Logistic Regression model
model = LogisticRegression(max_iter=1000,solver='liblinear')
print("Training the model...")
model.fit(X_train, y_train)
print("Model training complete!")

Training the model...
Model training complete!


Step 5 : Model Evaluation

In [19]:
print("Making predictions on the test set...")
y_pred = model.predict(X_test)

#calculating accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on the test set: {accuracy:.4f}")

Making predictions on the test set...
Accuracy on the test set: 0.9892


In [20]:
#Display confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Confusion Matrix:
[[4299   31]
 [  66 4584]]


In [21]:
#Displaying classification report
# This report provides Precision, Recall, and F1-Score for each class (0: True, 1: Fake)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['True News', 'Fake News']))


Classification Report:
              precision    recall  f1-score   support

   True News       0.98      0.99      0.99      4330
   Fake News       0.99      0.99      0.99      4650

    accuracy                           0.99      8980
   macro avg       0.99      0.99      0.99      8980
weighted avg       0.99      0.99      0.99      8980



In [22]:

print("\nModel evaluation complete!")


Model evaluation complete!


Step 6: Making Predictions on New Data

In [24]:
print("\n--- Making Predictions on New Data ---")
new_articles = ["BREAKING NEWS: Scientists discover cure for all cancers, available next month!", # Likely Fake
    "President issues executive order to increase national park funding by 20 percent. Details to follow.", # Likely True
    "Aliens land in Times Square, declare peace and offer free energy to all humanity." # Definitely Fake
                ]

for i , article in enumerate(new_articles):
    print(f"-- Article {i + 1} --")
    print(f"Original Text: {article}")

    # 1. Preprocess the new article using the same function
    clean_new_article = preprocess_text(article)
    print(f"Cleaned Text: {clean_new_article}")

    # 2. Vectorize the cleaned text
    vec_new_article = tfidf_vectorizer.transform([clean_new_article])

    # 3. Make Prediction
    prediction = model.predict(vec_new_article)
    prediction_proba = model.predict_proba(vec_new_article) #to get the probability of prediction

    # Interpret the prediction
    if prediction[0] == 0:
        print(f"Prediction: REAL News (Confidence: {prediction_proba[0][0]:.2f})")
    else:
        print(f"Prediction: FAKE News (Confidence: {prediction_proba[0][1]:.2f})")

print("\nPrediction demonstration complete!")



--- Making Predictions on New Data ---
-- Article 1 --
Original Text: BREAKING NEWS: Scientists discover cure for all cancers, available next month!
Cleaned Text: breaking news scientist discover cure cancer available next month
Prediction: FAKE News (Confidence: 0.87)
-- Article 2 --
Original Text: President issues executive order to increase national park funding by 20 percent. Details to follow.
Cleaned Text: president issue executive order increase national park funding percent detail follow
Prediction: FAKE News (Confidence: 0.67)
-- Article 3 --
Original Text: Aliens land in Times Square, declare peace and offer free energy to all humanity.
Cleaned Text: alien land time square declare peace offer free energy humanity
Prediction: FAKE News (Confidence: 0.84)

Prediction demonstration complete!
