# <center>FAKE NEWS DETECTION</center>
<center>BY: SHERLIEN MOLLY D</center>

1. Import Libraries
In this step, we import all the necessary libraries for data manipulation, text processing, and machine learning model building. This includes essential libraries like `pandas`, `nltk` for natural language processing, `TfidfVectorizer` for converting text into numerical features, and `LogisticRegression` for model training.


Explanation:
 - pandas: For handling data in tabular format.
 - nltk: For text preprocessing, including stopwords and tokenization.
 - re: Regular expressions for text cleaning.
 - sklearn: Machine learning functions for splitting data, vectorizing text, training models, and evaluating performance.
 - seaborn and matplotlib: For data visualization and analysis.


In [2]:
!! install nltk

['usage: install [-bCcpSsv] [-B suffix] [-f flags] [-g group] [-m mode]',
 '               [-o owner] file1 file2',
 '       install [-bCcpSsv] [-B suffix] [-f flags] [-g group] [-m mode]',
 '               [-o owner] file1 ... fileN directory',
 '       install -d [-v] [-g group] [-m mode] [-o owner] directory ...']

In [3]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
from collections import Counter

#Download necessary resources from nltk
nltk.download('stopwords')
nltk.download('punkt')

ModuleNotFoundError: No module named 'nltk'

2. Load the Data
We load two datasets: one containing fake news and another with true news. After loading the datasets, we label the data as fake (0) or true (1) and combine the two into a single DataFrame for ease of processing.

In [4]:
fake_df = pd.read_csv('Fake.csv')
true_df = pd.read_csv('True.csv')

#Add a label column to both datasets: 0 for fake, 1 for true
fake_df['label'] = 0
true_df['label'] = 1

#Combine both datasets
combined_df = pd.concat([fake_df, true_df], ignore_index=True)

#View basic info
print(combined_df.info())
print(combined_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   label    44898 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 1.7+ MB
None
                                               title  \
0   Donald Trump Sends Out Embarrassing New Year’...   
1   Drunk Bragging Trump Staffer Started Russian ...   
2   Sheriff David Clarke Becomes An Internet Joke...   
3   Trump Is So Obsessed He Even Has Obama’s Name...   
4   Pope Francis Just Called Out Donald Trump Dur...   

                                                text subject  \
0  Donald Trump just couldn t wish all Americans ...    News   
1  House Intelligence Committee Chairman Devin Nu...    News   
2  On Friday, it was revealed that former Milwauk

Explanation:
 - We add a `label` column to distinguish between fake (0) and true (1) news articles.
 - The `concat` function combines the two datasets vertically (row-wise).
 - Displaying the first few rows and data info ensures that the data is correctly loaded.

3. Data Preprocessing

 A. Text Cleaning
This step is critical as raw text can be noisy and contain irrelevant information such as punctuation, numbers, or stopwords. We clean the text by:
- Lowercasing the text for uniformity.
- Removing punctuation and numbers.
- Tokenizing the text (breaking it into individual words).
- Removing common stopwords (e.g., "the", "and") which do not contribute meaningful information.


In [5]:
#Function to clean the text
def clean_text(text):
    text = text.lower()   #Lowercase
    text = re.sub(r'\d+', '', text)   #Remove numbers
    text = re.sub(r'[^\w\s]', '', text)   #Remove punctuation and special characters
    tokens = word_tokenize(text)   #Tokenize
    cleaned_tokens = [word for word in tokens if word not in stopwords.words('english')]   #Remove stopwords
    return ' '.join(cleaned_tokens)

#Apply the cleaning function to the 'text' column
combined_df['cleaned_text'] = combined_df['text'].apply(clean_text)

#Check cleaned data
print(combined_df[['title', 'cleaned_text', 'label']].head())

NameError: name 'word_tokenize' is not defined

Explanation:
 - The `clean_text` function removes noise from the text data, helping the machine learning model focus on the most relevant words.
 - Applying the function to the 'text' column creates a new 'cleaned_text' column.
 - Cleaned text is displayed for verification.

4. Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) helps us understand the structure of the data and the distribution of features. Here, we examine the distribution of fake vs true articles and common word frequencies in both categories.

 A. Distribution of Fake vs True Articles
We visualize the number of fake and true news articles using a bar chart to get an overview of the class distribution.


In [None]:
#Countplot for labels
sns.countplot(x='label', data=combined_df)
plt.title('Distribution of Fake vs True News')
plt.show()

Explanation:
 - This count plot shows how balanced the dataset is. A balanced dataset ensures the model does not favor one class over the other.
 - Helps understand the ratio of fake to true news articles, which impacts the evaluation metrics.

B. Common Words in Fake vs True News
We compute and display the most common words in fake and true news to gain insights into which terms are frequent in each category.

In [None]:
#Function to get common words
def get_most_common_words(df, label, n=20):
    all_words = ' '.join(df[df['label'] == label]['cleaned_text']).split()
    word_freq = Counter(all_words)
    return word_freq.most_common(n)

#Get common words for fake and true news
common_fake_words = get_most_common_words(combined_df, label=0)
common_true_words = get_most_common_words(combined_df, label=1)

print('Most common words in fake news:', common_fake_words)
print('Most common words in true news:', common_true_words)

Explanation:
 - `get_most_common_words` function computes word frequencies for both fake and true news.
 - This gives us an idea of the different vocabularies used in fake and true news articles.


C. Length of News Articles
We visualize the distribution of article lengths for both fake and true news to check if there’s a difference in the size of articles between these two categories.


In [None]:
#Compute text length
combined_df['text_length'] = combined_df['cleaned_text'].apply(lambda x: len(x.split()))

#Boxplot for text lengths
sns.boxplot(x='label', y='text_length', data=combined_df)
plt.title('Distribution of Article Lengths by Fake vs True')
plt.show()


Explanation:
 - Boxplot shows the distribution of the number of words in fake and true articles.
 - This helps in understanding whether fake or true news articles are typically longer or shorter.


5. Text Vectorization (TF-IDF)
We convert the cleaned text into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency). This method weighs terms by how frequently they occur in each document while reducing the weight of commonly occurring terms across the entire corpus (e.g., "the", "is").

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

#Use TF-IDF Vectorizer
tfidf = TfidfVectorizer(max_features=5000)

#Fit and transform the text data
X = tfidf.fit_transform(combined_df['cleaned_text']).toarray()
y = combined_df['label']

#Check shape of feature matrix
print(X.shape)


Explanation:
 - TF-IDF converts the text into a numerical feature matrix.
 - `max_features=5000` limits the number of features to the 5000 most important words.
 - The feature matrix `X` is ready for model training.


6. Train-Test Split
We split the data into training and testing sets. This ensures that we train the model on one part of the data and test its performance on unseen data.


In [None]:
from sklearn.model_selection import train_test_split

#Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape, X_test.shape)

Explanation:
 - The data is split into 80% for training and 20% for testing.
 - `random_state=42` ensures that the split is reproducible.

7. Model Training: Logistic Regression
We use Logistic Regression, a commonly used algorithm for binary classification tasks. It models the probability that a given input belongs to a certain class.


In [None]:
from sklearn.linear_model import LogisticRegression

#Initialize Logistic Regression model
model = LogisticRegression(max_iter=1000)

#Train the model
model.fit(X_train, y_train)

#Make predictions
y_pred = model.predict(X_test)


Explanation:
 - We train the logistic regression model on the training data.
 - Once trained, the model is used to predict the class (fake or true) for the test data.

 8. Model Evaluation

 A. Accuracy and Classification Report
The accuracy and classification report give a detailed breakdown of how well the model performs. The classification report shows precision, recall, and F1-score for both classes.


In [None]:
from sklearn.metrics import accuracy_score, classification_report

#Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

#Classification report
print(classification_report(y_test, y_pred))

Explanation:
 - Accuracy is the percentage of correctly classified news articles.
 - The classification report provides precision, recall, and F1-score for both fake and true news.

B. Confusion Matrix
We visualize the confusion matrix to see how well the model distinguishes between fake and true articles.

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns

#Confusion matrix
cm = confusion_matrix(y_test, y_pred)

 #Plot the confusion matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Fake', 'True'], yticklabels=['Fake', 'True'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


Explanation:
 - The confusion matrix helps visualize the number of correct and incorrect predictions made by the model.
 - It shows how many fake news articles were misclassified as true and vice versa.


9. Experiment with Other Models (Optional)
You can experiment with other machine learning models, like Random Forest, to see if a more complex algorithm performs better than Logistic Regression.


In [None]:
from sklearn.ensemble import RandomForestClassifier

#Initialize Random Forest model
rf_model = RandomForestClassifier(n_estimators=100)

#Train the model
rf_model.fit(X_train, y_train)

#Make predictions
y_pred_rf = rf_model.predict(X_test)

#Evaluate Random Forest model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f'Random Forest Accuracy: {accuracy_rf * 100:.2f}%')


 Explanation:
 - Random Forest is an ensemble learning method that combines multiple decision trees to improve accuracy and robustness.
 - Training and testing this model allows for comparison with Logistic Regression.




 Conclusion
- The project demonstrates the full pipeline for a fake news detection system using machine learning.
- We started with raw text data, cleaned and preprocessed it, and then applied machine learning algorithms to classify the articles.
- Evaluation metrics such as accuracy, precision, recall, and the confusion matrix help assess the model's performance.

