## Abstract:
In this project, we aimed to detect fake news using machine learning methods. We explored SVM, RF, NLP, XGBoost, Logistic Regression, and MNB algorithms and compared their accuracies. With a dataset of around 50,000 rows containing news title, author, text, and label (0/1), Multinomial Naive Bayes (MNB) achieved the highest accuracy in detecting fake news. MNB is efficient and effective for text classification tasks, making it a suitable choice for this project.

In [1]:
import pandas as pd

# stop

# Load Dataset

In [2]:
label_df = pd.read_csv('Datasets/submit.csv')
content_df = pd.read_csv('Datasets/test.csv')


In [3]:
# Merging the two DataFrames based on the 'id' column
merged_df = pd.merge(content_df, label_df, on='id')
merged_df

Unnamed: 0,id,title,author,text,label
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning...",0
1,20801,Russian warships ready to strike terrorists ne...,,Russian warships ready to strike terrorists ne...,1
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...,0
3,20803,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor,"If at first you don’t succeed, try a different...",1
4,20804,Keiser Report: Meme Wars (E995),Truth Broadcast Network,42 mins ago 1 Views 0 Comments 0 Likes 'For th...,1
...,...,...,...,...,...
5195,25995,The Bangladeshi Traffic Jam That Never Ends - ...,Jody Rosen,Of all the dysfunctions that plague the world’...,0
5196,25996,John Kasich Signs One Abortion Bill in Ohio bu...,Sheryl Gay Stolberg,WASHINGTON — Gov. John Kasich of Ohio on Tu...,1
5197,25997,"California Today: What, Exactly, Is in Your Su...",Mike McPhate,Good morning. (Want to get California Today by...,0
5198,25998,300 US Marines To Be Deployed To Russian Borde...,,« Previous - Next » 300 US Marines To Be Deplo...,1


In [4]:
# Dropping rows with null values
df = merged_df.dropna()

df

Unnamed: 0,id,title,author,text,label
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning...",0
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...,0
3,20803,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor,"If at first you don’t succeed, try a different...",1
4,20804,Keiser Report: Meme Wars (E995),Truth Broadcast Network,42 mins ago 1 Views 0 Comments 0 Likes 'For th...,1
6,20806,Pelosi Calls for FBI Investigation to Find Out...,Pam Key,"Sunday on NBC’s “Meet the Press,” House Minori...",1
...,...,...,...,...,...
5194,25994,Trump on If ’Tapes’ Exist of Comey Conversatio...,Pam Key,Pres. Trump on if “tapes” exist of his convers...,1
5195,25995,The Bangladeshi Traffic Jam That Never Ends - ...,Jody Rosen,Of all the dysfunctions that plague the world’...,0
5196,25996,John Kasich Signs One Abortion Bill in Ohio bu...,Sheryl Gay Stolberg,WASHINGTON — Gov. John Kasich of Ohio on Tu...,1
5197,25997,"California Today: What, Exactly, Is in Your Su...",Mike McPhate,Good morning. (Want to get California Today by...,0


# Naive Bayes

In [5]:

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

In [6]:
# Step 1: Load and preprocess the data
X = df[['title', 'author', 'label']]  # Input features (title, author, label)
y = df['text']  # Text input (to be evaluated)

# Convert integer columns to strings
X['label'] = X['label'].astype(str)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['label'] = X['label'].astype(str)


In [7]:
# Step 2: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [8]:
# Step 3: Feature extraction (vectorization)
vectorizer = CountVectorizer()  # Initialize the vectorizer
X_train_vectorized = vectorizer.fit_transform(X_train['title'] + ' ' + X_train['author'] + ' ' + X_train['label'])
X_test_vectorized = vectorizer.transform(X_test['title'] + ' ' + X_test['author'] + ' ' + X_test['label'])


In [9]:
# Step 4: Train the Naive Bayes classifier
classifier = MultinomialNB()  # Initialize the Naive Bayes classifier
classifier.fit(X_train_vectorized, y_train)  # Train the classifier


In [10]:
# Step 5: Make predictions on the test set
y_pred = classifier.predict(X_test_vectorized)


In [11]:
# # Step 6: Evaluate the model using the 'text' column
# report = classification_report(y_test, y_pred)
# print(report)

In [12]:
# Calculate the accuracy
mnb_accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", mnb_accuracy)

Accuracy: 0.00546448087431694


# LogisticRegression

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


In [14]:
# Step 1: Load and preprocess the data
X = df[['title', 'author', 'label']]  # Input features (title, author, label)
y = df['text']  # Text input (to be evaluated)

# Convert integer columns to strings
X['label'] = X['label'].astype(str)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['label'] = X['label'].astype(str)


In [15]:
# Step 2: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [16]:
# Step 3: Feature extraction (vectorization)
vectorizer = CountVectorizer()  # Initialize the vectorizer
X_train_vectorized = vectorizer.fit_transform(X_train['title'] + ' ' + X_train['author'] + ' ' + X_train['label'])
X_test_vectorized = vectorizer.transform(X_test['title'] + ' ' + X_test['author'] + ' ' + X_test['label'])


In [17]:
# Step 4: Train the logistic regression classifier
classifier = LogisticRegression()  # Initialize the logistic regression classifier
classifier.fit(X_train_vectorized, y_train)  # Train the classifier


In [18]:
# Step 5: Make predictions on the test set
y_pred = classifier.predict(X_test_vectorized)


In [19]:

# Calculate the accuracy
lr_accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", lr_accuracy)

Accuracy: 0.008743169398907104


# Support Vector Machines (SVM)

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score



In [21]:
# Step 1: Load and preprocess the data
X = df[['title', 'author', 'label']]  # Input features (title, author, label)
y = df['text']  # Text input (to be evaluated)

# Convert integer columns to strings
X['label'] = X['label'].astype(str)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['label'] = X['label'].astype(str)


In [22]:
# Step 2: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



In [23]:
# Step 3: Feature extraction (vectorization)
vectorizer = CountVectorizer()  # Initialize the vectorizer
X_train_vectorized = vectorizer.fit_transform(X_train['title'] + ' ' + X_train['author'] + ' ' + X_train['label'])
X_test_vectorized = vectorizer.transform(X_test['title'] + ' ' + X_test['author'] + ' ' + X_test['label'])



In [24]:
# Step 4: Train the SVM classifier
classifier = SVC()  # Initialize the SVM classifier
classifier.fit(X_train_vectorized, y_train)  # Train the classifier



In [25]:
# Step 5: Make predictions on the test set
y_pred = classifier.predict(X_test_vectorized)



In [26]:
# Calculate the accuracy
svm_accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", svm_accuracy)


Accuracy: 0.00546448087431694


# Random Forest

In [27]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score



In [28]:
# Step 1: Load and preprocess the data
X = df[['title', 'author', 'label']]  # Input features (title, author, label)
y = df['text']  # Text input (to be evaluated)

# Convert integer columns to strings
X['label'] = X['label'].astype(str)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['label'] = X['label'].astype(str)


In [29]:
# Step 2: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



In [30]:
# Step 3: Feature extraction (vectorization)
vectorizer = CountVectorizer()  # Initialize the vectorizer
X_train_vectorized = vectorizer.fit_transform(X_train['title'] + ' ' + X_train['author'] + ' ' + X_train['label'])
X_test_vectorized = vectorizer.transform(X_test['title'] + ' ' + X_test['author'] + ' ' + X_test['label'])



In [31]:
# Step 4: Train the Random Forest classifier
classifier = RandomForestClassifier()  # Initialize the Random Forest classifier
classifier.fit(X_train_vectorized, y_train)  # Train the classifier



In [32]:
# Step 5: Make predictions on the test set
y_pred = classifier.predict(X_test_vectorized)



In [33]:
# Calculate the accuracy
rf_accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", rf_accuracy)


Accuracy: 0.008743169398907104


# Gradient Boosting [XGBoost]

In [34]:
import nltk
import re
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score


ModuleNotFoundError: No module named 'xgboost'

In [None]:
df = merged_df

In [None]:
# Step 2: Preprocess the text data
stemmer = PorterStemmer()
df['processed_text'] = df['text'].apply(lambda x: ' '.join([stemmer.stem(word) for word in nltk.word_tokenize(x.lower())]))
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['processed_text'])
y = df['label']

In [None]:
# Step 3: Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Step 4: Train the XGBoost model
model = XGBClassifier()
model.fit(X_train, y_train)


In [None]:
# Step 5: Make predictions on the test set
y_pred = model.predict(X_test)


In [None]:
# Step 6: Evaluate the model
gb_accuracy = accuracy_score(y_test, y_pred)
print("Accuracy =:",gb_accuracy)
print("Accuracy %: {:.2f}%".format(gb_accuracy * 100))

# Natural Language Processing(NLP)

In [None]:
import nltk
import re
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score


In [None]:
# ## if path is available for stopwords


# # Step 2: Data Cleaning
# def clean_text(text, stop_words):
#     # Convert to lowercase
#     text = text.lower()
    
#     # Remove special characters and punctuation
#     text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    
#     # Remove extra whitespaces
#     text = re.sub(r'\s+', ' ', text).strip()
    
#     # Remove stop words
#     tokens = nltk.word_tokenize(text)
#     filtered_tokens = [word for word in tokens if word not in stop_words]
#     text = ' '.join(filtered_tokens)
    
#     return text

# stop_words_path = r'C:/Users/NayeemIslam/AppData/Roaming/nltk_data/corpora/stopwords'  # Custom stop words file path

# with open(stop_words_path, 'r') as file:
#     stop_words = set(file.read().splitlines())

# df['clean_text'] = df['text'].apply(lambda x: clean_text(x, stop_words))


In [None]:
# using custom list of stop words
stop_words_list = {"0o", "0s", "3a", "3b", "3d", "6b", "6o", "a", "a1", "a2", "a3", "a4", "ab", "able", "about", "above", "abst", "ac", "accordance", "according", "accordingly", "across", "act", "actually", "ad", "added", "adj", "ae", "af", "affected", "affecting", "affects", "after", "afterwards", "ag", "again", "against", "ah", "ain", "ain't", "aj", "al", "all", "allow", "allows", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "announce", "another", "any", "anybody", "anyhow", "anymore", "anyone", "anything", "anyway", "anyways", "anywhere", "ao", "ap", "apart", "apparently", "appear", "appreciate", "appropriate", "approximately", "ar", "are", "aren", "arent", "aren't", "arise", "around", "as", "a's", "aside", "ask", "asking", "associated", "at", "au", "auth", "av", "available", "aw", "away", "awfully", "ax", "ay", "az", "b", "b1", "b2", "b3", "ba", "back", "bc", "bd", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "begin", "beginning", "beginnings", "begins", "behind", "being", "believe", "below", "beside", "besides", "best", "better", "between", "beyond", "bi", "bill", "biol", "bj", "bk", "bl", "bn", "both", "bottom", "bp", "br", "brief", "briefly", "bs", "bt", "bu", "but", "bx", "by", "c", "c1", "c2", "c3", "ca", "call", "came", "can", "cannot", "cant", "can't", "cause", "causes", "cc", "cd", "ce", "certain", "certainly", "cf", "cg", "ch", "changes", "ci", "cit", "cj", "cl", "clearly", "cm", "c'mon", "cn", "co", "com", "come", "comes", "con", "concerning", "consequently", "consider", "considering", "contain", "containing", "contains", "corresponding", "could", "couldn", "couldnt", "couldn't", "course", "cp", "cq", "cr", "cry", "cs", "c's", "ct", "cu", "currently", "cv", "cx", "cy", "cz", "d", "d2", "da", "date", "dc", "dd", "de", "definitely", "describe", "described", "despite", "detail", "df", "di", "did", "didn", "didn't", "different", "dj", "dk", "dl", "do", "does", "doesn", "doesn't", "doing", "don", "done", "don't", "down", "downwards", "dp", "dr", "ds", "dt", "du", "due", "during", "dx", "dy", "e", "e2", "e3", "ea", "each", "ec", "ed", "edu", "ee", "ef", "effect", "eg", "ei", "eight", "eighty", "either", "ej", "el", "eleven", "else", "elsewhere", "em", "empty", "en", "end", "ending", "enough", "entirely", "eo", "ep", "eq", "er", "es", "especially", "est", "et", "et-al", "etc", "eu", "ev", "even", "ever", "every", "everybody", "everyone", "everything", "everywhere", "ex", "exactly", "example", "except", "ey", "f", "f2", "fa", "far", "fc", "few", "ff", "fi", "fifteen", "fifth", "fify", "fill", "find", "fire", "first", "five", "fix", "fj", "fl", "fn", "fo", "followed", "following", "follows", "for", "former", "formerly", "forth", "forty", "found", "four", "fr", "from", "front", "fs", "ft", "fu", "full", "further", "furthermore", "fy", "g", "ga", "gave", "ge", "get", "gets", "getting", "gi", "give", "given", "gives", "giving", "gj", "gl", "go", "goes", "going", "gone", "got", "gotten", "gr", "greetings", "gs", "gy", "h", "h2", "h3", "had", "hadn", "hadn't", "happens", "hardly", "has", "hasn", "hasnt", "hasn't", "have", "haven", "haven't", "having", "he", "hed", "he'd", "he'll", "hello", "help", "hence", "her", "here", "hereafter", "hereby", "herein", "heres", "here's", "hereupon", "hers", "herself", "hes", "he's", "hh", "hi", "hid", "him", "himself", "his", "hither", "hj", "ho", "home", "hopefully", "how", "howbeit", "however", "how's", "hr", "hs", "http", "hu", "hundred", "hy", "i", "i2", "i3", "i4", "i6", "i7", "i8", "ia", "ib", "ibid", "ic", "id", "i'd", "ie", "if", "ig", "ignored", "ih", "ii", "ij", "il", "i'll", "im", "i'm", "immediate", "immediately", "importance", "important", "in", "inasmuch", "inc", "indeed", "index", "indicate", "indicated", "indicates", "information", "inner", "insofar", "instead", "interest", "into", "invention", "inward", "io", "ip", "iq", "ir", "is", "isn", "isn't", "it", "itd", "it'd", "it'll", "its", "it's", "itself", "iv", "i've", "ix", "iy", "iz", "j", "jj", "jr", "js", "jt", "ju", "just", "k", "ke", "keep", "keeps", "kept", "kg", "kj", "km", "know", "known", "knows", "ko", "l", "l2", "la", "largely", "last", "lately", "later", "latter", "latterly", "lb", "lc", "le", "least", "les", "less", "lest", "let", "lets", "let's", "lf", "like", "liked", "likely", "line", "little", "lj", "ll", "ll", "ln", "lo", "look", "looking", "looks", "los", "lr", "ls", "lt", "ltd", "m", "m2", "ma", "made", "mainly", "make", "makes", "many", "may", "maybe", "me", "mean", "means", "meantime", "meanwhile", "merely", "mg", "might", "mightn", "mightn't", "mill", "million", "mine", "miss", "ml", "mn", "mo", "more", "moreover", "most", "mostly", "move", "mr", "mrs", "ms", "mt", "mu", "much", "mug", "must", "mustn", "mustn't", "my", "myself", "n", "n2", "na", "name", "namely", "nay", "nc", "nd", "ne", "near", "nearly", "necessarily", "necessary", "need", "needn", "needn't", "needs", "neither", "never", "nevertheless", "new", "next", "ng", "ni", "nine", "ninety", "nj", "nl", "nn", "no", "nobody", "non", "none", "nonetheless", "noone", "nor", "normally", "nos", "not", "noted", "nothing", "novel", "now", "nowhere", "nr", "ns", "nt", "ny", "o", "oa", "ob", "obtain", "obtained", "obviously", "oc", "od", "of", "off", "often", "og", "oh", "oi", "oj", "ok", "okay", "ol", "old", "om", "omitted", "on", "once", "one", "ones", "only", "onto", "oo", "op", "oq", "or", "ord", "os", "ot", "other", "others", "otherwise", "ou", "ought", "our", "ours", "ourselves", "out", "outside", "over", "overall", "ow", "owing", "own", "ox", "oz", "p", "p1", "p2", "p3", "page", "pagecount", "pages", "par", "part", "particular", "particularly", "pas", "past", "pc", "pd", "pe", "per", "perhaps", "pf", "ph", "pi", "pj", "pk", "pl", "placed", "please", "plus", "pm", "pn", "po", "poorly", "possible", "possibly", "potentially", "pp", "pq", "pr", "predominantly", "present", "presumably", "previously", "primarily", "probably", "promptly", "proud", "provides", "ps", "pt", "pu", "put", "py", "q", "qj", "qu", "que", "quickly", "quite", "qv", "r", "r2", "ra", "ran", "rather", "rc", "rd", "re", "readily", "really", "reasonably", "recent", "recently", "ref", "refs", "regarding", "regardless", "regards", "related", "relatively", "research", "research-articl", "respectively", "resulted", "resulting", "results", "rf", "rh", "ri", "right", "rj", "rl", "rm", "rn", "ro", "rq", "rr", "rs", "rt", "ru", "run", "rv", "ry", "s", "s2", "sa", "said", "same", "saw", "say", "saying", "says", "sc", "sd", "se", "sec", "second", "secondly", "section", "see", "seeing", "seem", "seemed", "seeming", "seems", "seen", "self", "selves", "sensible", "sent", "serious", "seriously", "seven", "several", "sf", "shall", "shan", "shan't", "she", "shed", "she'd", "she'll", "shes", "she's", "should", "shouldn", "shouldn't", "should've", "show", "showed", "shown", "showns", "shows", "si", "side", "significant", "significantly", "similar", "similarly", "since", "sincere", "six", "sixty", "sj", "sl", "slightly", "sm", "sn", "so", "some", "somebody", "somehow", "someone", "somethan", "something", "sometime", "sometimes", "somewhat", "somewhere", "soon", "sorry", "sp", "specifically", "specified", "specify", "specifying", "sq", "sr", "ss", "st", "still", "stop", "strongly", "sub", "substantially", "successfully", "such", "sufficiently", "suggest", "sup", "sure", "sy", "system", "sz", "t", "t1", "t2", "t3", "take", "taken", "taking", "tb", "tc", "td", "te", "tell", "ten", "tends", "tf", "th", "than", "thank", "thanks", "thanx", "that", "that'll", "thats", "that's", "that've", "the", "their", "theirs", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "thered", "therefore", "therein", "there'll", "thereof", "therere", "theres", "there's", "thereto", "thereupon", "there've", "these", "they", "theyd", "they'd", "they'll", "theyre", "they're", "they've", "thickv", "thin", "think", "third", "this", "thorough", "thoroughly", "those", "thou", "though", "thoughh", "thousand", "three", "throug", "through", "throughout", "thru", "thus", "ti", "til", "tip", "tj", "tl", "tm", "tn", "to", "together", "too", "took", "top", "toward", "towards", "tp", "tq", "tr", "tried", "tries", "truly", "try", "trying", "ts", "t's", "tt", "tv", "twelve", "twenty", "twice", "two", "tx", "u", "u201d", "ue", "ui", "uj", "uk", "um", "un", "under", "unfortunately", "unless", "unlike", "unlikely", "until", "unto", "uo", "up", "upon", "ups", "ur", "us", "use", "used", "useful", "usefully", "usefulness", "uses", "using", "usually", "ut", "v", "va", "value", "various", "vd", "ve", "ve", "very", "via", "viz", "vj", "vo", "vol", "vols", "volumtype", "vq", "vs", "vt", "vu", "w", "wa", "want", "wants", "was", "wasn", "wasnt", "wasn't", "way", "we", "wed", "we'd", "welcome", "well", "we'll", "well-b", "went", "were", "we're", "weren", "werent", "weren't", "we've", "what", "whatever", "what'll", "whats", "what's", "when", "whence", "whenever", "when's", "where", "whereafter", "whereas", "whereby", "wherein", "wheres", "where's", "whereupon", "wherever", "whether", "which", "while", "whim", "whither", "who", "whod", "whoever", "whole", "who'll", "whom", "whomever", "whos", "who's", "whose", "why", "why's", "wi", "widely", "will", "willing", "wish", "with", "within", "without", "wo", "won", "wonder", "wont", "won't", "words", "world", "would", "wouldn", "wouldnt", "wouldn't", "www", "x", "x1", "x2", "x3", "xf", "xi", "xj", "xk", "xl", "xn", "xo", "xs", "xt", "xv", "xx", "y", "y2", "yes", "yet", "yj", "yl", "you", "youd", "you'd", "you'll", "your", "youre", "you're", "yours", "yourself", "yourselves", "you've", "yr", "ys", "yt", "z", "zero", "zi", "zz",}

In [None]:
import nltk
nltk.download('punkt')
# Step 2: Data Cleaning
def clean_text(text, stop_words):
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and punctuation
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    
    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Remove stop words
    tokens = nltk.word_tokenize(text)
    filtered_tokens = [word for word in tokens if word not in stop_words]
    text = ' '.join(filtered_tokens)
    
    return text



df['clean_text'] = df['text'].apply(lambda x: clean_text(x, stop_words_list))


In [None]:
# Step 3: Tokenization and Stemming
stemmer = PorterStemmer()
df['stemmed_tokens'] = df['clean_text'].apply(lambda x: [stemmer.stem(word) for word in x.split()])



In [None]:
# Step 4: Feature Extraction (Bag-of-Words)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['stemmed_tokens'].apply(' '.join))
y = df['label']



In [None]:
# Step 5: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Step 6: Train and evaluate the Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)



In [None]:
# Step 7: Make predictions on the test set
y_pred = model.predict(X_test)



In [None]:
# Step 8: Evaluate the model
nlp_mnb_accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", nlp_mnb_accuracy)


# Compare Accuracy 

In [None]:
import matplotlib.pyplot as plt

# List of accuracy values for each method
accuracy_values = [svm_accuracy, rf_accuracy, nlp_mnb_accuracy, gb_accuracy, lr_accuracy, mnb_accuracy]

# List of method names
method_names = ['SVM', 'Random Forest', 'NLP-MNB', 'XGBoost', 'Logistic Regression', 'MNB']

# Create a bar chart to compare accuracies
plt.figure(figsize=(8, 6))
plt.bar(method_names, accuracy_values)
plt.xlabel('Machine Learning Methods')
plt.ylabel('Accuracy')
plt.title('Comparison of Accuracy for Fake News Detection')
plt.ylim(0, 1)  # Set the y-axis limit to show accuracy from 0 to 1
plt.show()


# Example using MultinomialNB Model

In [None]:
import nltk
import re
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Step 1: Load the dataset and train the model (assuming you have already performed Steps 1-5)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['stemmed_tokens'].apply(' '.join))
y = df['label']
model = MultinomialNB()
model.fit(X, y)


In [None]:
# Step 2: Function to clean and preprocess the given text
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and punctuation
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    
    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Stemming
    stemmer = PorterStemmer()
    tokens = nltk.word_tokenize(text)
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    processed_text = ' '.join(stemmed_tokens)
    
    return processed_text

In [None]:
# Step 3: Function to predict the label (true/false) for the given text
def predict_label(text, author, title):
    # Preprocess the text
    processed_text = preprocess_text(text)
    
    # Create the feature vector using the same vectorizer used for training
    feature_vector = vectorizer.transform([processed_text + ' ' + author + ' ' + title])
    
    # Make the prediction using the trained model
    prediction = model.predict(feature_vector)[0]
    
    # Return the predicted label (0 for false, 1 for true)
    return prediction


In [None]:
# Step 4: Example usage
text = "If at first you donâ€™t succeed, try a different sport. Tim Tebow, who was a Heisman quarterback at the University of Florida but was unable to hold an N. F. L. job, is pursuing a career in Major League Baseball. He will hold a workout for M. L. B. teams this month, his agents told ESPN and other news outlets. â€œThis may sound like a publicity stunt, but nothing could be further from the truth,â€ said Brodie Van Wagenen,   of CAA Baseball, part of the sports agency CAA Sports, in the statement. â€œI have seen Timâ€™s workouts, and people inside and outside the industry  â€”   scouts, executives, players and fans  â€”   will be impressed by his talent. â€ Itâ€™s been over a decade since Tebow, 28, has played baseball full time, which means a comeback would be no easy task. But the former major league catcher Chad Moeller, who said in the statement that he had been training Tebow in Arizona, said he was â€œbeyond impressed with Timâ€™s athleticism and swing. â€ â€œI see bat speed and power and real baseball talent,â€ Moeller said. â€œI truly believe Tim has the skill set and potential to achieve his goal of playing in the major leagues and based on what I have seen over the past two months, it could happen relatively quickly. â€ Or, take it from Gary Sheffield, the former   outfielder. News of Tebowâ€™s attempted comeback in baseball was greeted with skepticism on Twitter. As a junior at Nease High in Ponte Vedra, Fla. Tebow drew the attention of major league scouts, batting . 494 with four home runs as a left fielder. But he ditched the bat and glove in favor of pigskin, leading Florida to two national championships, in 2007 and 2009. Two former scouts for the Los Angeles Angels told WEEI, a Boston radio station, that Tebow had been under consideration as a high school junior. â€œâ€™x80â€™x9cWe wanted to draft him, â€™x80â€™x9cbut he never sent back his information card,â€ said one of the scouts, Tom Kotchman, referring to a questionnaire the team had sent him. â€œHe had a strong arm and had a lot of power,â€ said the other scout, Stephen Hargett. â€œIf he would have been there his senior year he definitely would have had a good chance to be drafted. â€ â€œIt was just easy for him,â€ Hargett added. â€œYou thought, If this guy dedicated everything to baseball like he did to football how good could he be?â€ Tebowâ€™s high school baseball coach, Greg Mullins, told The Sporting News in 2013 that he believed Tebow could have made the major leagues. â€œHe was the leader of the team with his passion, his fire and his energy,â€ Mullins said. â€œHe loved to play baseball, too. He just had a bigger fire for football. â€ Tebow wouldnâ€™t be the first athlete to switch from the N. F. L. to M. L. B. Bo Jackson had one   season as a Kansas City Royal, and Deion Sanders played several years for the Atlanta Braves with mixed success. Though Michael Jordan tried to cross over to baseball from basketball as a    in 1994, he did not fare as well playing one year for a Chicago White Sox minor league team. As a football player, Tebow was unable to match his college success in the pros. The Denver Broncos drafted him in the first round of the 2010 N. F. L. Draft, and he quickly developed a reputation for clutch performances, including a memorable   pass against the Pittsburgh Steelers in the 2011 Wild Card round. But his stats and his passing form werenâ€™t pretty, and he spent just two years in Denver before moving to the Jets in 2012, where he spent his last season on an N. F. L. roster. He was cut during preseason from the New England Patriots in 2013 and from the Philadelphia Eagles in 2015."
author = "Daniel Victor"
title = "Tim Tebow Will Attempt Another Comeback, This Time in Baseball - The New York Times"


prediction = predict_label(text, author, title)


if prediction == 0:
    print("The given text is predicted to be FALSE.")
else:
    print("The given text is predicted to be TRUE.")

print("prediction=",prediction)