# Fake News Detection

## Steps for building a Fake News Detector

- Preprocessing: Clean and preprocess the text data to remove irrelevant information, such as HTML tags, punctuation, and stopwords. Convert the text into a numerical representation that machine learning algorithms can process, such as using techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings like Word2Vec or GloVe.

- Feature Extraction: Extract relevant features from the preprocessed text data. These features can include word frequencies, n-grams, or other linguistic features that may help distinguish between real and fake news.

- Model Training: Choose a machine learning algorithm or model and train it on the labeled dataset. Split the dataset into training and testing sets to evaluate the model's performance.

- Model Evaluation: Evaluate the trained model using appropriate evaluation metrics like accuracy, precision, recall, and F1 score. Adjust the model parameters or try different algorithms if the results are not satisfactory.

In [67]:
import numpy as np
import pandas as pd
import seaborn as sns
import re 
import string 

df_true = pd.read_csv("True.csv")
df_fake = pd.read_csv("Fake.csv")

In [68]:
df_true.columns

Index(['title', 'text', 'subject', 'date'], dtype='object')

In [69]:
df_fake.columns

Index(['title', 'text', 'subject', 'date'], dtype='object')

In [70]:
df_true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [71]:
df_fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


### Text Preprocessing

In [72]:
df_true = df_true.drop(["title", "date", "subject"], axis = 1)
df_fake = df_fake.drop(["title", "date", "subject"], axis = 1)

In [73]:
df_true['label'] = 1
df_fake['label'] = 0

In [74]:
df_true.head()

Unnamed: 0,text,label
0,WASHINGTON (Reuters) - The head of a conservat...,1
1,WASHINGTON (Reuters) - Transgender people will...,1
2,WASHINGTON (Reuters) - The special counsel inv...,1
3,WASHINGTON (Reuters) - Trump campaign adviser ...,1
4,SEATTLE/WASHINGTON (Reuters) - President Donal...,1


In [75]:
df_fake.head()

Unnamed: 0,text,label
0,Donald Trump just couldn t wish all Americans ...,0
1,House Intelligence Committee Chairman Devin Nu...,0
2,"On Friday, it was revealed that former Milwauk...",0
3,"On Christmas day, Donald Trump announced that ...",0
4,Pope Francis used his annual Christmas Day mes...,0


In [76]:
df = pd.concat([df_fake, df_true], axis =0 )

In [77]:
df = df.sample(frac = 1)
df.head()

Unnamed: 0,text,label
1247,(Reuters) - U.S. Senator Susan Collins of Main...,1
19635,(Reuters) - The British government has told Ge...,1
13124,HELSINKI/MOSCOW (Reuters) - Finland s defense ...,1
10082,WASHINGTON (Reuters) - The White House said on...,1
2553,(Reuters) - The Trump administration on Tuesda...,1


In [78]:
# import nltk 
# from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize
# from nltk.stem import WordNetLemmatizer
# from nltk.tokenize import RegexpTokenizer

# # Download necessary resources
# nltk.download('omw-1.4')
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\harsh\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\harsh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\harsh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\harsh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [79]:
# def preprocess_text(text):
#     # Lowercasing
#     text = text.lower()

#     # Remove URLs, email addresses, or other specific patterns
#     text = re.sub(r'\S+@\S+', '', text)  # Remove email addresses
#     text = re.sub(r'http\S+', '', text)  # Remove URLs

#     # Remove punctuation and special characters
#     text = re.sub(r'[^\w\s]', '', text)

#     # Tokenization
#     tokenizer = RegexpTokenizer(r'\w+')
#     tokens = tokenizer.tokenize(text)

#     # Stopword Removal
#     stop_words = set(stopwords.words('english'))
#     tokens = [token for token in tokens if token not in stop_words]

#     # Lemmatization
#     lemmatizer = WordNetLemmatizer()
#     tokens = [lemmatizer.lemmatize(token) for token in tokens]

#     # Remove numerical data
#     tokens = [token for token in tokens if not token.isnumeric()]

#     return tokens

In [80]:
def wordprocessing(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub("\\W"," ",text) 
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)    
    return text

- text = text.lower(): Converts the text to lowercase.
- text = re.sub('\[.*?\]', '', text): Removes any text enclosed in square brackets, including the brackets themselves.
- text = re.sub("\\W"," ",text): Replaces any non-word characters (such as punctuation marks) with a space.
- text = re.sub('https?://\S+|www\.\S+', '', text): Removes any URLs or website addresses starting with "http://" or "https://" or "www.".
- text = re.sub('<.*?>+', '', text): Removes any HTML tags enclosed in angle brackets, including the brackets themselves.
- text = re.sub('[%s]' % re.escape(string.punctuation), '', text): Removes any remaining punctuation marks.
- text = re.sub('\n', '', text): Removes newline characters.
- text = re.sub('\w*\d\w*', '', text): Removes any alphanumeric words that contain numbers.

In [92]:
# Apply preprocessing to 'text' column
df['text'] = df['text'].apply(lambda x: wordprocessing(str(x)))

In [93]:
df.head()

Unnamed: 0,text,label
1247,reuters u s senator susan collins of main...,1
19635,reuters the british government has told ge...,1
13124,helsinki moscow reuters finland s defense ...,1
10082,washington reuters the white house said on...,1
2553,reuters the trump administration on tuesda...,1


## Machine Learning

In [82]:
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [83]:
x, y = df['text'], df['label']

In [84]:
# Splitting train and test sets

x_train, x_test, y_train, y_test = train_test_split(x, y) 

In [85]:
print('x train shape:', x_train.shape)

x train shape: (33673,)


In [86]:
print('x test shape:', x_test.shape)

x test shape: (11225,)


In [87]:
print('y train shape:', y_train.shape)

y train shape: (33673,)


In [88]:
print('y test shape:', y_test.shape)

y test shape: (11225,)


In [89]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorization = TfidfVectorizer()

# Converting all the training and testing data into strings first

x_train = df['text'].tolist()
# Convert each element in the list to a string
x_train = [str(text) for text in x_train]

x_test = df['text'].tolist()
# Convert each element in the list to a string
x_test = [str(text) for text in x_test]

In [90]:
xv_train = vectorization.fit_transform(x_train)
xv_test = vectorization.transform(x_test)

In [91]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()
LR.fit(xv_train, y_train)

ValueError: Found input variables with inconsistent numbers of samples: [44898, 33673]