# 📰 Fake News Generator and Detector using NLP and GPT-2

This project combines Natural Language Processing (NLP) and Machine Learning to generate and detect fake news.

- The **generator** uses GPT-2 to create fake news headlines from a user-provided prompt.
- The **detector** uses TF-IDF and Logistic Regression to classify news as either **FAKE** or **REAL**.
- A Gradio interface allows interactive use of both components.

Built entirely in Python using libraries like `transformers`, `scikit-learn`, and `gradio`.

In [26]:
import pandas as pd
df=pd.read_csv("/content/fake_and_real_news_dataset.csv")

In [27]:
df.head()

Unnamed: 0,idd,title,text,label
0,Fq+C96tcx+,‘A target on Roe v. Wade ’: Oklahoma bill maki...,UPDATE: Gov. Fallin vetoed the bill on Friday....,REAL
1,bHUqK!pgmv,Study: women had to drive 4 times farther afte...,Ever since Texas laws closed about half of the...,REAL
2,4Y4Ubf%aTi,"Trump, Clinton clash in dueling DC speeches","Donald Trump and Hillary Clinton, now at the s...",REAL
3,_CoY89SJ@K,Grand jury in Texas indicts activists behind P...,A Houston grand jury investigating criminal al...,REAL
4,+rJHoRQVLe,"As Reproductive Rights Hang In The Balance, De...",WASHINGTON -- Forty-three years after the Supr...,REAL


In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4594 entries, 0 to 4593
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   idd     4594 non-null   object
 1   title   4593 non-null   object
 2   text    4594 non-null   object
 3   label   4594 non-null   object
dtypes: object(4)
memory usage: 143.7+ KB


In [29]:
df.isnull().sum()

Unnamed: 0,0
idd,0
title,1
text,0
label,0


In [30]:
df=df.fillna('')

In [31]:
df.isnull().sum()

Unnamed: 0,0
idd,0
title,0
text,0
label,0


In [32]:
df.columns

Index(['idd', 'title', 'text', 'label'], dtype='object')

In [33]:
df=df.drop(['idd', 'title'], axis=1)
df.head()

Unnamed: 0,text,label
0,UPDATE: Gov. Fallin vetoed the bill on Friday....,REAL
1,Ever since Texas laws closed about half of the...,REAL
2,"Donald Trump and Hillary Clinton, now at the s...",REAL
3,A Houston grand jury investigating criminal al...,REAL
4,WASHINGTON -- Forty-three years after the Supr...,REAL


In [34]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re
port_stem=PorterStemmer()


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [35]:
stop_words = set(stopwords.words('english'))
def stemming(content):
  con=re.sub('[^a-zA-Z]', ' ', content)
  con=con.lower()
  con=con.split()
  con=[port_stem.stem(word) for word in con if word not in stop_words]
  con=' '.join(con)
  return con

In [36]:
df['text'] = df['text'].apply(stemming)

In [37]:
x=df['text']
y=df['label']
y.shape

(4594,)

In [38]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)


In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect=TfidfVectorizer()

In [40]:
x_train_vect = vect.fit_transform(x_train)
x_test_vect = vect.transform(x_test)

In [41]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train_vect, y_train)

In [42]:
y_pred = model.predict(x_test_vect)

In [43]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
print("Acccuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

Acccuracy: 0.9260226283724978

Classification Report:
               precision    recall  f1-score   support

        FAKE       0.91      0.95      0.93       592
        REAL       0.95      0.90      0.92       557

    accuracy                           0.93      1149
   macro avg       0.93      0.93      0.93      1149
weighted avg       0.93      0.93      0.93      1149


Confusion Matrix:
 [[563  29]
 [ 56 501]]


In [44]:
def predict_news(news_text):
  stemmed = stemming(news_text)
  vect_text=vect.transform([stemmed])
  prediction = model.predict(vect_text)
  return prediction[0]

In [45]:
print(predict_news("Breaking: The government has confirmed new Covid-19 restrictions."))

FAKE


In [46]:
predict_news("In india in 2014 , bjp won the elections")

'FAKE'

In [47]:
#STARTING THE GENERATOR PART OF THE PROJECT FROM HERE
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
generator_model = GPT2LMHeadModel.from_pretrained(model_name)
generator_model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [48]:
def generate_fake_news(prompt, max_length=50, num_return_sequences=1):
  input_ids = tokenizer.encode(prompt, return_tensors="pt")
  outputs = generator_model.generate(
      input_ids,
      max_length=max_length,
      num_return_sequences=num_return_sequences,
      no_repeat_ngram_size=2,
      do_sample=True,
      top_k=50,
      top_p=0.9,
      temperature=0.7,
      pad_token_id=tokenizer.eos_token_id
  )
  return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [49]:
prompt = "Breaking News:"
print(generate_fake_news(prompt))

Breaking News:

An Ohio man was arrested in connection with a murder, according to a report from the Cleveland Plain Dealer.
.@clevelandreds are investigating. They are looking for a man who shot and killed a woman.


In [50]:
#INTERGRATING BOTH DETECTOR AND GENERATOR CODE TOGETHER
import gradio as gr
def gradio_pipeline(prompt):
  generated_text = generate_fake_news(prompt)
  prediction = predict_news(generated_text)
  return generated_text, prediction
interface = gr.Interface(
    fn = gradio_pipeline,
    inputs = gr.Textbox(lines=2, placeholder="Enter a prompt..."),
    outputs = [
        gr.Textbox(label="Generated Fake News"),
        gr.Textbox(label="Prediction(Fake or Real)")
    ],
    title = "Fake News Generator and Detector",
    description = "Enter a prompt to generate fake news and classify it as FAKE or REAL."
)
interface.launch(debug=True)

It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://67f60c9f90b445e183.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://67f60c9f90b445e183.gradio.live


