# Fake Review Detection Using Machine Learning

## Problem Statement
Online review platforms play a critical role in influencing customer decisions. However, many reviews are intentionally fabricated to manipulate product ratings and consumer trust. These fake reviews reduce platform credibility and negatively affect both customers and businesses.

Manual detection of fake reviews is impractical due to the large volume of user-generated content. Therefore, automated detection methods using Natural Language Processing (NLP) and Machine Learning are required.

## Objective
The objective of this project is to build a supervised machine learning model that classifies reviews as **fake** or **genuine** based on their textual content.


## Dataset Description
The dataset consists of labeled textual reviews. Each review is associated with a binary label indicating whether it is fake or genuine.


In [1]:
import pandas as pd
!pip install gradio


df = pd.read_csv("fake reviews dataset.csv")
df.head()
df.info()
df['label'].value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40432 entries, 0 to 40431
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   category  40432 non-null  object 
 1   rating    40432 non-null  float64
 2   label     40432 non-null  object 
 3   text_     40432 non-null  object 
dtypes: float64(1), object(3)
memory usage: 1.2+ MB


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
CG,20216
OR,20216


In [22]:
import re

def clean_text(text):
    text = str(text).lower()
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"[^a-z\s]", "", text)
    return text

df["clean_review"] = df["text_"].apply(clean_text)


The dataset distribution shows whether the data is balanced. An imbalanced dataset can affect model performance and evaluation metrics.

## Text Preprocessing
Text preprocessing is necessary to reduce noise and standardize input data before feature extraction.


In [2]:
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk

nltk.download('stopwords')
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    text = re.sub(r'[^a-z\s]', '', text)
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
def predict_review(review):
    clean_review = preprocess_text(review)
    vect_review = tfidf.transform([clean_review])
    pred = model.predict(vect_review)[0]
    return "Fake Review" if pred == "OR" else "Genuine Review"

Lowercasing reduces vocabulary size. Removing punctuation and numbers eliminates non-semantic tokens. Stopwords are removed to reduce noise. Lemmatization preserves grammatical meaning while normalizing word forms.


## Train–Test Split
The dataset is split into training and testing sets to evaluate model generalization.


In [23]:
from sklearn.model_selection import train_test_split

X = df['text_']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## Feature Extraction
TF-IDF converts text into numerical features by weighting words based on their frequency and importance across documents.


In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1,2))  # include bigrams
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

X_train_tfidf.shape

(32345, 5000)

## Logistic Regression Model
Logistic Regression is a strong baseline model for binary text classification due to its simplicity and efficiency.

In [25]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

y_train = y_train.values if hasattr(y_train, "values") else y_train
y_test = y_test.values if hasattr(y_test, "values") else y_test

lr = LogisticRegression(
    max_iter=1000,
    solver="liblinear",
    random_state=42
)

params = {
    "C": [0.1, 1, 10]
}

grid = GridSearchCV(
    estimator=lr,
    param_grid=params,
    cv=3,
    scoring="accuracy",
    n_jobs=-1
)

grid.fit(X_train_tfidf, y_train)
best_model = grid.best_estimator_
print("Best params:", grid.best_params_)
y_pred = best_model.predict(X_test_tfidf)
print(classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)


Best params: {'C': 10}
              precision    recall  f1-score   support

          CG       0.93      0.93      0.93      4016
          OR       0.93      0.93      0.93      4071

    accuracy                           0.93      8087
   macro avg       0.93      0.93      0.93      8087
weighted avg       0.93      0.93      0.93      8087

Confusion Matrix:
 [[3741  275]
 [ 295 3776]]


In [17]:
import joblib

joblib.dump(best_model, "fake_review_lr.pkl")
joblib.dump(tfidf, "tfidf_vectorizer.pkl")
model = joblib.load("fake_review_lr.pkl")
tfidf = joblib.load("tfidf_vectorizer.pkl")

## Naive Bayes Model
Naive Bayes is commonly used in text classification due to its probabilistic nature and efficiency with high-dimensional data.

In [26]:
# @title
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb.fit(X_train_tfidf, y_train)

y_pred_nb = nb.predict(X_test_tfidf)

print(classification_report(y_test, y_pred_nb))
confusion_matrix(y_test, y_pred_nb)

              precision    recall  f1-score   support

          CG       0.89      0.88      0.89      4016
          OR       0.89      0.89      0.89      4071

    accuracy                           0.89      8087
   macro avg       0.89      0.89      0.89      8087
weighted avg       0.89      0.89      0.89      8087



array([[3551,  465],
       [ 448, 3623]])

## Model Comparison
Both models were evaluated using Accuracy, Precision, Recall, and F1-score.

Logistic Regression generally performs better when features are linearly separable, while Naive Bayes assumes feature independence.

Based on the evaluation metrics, the better-performing model was selected.


In [27]:
%%writefile app.py
import streamlit as st
import joblib
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk

# Download NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

# Load saved model & vectorizer
model = joblib.load("fake_review_lr.pkl")
tfidf = joblib.load("tfidf_vectorizer.pkl")

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Preprocessing function
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    text = re.sub(r'[^a-z\s]', '', text)
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words)

# Streamlit Interface
st.title("Fake Review Detector")

review = st.text_area("Enter a review:")

if st.button("Predict"):
    clean_review = preprocess_text(review)
    vect_review = tfidf.transform([clean_review])
    prediction = model.predict(vect_review)[0]
    st.success("Fake Review" if prediction == "OR" else "Genuine Review")


Overwriting app.py


## Error Analysis
Misclassified reviews were analyzed manually. Some fake reviews resemble genuine opinions, while some real reviews contain exaggerated language, confusing the classifier.

In [34]:
# ماسک خطاها
mask = y_test != y_pred

# استخراج نمونه‌های اشتباه
error_reviews = X_test[mask].reset_index(drop=True)

true_labels = y_test[mask]
pred_labels = y_pred[mask]

# تعداد امن
n = min(5, len(error_reviews))

for i in range(n):
    print("Review:")
    print(error_reviews.iloc[i])
    print("True Label:", true_labels[i])
    print("Predicted Label:", pred_labels[i])
    print("-" * 80)

print(type(X_test))
print(type(y_test))
print(type(y_pred))


Review:
This is a strong production filled with beautiful and dramatic singing, and Wixell's playing a dual role tightens the story. Some artistic lapses in the directing are made up for by the exceptional final act.
True Label: OR
Predicted Label: CG
--------------------------------------------------------------------------------
Review:
Our 3 & 1/2 year old have been using them for a month now and have been using them to make homemade Biscuit Chicken, Biscuit Rolls, and other kinds of sandwiches. They are great for baking, which is always a challenge. They are very sturdy and the rubber seal is great for holding the eggs and nuts. I am very happy with this purchase. I would buy them again. I am very pleased with this purchase. I love these bowls! They are so easy to clean, no need for a sponge for it to stay dry. I use them to pour coffee beans or tea into bowls, it doesn't matter if you have a hot water pitcher or a hot water pitcher. I love the color, the color is not as bright as 

## Discussion
The models performed well on structured textual data. However, they struggle with sarcasm, short reviews, and ambiguous language. Dataset size and label quality also affect performance.

## Conclusion
This project demonstrated how NLP and machine learning can be used to detect fake reviews. Logistic Regression provided strong baseline performance with interpretable results.

## Future Work
Future improvements may include:
- Deep learning models (LSTM, BERT)
- Larger and multilingual datasets
- Incorporating reviewer behavior features


In [35]:
import gradio as gr

iface = gr.Interface(
    fn=predict_review,       # Function to call
    inputs=gr.Textbox(label="Enter a review"),  # Input type
    outputs=gr.Textbox(label="Prediction"),    # Output type
    title="Fake Review Detector",
    description="Enter a review and the model will classify it as Fake or Genuine."
)

iface.launch()


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://0a893541eb58f90a22.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


