In [5]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/deceptive-opinion-spam-corpus/deceptive-opinion.csv


## Load the Dataset (Deceptive Opinion Spam Corpus)

In this step, we load the Deceptive Opinion Spam dataset provided as a CSV file on Kaggle.  
The dataset contains reviews labeled as:

- **deceptive** — fake reviews  
- **truthful** — real reviews  

We will load the CSV, inspect its shape, check for missing values, and preview the first few rows.


In [7]:
import pandas as pd

df = pd.read_csv("/kaggle/input/deceptive-opinion-spam-corpus/deceptive-opinion.csv")

print("Dataset Shape:", df.shape)
print("\nColumns:", df.columns.tolist())

df.head()


Dataset Shape: (1600, 5)

Columns: ['deceptive', 'hotel', 'polarity', 'source', 'text']


Unnamed: 0,deceptive,hotel,polarity,source,text
0,truthful,conrad,positive,TripAdvisor,We stayed for a one night getaway with family ...
1,truthful,hyatt,positive,TripAdvisor,Triple A rate with upgrade to view room was le...
2,truthful,hyatt,positive,TripAdvisor,This comes a little late as I'm finally catchi...
3,truthful,omni,positive,TripAdvisor,The Omni Chicago really delivers on all fronts...
4,truthful,hyatt,positive,TripAdvisor,I asked for a high floor away from the elevato...


## Clean and Prepare the Dataset

The original dataset contains multiple columns such as hotel name, polarity, and review 
source. For fake review detection, we only need:

- `text`  → the review content  
- `label` → deceptive (fake) or truthful (real)

We will:
1. Extract the relevant columns  
2. Rename the label column  
3. Convert labels to binary (1 = deceptive, 0 = truthful)  
4. Verify class distribution  


In [8]:
import pandas as pd

df = pd.read_csv("/kaggle/input/deceptive-opinion-spam-corpus/deceptive-opinion.csv")

# Keep only necessary columns
df = df[['text', 'deceptive']].copy()

# Rename label column
df.rename(columns={'deceptive': 'label'}, inplace=True)

# Convert labels to binary
df['label'] = df['label'].map({'deceptive': 1, 'truthful': 0})

print("Cleaned Dataset Shape:", df.shape)
print("\nClass Distribution:")
print(df['label'].value_counts())

print("\nPreview:")
df.head()


Cleaned Dataset Shape: (1600, 2)

Class Distribution:
label
0    800
1    800
Name: count, dtype: int64

Preview:


Unnamed: 0,text,label
0,We stayed for a one night getaway with family ...,0
1,Triple A rate with upgrade to view room was le...,0
2,This comes a little late as I'm finally catchi...,0
3,The Omni Chicago really delivers on all fronts...,0
4,I asked for a high floor away from the elevato...,0


## Text Cleaning & Preprocessing

To prepare the reviews for machine learning, we apply standard NLP preprocessing:

### Cleaning Steps:
- Convert text to lowercase  
- Remove punctuation  
- Remove numbers  
- Remove stopwords (e.g., "the", "and", "is")  
- Apply lemmatization (normalize words to their root form)  

This produces cleaner text, reduces noise, and improves model accuracy.


In [9]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download NLTK resources (only first time)
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    # Lowercase
    text = text.lower()
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    
    # Remove punctuation and numbers
    text = re.sub(r'[^a-z\s]', ' ', text)
    
    # Tokenization
    tokens = text.split()
    
    # Remove stopwords + lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    
    return " ".join(tokens)

# Apply cleaning
df['clean_text'] = df['text'].apply(clean_text)

print("Sample cleaned text:\n")
df[['text', 'clean_text']].head()


[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Sample cleaned text:



Unnamed: 0,text,clean_text
0,We stayed for a one night getaway with family ...,stayed one night getaway family thursday tripl...
1,Triple A rate with upgrade to view room was le...,triple rate upgrade view room less also includ...
2,This comes a little late as I'm finally catchi...,come little late finally catching review past ...
3,The Omni Chicago really delivers on all fronts...,omni chicago really delivers front spaciousnes...
4,I asked for a high floor away from the elevato...,asked high floor away elevator got room pleasa...


## Train/Test Split & TF-IDF Vectorization

Now that the dataset is cleaned, we convert the text into numerical features using 
**TF-IDF (Term Frequency – Inverse Document Frequency)**.

Steps performed:
1. Split dataset into training and testing sets (80% train, 20% test).
2. Convert `clean_text` into TF-IDF vectors.
3. Limit vocabulary size to avoid overfitting and speed up training.
4. Save the vectorizer to reuse during prediction.


In [10]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    df['clean_text'], 
    df['label'], 
    test_size=0.2, 
    random_state=42,
    stratify=df['label']
)

# TF-IDF Vectorizer
tfidf = TfidfVectorizer(
    max_features=5000,   # limit vocabulary
    ngram_range=(1,2),   # unigrams + bigrams improve accuracy
    min_df=2             # ignore very rare words
)

# Fit on training data and transform both
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print("TF-IDF train shape:", X_train_tfidf.shape)
print("TF-IDF test shape:", X_test_tfidf.shape)

X_train_tfidf[:5]

TF-IDF train shape: (1280, 5000)
TF-IDF test shape: (320, 5000)


<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 702 stored elements and shape (5, 5000)>

## Model Training & Evaluation

In this step, we train multiple machine learning models:

- **Logistic Regression**
- **Support Vector Machine (SVM)**
- **XGBoost Classifier**

For each model, we compute:

- Accuracy
- Precision
- Recall
- F1-score

This helps us identify the best-performing fake review detection model.


In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import warnings
warnings.filterwarnings("ignore")

# ------------------
# Logistic Regression
# ------------------
lr = LogisticRegression(max_iter=2000)
lr.fit(X_train_tfidf, y_train)
pred_lr = lr.predict(X_test_tfidf)

# ------------------
# SVM (Linear kernel)
# ------------------
svm = LinearSVC()
svm.fit(X_train_tfidf, y_train)
pred_svm = svm.predict(X_test_tfidf)

# ---------------
# XGBoost
# ---------------
from xgboost import XGBClassifier

xgb = XGBClassifier(
    eval_metric='logloss',
    n_estimators=300,
    learning_rate=0.1,
    max_depth=6,
    subsample=0.7,
    colsample_bytree=0.7
)

xgb.fit(X_train_tfidf, y_train)
pred_xgb = xgb.predict(X_test_tfidf)

# ----------------------
# Function to evaluate
# ----------------------
def evaluate_model(name, y_true, y_pred):
    print(f"\n{name} Performance:")
    print("-" * 40)
    print("Accuracy :", accuracy_score(y_true, y_pred))
    print("Precision:", precision_score(y_true, y_pred))
    print("Recall   :", recall_score(y_true, y_pred))
    print("F1-score :", f1_score(y_true, y_pred))

# Print results
evaluate_model("Logistic Regression", y_test, pred_lr)
evaluate_model("SVM", y_test, pred_svm)
evaluate_model("XGBoost", y_test, pred_xgb)



Logistic Regression Performance:
----------------------------------------
Accuracy : 0.875
Precision: 0.8658536585365854
Recall   : 0.8875
F1-score : 0.8765432098765432

SVM Performance:
----------------------------------------
Accuracy : 0.878125
Precision: 0.8711656441717791
Recall   : 0.8875
F1-score : 0.8792569659442724

XGBoost Performance:
----------------------------------------
Accuracy : 0.825
Precision: 0.7988505747126436
Recall   : 0.86875
F1-score : 0.8323353293413174


## Save Final Model and TF-IDF Vectorizer

We select the best-performing model (SVM) and save:

- The trained SVM model  
- The fitted TF-IDF vectorizer  

Both are saved using pickle so they can be loaded inside a Flask API for real-time
fake review detection.

In [12]:
import pickle

# Save SVM model
with open("final_svm_model.pkl", "wb") as f:
    pickle.dump(svm, f)

# Save TF-IDF vectorizer
with open("tfidf_vectorizer.pkl", "wb") as f:
    pickle.dump(tfidf, f)

print("Model and vectorizer saved successfully!")


Model and vectorizer saved successfully!
