Detect whether a customer review is Fake (CG) or Genuine (OR) to help e-commerce companies prevent losses due to fake reviews.

In [1]:
import pandas as pd 
import numpy as np

# EDA

In [2]:
df = pd.read_csv("fake reviews dataset.csv")
df.head()

Unnamed: 0,category,rating,label,text_
0,Home_and_Kitchen_5,5.0,CG,"Love this! Well made, sturdy, and very comfor..."
1,Home_and_Kitchen_5,5.0,CG,"love it, a great upgrade from the original. I..."
2,Home_and_Kitchen_5,5.0,CG,This pillow saved my back. I love the look and...
3,Home_and_Kitchen_5,1.0,CG,"Missing information on how to use it, but it i..."
4,Home_and_Kitchen_5,5.0,CG,Very nice set. Good quality. We have had the s...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40432 entries, 0 to 40431
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   category  40432 non-null  object 
 1   rating    40432 non-null  float64
 2   label     40432 non-null  object 
 3   text_     40432 non-null  object 
dtypes: float64(1), object(3)
memory usage: 1.2+ MB


In [4]:
df.value_counts("label")

label
CG    20216
OR    20216
Name: count, dtype: int64

In [5]:
df.value_counts("category")

category
Kindle_Store_5                  4730
Books_5                         4370
Pet_Supplies_5                  4254
Home_and_Kitchen_5              4056
Electronics_5                   3988
Sports_and_Outdoors_5           3946
Tools_and_Home_Improvement_5    3858
Clothing_Shoes_and_Jewelry_5    3848
Toys_and_Games_5                3794
Movies_and_TV_5                 3588
Name: count, dtype: int64

In [6]:
df.shape

(40432, 4)

# Data Cleaning

# Lowercase

In [7]:
df['text_'] = df['text_'].str.lower()

Converts all text to lowercase.

In [8]:
df.head()

Unnamed: 0,category,rating,label,text_
0,Home_and_Kitchen_5,5.0,CG,"love this! well made, sturdy, and very comfor..."
1,Home_and_Kitchen_5,5.0,CG,"love it, a great upgrade from the original. i..."
2,Home_and_Kitchen_5,5.0,CG,this pillow saved my back. i love the look and...
3,Home_and_Kitchen_5,1.0,CG,"missing information on how to use it, but it i..."
4,Home_and_Kitchen_5,5.0,CG,very nice set. good quality. we have had the s...


# remove punctuation maks


In [9]:
import string,time
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [10]:
pun = string.punctuation

Removes characters like . , ! ?

In [11]:
def remove_punc(text):
    for char in pun:
        text = text.replace(char,'')
    return text

In [12]:
text = 'string. with . punctution ?'

In [13]:
df['text_']= df['text_'].apply(remove_punc)

In [14]:
df['text_']

0        love this  well made sturdy and very comfortab...
1        love it a great upgrade from the original  ive...
2        this pillow saved my back i love the look and ...
3        missing information on how to use it but it is...
4        very nice set good quality we have had the set...
                               ...                        
40427    i had read some reviews saying that this bra r...
40428    i wasnt sure exactly what it would be it is a ...
40429    you can wear the hood by itself wear it with t...
40430    i liked nothing about this dress the only reas...
40431    i work in the wedding industry and have to wor...
Name: text_, Length: 40432, dtype: object

# Remove stopwords = (a ,the ,of ,are ,my)

In [15]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [16]:
def remove_stopwords(text):
    new_test= []
    for word in text.split(' '):
        if word in ENGLISH_STOP_WORDS:
            new_test.append(' ')
        else:
            new_test.append(word)
    x = new_test[:]
    new_test.clear()
    return " ".join(x)

Removes common words like a, the, of that do not add meaning.

In [17]:
df['text_'] = df['text_'].apply(remove_stopwords)

In [18]:
df['text_']

0        love        sturdy     comfortable    love itv...
1        love     great upgrade     original  ive      ...
2          pillow saved       love   look   feel     pi...
3        missing information       use           great ...
4          nice set good quality         set     months...
                               ...                        
40427        read   reviews saying     bra ran small   ...
40428      wasnt sure exactly               little larg...
40429        wear   hood     wear       hood   wear jus...
40430      liked       dress     reason   gave   4 star...
40431      work     wedding industry       work long da...
Name: text_, Length: 40432, dtype: object

# Feature Engineering

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF vectorization

In [20]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['label_encoded'] = le.fit_transform(df['label'])

ML models cannot understand raw text.  
TF-IDF highlights important words and downweights frequent/common words.

In [21]:
tfidf = TfidfVectorizer( max_features=5000, stop_words='english')
X = tfidf.fit_transform(df['text_'])
y = df['label_encoded']

Converts text into numerical TF-IDF vectors.  
max_features=5000 → only top 5000 word

# Word count

In [22]:
df['word_count'] = df['text_'].apply(lambda x:len(str(x).split()))
df['word_count']

0          6
1          7
2          6
3          6
4          6
        ... 
40427    134
40428    102
40429    190
40430    114
40431    142
Name: word_count, Length: 40432, dtype: int64

Spelling Correction

Sentiment can help identify tone differences between fake and genuine reviews.

In [23]:
from textblob import TextBlob

Calculates polarity (-1 to 1) of each review

In [24]:
df['sentiment'] = df['text_'].apply(lambda x: TextBlob(str(x)).sentiment.polarity)

# Encoding

In [25]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

# naive_bayes

Trains Multinomial Naive Bayes for text classification.    
predict_proba  gives confidence score for predictions.

In [26]:
from sklearn.naive_bayes import MultinomialNB
model_nb = MultinomialNB()
model_nb.fit(X_train,y_train)

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


Simple, fast baseline model for text classification.   
Works well for sparse TF-IDF features.

In [27]:
y_pred = model_nb.predict(X_test)

In [28]:
y_proba = model_nb.predict_proba(X_test)

In [29]:
confidence = y_proba[:, 1]

# Measures model performance:

In [30]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8458019042908371


In [31]:
from sklearn.metrics import precision_score

precision = precision_score(y_test, y_pred)
print("Precision:", precision)

Precision: 0.8639175257731959


In [32]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
print(cm)

[[3488  528]
 [ 719 3352]]


Predicts Fake/Genuine for a new review with confidence score.

In [33]:
def decode_prediction(pred):
    if pred == 0:
        return "Fake Review (CG)"
    else:
        return "Genuine Review (OR)"

results = [decode_prediction(p) for p in y_pred]


In [34]:
new_review = ["I bought this for my mother. She uses it mainly for video calls and watching YouTube.The screen is clear and the"
" sound is loud enough, but the camera quality is average in low light."]

new_pred = model_nb.predict(tfidf.transform(new_review))
new_proba = model_nb.predict_proba(tfidf.transform(new_review))

print("Prediction:", decode_prediction(new_pred[0]))
print("Confidence:", round(new_proba[0][1]*100, 2), "%")


Prediction: Genuine Review (OR)
Confidence: 81.16 %


# xgboost 

In [35]:
from xgboost import XGBClassifier

Trains XGBoost, a powerful gradient boosting classifier.   
Evaluates with accuracy and confusion matrix.

In [36]:
model_xg = XGBClassifier()
model_xg.fit(X_train,y_train)

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


Often gives higher recall for genuine reviews  reduces misclassification of real reviews.

In [37]:
xg_pred = model_xg.predict(X_test)

In [38]:
accuracy = accuracy_score(y_test, xg_pred)
print("Accuracy:", accuracy)

Accuracy: 0.82985037714851


In [39]:
cm = confusion_matrix(y_test, xg_pred)
print(cm)

[[3167  849]
 [ 527 3544]]


In [None]:
import pickle

with open('xgb_fake_review_model.pkl', 'wb') as f:
    pickle.dump(model_xg, f)


In [43]:
import pickle

with open('tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf, f)

# The system can accurately detect fake and genuine reviews, helping e-commerce platforms reduce fraudulent reviews.