# **Hotel Fake Review Detection – Week 1**

## **1. Setup**

This cell installs necessary libraries, imports modules, and downloads required data resources.

**Before running this cell, please follow Step 2 in the README to create and activate your Python virtual environment.**

**Note for macOS users:**  
If you encounter SSL errors (e.g., certificate verify failed) when downloading NLTK data, please refer to Step 3 in the README for instructions on how to resolve this.

In [1]:
%pip install -q torch transformers datasets scikit-learn pandas nltk ipywidgets beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import nltk

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jasminehuang/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jasminehuang/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/jasminehuang/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## **2. Load Datasets**

In [3]:
# Deceptive Opinion Spam Corpus
hotel_reviews_df = pd.read_csv('../data/raw/dosc_hotel_reviews.csv')
# Kaggle product reviews dataset
product_reviews_df = pd.read_csv('../data/raw/kaggle_fake_reviews.csv')

# Basic check
print("Product Reviews:", product_reviews_df.shape)
print("Hotel Reviews:", hotel_reviews_df.shape)

Product Reviews: (40432, 4)
Hotel Reviews: (1600, 5)


## **3. Data Exploration**

### 3.1 Examine the structure of both datasets

In [4]:
print("=== EXPLORING DATASETS ===")
print("\n1. Product Reviews Dataset (Kaggle):")
print(f"Shape: {product_reviews_df.shape}")
print(f"Columns: {product_reviews_df.columns.tolist()}")
print("\nFirst few rows:")
print(product_reviews_df.head())

print("\n2. Hotel Reviews Dataset (DOSC):")
print(f"Shape: {hotel_reviews_df.shape}")
print(f"Columns: {hotel_reviews_df.columns.tolist()}")
print("\nFirst few rows:")
print(hotel_reviews_df.head())

=== EXPLORING DATASETS ===

1. Product Reviews Dataset (Kaggle):
Shape: (40432, 4)
Columns: ['category', 'rating', 'label', 'text_']

First few rows:
             category  rating label  \
0  Home_and_Kitchen_5     5.0    CG   
1  Home_and_Kitchen_5     5.0    CG   
2  Home_and_Kitchen_5     5.0    CG   
3  Home_and_Kitchen_5     1.0    CG   
4  Home_and_Kitchen_5     5.0    CG   

                                               text_  
0  Love this!  Well made, sturdy, and very comfor...  
1  love it, a great upgrade from the original.  I...  
2  This pillow saved my back. I love the look and...  
3  Missing information on how to use it, but it i...  
4  Very nice set. Good quality. We have had the s...  

2. Hotel Reviews Dataset (DOSC):
Shape: (1600, 5)
Columns: ['deceptive', 'hotel', 'polarity', 'source', 'text']

First few rows:
  deceptive   hotel  polarity       source  \
0  truthful  conrad  positive  TripAdvisor   
1  truthful   hyatt  positive  TripAdvisor   
2  truthful   hya

### 3.2 Check for missing values

In [5]:
print("\n=== MISSING VALUES CHECK ===")
print("Product Reviews missing values:")
print(product_reviews_df.isnull().sum())
print("\nHotel Reviews missing values:")
print(hotel_reviews_df.isnull().sum())


=== MISSING VALUES CHECK ===
Product Reviews missing values:
category    0
rating      0
label       0
text_       0
dtype: int64

Hotel Reviews missing values:
deceptive    0
hotel        0
polarity     0
source       0
text         0
dtype: int64


### 3.3 Check label distributions

In [6]:
print("\n=== LABEL DISTRIBUTIONS ===")
print("Product Reviews labels:")
print(product_reviews_df['label'].value_counts())
print("\nHotel Reviews labels:")
print(hotel_reviews_df['deceptive'].value_counts())


=== LABEL DISTRIBUTIONS ===
Product Reviews labels:
label
CG    20216
OR    20216
Name: count, dtype: int64

Hotel Reviews labels:
deceptive
truthful     800
deceptive    800
Name: count, dtype: int64


### 3.4 Check text column names and sample texts

In [7]:
print("\n=== TEXT SAMPLES ===")
print("Product Reviews text samples:")
text_col_product = [col for col in product_reviews_df.columns if 'text' in col.lower()][0]
print(f"Text column: {text_col_product}")
print(product_reviews_df[text_col_product].head(2).tolist())

print("\nHotel Reviews text samples:")
text_col_hotel = [col for col in hotel_reviews_df.columns if 'text' in col.lower()][0]
print(f"Text column: {text_col_hotel}")
print(hotel_reviews_df[text_col_hotel].head(2).tolist())


=== TEXT SAMPLES ===
Product Reviews text samples:
Text column: text_
['Love this!  Well made, sturdy, and very comfortable.  I love it!Very pretty', "love it, a great upgrade from the original.  I've had mine for a couple of years"]

Hotel Reviews text samples:
Text column: text
['We stayed for a one night getaway with family on a thursday. Triple AAA rate of 173 was a steal. 7th floor room complete with 44in plasma TV bose stereo, voss and evian water, and gorgeous bathroom(no tub but was fine for us) Concierge was very helpful. You cannot beat this location... Only flaw was breakfast was pricey and service was very very slow(2hours for four kids and four adults on a friday morning) even though there were only two other tables in the restaurant. Food was very good so it was worth the wait. I would return in a heartbeat. A gem in chicago... \n', 'Triple A rate with upgrade to view room was less than $200 which also included breakfast vouchers. Had a great view of river, lake, Wrigley

## **4.Data Preprocessing and Standardization**

### 4.1 Clean and standardize datasets

In [8]:
import sys
sys.path.append('../src')
from data_processing import standardize_dataset, convert_labels_to_binary

product_clean_df = standardize_dataset(product_reviews_df, 'text_', 'label')
hotel_clean_df = standardize_dataset(hotel_reviews_df, 'text', 'deceptive')

print("\n=== CLEANED DATASETS ===")
print("\n1. Product Reviews Dataset (Kaggle):")
print(product_clean_df.head())
print("\n2. Hotel Reviews Dataset (DOSC):")
print(hotel_clean_df.head())


=== CLEANED DATASETS ===

1. Product Reviews Dataset (Kaggle):
                                                text label
0  love well made sturdy comfortable love itvery ...    CG
1   love great upgrade original ive mine couple year    CG
2            pillow saved back love look feel pillow    CG
3        missing information use great product price    CG
4                nice set good quality set two month    CG

2. Hotel Reviews Dataset (DOSC):
                                                text     label
0  stayed one night getaway family thursday tripl...  truthful
1  triple rate upgrade view room less 200 also in...  truthful
2  come little late im finally catching review pa...  truthful
3  omni chicago really delivers front spaciousnes...  truthful
4  asked high floor away elevator got room pleasa...  truthful


### 4.2 Convert labels to binary

In [9]:
from data_processing import convert_labels_to_binary
product_final_df = convert_labels_to_binary(product_clean_df)
hotel_final_df = convert_labels_to_binary(hotel_clean_df)

print("\n=== FINAL DATASETS ===")
print("\n1.Product Reviews Dataset (Kaggle):")
print(product_final_df.head())
print("\n2. Hotel Reviews Dataset (DOSC):")
print(hotel_final_df.head())


=== FINAL DATASETS ===

1.Product Reviews Dataset (Kaggle):
                                                text  label
0  love well made sturdy comfortable love itvery ...      1
1   love great upgrade original ive mine couple year      1
2            pillow saved back love look feel pillow      1
3        missing information use great product price      1
4                nice set good quality set two month      1

2. Hotel Reviews Dataset (DOSC):
                                                text  label
0  stayed one night getaway family thursday tripl...      0
1  triple rate upgrade view room less 200 also in...      0
2  come little late im finally catching review pa...      0
3  omni chicago really delivers front spaciousnes...      0
4  asked high floor away elevator got room pleasa...      0


## **5. Save Final Cleaned Data**

In [10]:
import os

# Save the final cleaned datasets
product_final_df.to_csv('../data/processed/product_final_clean.csv', index=False)
hotel_final_df.to_csv('../data/processed/hotel_final_clean.csv', index=False)

print("✅ Final cleaned datasets saved to data/processed/")
print(f"✅ Product reviews: {product_final_df.shape}")
print(f"✅ Hotel reviews: {hotel_final_df.shape}")

✅ Final cleaned datasets saved to data/processed/
✅ Product reviews: (40431, 2)
✅ Hotel reviews: (1600, 2)
