# Case 11 - Detecting Fake Reviews for Hotels Using Text Analytics

## Situation
There is not a day that passes without mentions of fake news and fake content in the popular press. Platforms such as Facebook and TripAdvisor generate vast amounts of user-generated content. If this content is fake, it undermines user trust in the platform. For sites like TripAdvisor, which serve as crucial gatekeepers to the hotel industry, this issue becomes even more critical, as many users base their hotel decisions on reviews found on these platforms.

---

## Complication
It is widely suspected that many reviews on TripAdvisor are fake. While management understands the utility of machine learning for tasks like spam detection (e.g., Gmail’s spam filter), they are uncertain about applying this technology to automatically detect or flag fake reviews.

---

## Key Question
Can we build a classifier to distinguish between truthful and deceptive (fake) reviews for hotels?

---

## Data
A key challenge in this task is obtaining labeled training data that distinguishes between genuine and fake reviews. This is inherently difficult, as if we could reliably identify fake reviews, we wouldn’t need a detection system.

### **Key Innovation**
To overcome this challenge, researchers hired workers on **Amazon Mechanical Turk (AMT)** to create a dataset of “genuine fake reviews”:
- Workers were paid to write fake hotel reviews for Chicago hotels.
- They were instructed to make the reviews believable and to pose as real guests who were either satisfied or dissatisfied with their stay.

### **Dataset Overview**
- **400 Truthful Positive Reviews**
- **400 Truthful Negative Reviews**
- **400 Deceptive Positive Reviews**
- **400 Deceptive Negative Reviews**
- Data sourced from the 20 most popular hotels in Chicago.

### **Gold-Standard Deceptive Opinion Spam**
- 400 Human Intelligence Tasks (HITs) were created, evenly distributed across 20 hotels.
- **Requirements for Turkers:**
  - Located in the United States.
  - Approval rating of at least 90%.
  - Maximum of one submission per Turker.
  - Allowed up to 30 minutes to complete the HIT.
  - Paid $1 per accepted submission.

**Instructions for Turkers:**  
They were told to assume the role of a hotel marketing department employee tasked with writing a fake review as if they were a customer. The review needed to sound realistic and portray the hotel in a positive light.

---

## Target Variable
- **Dependent Variable:** `deceptive` column.
- **Task:** Predict whether a review is truthful or deceptive.

---

## Links
- **Dataset:** [deceptive-opinion.csv]

---

## Notes and Points for Discussion

1. **Binary Classification Workflow:**
   - Once the text is systematically converted to numerical representations, the binary classification techniques used in earlier examples can be effectively applied here.

2. **Random Forest Model:**
   - This example introduces the random forest algorithm.

3. **Feature Importance:**
   - The random forest model can compute which words in the reviews are most significant for classifying deceptive versus truthful reviews.
   - **Method:** The permutation method is used, where feature values are perturbed randomly to observe the impact on the model’s error. Larger increases in error indicate more critical features.

4. **Applications:**
   - What other industries or companies could benefit from this technology?
   - Can you identify similar applications beyond hotel reviews?

---