# COGS 108 - Project Proposal

# Names

- Caleb Galdston
- Liam Manatt
- Nilay Menon
- Vikram Venkatesh
- Vrisan Dubey

# Research Question

Given a product review’s content, **how well can we classify if the review was written by a human or not?**

## Background and Prior Work

In today’s internet based world, product reviews, especially the ones online, play a very important role when it comes to consumer purchasing decisions. Such reviews also have a big impact on the reputation of companies selling the products. With reviews being very important, there could be the possibility that people try to game the system by writing fake reviews. The internet has millions if not billions of reviews for products with the number of reviews growing fast everyday.  According to Scott Clark from CMSwire.com, “with the advent of generative AI, fake reviews are becoming more advanced and difficult to detect”.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) As this growth of artificial intelligence continues, so does the possibility of fraudulent reviews generated by bots. We know that there is a possibility of having more fake reviews than real reviews more than ever now due to large language models. With such an important problem, we wanted to see if we are able to classify whether a review is written by a human or not.

With such a pressing topic, there have been many attempts to help combat such reviews. For example, according to a study done by Arjun Muherjee and a couple others, “supervised learning was used with a set of review centric features (e.g., unigrams and review length) and reviewer and product centric features (e.g., average rating, sales rank, etc.) to detect fake reviews” (2). The use of features like n grams are important when trying to predict whether or not a review is fake or real. “An AUC (Area Under the ROC Curve) of 0.78 was reported using logistic regression. The assumption, however, is too restricted for detecting generic fake reviews”.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) This shows that detecting fake reviews might be a bit harder than we initially thought. 

Another study that went into fake review detection using machine learning methods, states that “fake reviews are differentiated from genuine reviews using four linguistic clues like level of detail, understandability, cognition indicators and writing style”.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) Using features like case of letters, things like if a word was a feeling word, and the words part of speech, the people in the study were able to use machine learning algorithms like logistic regression to classify if a review was genuine or not. However, even these researchers found it difficult to reach a high level of accuracy due to things like the fabricated review being very close to what is considered to be a genuine review. 

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Clark, Scott. "How to Spot and Combat Fake Reviews and Bots." *CMSWIRE*, (18 Oct 2023). https://www.cmswire.com/customer-experience/how-to-spot-and-combat-fake-reviews-and-bots/
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Mukherjee, Arjun et al. “Fake Review Detection : Classification and Analysis of Real and Pseudo Reviews.” (2013). https://www2.cs.uh.edu/~arjun/papers/UIC-CS-TR-yelp-spam.pdf
3. <a name="cite_note-3"></a> [^](#cite_ref-3) N. A. Patel and R. Patel, "A Survey on Fake Review Detection using Machine Learning Techniques," *2018 4th International Conference on Computing Communication and Automation (ICCCA)*, Greater Noida, India, (2018). https://ieeexplore.ieee.org/abstract/document/8777594



# Hypothesis


Given text data of a product review, our model will be able to accurately predict whether the review is real or fake. We believe that certain review features (length, unigrams, etc) will be associated with fake reviews, and our model will be able to use these features to predict whether a given review is real after training it on thousands of sample reviews. 

# Data

Our fathom dataset would be pretty simple. We would just need each observation to be a review with attributes containing its title and content. Metadata about each review would be helpful but not necessarily required (such metadata could include product associated with the review, how helpful it was, when it was published, etc). Ideally, we would like to find a dataset where the content of the review is stored as a string, prelabeled as fake or real. This would make obtaining the data much easier, as we wouldn’t have to go about finding real or fake reviews on websites and then figuring out how to tell if they are real or fake. Even if we found a good way to pre-label the data, scraping reviews introduces another concern about website policies and such.

We would mostly want to focus on the text attributes and create attributes related to the text. Since we won’t have many explicit features (i.e. features that are directly present in the dataset), much of our time would be done in engineering new text-related features reflecting/summarizing the data that we already have. These will be paramount to the performance of our model.

We would like most of our reviews to be pretty recent, meaning around 2015 to present. Since AI-generated reviews have become increasingly more common with the recent rise of software like Chat-GPT and other large-language models, we feel like our results will be more applicable and interpretable if we used newer data.

We would like to have 10,000+ observations with at least 5,000 examples for both fake and real reviews. We somewhat ballparked these numbers but we thought that if we are able to collect this amount of data, our model would be able to learn reasonably well. We want to make sure that we do not have an overwhelming amount of one class in the data. We fear that this would make our model inherently biased, which we obviously want to avoid.

# Ethics & Privacy

Classifying fake reviews can be ethically challenging. The case of a false positive is particularly damaging. For example, if our model errs and marks a true review fake, most people would simply discard that information, thereby invalidating the poster’s speech. Furthermore, websites might delete this review, completely preventing someone from sharing their opinion. This is something that we can account for in our model metrics by valuing precision more than recall. However, we cannot fully eliminate this possibility so we would address this issue directly in our results analysis.

For ethical concerns of our data source, biases in the data could greatly impact our results. For example, if the curator was biased in their data collection, drawing fake reviews from a subset of products more often than others, producers could be adversely affected by our model’s bias. This is something that we would try to ascertain in our EDA stage, and we would address this thoroughly before any statements on our results. Moreover, certain word choices may be penalized heavier than others, which could unduly target geo/cultural groups. To determine this we would have to audit our model. Additionally, there is a privacy concern with the data collection process, as our data will likely be scraped, data consent may be an issue we encounter. This is of utmost importance, thus we must ensure ethically sourced data prior to any model construction.

# Team Expectations 

As a group, our main focus is to all contribute our fair share of what is expected from each other every time we meet up. It is important that each team member is held accountable for their responsibilities so that the group can progress towards our project goals and deadlines. All group members should be included in all communication made regarding the project so that nobody is felt left out or lost. It is also important to have respect and understanding of any extraneous circumstances that may cause someone to not be able to fulfill their duties. In all, it boils down to everyone doing their honest work to the best of their ability, staying up to date with communication on project updates, being involved in discussion, showing up to meetings, and being respectful of one another. 

Throughout the quarter, we have multiple places of communication from text messaging, discord, and email. Text messaging is primarily used for communication regarding when to meet, keeping each other updated, and any general information about the project. Discord and email is used to share information amongst each other regarding project materials, links, and more technical planning information. In all communication, it is important to be open-minded and respectful of other team members' inputs and ideas. Responses regarding disagreements should be dealt with respectfully and not in a rash, harsh manner. Should this happen, group members will need to meet up in person and find a solution together. It is also important to express to the group if you need help with something instead of just struggling by yourself. Also, going above and beyond on project items is each member’s choice, ability, and hard work. 

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 5/7  |  2 PM | Brainstorm where we can find the data that we need. Come up with at least one dataset that you think could be helpful (Everyone)  | Finalize dataset and distribute work for EDA/feature engineering | 
| 5/14  |  2 pm | Some sort of EDA/feature engineering, does not have to be 100% done but need to have some progress (Vrisan, Caleb, Nilay)  | Work on finishing the data checkpoint and figure out ways to continue EDA/feature engineering | 
| 5/21  | 2 pm  | EDA and feature engineering is 90% finished (Vikram, Liam)  | Discuss what types of models that we would like to use. Figure out a baseline model   |
| 5/28  | 2 PM  | EDA and feature engineering is 100% done. Baseline model is done with some type of results to show. Accuracy does not have to be great at all but just needs to be established as a baseline (Nilay, Caleb) | Work on finishing the EDA checkpoint and brainstorm and then finalize options for final model   |
| 6/7  | 2 PM  | Final Model and any hyperparameter tuning is done. Final results have been obtained (Everyone) | Start working on the video presentation and final notebook  |
| 6/14  | 2 PM  | Final notebook is 95% done and almost ready to be submitted (Everyone) | Record final version of the video and finalize the submitted notebook |