**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Vrisan Dubey
- Vikram Venkatesh
- Nilay Menon
- Caleb Galdston
- Liam Manatt

# Research Question

Given a product review’s content, are features generated from sentiment analysis, tf-idf, and bag of words models good predictors of whether or not a review was written by a human or not?


## Background and Prior Work

In today’s internet based world, product reviews, especially the ones online, play a very important role when it comes to consumer purchasing decisions. Such reviews also have a big impact on the reputation of companies selling the products. With reviews being very important, there could be the possibility that people try to game the system by writing fake reviews. The internet has millions if not billions of reviews for products with the number of reviews growing fast everyday.  According to Scott Clark from CMSwire.com, “with the advent of generative AI, fake reviews are becoming more advanced and difficult to detect”.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) As this growth of artificial intelligence continues, so does the possibility of fraudulent reviews generated by bots. We know that there is a possibility of having more fake reviews than real reviews more than ever now due to large language models. With such an important problem, we wanted to see if we are able to classify whether a review is written by a human or not.

With such a pressing topic, there have been many attempts to help combat such reviews. For example, according to a study done by Arjun Muherjee and a couple others, “supervised learning was used with a set of review centric features (e.g., unigrams and review length) and reviewer and product centric features (e.g., average rating, sales rank, etc.) to detect fake reviews” (2). The use of features like n grams are important when trying to predict whether or not a review is fake or real. “An AUC (Area Under the ROC Curve) of 0.78 was reported using logistic regression. The assumption, however, is too restricted for detecting generic fake reviews”.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) This shows that detecting fake reviews might be a bit harder than we initially thought. 

Another study that went into fake review detection using machine learning methods, states that “fake reviews are differentiated from genuine reviews using four linguistic clues like level of detail, understandability, cognition indicators and writing style”.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) Using features like case of letters, things like if a word was a feeling word, and the words part of speech, the people in the study were able to use machine learning algorithms like logistic regression to classify if a review was genuine or not. However, even these researchers found it difficult to reach a high level of accuracy due to things like the fabricated review being very close to what is considered to be a genuine review. 

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Clark, Scott. "How to Spot and Combat Fake Reviews and Bots." *CMSWIRE*, (18 Oct 2023). https://www.cmswire.com/customer-experience/how-to-spot-and-combat-fake-reviews-and-bots/
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Mukherjee, Arjun et al. “Fake Review Detection : Classification and Analysis of Real and Pseudo Reviews.” (2013). https://www2.cs.uh.edu/~arjun/papers/UIC-CS-TR-yelp-spam.pdf
3. <a name="cite_note-3"></a> [^](#cite_ref-3) N. A. Patel and R. Patel, "A Survey on Fake Review Detection using Machine Learning Techniques," *2018 4th International Conference on Computing Communication and Automation (ICCCA)*, Greater Noida, India, (2018). https://ieeexplore.ieee.org/abstract/document/8777594

# Hypothesis


Given text data of a product review, we hypothesize that our model can accurately predict whether a review was written by a human or not. We believe that certain features extracted from the review content, such as sentiment scores, term frequency-inverse document frequency (tf-idf) values, and bag of words representations, will reveal unique, identifiable patterns in human written and automated reviews. By training on a large dataset, we expect our model to learn and use these features to differentiate between human written and generated reviews.

# Data

## Data overview
- Dataset #1
  - Dataset Name: Fake Reviews Dataset
  - Link to the dataset: https://www.kaggle.com/datasets/mexwell/fake-reviews-dataset
  - Number of observations: 40432
  - Number of variables: 4

Our dataset is relatively simple as it just contains only a few features - the text contents of the review (stored as string), the review's rating (ordinal variable between 1 and 5, stored as float), and the review's category (stored as string). The target variable labels the review as either Computer Generated ('CG') or Original Review ('O'G). Our dataset does not have any missing values so most of our data cleaning involves little tasks like making the column names and features more intuitive (for readability purposes). We also have to preprocess the text column to make it usable in our analysis, which includes making the text universally lowercase and removing stopwords and punctuation. However, we will keep the original reviews (with stopwords and punctuation) in case it has some meaning later down in our analysis.

## Fake Reviews Dataset

In [1]:
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')

stopwords = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to /home/vdubey/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/vdubey/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 
reviews_df = pd.read_csv('data/fake_reviews.csv')
reviews_df.head()

Unnamed: 0,category,rating,label,text_
0,Home_and_Kitchen_5,5.0,CG,"Love this! Well made, sturdy, and very comfor..."
1,Home_and_Kitchen_5,5.0,CG,"love it, a great upgrade from the original. I..."
2,Home_and_Kitchen_5,5.0,CG,This pillow saved my back. I love the look and...
3,Home_and_Kitchen_5,1.0,CG,"Missing information on how to use it, but it i..."
4,Home_and_Kitchen_5,5.0,CG,Very nice set. Good quality. We have had the s...


In [3]:
reviews_df.label.unique()

array(['CG', 'OR'], dtype=object)

In [4]:
reviews_df.dtypes

category     object
rating      float64
label        object
text_        object
dtype: object

In [5]:
reviews_df.isna().sum()

category    0
rating      0
label       0
text_       0
dtype: int64

In [6]:
reviews_df = reviews_df.rename(columns = {'text_': 'text'})
reviews_df['category'] = reviews_df['category'].apply(lambda s: s[:-2].replace('_', ' '))
reviews_df['rating'] = reviews_df['rating'].astype(int)

In [7]:
reviews_df['text_no_stop'] = reviews_df['text'].apply(lambda s: ' '.join([token for token in word_tokenize(s.lower()) if token not in stopwords]))
reviews_df['text_no_punct'] = reviews_df['text'].apply(lambda s: s.lower().translate(str.maketrans('', '', string.punctuation)))

In [8]:
reviews_df.head()

Unnamed: 0,category,rating,label,text,text_no_stop,text_no_punct
0,Home and Kitchen,5,CG,"Love this! Well made, sturdy, and very comfor...","love ! well made , sturdy , comfortable . love...",love this well made sturdy and very comfortab...
1,Home and Kitchen,5,CG,"love it, a great upgrade from the original. I...","love , great upgrade original . 've mine coupl...",love it a great upgrade from the original ive...
2,Home and Kitchen,5,CG,This pillow saved my back. I love the look and...,pillow saved back . love look feel pillow .,this pillow saved my back i love the look and ...
3,Home and Kitchen,1,CG,"Missing information on how to use it, but it i...","missing information use , great product price !",missing information on how to use it but it is...
4,Home and Kitchen,5,CG,Very nice set. Good quality. We have had the s...,nice set . good quality . set two months,very nice set good quality we have had the set...


# Ethics & Privacy

Classifying fake reviews can be ethically challenging. The case of a false positive is particularly damaging. For example, if our model errs and marks a true review fake, most people would simply discard that information, thereby invalidating the poster’s speech. Furthermore, websites might delete this review, completely preventing someone from sharing their opinion. This is something that we can account for in our model metrics by valuing precision more than recall. However, we cannot fully eliminate this possibility so we would address this issue directly in our results analysis.

For ethical concerns of our data source, biases in the data could greatly impact our results. For example, if the curator was biased in their data collection, drawing fake reviews from a subset of products more often than others, producers could be adversely affected by our model’s bias. This is something that we would try to ascertain in our EDA stage, and we would address this thoroughly before any statements on our results. Moreover, certain word choices may be penalized heavier than others, which could unduly target geo/cultural groups. To determine this we would have to audit our model. Additionally, there is a privacy concern with the data collection process, as our data will likely be scraped, data consent may be an issue we encounter. This is of utmost importance, thus we must ensure ethically sourced data prior to any model construction.

# Team Expectations 

As a group, our main focus is to all contribute our fair share of what is expected from each other every time we meet up. It is important that each team member is held accountable for their responsibilities so that the group can progress towards our project goals and deadlines. All group members should be included in all communication made regarding the project so that nobody is felt left out or lost. It is also important to have respect and understanding of any extraneous circumstances that may cause someone to not be able to fulfill their duties. In all, it boils down to everyone doing their honest work to the best of their ability, staying up to date with communication on project updates, being involved in discussion, showing up to meetings, and being respectful of one another.

Throughout the quarter, we have multiple places of communication from text messaging, discord, and email. Text messaging is primarily used for communication regarding when to meet, keeping each other updated, and any general information about the project. Discord and email is used to share information amongst each other regarding project materials, links, and more technical planning information. In all communication, it is important to be open-minded and respectful of other team members' inputs and ideas. Responses regarding disagreements should be dealt with respectfully and not in a rash, harsh manner. Should this happen, group members will need to meet up in person and find a solution together. It is also important to express to the group if you need help with something instead of just struggling by yourself. Also, going above and beyond on project items is each member’s choice, ability, and hard work.

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 5/7  |  2 PM | Brainstorm where we can find the data that we need. Come up with at least one dataset that you think could be helpful (Everyone)  | Finalize dataset and distribute work for EDA/feature engineering | 
| 5/14  |  2 pm | Some sort of EDA/feature engineering, does not have to be 100% done but need to have some progress (Vrisan, Caleb, Nilay)  | Work on finishing the data checkpoint and figure out ways to continue EDA/feature engineering. Discussed potential models like Naive Bayes and features like bag of words and sentiment analysis to help us. | 
| 5/21  | 2 pm  | EDA and feature engineering is 90% finished. Go over most important EDA to put on project. (Vikram, Liam)  | Discuss what types of models that we would like to use. Figure out a baseline model. |
| 5/28  | 2 PM  | EDA and feature engineering is 100% done. Baseline model is done with some type of results to show. Accuracy does not have to be great at all but just needs to be established as a baseline (Nilay, Caleb) | Work on finishing the EDA checkpoint and brainstorm and then finalize options for final model   |
| 6/7  | 2 PM  | Final Model and any hyperparameter tuning is done. Final results have been obtained (Everyone) | Start working on the video presentation and final notebook  |
| 6/14  | 2 PM  | Final notebook is 95% done and almost ready to be submitted (Everyone) | Record final version of the video and finalize the submitted notebook |