**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Vrisan Dubey
- Vikram Venkatesh
- Nilay Menon
- Caleb Galdston
- Liam Manatt

# Research Question

-  Include a specific, clear data science question.
-  Make sure what you're measuring (variables) to answer the question is clear

What is your research question? Include the specific question you're setting out to answer. This question should be specific, answerable with data, and clear. A general question with specific subquestions is permitted. (1-2 sentences)



## Background and Prior Work


- Include a general introduction to your topic
- Include explanation of what work has been done previously
- Include citations or links to previous work

This section will present the background and context of your topic and question in a few paragraphs. Include a general introduction to your topic and then describe what information you currently know about the topic after doing your initial research. Include references to other projects who have asked similar questions or approached similar problems. Explain what others have learned in their projects.

Find some relevant prior work, and reference those sources, summarizing what each did and what they learned. Even if you think you have a totally novel question, find the most similar prior work that you can and discuss how it relates to your project.

References can be research publications, but they need not be. Blogs, GitHub repositories, company websites, etc., are all viable references if they are relevant to your project. It must be clear which information comes from which references. (2-3 paragraphs, including at least 2 references)

 **Use inline citation through HTML footnotes to specify which references support which statements** 

For example: After government genocide in the 20th century, real birds were replaced with surveillance drones designed to look just like birds.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Use a minimum of 2 or 3 citations, but we prefer more.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) You need enough to fully explain and back up important facts. 

Note that if you click a footnote number in the paragraph above it will transport you to the proper entry in the footnotes list below.  And if you click the ^ in the footnote entry, it will return you to the place in the main text where the footnote is made.

To understand the HTML here, `<a name="#..."> </a>` is a tag that allows you produce a named reference for a given location.  Markdown has the construciton `[text with hyperlink](#named reference)` that will produce a clickable link that transports you the named reference.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.


In today’s internet based world, product reviews, especially the ones online, play a very important role when it comes to consumer purchasing decisions. Such reviews also have a big impact on the reputation of companies selling the products. With reviews being very important, there could be the possibility that people try to game the system by writing fake reviews. The internet has millions if not billions of reviews for products with the number of reviews growing fast everyday. According to Scott Clark from CMSwire.com, “with the advent of generative AI, fake reviews are becoming more advanced and difficult to detect”.1 As this growth of artificial intelligence continues, so does the possibility of fraudulent reviews generated by bots. We know that there is a possibility of having more fake reviews than real reviews more than ever now due to large language models. With such an important problem, we wanted to see if we are able to classify whether a review is written by a human or not.

With such a pressing topic, there have been many attempts to help combat such reviews. For example, according to a study done by Arjun Muherjee and a couple others, “supervised learning was used with a set of review centric features (e.g., unigrams and review length) and reviewer and product centric features (e.g., average rating, sales rank, etc.) to detect fake reviews” (2). The use of features like n grams are important when trying to predict whether or not a review is fake or real. “An AUC (Area Under the ROC Curve) of 0.78 was reported using logistic regression. The assumption, however, is too restricted for detecting generic fake reviews”.2 This shows that detecting fake reviews might be a bit harder than we initially thought.

Another study that went into fake review detection using machine learning methods, states that “fake reviews are differentiated from genuine reviews using four linguistic clues like level of detail, understandability, cognition indicators and writing style”.3 Using features like case of letters, things like if a word was a feeling word, and the words part of speech, the people in the study were able to use machine learning algorithms like logistic regression to classify if a review was genuine or not. However, even these researchers found it difficult to reach a high level of accuracy due to things like the fabricated review being very close to what is considered to be a genuine review.

^ Clark, Scott. "How to Spot and Combat Fake Reviews and Bots." CMSWIRE, (18 Oct 2023). https://www.cmswire.com/customer-experience/how-to-spot-and-combat-fake-reviews-and-bots/
^ Mukherjee, Arjun et al. “Fake Review Detection : Classification and Analysis of Real and Pseudo Reviews.” (2013). https://www2.cs.uh.edu/~arjun/papers/UIC-CS-TR-yelp-spam.pdf
^ N. A. Patel and R. Patel, "A Survey on Fake Review Detection using Machine Learning Techniques," 2018 4th International Conference on Computing Communication and Automation (ICCCA), Greater Noida, India, (2018). https://ieeexplore.ieee.org/abstract/document/8777594

# Hypothesis



- Include your team's hypothesis
- Ensure that this hypothesis is clear to readers
- Explain why you think this will be the outcome (what was your thinking?)

What is your main hypothesis/predictions about what the answer to your question is? Briefly explain your thinking. (2-3 sentences)

# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name: Fake Reviews Dataset
  - Link to the dataset: https://www.kaggle.com/datasets/mexwell/fake-reviews-dataset
  - Number of observations: 40432
  - Number of variables: 4

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

Our dataset is relatively simple as it just contains only a few features - the text contents of the review (stored as string), the review's rating (ordinal variable between 1 and 5, stored as float), and the review's category (stored as string). The target variable labels the review as either Computer Generated ('CG') or Original Review ('O'G). Our dataset does not have any missing values so most of our data cleaning involves little tasks like making the column names and features more intuitive (for readability purposes). We also have to preprocess the text column to make it usable in our analysis, which includes making the text universally lowercase and removing stopwords and punctuation. However, we will keep the original reviews (with stopwords and punctuation) in case it has some meaning later down in our analysis.

## Fake Reviews Dataset

In [1]:
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')

stopwords = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to /home/vdubey/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/vdubey/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 
reviews_df = pd.read_csv('data/fake_reviews.csv')
reviews_df.head()

Unnamed: 0,category,rating,label,text_
0,Home_and_Kitchen_5,5.0,CG,"Love this! Well made, sturdy, and very comfor..."
1,Home_and_Kitchen_5,5.0,CG,"love it, a great upgrade from the original. I..."
2,Home_and_Kitchen_5,5.0,CG,This pillow saved my back. I love the look and...
3,Home_and_Kitchen_5,1.0,CG,"Missing information on how to use it, but it i..."
4,Home_and_Kitchen_5,5.0,CG,Very nice set. Good quality. We have had the s...


In [3]:
reviews_df.label.unique()

array(['CG', 'OR'], dtype=object)

In [4]:
reviews_df.dtypes

category     object
rating      float64
label        object
text_        object
dtype: object

In [5]:
reviews_df.isna().sum()

category    0
rating      0
label       0
text_       0
dtype: int64

In [6]:
reviews_df = reviews_df.rename(columns = {'text_': 'text'})
reviews_df['category'] = reviews_df['category'].apply(lambda s: s[:-2].replace('_', ' '))
reviews_df['rating'] = reviews_df['rating'].astype(int)

In [7]:
reviews_df['text_no_stop'] = reviews_df['text'].apply(lambda s: ' '.join([token for token in word_tokenize(s.lower()) if token not in stopwords]))
reviews_df['text_no_punct'] = reviews_df['text'].apply(lambda s: s.lower().translate(str.maketrans('', '', string.punctuation)))

In [8]:
reviews_df.head()

Unnamed: 0,category,rating,label,text,text_no_stop,text_no_punct
0,Home and Kitchen,5,CG,"Love this! Well made, sturdy, and very comfor...","love ! well made , sturdy , comfortable . love...",love this well made sturdy and very comfortab...
1,Home and Kitchen,5,CG,"love it, a great upgrade from the original. I...","love , great upgrade original . 've mine coupl...",love it a great upgrade from the original ive...
2,Home and Kitchen,5,CG,This pillow saved my back. I love the look and...,pillow saved back . love look feel pillow .,this pillow saved my back i love the look and ...
3,Home and Kitchen,1,CG,"Missing information on how to use it, but it i...","missing information use , great product price !",missing information on how to use it but it is...
4,Home and Kitchen,5,CG,Very nice set. Good quality. We have had the s...,nice set . good quality . set two months,very nice set good quality we have had the set...


# Ethics & Privacy

- Thoughtful discussion of ethical concerns included
- Ethical concerns consider the whole data science process (question asked, data collected, data being used, the bias in data, analysis, post-analysis, etc.)
- How your group handled bias/ethical concerns clearly described

Acknowledge and address any ethics & privacy related issues of your question(s), proposed dataset(s), and/or analyses. Use the information provided in lecture to guide your group discussion and thinking. If you need further guidance, check out [Deon's Ethics Checklist](http://deon.drivendata.org/#data-science-ethics-checklist). In particular:

- Are there any biases/privacy/terms of use issues with the data you propsed?
- Are there potential biases in your dataset(s), in terms of who it composes, and how it was collected, that may be problematic in terms of it allowing for equitable analysis? (For example, does your data exclude particular populations, or is it likely to reflect particular human biases in a way that could be a problem?)
- How will you set out to detect these specific biases before, during, and after/when communicating your analysis?
- Are there any other issues related to your topic area, data, and/or analyses that are potentially problematic in terms of data privacy and equitable impact?
- How will you handle issues you identified?

Classifying fake reviews can be ethically challenging. The case of a false positive is particularly damaging. For example, if our model errs and marks a true review fake, most people would simply discard that information, thereby invalidating the poster’s speech. Furthermore, websites might delete this review, completely preventing someone from sharing their opinion. This is something that we can account for in our model metrics by valuing precision more than recall. However, we cannot fully eliminate this possibility so we would address this issue directly in our results analysis.

For ethical concerns of our data source, biases in the data could greatly impact our results. For example, if the curator was biased in their data collection, drawing fake reviews from a subset of products more often than others, producers could be adversely affected by our model’s bias. This is something that we would try to ascertain in our EDA stage, and we would address this thoroughly before any statements on our results. Moreover, certain word choices may be penalized heavier than others, which could unduly target geo/cultural groups. To determine this we would have to audit our model. Additionally, there is a privacy concern with the data collection process, as our data will likely be scraped, data consent may be an issue we encounter. This is of utmost importance, thus we must ensure ethically sourced data prior to any model construction.

# Team Expectations 


Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

As a group, our main focus is to all contribute our fair share of what is expected from each other every time we meet up. It is important that each team member is held accountable for their responsibilities so that the group can progress towards our project goals and deadlines. All group members should be included in all communication made regarding the project so that nobody is felt left out or lost. It is also important to have respect and understanding of any extraneous circumstances that may cause someone to not be able to fulfill their duties. In all, it boils down to everyone doing their honest work to the best of their ability, staying up to date with communication on project updates, being involved in discussion, showing up to meetings, and being respectful of one another.

Throughout the quarter, we have multiple places of communication from text messaging, discord, and email. Text messaging is primarily used for communication regarding when to meet, keeping each other updated, and any general information about the project. Discord and email is used to share information amongst each other regarding project materials, links, and more technical planning information. In all communication, it is important to be open-minded and respectful of other team members' inputs and ideas. Responses regarding disagreements should be dealt with respectfully and not in a rash, harsh manner. Should this happen, group members will need to meet up in person and find a solution together. It is also important to express to the group if you need help with something instead of just struggling by yourself. Also, going above and beyond on project items is each member’s choice, ability, and hard work.

# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  10 AM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 6 PM  | Import & Wrangle Data (Ant Man); EDA (Hulk) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin Analysis (Iron Man; Thor) | Discuss/edit Analysis; Complete project check-in |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| Discuss/edit full project |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |