# COGS 108 - Project Proposal

## Authors

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

Example team list and credits:
- Scott Huang: Analysis, Conceptualization, Writing – review & editing
- Ye Teng: Background research, Visualization, Writing – review & editing
- Cathy: Project Administration, Software, Writing - original draft
- Fei Liang: Experiential Investigation, Writing - original draft
- Rich: Data curation, Methodology, Writing – review & editing

## Research Question

Which product attributes such as price, customer ratings, number of reviews, and brand—are most strongly associated with dietary supplement sales on e-commerce platforms?

## Background and Prior Work


Dietary supplements are a growing part of the health and wellness market, and many consumers now buy products such as vitamins, protein powders, and probiotics through e-commerce platforms. Unlike prescription medications, dietary supplements are not strictly regulated before they are sold, which means consumers often rely on visible product information when making purchasing decisions.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Because of this, attributes such as price, brand reputation, and especially online reviews play an important role in shaping supplement sales online.

Previous research on online shopping behavior has consistently shown that customer ratings and reviews have a strong influence on product sales. Chevalier and Mayzlin found that products with higher ratings and more positive reviews tend to sell better, particularly when consumers cannot easily evaluate product quality before purchase.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) This highlights how social proof becomes more important when buyers face uncertainty, which is often the case for health-related products like dietary supplements.

Beyond reviews, brand reputation also plays a key role in consumer decision-making. Prior research suggests that in online marketplaces with many similar products, consumers often rely on brand identity and reputation as signals of quality and reliability.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) This project builds on prior work by examining how price, customer ratings, number of reviews, and brand are associated with actual dietary supplement sales on e-commerce platforms, using observed sales data rather than self-reported preferences.
1. <a name="cite_note-1"></a> [^](#cite_ref-1) U.S. Food & Drug Administration. Dietary Supplements.
https://www.fda.gov/food/dietary-supplements

2. <a name="cite_note-2"></a> [^](#cite_ref-2)Chevalier, J. A., & Mayzlin, D. (2006). The effect of word of mouth on sales: Online book reviews. Journal of Marketing Research, 43(3), 345–354. https://www.jstor.org/stable/30162409?seq=1 

3. <a name="cite_note-4"></a> [^](#cite_ref-4) Clemons, E. K., Gao, G. G., & Hitt, L. M. (2006). When online reviews meet hyperdifferentiation. Journal of Management Information Systems, 23(2), 149–171.  https://www.tandfonline.com/doi/abs/10.2753/MIS0742-1222230207


## Hypothesis


We hypothesize that price, customer ratings, review count, and brand are significantly associated with dietary supplement sales. Specifically, we expect review count and average rating to be positively associated with sales, price to be negatively associated with sales, and sales to differ across brands. Among these attributes, we hypothesize that review count will exhibit the strongest positive association with sales. We define “sales” as either total units sold, revenue generated, or sales rank, depending on data availability.

We believe that consumer purchasing decisions on e-commerce platforms are strongly influenced by social proof and credibility. In particular, the number of reviews may serve as an indicator of product popularity and trustworthiness, making consumers more likely to purchase items that appear widely used or endorsed by others.

## Data

Ideal dataset to answer our question

To rigorously study which product attributes are most strongly associated with dietary supplement sales on e-commerce platforms, the ideal dataset would be a product-level panel dataset (i.e., repeated measurements over time) covering a large set of dietary supplement items on a single platform (e.g., Amazon). The dataset would include a clear measure of sales performance (ideally true units sold or revenue; otherwise a reliable proxy such as Best Sellers Rank / Sales Rank), along with product attributes such as price, customer rating, number of reviews, and brand.

What variables are needed?

Outcome (sales / sales proxy)
    units_sold (preferred) or revenue (GMV), measured daily/weekly
    If not available: sales_rank / best_sellers_rank (BSR) as a sales proxy (lower rank typically indicates better sales performance)


Key predictors (product attributes)

    price (preferably the “buy box” price), plus list_price and discount if available
    rating_avg (average star rating)
    review_count (number of reviews)
    brand (canonical brand name)


Important controls (to reduce confounding)

    category / sub_category within dietary supplements (e.g., vitamins, protein, fish oil)
    product_age (days since listing / first observed date)
    pack_size / servings / net_weight (to compare products fairly; enables price-per-serving)
    Optional: prime_eligible, seller_type (Amazon vs third-party), badges (e.g., “Amazon’s Choice”), subscription availability


How many observations are needed?
At minimum, we would want ~1,000 products to fit a baseline multivariable model, but this can be unstable once we include brand and sub-category controls. A more realistic target is 5,000–20,000+ unique products, which supports: (1) brand effects (e.g., top brands vs “Other”), (2) category stratification, and (3) non-linear price relationships. If using time-series/panel data, having multiple time points per product (e.g., 30–180 days) would substantially improve reliability and allow analyses of how changes in price relate to changes in sales proxy.

Who/what/how would these data be collected?
Ideally, the data would be collected either (1) directly from the platform through an approved data source or third-party API that tracks products over time (e.g., sales rank and price history), or (2) from a reputable public dataset that provides product metadata plus review/rating information. If we build the dataset ourselves via an API, we would define a sampling frame of dietary supplement products and collect daily/weekly snapshots of price, rating, review count, and sales rank.

How would these data be stored/organized?
We would store the data in a normalized structure:
Products table (products.csv): one row per product (ASIN/product_id) containing stable attributes (brand, category, pack size, etc.).
Daily metrics table (daily_metrics.csv): one row per product per date containing time-varying measures (price, rating, review count, sales rank).
Mapping/cleaning table (brand_category_map.csv): standardized brand names and cleaned sub-categories.
 This format supports both cross-sectional modeling and panel/time-series analyses, and makes data cleaning reproducible.

Potential Real Datasets
1) Kaggle — “Amazon Products Sales Dataset (42K+ items, 2025)”
https://www.kaggle.com/datasets/ikramshah512/amazon-products-sales-dataset-42k-items-2025
This Kaggle dataset provides a large collection of Amazon product listings and is suitable for building a product-level dataset for analysis. The data is available via Kaggle download as CSV and can typically be used immediately after importing into Python. The dataset often contains key predictors such as price, rating, review count, brand, and category, though the presence of a direct sales measure (or sales-rank proxy) depends on the specific column schema in the downloaded version. If it includes a sales proxy (e.g., rank) or a “sales” field, it can be used as the primary dataset for our regression/feature-importance analyses; otherwise it still serves as a strong source for product attributes.

2) Amazon Reviews ’23 (McAuley Lab) — product + review data at scale
https://amazon-reviews-2023.github.io/
Amazon Reviews ’23 is a research-grade dataset that contains large-scale Amazon review data and associated product metadata. It can be downloaded from the project website and related hosting (often HuggingFace/GitHub), but it is large and may require filtering to the Health/Dietary Supplements categories and possibly sampling for compute constraints. This dataset is particularly useful for constructing rating averages, review counts, and potentially review-derived features (e.g., sentiment), with timestamps that support time-based aggregation. However, it typically does not provide true sales, so we would use it mainly to study attribute relationships with a sales proxy (if available elsewhere) or as a complementary dataset to enrich our main product table.

3) Keepa API — price history + sales rank history (best for panel data)
https://keepaapi.readthedocs.io/en/latest/api_methods.html
Keepa is a specialized service that tracks Amazon product history, including price time series and often sales rank (BSR) time series, which aligns closely with our ideal panel dataset structure. Using Keepa generally requires an API key and is subject to usage limits/quotas, so we would need to plan a manageable sampling strategy (e.g., focusing on a subset of dietary supplement products). The most important variables from Keepa for our project would be daily/weekly price, sales rank, and product identifiers and metadata needed to join with brand/category information. This source would allow stronger analyses of how price changes correlate with sales-rank changes over time.


4) Kaggle — “Amazon US Customer Reviews Dataset” (review-focused, best as supplement)
https://www.kaggle.com/datasets/cynthiarempel/amazon-us-customer-reviews-dataset
This Kaggle dataset contains U.S. customer review data and is useful for extracting ratings, review counts, and potentially review-derived features. It is straightforward to access via Kaggle download, but it typically lacks detailed product pricing history and does not include direct sales measures. Therefore, it is best used as a supplementary dataset to enrich product-level reputation variables or to validate patterns seen in a main product-attribute dataset. If the dataset includes product identifiers that can be linked to a product table, we can merge it to compute aggregated reputation measures.

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> Our data will be reviews and ratings on the platforms, which are publicly available. We are not dealing with individual human subjects and asking for their consent. When they submitted the review, they understood that their reviews would be publicly available. So informed consent is not neccessary in this project. 

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> There are some bias that could be introduced in the data collection process. Products with more sales are likely to get more reviews. People are more likely to leave a review if their experience if really positive/negative. There could even be fake reivews from the seller trying to boost sales. We will try to find a way of normalizing reviews and minimize the bias mentioned above. 

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> We will collect all the data anonymously and neglect all the PII (e.g. account name, email). We will also only collect the reviews and rating of the product and not use any other irrelevant information.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

> We will ensure the results don't favor any groups over others unfairly. It will be clearly stated that this project studies the nature of consumer behaviors based on ratings, reviews, and sales of dietary supplements rather than their actual medical effectiveness.


### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

> Since the data would be public available already and we are not doing any new data collection ourselves, there will not be any measures used to protect or secure data.

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?

> There are no data directly collected from any individual in the data collection process, and there will not be any personal information included.

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

> There is currently no plan to delete the local copy of the data we obtain. However, it is unlikely that we ever use the data for any other purposes in the future. It is possible that we will have a data retention plan after we have completed the project and the data is no longer needed.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

> Due to the nature of this project, the analysis is more likely to reflect trends in customers' purchasing behaviors rather than the true health benefits of the supplements. We will clearly state the the limitations in the analysis in the report so the users are aware.
 
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> Besides the bias already mentioned in the data collection part, there are some other potential bias in the dataset. There might be a significant difference in sales and rating/reviews across different categories of supplements. There could also be temporal bias, where older products recieve more reviews over time. We will address these bias by comparing supplements by category and normalize the effetcs of temporal bias.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

> Our visualizations and summary statistics will be presented in the sole purpose of helping the readers understand the data more intuitively and represent the underlying data. The representations will be closely related to the focus of the projects and will not contain any irrelevant or distracting information.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

> We will ensure that no data with PII will be used in the project. We do not need any PII for the purpose of the project so any data related to PII would be removed in the data cleaning process.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

> The process of generating the analysis will be well documented throughout the project and presented in our final report. It will be documented step by step in detail such that the analysis is reproducible if we ever need to conduct the same analysis again. 

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

> We ensure that the data does not rely on variables or proxies that are unfairly discriminatory. The data will only be characteristics about the products (ratings, reviews, sales, etc.) and not of specific individuals and gourps. The results will not be used for discriminatory purposes.

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

> We will separate the supplements into groups to test the resutls. The performance will be evaluated over different groupings such as categories, price tiers, and etc.

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

> We will use more metrics than simply sales prediction accuracy. There will also be multiple evaluation metrics considered (RMSE, correlation, etc.) and results will be compared for performances. 

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

> We will choose a balanced model and that incorporates performance and interpretability. All features used in the models will be explained, and the decision making process would be clear.

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

> The limitation of the project will be clearly communicated in the report. We will elaborate the shorcomings and bias briefly mentioned in this section in the final report so the reader can clearly understand the limitations and apply the findings of the study as how it is intended to be. We will state that state that the project will invetigate consumer behavior rather than supplement effectiveness. 

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

> We will monitor the model through sample predictions randomly selected from the platform.

 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?

> If any users are harmed by the model, we will go back and analyze the reason for the unintended harm, and revise (modifying features or model) to prevent similar problems in the future.

 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?

> If at any point we find a roll back is needed, we will removed the results of the project immediately to minimize effects.

 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

> As stated earlier, we will clearly state the limitations of this project so users understand the purpose of the project clearly and use the results accordingly. We curretly do not have a plan to monitor these.


## Team Expectations 

  Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* *Team Expectation 1*
* Our team will primarily communicate through WeChat for updates, questions, and storing documents and datasets.
* *Team Expectation 2*
* We will assume that all feedback is well-intentioned and aimed at improving the project. Team members are encouraged to speak up if they disagree or have concerns.
* *Team Expecation 3*
* All team members are expected to contribute equally across the entire project in terms of effort. If a member is struggling to complete a task, they are expected to notify the group as soon as possible. The team will work together to redistribute tasks if needed.
* *Team Expectation 4*
* If issues arise regarding participation or communication, we will first address them respectfully within the group through written or verbal communication. If a member continues to fail to meet expectations after being notified, we will follow course policy and escalate the issue to the professor.
* *Team Expectation 5*
* If the team cannot reach an agreement on a minor decision after discussion (e.g., formatting or visualization style), we will first attempt a majority vote. If a quick resolution is needed, we may use a neutral random decision method (such as a coin flip or random spinner) to ensure fairness and keep the project moving forward. Major decisions related to the dataset selection or analysis methods will require group agreement.

## Project Timeline Proposal

January 29, 6:30 PM
Completed before meeting: Reviewed an example project on fast food access and disease rates.
Discuss at meeting: Summarized the project; Discussed strengths, weaknesses, and limitations (multicollinearity, confounding, small sample size); Identified lessons to apply to our own project, including being transparent about limitations.

February 4 6:30PM
Completed before meeting: Review final project checklist and course requirements; Review and finalize the research question and variables.
Discuss at meeting: Confirm project scope and research question; Align on outcome variable (sales or sales rank) and predictor variables (price, ratings, number of reviews, brand); Assign initial roles.

February 9   6:30PM
Completed before meeting: Conduct background research on dietary supplements and ecommerce consumer behavior; Identify 1–2 relevant prior studies or articles.
Discuss at meeting: Discuss background context and prior work; Decide which sources to cite; Refine and finalize the hypothesis.

February 12 6:30PM
Completed before meeting: Identify and explore potential datasets; Review dataset features, size, and limitations.
Discuss at meeting: Finalize dataset selection; Discuss data limitations; Draft the Dataset(s) section.

February 16 6:30PM
Completed before meeting: Import dataset; Perform initial data cleaning (handle missing values, convert variable types).
Discuss at meeting: Review data cleaning decisions; Decide how to handle outliers; Confirm the dataset is clean and usable.

February 20 6:30PM
Completed before meeting: Conduct exploratory data analysis (EDA); Create preliminary visualizations showing relationships between product attributes and sales.
Discuss at meeting: Interpret EDA results; Decide which visualizations to include in the final project.

February 25 6:30PM
Completed before meeting: Perform correlation and regression analyses; Save analysis outputs.
Discuss at meeting: Interpret analysis results; Discuss associations between variables; Emphasize correlation versus causation.

March 2 6:30PM
Completed before meeting: Draft the Ethics and Bias section; Identify potential ethical concerns and sources of bias in the data.
Discuss at meeting: Review ethical considerations; Discuss how bias and limitations are addressed in the analysis.

March 6 6:30PM
Completed before meeting: Draft the Conclusion and Discussion sections.
Discuss at meeting: Edit and refine conclusions; Ensure the research question is clearly answered.

March 10 6:30PM
Completed before meeting: Combine all sections into a complete project notebook draft.
Discuss at meeting: Full project review; Edit for clarity, organization, and narrative flow.

March 16 6:30PM
Completed before meeting: Prepare the project video script and select key visualizations.
Discuss at meeting: Practice and refine video explanation

March 18 6:30PM
Completed before meeting: Final proofreading
Discuss at meeting: Submit Final Project and Group Project Survey.