# COGS 108 - Project Proposal

## Authors

- Thomas Cung: Project Administration, Analysis, Data curation
- Devam Parekh: Data curation, Analysis, Software
- Jennifer Nguyen: Project Administration, Background Research, Conceptualization
- Russell Ng: Visualization, Methodology, Wring – Review & Editing
- Joshua Narciso: Experimental Investigation, Writing – Original Draft, Analysis

## Research Question

What browsing behaviors best predict whether a user completes a purchase or abandons their cart? In order to answer this question, we will be looking at whether or not the consumer purchases the product or not during their session (purchase = 1, no purchase = 0). In order to measure this, possible variables to consider include product-related page views, page values, bounce rate, exit rate, total session duration, return visits, traffic source, time period (weekend/holiday), cart-related interactions (add/removing items from cart) etc… This project is going to be a prediction task that can model the probability someone is willing to purchase an item based on their actions. Model performance could be evaluated with accuracy, precision, F1 score, etc.

## Background and Prior Work

Online shopping has become a central part of everyday life, with billions of visits to e‑commerce sites generating trillions of dollars in global sales each year. As more people start to browse and buy online, each visit leaves a record of clicks, page views, and the duration of time spent on screens to curate behavioral data. Instead of only looking at final sales numbers, companies now rely on web analytics and machine learning to understand what people actually do on their sites. This shift from aggregate sales metrics to behavioral analysis makes it possible to model which types of browsing sessions are most likely to end in a purchase versus an abandoned cart. In this project, we will use a dataset to model which patterns of on site browsing behavior are most predictive of whether a session results in a completed purchase or an abandoned cart.

Some prior work to highlight related to this topic include Gkikas and Theodoridis (2024), who take this idea and apply it to a real fashion ecommerce site using Google Analytics data. They look at a small set of familiar metrics such as event count, sessions, purchase revenue, transactions, and bounce rate, then use models like decision trees, Naive Bayes, and k‑nearest neighbors to classify users into different engagement levels. In their results, decision trees stand out because they are both accurate and easy to interpret, which is useful for marketers who need simple rules rather than black‑box predictions.

Another work that could help contextualize our project would be from The Online Shoppers Purchasing Intention dataset, introduced by Sakar et al., because of how it already summarizes 12,330 user sessions with features that describe page categories, time spent, and analytics metrics like bounce rate, exit rate, page value, and a binary Revenue label for whether the user bought something or not. In their paper, Sakar et al. frame this as a binary classification problem and experiment with multilayer perceptrons and neural networks to predict purchasing intention in real time. 

It is also important to note that behavioral logs only show what users are doing on the surface. They do not directly reveal why they make certain choices. Research on online purchase intention points out that factors such as perceived risk, perceived reputation, and trust in the website also shape whether someone feels comfortable buying online, as shown in recent work by Ikhlash and Linda on how risk perception and online trust influence e‑commerce purchase intention. It is possible that a user may view many product pages and spend a long time on the site but still decide not to buy because they may feel like they do not have the funds currently to buy.

<a name="fn1">1.</a> Gkikas, D. C., & Theodoridis, P. K. (2024). Predicting online shopping behavior: Using machine learning and Google Analytics to classify user engagement. Applied Sciences, 14(23), 11403. https://doi.org/10.3390/app142311403 <a href="#fnref1">^</a>

<a name="fn2">2.</a> Sakar, C. O., Polat, S. O., Katircioglu, M., & Kastro, Y. (2019). Real‑time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks. Neural Computing and Applications, 31(10), 6893–6908. https://link.springer.com/article/10.1007/s00521-018-3523-0 <a href="#fnref2">^</a>

<a name="fn3">3.</a> Ikhlash, M., & Linda, K. R. (2024). The effect of risk perception and online trust on purchase intention in e‑commerce. International Review of Management and Marketing, 14(6), 109–118. https://www.econjournals.com/index.php/irmm/article/download/17111/8280/39919 <a href="#fnref3">^</a>

## Hypothesis


Online shopping sessions with higher engagement (e.g. product page views, duration on page spent, etc…) will have a higher probability of a purchase than shopping sessions with low engagement and higher bounce or exit rates. We predict a strong correlation between returning visitors and a purchase. The prediction is based on consumer behavior theory, suggesting greater engagement will reflect a higher intent to purchase a product.

## Data

The ideal dataset would have the following requirements fulfilled:
1. Session and interaction data
2. Whether the product was purchased or not
3. User/demographic data

Some more specific variables could include total session duration, number of pages visited, which type of pages visited, product page views, time of visit, and bounce and exit rates. The dataset should be sufficiently large with at least 50,000 sessions included in it. This data should be have users that represent all demographics. The data would be collected using web analytics tools. It would hopefully be in a structured format such as a .csv file. It would be optimal if two tables were included in the dataset, one recording individual interactions and another recording sessions data, both using unique identifiers to match the data throughout both tables. 

https://www.kaggle.com/datasets/imakash3011/online-shoppers-purchasing-intention-dataset

This dataset offers information on which type of pages a user visits, how long they visit each page, how frequently they visit it, the amount of times a user visits a product-related page and exits without any further interaction, how close a holiday is and the number of times a specific page was exited on. All of these variables seem important to the project and may be subject to potential analysis. The type of pages a user visits could be important because it can provide information on the browsing behavior of a customer, and so do the duration and frequency of visits. Knowing which specific pages are usually exited on or not interacted with much provides insight as to which pages usually turn customers away and should be avoided. 

https://www.kaggle.com/datasets/paulsamuelwe/e-commerce-customer-behaviour-dataset/code

The second dataset uses unique identifiers to offer time stamps on customers. It offers the duration a user visits a site, which could be used to classify customer behavior. It also provides the reviews of a product, potentially offering the quality of a product and how that can affect browsing behavior. In addition, it offers annual income for further classification of users. 

## Ethics 

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

   > The data in this dataset is collected from the UCI Machine Learning repository, which is donated by researchers and institutions. This means that the data collected is fully consensual. 
 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

   > Bias can not be fully eliminated in any dataset, but key mitigation steps were taken by using data that belongs to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period.
 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

   > Although the dataset is completely anonymous, there is still the risk of behavioral fingerprinting with this dataset, which combines the users' unique shopping habits across different sites.
 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

   > While this dataset is public, we can protect and secure the data by using .gitignore for the raw data to prevent people from downloading it from our repo.
 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

   > We plan to delete our local copies of the raw data once the class is over.

### C. Analysis
 - [ ] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 - [ ] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 - [ ] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [ ] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [ ] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

   > We will ensure our model does not rely on variables that could act as proxies for protected or sensitive attributes, such as region or visitor type. We will check for potential biases in predictions. 
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [ ] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [ ] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

   > We recognize that the predictions from our model could be misused, such as pressuring users to complete purchases or unfairly favoring certain groups. We plan to document these risks and emphasize that our model is for research purposes only. 


## Team Expectations 

1. Communicate regularly through Discord to stay updated on progress and deadlines.
2. Hold meetings to review tasks, discuss challenges, and plan next steps.
3. Arrive on time for meetings, no more than a 10-minute delay unless communicated in advance. 
4. Project decisions will be made by a majority vote to ensure fairness and efficiency.
5. Address conflicts or concerns respectfully through early and open discussion.

## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| W2: Tues, 1/13   |  4-5PM | Read COGS 108 team expectations and brainstorm topics/questions  | Discuss best form of communication, team expectations, assign roles, and research topics | 
| W4: Fri, 1/30  |  3-4PM |  Complete project review | Finalize research question | 
| W5: Wed, 2/4  | 5-6PM  | Come to meeting with research question idea and do background research on topics. | Finalize research question, and discuss hypothesis, ideal datasets, research background, and complete project proposal   |
| W6: Mon, 2/9 | 5-6PM  | Search and review potential datasets | Identify key features needed for analysis   |
| W7: Mon, 2/16  | 5-6PM  | Begin data cleaning and preprocessing | DAssign specific EDA tasks (exploratory analysis, visualizations) |
|W8: Mon, 2/23  | 5-6PM  | Conduct EDA and visualizations. Review patterns in browsing behavior, bounce rates, and product engagement| Discuss preliminary insights and feature selection for modeling |
| W9: Mon, 3/2  | 5-6PM  | Perform hypothesis testing, build  predictive model | Evaluate model performance and refine features |
|W10: Mon, 3/9  | 5-6PM  | Interpret final results, draft and finalize | Complete ethics discussion and final analysis summary. Turn in Final Project and Group Project surveys |