# STAT 301 Project — Online Shoppers Purchasing Intention

**Name:** Ivy Yan   
**Group:**  group-25   
**Student number:** 39512314 

---


## Section 1: Data Description


### 1. Descriptive Summary

#### Data description

This data contains information about user behavior when online shopping. Each row in the dataset represents one online shopping session, and the variables describe various aspects of user behavior, technical attributes, and timing information.  
The dataset includes both quantitative and categorical variables, such as the number and duration of pages visited, bounce and exit rates, month of visit, type of operating system, and whether the session occurred on a weekend.  

- **Number of observations:** 12330  
- **Number of variables:** 18
#### Variables description
  | Variable Name | Type | Description |
|-----------------------------|------------------|-------------------------------------------------------------|
| Administrative |Numeric(Integer) | Number of administrative pages visited by the user. |
| Administrative_Duration | Numeric(Integer) | Total time spent on administrative-related pages (in seconds). |
| Informational | Numeric(Integer) | Number of informational pages visited by the user. |
| Informational_Duration | Numeric(Integer) | Total time spent on informational pages (in seconds). |
| ProductRelated | Numeric(Integer) | Number of product-related pages visited by the user. |
| ProductRelated_Duration | Numeric(Continuous) | Total time spent on product-related pages (in seconds). |
| BounceRates | Numeric(Continuous) | percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session. |
| ExitRates | Numeric(Continuous) | percentage of this page being the last session. |
| PageValues | Numeric(Integer) | Average value for a web page that a user visited before completing an e-commerce transaction.  |
| SpecialDay | Numeric(Integer) | Closeness of the site visit date to a special day (e.g., Mother’s Day, Christmas). Ranges from 0 to 1. |
| Month | Categorical | Month of the visit (Feb–Dec). |
| OperatingSystems | Categorical | Type of operating system used by the visitor (e.g., Windows, Mac). |
| Browser | Categorical | Type of browser used (e.g., Chrome, Firefox, Safari). |
| Region | Categorical | Geographic region of the visitor. |
| TrafficType | Categorical | Type of traffic source (e.g., direct, referral, search). |
| VisitorType | Categorical | Indicates if the visitor is a Returning Visitor or New Visitor. |
| Weekend | Binary (True/False) | Indicates whether the visit occurred on a weekend. |
| Revenue | Binary (0/1) | Indicates whether the session ended with a purchase (1 = purchase, 0 = no purchase). |

---



### 2. Source and Information 
This dataset is the **Online Shoppers Purchasing Intention** dataset published by UCI Machine Learning Repository. 
It contains session-level features for e-commerce website visits and a binary outcome `Revenue` indicating whether a purchase occurred. 



### 3. Pre-selection of Variables 
I will first drop `Administrative_Duration`, `Informational_Duration` and `ProductRelated_Duration` because they are **redundant**. They are highly associated with `Administrative`, `Informational` and `ProductRelated`, which may cause multicollinearity. `OperatingSystems`, `browser`, and `region` will also be deleted because they don't have a clear categorization.

Other variables will be selected to further analyse.


## Section 2: Scientific Question


### 1. Scientific Question 
**Question:** _We want to examine the association between session behavior (including BounceRates, ExitRates, and Administrative activities) and the purchase outcome (Revenue)._

### 2. Name the Response 
**Response:** `Revenue` (binary: 1 if the session resulted in a purchase, 0 otherwise).

### 3. Aim: Prediction, Inference, or Both? 
It is an **Inference** model, because we aim to understand whether session behavior — including BounceRates, ExitRates, and Administrative activities — is associated with the purchase outcome (`Revenue`), rather than predicting it.


## Section 3: Exploratory Data Analysis and Visualization (EDA)

In [2]:
library(tidyverse)
library(repr)
library(infer)
library(cowplot)
library(broom)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘cowplot’


The following object is masked from ‘package:lubridate’:

    stamp




In [30]:
online_shopping_raw <- read.csv("online_shoppers_intention.csv")

online_shopping <- online_shopping_raw %>%
filter(is.na(Region) | Region != 1L)

head(online_shopping)
nrow(online_shopping)

Unnamed: 0_level_0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
Unnamed: 0_level_1,<int>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<int>,<int>,<int>,<int>,<chr>,<lgl>,<lgl>
1,0,0,0,0,1,0.0,0.2,0.2,0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
2,0,0,0,0,2,2.666667,0.05,0.14,0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
3,0,0,0,0,1,0.0,0.2,0.2,0,0.4,Feb,2,4,3,3,Returning_Visitor,False,False
4,0,0,0,0,2,37.0,0.0,0.1,0,0.8,Feb,2,2,2,3,Returning_Visitor,False,False
5,0,0,0,0,3,395.0,0.0,0.06666667,0,0.0,Feb,1,1,3,3,Returning_Visitor,False,False
6,0,0,0,0,16,407.75,0.01875,0.02583333,0,0.4,Feb,1,1,4,3,Returning_Visitor,False,False


In [34]:
online_shopping_glm <- glm(Revenue ~ BounceRates + ExitRates + Administrative, data = online_shopping, family = 'binomial')

online_shopping_model <- tidy (online_shopping_glm, conf.int = TRUE, conf.level = 0.95)
online_shopping_model

online_shopping_model_odds <- tidy (online_shopping_glm, conf.int = TRUE, conf.level = 0.95, exponentiate = TRUE)
online_shopping_model_odds


term,estimate,std.error,statistic,p.value,conf.low,conf.high
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),-0.91785745,0.067646399,-13.5684598,6.160289e-42,-1.05070557,-0.7855056
BounceRates,-0.55884576,4.198105111,-0.1331186,0.8940996,-8.92803854,7.52559072
ExitRates,-32.87538298,2.722065685,-12.0773658,1.3910510000000001e-33,-38.31108169,-27.64158881
Administrative,0.03808407,0.009040226,4.2127345,2.522976e-05,0.02023437,0.05568946


term,estimate,std.error,statistic,p.value,conf.low,conf.high
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),0.3993738,0.067646399,-13.5684598,6.160289e-42,0.3496909,0.4558891
BounceRates,0.5718688,4.198105111,-0.1331186,0.8940996,0.0001326179,1854.909
ExitRates,5.277188e-15,2.722065685,-12.0773658,1.3910510000000001e-33,2.2998980000000003e-17,9.894879e-13
Administrative,1.038819,0.009040226,4.2127345,2.522976e-05,1.02044,1.057269



### Interpretations (2–3 sentences for each point)
- **Why this plot?** It addresses the scientific question by relating the binary response (`Revenue`, shown as conversion rate) to at least **three variables** simultaneously: `Month` (x‑axis), `VisitorType` (line grouping), and `Weekend` (line grouping). This helps reveal temporal seasonality and behavioral differences across visitor types and weekend vs. weekday sessions.
- **Brief results:** _Edit after running the plot_. Note any visible gaps between returning and new visitors, months with notably high/low conversion, and whether weekend behavior differs from weekdays.
- **What we learn / potential issues:** _Edit after inspection_. Consider whether some groups have small sample sizes (check `n` in the summary table), whether seasonality or special events (e.g., proximity to holidays) may confound results, and which variables look most promising for modeling.
