### Understanding the dataset

There are two types of data:

- Numerical: PageValues, ProductRelated_Duration, BounceRates
- Categorical: Month, VisitorType, Region, OperatingSystems

> This means that we must make use of **One-Hot enconding** or similar to make all data numerical.

There is a long tail as most users spend little time on the site while others spend hours.

> **Standardisation** would be needed to make sure the clusters won't be skewed to the side because of the few users that spend a long time on the site.

### Convex vs concave

The Assumption (Convex): K-Means (which uses the Elbow method) assumes clusters are spherical blobs (like a ball or an orange). It tries to find the center of the ball. This is "convex" geometry.
- The standard clustering algorithm.
- It is simple, fast, and easy to explain.

The Reality (Concave/Non-Convex): Real human behavior often forms weird shapes. Imagine a "donut" shape or a "banana" shape of data points.
- E.g. DBSCAN - Density-Based Spatial Clustering of Applications with Noise
- It does not assume clusters are spheres. It groups points that are packed closely together. If your shoppers form a "banana" shape, DBSCAN will find the banana. It also handles outliers (noise) well.

If your data is shaped like a banana (concave), K-Means will fail. It tries to force a circle around the banana, often cutting it in half and putting the two halves in different clusters. This is mathematically "correct" for the algorithm, but logically wrong for your analysis.

> Mention both methods as options that we will compare, use K-means as the baseline as it is the standard clustering method and acknowledge that shoppers might form irregular (concave) groups so we will check for that

The Winning Narrative:
- "We applied K-Means (Baseline)."
- "We observed that K-Means failed to capture the complex structure (evaluated via Silhouette score/Visuals)."
- "Therefore, we advanced to DBSCAN, which successfully captured the non-convex 'concave' clusters."
- "This improved our prediction accuracy by X%."

### Alternative text about clustering

"We will primarily apply K-Means clustering to identify behavioral segments. To prepare the mixed dataset (numerical and categorical), we will apply One-Hot Encoding and standardize the skewed numerical features (e.g., duration metrics).

Addressing Cluster Geometry: We acknowledge that K-Means assumes convex (spherical) clusters, which may not align with complex user behaviors (potentially concave or irregular distributions). Therefore, we will not rely solely on the Elbow Method. We will cross-validate cluster quality using the Silhouette Coefficient and visualize the data using PCA or t-SNE to inspect separation. If K-Means fails to identify distinct groups, we will explore density-based algorithms like DBSCAN, which can handle non-convex clusters and noise effectively."

> NOTE: Do not know much about these so research needed before adding "Silhouette Coefficient and visualize the data using PCA or t-SNE"

### Other things that might need to be mentioned

#### The "Multicollinearity" Trap (Double Counting)

Your dataset contains features that are almost identical.

The Culprits: BounceRates and ExitRates.

Bounce Rate: You enter the site and leave immediately without triggering any other request.

Exit Rate: You leave the site from this page (regardless of how many pages you visited before).

The Problem: These two numbers are highly correlated (if Bounce goes up, Exit usually goes up).


Why it breaks models: You list Logistic Regression  as a model. Logistic Regression struggles when two input variables are highly correlated (Multicollinearity). It makes the model unstable and the coefficients uninterpretable.

> Add a line in 1. Data Preparation: "We will check for multicollinearity (using a Correlation Matrix or VIF) and remove redundant features, specifically examining the high correlation between Bounce Rates and Exit Rates."

> NOTE: in Lecture 6 (GitLab) we had about correlation matrices etc


#### The "Data Leakage" Trap (The Cluster-Predict Paradox)

This is the most common error in "Cluster-then-Predict" projects.

The Scenario:
- You take your entire dataset (12,000 rows).
- You run Clustering on all of it.
- You assign a cluster label (e.g., "Impulsive Shopper") to every row.
- Then you split the data into Training (80%) and Test (20%) sets for your Supervised prediction.

The Error: You just cheated! Your "Test Set" rows were used to calculate the cluster centers. Information from the test set has "leaked" into the training process. Your results will look artificially good, but the model will fail in real life.

> "To prevent data leakage, the Unsupervised Clustering will be fit only on the training set. The resulting cluster centroids will then be used to assign labels to the test set instances."

NOTE: In the context of K-Means clustering, a **Centroid** is simply the geometric center of a cluster. Think of it as the "average behavior" of all the people in that specific group.

### DBSCAN does not have .predict()

Can we use k-NN (k-Nearest Neighbors) with DBSCAN?
There is a slight confusion here:
- k-NN is Supervised: It needs labels (e.g., "Buyer" vs "Non-Buyer") to work. You use it to predict new data.
- Clustering is Unsupervised: It creates the labels.

If you want to keep the "irregular shapes" argument (DBSCAN) but make it work for your exam, you can use k-NN as a helper:

1. Cluster the training data using DBSCAN (Unsupervised).
2. Train a k-NN model using those new cluster labels.
3. Predict the cluster for the test set using the k-NN model.

