# Classification using a Support Vector Classifier (SVC)

| Key              | Value                                                                                                                                                                                                                                                                                                |
|:-----------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Course Codes** | BBT 4106, BCM 3104, and BFS 4102                                                                                                                                                                                                                                                                     |
| **Course Names** | BBT 4106: Business Intelligence I (Week 10-12 of 13),<br/>BCM 3104: Business Intelligence and Data Analytics (Week 10-12 of 13) and<br/>BFS 4102: Advanced Business Data Analytics (Week 4-6 of 13)                                                                                                  |
| **Semester**     | January to April 2026                                                                                                                                                                                                                                                                                |
| **Lecturer**     | Allan Omondi                                                                                                                                                                                                                                                                                         |
| **Contact**      | aomondi@strathmore.edu                                                                                                                                                                                                                                                                               |
| **Note**         | The lecture contains both theory and practice.<br/>This notebook forms part of the practice.<br/>It is intended for educational purpose only.<br/>Recommended citation: [BibTex](https://raw.githubusercontent.com/course-files/RegressionAndClassification/refs/heads/main/RecommendedCitation.bib) |

**Business context**: A business has a strategic objective to *increase the number of purchases made by customers by 20% by the end of the current financial year*. The lagging KPI in the financial perspective of the business' performance is the number of purchases whereas its leading KPI is the number of visits to the eCommerce website. The business would like to predict whether a customer will make a purchase so that the marketing and sales teams can intervene early and increase the number of purchases.

**Dataset**: The dataset used in this notebook is based on the **"Online Shoppers Purchasing Intention"** dataset. It contains 12,330 observations where each observation represents a user session on an eCommerce website. The dataset includes various features such as the number of pages viewed, time spent on the website, whether the user made a purchase (the target variable), etc. as described [here](https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset) and listed below.

| **Feature Type** | **Feature Name**          | **Description**                                                                                      |
|:-----------------|:--------------------------|:-----------------------------------------------------------------------------------------------------|
| **Feature**      | `Administrative`          | Number of administrative pages viewed by the user                                                    |
| **Feature**      | `Administrative_Duration` | Total time spent on administrative pages in seconds                                                  |
| **Feature**      | `Informational`           | Number of informational pages viewed by the user                                                     |
| **Feature**      | `Informational_Duration`  | Total time spent on informational pages in seconds                                                   |
| **Feature**      | `ProductRelated`          | Number of product-related pages viewed by the user                                                   |
| **Feature**      | `ProductRelated_Duration` | Total time spent on product-related pages in seconds                                                 |
| **Feature**      | `BounceRates`             | Bounce rate of the session (percentage of single-page visits)                                        |
| **Feature**      | `ExitRates`               | Exit rate of the session (percentage of exits from the site)                                         |
| **Feature**      | `PageValues`              | Average value of the pages viewed in the session (monetary value)                                    |
| **Feature**      | `SpecialDay`              | Number of special days (e.g., holidays) in the session                                               |
| **Feature**      | `Month`                   | Month of the session (encoded as a numeric value)                                                    |
| **Feature**      | `OperatingSystems`        | Operating system used by the user (encoded as a numeric value)                                       |
| **Feature**      | `Browser`                 | Browser used by the user (encoded as a numeric value)                                                |
| **Feature**      | `Region`                  | Region of the user (encoded as a numeric value)                                                      |
| **Feature**      | `TrafficType`             | Type of traffic that brought the user to the site (encoded as a numeric value)                       |
| **Feature**      | `VisitorType`             | Type of visitor (e.g., Returning Visitor, New Visitor) encoded as a numeric value                    |
| **Feature**      | `Weekend`                 | Indicates if the session occurred on a weekend (encoded as a numeric value, 0 for False, 1 for True) |
| **Target**       | `Revenue`                 | Indicates if the user made a purchase (1 for Yes, 0 for No)                                          |

**Remote Environments:**

Do your best to setup your local environment as guided during the lab, however, if you have challenges setting it up, then you can use the following remote environments temporarily for the lab:<br/>

[![Colab](https://img.shields.io/badge/Open-Colab-orange?logo=googlecolab)](
https://colab.research.google.com/github/course-files/RegressionAndClassification/blob/main/4_svm.ipynb) (preferred option)

[![Codespaces](https://img.shields.io/badge/Open-Codespaces-blue?logo=github)](
https://github.com/codespaces/new/course-files/RegressionAndClassification) (alternative)

## Step 1: Import the necessary libraries

**Purpose**: This chunk imports all the necessary libraries for data analysis, machine learning, and visualization.

1. **For File and system operations [urllib3](https://urllib3.readthedocs.io/en/stable/)**
    - `urllib.request` is used for opening and downloading data from URLs.
    - `os` provides functions for interacting with the operating system, such as file and directory management.
    - The `import sys` statement allows access to Python's system-specific parameters and functions, such as command-line arguments and the interpreter environment.
    - `sys` is imported to check if the code is running in Google Colab or not, which can affect how files are downloaded or saved.

2. **For data manipulation - [pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html) and [numpy](https://numpy.org/doc/stable/):**
    - `pandas as pd`: For loading the dataset, creating and managing DataFrames, data manipulation and analysis using DataFrames
    - `numpy as np`: For numerical operations and array manipulations

3. **For statistical data analysis - [scipy.stats](https://docs.scipy.org/doc/scipy/tutorial/stats.html)**
    - `kurtosis`: Measures the "tailedness" of data distribution
    - `skew`: Measures the asymmetry of data distribution

4. **For data preprocessing and transformation - [scikit-learn.preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html)**
    - `LabelEncoder`: LabelEncoder from scikit-learn converts categorical text labels (e.g., cat, dog, mouse) into numerical values (e.g., 0, 1, 2). It is used to prepare categorical data for machine learning algorithms that require numeric inputs
    - `StandardScaler`: For feature scaling

5. **For Machine Learning - [scikit-learn](https://scikit-learn.org/stable/supervised_learning.html)**
    - `DecisionTreeClassifier`: A class from scikit-learn that implements the CART (Classification and Regression Trees) algorithm for building decision tree models.
    - `plot_tree`: A function from scikit-learn’s tree module that visualizes the decision tree structure.
    - `train_test_split`: A function from scikit-learn’s model_selection module that splits the dataset into training and testing sets.
    - `classification_report`: A function from scikit-learn’s metrics module used to evaluate the performance of the classifier. It gives detailed metrics such as precision, recall, f1-score, and support for each class.
    - `confusion_matrix`: A function from scikit-learn’s metrics module that computes the confusion matrix to evaluate the accuracy of a classification.
    - `GridSearchCV`: For hyperparameter tuning using cross-validation

6. **For data visualization - [matplotlib](https://matplotlib.org/stable/gallery/index.html) and [seaborn](https://seaborn.pydata.org/examples/index.html)**
    - `matplotlib.pyplot as plt`: For basic plotting functionality
    - `seaborn as sns`: For enhanced statistical visualizations

7. **For model persistence - [joblib](https://joblib.readthedocs.io/en/stable/)**
    - `joblib` is used for saving and loading Python objects, such as machine learning models, to and from disk.

8. **For suppressing warnings - [warnings](https://docs.python.org/3/library/warnings.html)**
    - `warnings`: Controls warning messages
    - `warnings.filterwarnings('ignore')`: Suppresses warning messages for cleaner output
    - Used to suppress warnings that may arise during the execution of the code. Even though it is not necessary for the code to run, it helps in keeping the output clean and focused on the results.

Confirm the following:
1. Which Python interpreter will be used to execute new code and where it is located
2. The Python version

Then install all the packages into the Jupyter notebook's virtual environment before importing them.

In [1]:
import sys
sys.executable

'c:\\Users\\aomondi\\Documents\\GitHub\\Teaching\\RegressionAndClassification\\.venv\\Scripts\\python.exe'

In [None]:
!python --version

Python 3.14.2


In [None]:
if "google.colab" in sys.modules:
    print("Installing in Google Colab")
    %pip install -r https://raw.githubusercontent.com/course-files/RegressionAndClassification/refs/heads/main/requirements/colab.txt
else:
    print("Installing in dev environment")
    %pip install -r https://raw.githubusercontent.com/course-files/RegressionAndClassification/refs/heads/main/requirements/dev.txt -c https://raw.githubusercontent.com/course-files/RegressionAndClassification/refs/heads/main/requirements/constraints.txt

Installing in dev environment
Note: you may need to restart the kernel to use updated packages.


In [31]:
# For file and system operations
import urllib.request
import os

# For data manipulation
import pandas as pd
import numpy as np

# For data preprocessing and transformation
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler

# For Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


# For model persistence
import joblib

# For suppressing warnings
import warnings
warnings.filterwarnings('ignore')

## Step 2: Load the data

**Purpose**: This chunk loads the dataset by checking if the dataset exists locally; if not, then it downloads it from the specified URL and saves it locally before loading it into a Pandas DataFrame.

- **Data Loading Parameters**
    - Uses `pd.read_csv()` with specific parameters:
        - `usecols`: Loads only the columns specified in `use_cols` for memory efficiency
        - `encoding='utf-8'`: Handles special characters in the dataset. This is suitable for most languages and special characters like ñ, €, ®. Other alternative encodings include:
        - `encoding='utf-16'`: Supports multilingual characters, uses 2 Bytes per character.
        - `encoding='utf-32'`: Like utf-16 but uses 4 Bytes per character, suitable for full Unicode range.
        - `encoding='latin-1'`: Handles Western European characters such as ñ, ß, € without throwing decode errors.
        - `encoding='big5'`: Traditional Chinese encoding used in Taiwan and Hong Kong.
        - `encoding='shift_jis'`: Japanese character encoding used on Windows.
        - You can try different encodings if you encounter the `UnicodeDecodeError` while reading a file. This is useful in cases where the business has branches across different countries and the dataset contains characters from multiple languages.
        - `nrows=200000`: Limits the number of rows loaded to 200,000. This can be reduced or increased based on the available memory and the size of the dataset.
    - The data is then stored in a `Pandas` DataFrame for further analysis
    - This selective loading approach helps manage memory usage and focuses the analysis on the relevant features for the design of the model.

In [None]:
dataset_path = './data/online_shoppers_intention.csv'
url = 'https://raw.githubusercontent.com/course-files/RegressionAndClassification/refs/heads/main/data/online_shoppers_intention.csv'

if not os.path.exists(dataset_path):
    print("Downloading dataset...")
    if not os.path.exists('./data'):
        os.makedirs('./data')
    urllib.request.urlretrieve(url, dataset_path)
    print("✅ Dataset downloaded")
else:
    print("✅ Dataset already exists locally")

use_cols = ['Administrative', 'Administrative_Duration', 'Informational',
            'Informational_Duration', 'ProductRelated',
            'ProductRelated_Duration', 'BounceRates', 'ExitRates',
            'PageValues', 'SpecialDay', 'Month', 'OperatingSystems',
            'Browser', 'Region', 'TrafficType', 'VisitorType', 'Weekend',
            'Revenue']
online_shoppers_intention_data = pd.read_csv(dataset_path, usecols=use_cols, encoding='utf-8', nrows=200000)

### Identify the numeric and categorical columns

**Selection of numeric columns**
- The code identifies columns with numeric data types (`int64` and `float64`) that can be subjected to mathematical or statistical functions.
- The code also identifies non-numeric columns (e.g., `strings`, `objects`, etc.) by excluding numeric (`int64`, `float64`) and `datetime` data types.
- This is done using `select_dtypes()` method of the DataFrame, which filters columns based on their data types.

In [None]:
numeric_cols = online_shoppers_intention_data.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = online_shoppers_intention_data.select_dtypes(exclude=['int64', 'float64', 'datetime64[ns]']).columns

print("\nThe identified numeric columns are:")
print(numeric_cols.tolist())

print("\nThe identified categorical columns are:")
print(categorical_cols.tolist())

## Step 3: Initial Exploratory Data Analysis (EDA)

In [None]:
print("\n*1* The number of observations and variables")
display(online_shoppers_intention_data.shape)

print("\n*2* The data types:")
display(online_shoppers_intention_data.info())

print("\n*3* The summary of the numeric columns:")
display(online_shoppers_intention_data.describe())

print("\n*4* The whole dataset:")
display(online_shoppers_intention_data)

print("\n*5* The first 5 rows in the dataset:")
display(online_shoppers_intention_data.head())

print("\n*6* Percentage distribution for each category")
print("\nNumber of observations per class:")
print("Frequency counts:\n", online_shoppers_intention_data['Revenue'].value_counts())
print("\nPercentages:\n", online_shoppers_intention_data['Revenue'].value_counts(normalize=True) * 100, "%")

## Step 4: Data preprocessing and transformation

### Resample with replacement to balance the dataset

- **Purpose**: The purpose of this step is to balance the dataset by upsampling the minority class (where `Revenue` is 1) to match the number of observations in the majority class (where `Revenue` is 0`). This helps to mitigate the class imbalance problem, which can lead to biased model predictions.
- **Resampling**: The `resample` function from `sklearn.utils` is used to create a balanced dataset by:
    - Separating the majority and minority classes based on the `Revenue` column.
    - Upsampling the minority class by sampling with replacement to match the number of observations in the majority class.
    - Combining the upsampled minority class with the majority class to create a balanced dataset.
        - `replace=True`: This allows sampling with replacement, meaning the same observation can be selected multiple times.
        - `n_samples=len(df_majority)`: This ensures that the number of samples in the upsampled minority class matches the number of samples in the majority class.
        - `random_state=53`: This sets a random seed for reproducibility, ensuring that the same random samples are selected each time the code is run.
        - **Combining**: The upsampled minority class is combined with the majority class using `pd.concat()`, resulting in a balanced dataset.
        - **Final Dataset**: The final balanced dataset is stored in `df_balanced`, and the features and target variable are separated into `X_balanced` and `y_balanced` respectively.

In [None]:
# Separate majority and minority classes
df_majority = online_shoppers_intention_data[online_shoppers_intention_data['Revenue']==0]
df_minority = online_shoppers_intention_data[online_shoppers_intention_data['Revenue']==1]

# Upsample minority class
df_minority_upsampled = resample(df_minority,
                               replace=True,     # Sample with replacement
                               n_samples=len(df_majority),    # To match the majority class
                               random_state=53)  # To ensure the results are reproducible

# Combine majority class with upsampled minority class
df_balanced = pd.concat([df_majority, df_minority_upsampled])

online_shoppers_intention_data = df_balanced.copy()

In [None]:
print("\nNumber of observations per class:")
print("Frequency counts:\n", online_shoppers_intention_data['Revenue'].value_counts())
print("\nPercentages:\n", online_shoppers_intention_data['Revenue'].value_counts(normalize=True) * 100, "%")

### Represent the non-numeric, categorical columns as numeric using label encoding

- We need to convert the data into a numeric format suitable for the model
- First, we map the boolean `Revenue` target to integers (0 and 1)
- Then we encode any categorical variables into numeric form. In this dataset, columns like `VisitorType`, `Weekend`, and `Month` are categorical. We can use label encoding for simplicity.

In [None]:
# Map the target 'Revenue' from False/True to 0/1
online_shoppers_intention_data['Revenue'] = online_shoppers_intention_data['Revenue'].map({False: 0, True: 1})

# Create a dictionary to store the label encoders for each column
label_encoders = {}

# Encode the categorical columns: 'VisitorType', 'Weekend', and 'Month'
for col in ['VisitorType', 'Weekend', 'Month']:
    label_encoders[col] = LabelEncoder()
    online_shoppers_intention_data[col] = label_encoders[col].fit_transform(online_shoppers_intention_data[col])

In [None]:
online_shoppers_intention_data.info()

In [None]:
online_shoppers_intention_data.head()

### Create X and y datasets for the features and target variable respectively

- `X = ...`: Separates the data such that the *DataFrame* called `X` contains only the features (independent variables or predictors)
    - `axis=0` drops the concerned rows.
    - `axis=1` drops the concerned columns.

- `y = ...`: Separates the data such that the *Series* called `y` contains only the target (dependent variable or outcome)

In [None]:
X = online_shoppers_intention_data.drop('Revenue', axis=1)
y = online_shoppers_intention_data['Revenue']

print("\nThe number of observations and variables in the features dataset")
print(X.shape)
print("\nThe columns in the features dataset")
print(X.columns)

print("\nThe number of observations and variables in the target dataset")
print(y.shape)

### Train‑test split

- This step splits the dataset into training and testing sets to evaluate the model's performance on unseen data. The `train_test_split()` function is used to randomly split the data, ensuring that the target variable's distribution is preserved in both sets.
- `stratify=y` in train_test_split ensures that the train and test sets have the same proportion of each class label as the original dataset. This is important for classification tasks, especially when classes are imbalanced, as it preserves the class distribution in both splits.
- `test_size=0.3` indicates that 30% of the data will be used for testing, while 70% will be used for training.
- `random_state=53` ensures reproducibility of the split, meaning that every time you run the code, you will get the same split of data.
- `StandardScaler()` is used to standardize the features by setting mean = 0 and variance = 1. This is important for kNN, as it is sensitive to the scale of the features. Standardization ensures that all features contribute equally to the distance calculations.
- `fit_transform()` is applied to the training data to compute the mean and standard deviation, and then transform the data accordingly.
- `transform()` is applied to the test data using the same scaler fitted on the training data. This ensures that the test data is scaled in the same way as the training data, preventing data leakage.

- The `train_test_split` function returns four objects:
  - `X_train`: features for training
  - `X_test`: features for testing
  - `y_train`: labels for training
  - `y_test`: labels for testing

**Why:** Splitting the data this way allows you to train your model on one part of the data and evaluate its performance on unseen data, which helps prevent overfitting and gives an objective measure of the model's accuracy.

*Analogy:* This is similar to how a student learning a subject is not exposed to only one past paper that they can memorize. If they memorize the past paper and the exam assesses them on a different set of questions, then their performance in the exam will not be the same as their performance in the memorized past paper.

In [None]:
# Split into a training set and a test set (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, stratify=y, random_state=53)

### Apply scaling to the features

`StandardScaler` is a preprocessing technique from scikit-learn whose purpose is to standardize features by removing the mean and scaling it to a unit variance. It does this by applying the standardization formula to each feature:
- Standardization formula: `z = (x - μ) / σ`
- Where:
   - `x` is the original value of the feature.
   - `μ` is the mean of the feature values.
   - `σ` is the standard deviation of the feature values.
- The result is:
    - The transformed data will have a mean of 0
    - Standard deviation of 1
    - Roughly 68% of the values will lie between -1 and 1
    - Roughly 95% of the values will lie between -2 and 2

- Advantages:
    - Makes features comparable when their original versions are on different scales
    - Many machine learning algorithms perform better when features are on similar scales
    - Particularly important for algorithms that use distance calculations or assume normally distributed data


In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Step 5: Design a baseline Support Vector Classifier

In [None]:
# Initialize the classifier
support_vector_classifier_baseline = SVC(random_state=53)

# Fit the model on the training data
support_vector_classifier_baseline.fit(X_train_scaled, y_train)

In [None]:
# Make predictions on the test set
y_pred = support_vector_classifier_baseline.predict(X_test_scaled)

# Compute and display the accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))  # Overall fraction of correct predictions

# Show precision, recall, F1-score for each class
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Compute and display the confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

## Step 6: Perform hyperparameter tuning

**Hyperparameters:**

- `C`: Regularization parameter. Controls the trade-off between achieving a low training error and a low testing error (generalization). Smaller values specify stronger regularization.
- `kernel`: Specifies the kernel type to be used in the algorithm. Common options are 'linear' (linear decision boundary) and 'rbf' (nonlinear, radial basis function).
- `gamma`: Kernel coefficient for 'rbf', 'poly', and 'sigmoid'. It defines how far the influence of a single training example reaches. 'scale' and 'auto' are automatic settings based on the data.

In [None]:
# Define the parameter grid for SVC
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

# Set up GridSearchCV for SVC
grid_search = GridSearchCV(
    support_vector_classifier_baseline,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

# Fit GridSearchCV on the scaled training data
grid_search.fit(X_train_scaled, y_train)

# Print the best parameters and the best score
print("Best parameters found:", grid_search.best_params_)
print("Best cross-validation accuracy:", grid_search.best_score_)

# Use the best estimator as your tuned model
support_vector_classifier_optimum = grid_search.best_estimator_

## Step 7: Evaluate the Model

`y_pred = model.predict(X_test)`

- This uses the trained decision tree classifier (`model`) to predict the labels for the test set features (`X_test`). This gives you the model’s predictions on data it has not seen before, which is necessary for evaluating its performance.

`print("Classification Report:\n", classification_report(y_test, y_pred))`
- This prints a detailed classification report comparing the true labels (`y_test`) to the predicted labels (`y_pred`). The report includes precision, recall, F1-score, and support for each class, enabling you to understand how well the model performs for each category.
- It shows the performance metrics for a model that predicts two classes:
    - Class 0 - A case where the user's interaction with the eCommerce website does not lead to a purchase.
    - Class 1 - A case where the user's interaction with the eCommerce website leads to a purchase.

| Term             | Meaning                                                                                                                             |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| **Precision**    | Out of all items the model said are class X, how many are actually class X?                                                         |
| **Recall**       | Out of all actual items in class X, how many did the model correctly find?                                                          |
| **F1-score**     | A balance between precision and recall such  that a higher value means better balance.                                              |
| **Support**      | The number of actual items in that class.                                                                                           |
| **Macro avg**    | The average of precision, recall, and F1-score for both classes, treating them equally.                                             |
| **Weighted avg** | The average of precision, recall, and F1-score, but weighted by how many samples are in each class (so class 1 has more influence). |

- The results show that the model is much better at predicting class 1 than class 0, and overall gets 75% of predictions correct. This may be because there are more class 1 cases in the data.

In [None]:
# Make predictions on the test set
y_pred = support_vector_classifier_optimum.predict(X_test_scaled)

# Compute and display the accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))  # Overall fraction of correct predictions

# Show precision, recall, F1-score for each class
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Compute and display the confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


## Step 8: Use the model to make a prediction on a new sample

### Predictions using new data provided in a DataFrame

In [None]:
# Example new data as a DataFrame
new_data_original = pd.DataFrame({
    'Administrative': [2, 5],
    'Administrative_Duration': [50.0, 60.0],
    'Informational': [0, 4],
    'Informational_Duration': [0.0, 116],
    'ProductRelated': [20, 50],
    'ProductRelated_Duration': [400.0, 600.0],
    'BounceRates': [0.02, 0],
    'ExitRates': [0.05, 0.006593407],
    'PageValues': [0.0, 6.281494505],
    'SpecialDay': [0.0, 0.0],
    'Month': ['Nov', 'Jul'],
    'OperatingSystems': [2, 1],
    'Browser': [1, 3],
    'Region': [1, 9],
    'TrafficType': [2, 1],
    'VisitorType': ['Returning_Visitor', 'New_Visitor'],
    'Weekend': ['False', 'True']
})

new_data = new_data_original.copy()

# Encode categorical columns using the same label encoders
for col in ['VisitorType', 'Weekend', 'Month']:
    new_data[col] = label_encoders[col].transform(new_data[col])

# Scale the features using the same scaler as training
new_data_scaled = scaler.transform(new_data)

# Predict
prediction = support_vector_classifier_optimum.predict(new_data_scaled)
print(prediction)

# Add predictions as a new column
new_data_original['Predicted_Revenue'] = prediction
display(new_data_original)

### Predictions using new data provided in a CSV file

In [None]:
dataset_path = './data/online_shoppers_intention_new_data.csv'
url = 'https://raw.githubusercontent.com/course-files/RegressionAndClassification/refs/heads/main/data/online_shoppers_intention_new_data.csv'

if not os.path.exists(dataset_path):
    print("Downloading dataset...")
    if not os.path.exists('./data'):
        os.makedirs('./data')
    urllib.request.urlretrieve(url, dataset_path)
    print("✅ Dataset downloaded")
else:
    print("✅ Dataset already exists locally")

use_cols = ['Administrative', 'Administrative_Duration', 'Informational',
            'Informational_Duration', 'ProductRelated',
            'ProductRelated_Duration', 'BounceRates', 'ExitRates',
            'PageValues', 'SpecialDay', 'Month', 'OperatingSystems',
            'Browser', 'Region', 'TrafficType', 'VisitorType', 'Weekend',
            'Revenue']
new_data_original = pd.read_csv(dataset_path, usecols=use_cols, encoding='utf-8', nrows=200000)
new_data = new_data_original.drop(['Revenue'], axis=1).copy()

# Encode categorical columns using the same label encoders
for col in ['VisitorType', 'Weekend', 'Month']:
    new_data[col] = label_encoders[col].transform(new_data[col])

# Scale the features using the same scaler as training
new_data_scaled = scaler.transform(new_data)

# Predict
predictions = support_vector_classifier_optimum.predict(new_data_scaled)
new_data_original['Predicted_Revenue'] = predictions
display(new_data_original.head())

## Step 9: Export the results for further analysis and reporting using a tool like Power BI

In [None]:
# Save the results as a CSV file for further analysis and reporting
output_path = './data/online_shoppers_intention_predicted_data_svc.csv'
# Ensure the data directory exists
if not os.path.exists('./data'):
    os.makedirs('./data')
# Save the CSV file regardless of environment (Google Colab or local)
new_data_original.to_csv(output_path, index=False)
print(f"\n✅ Results saved to {output_path}")

# Provide a download link if running in Google Colab
try:
    from google.colab import files
    files.download(output_path)
except ImportError:
    print("❌ Not running in Google Colab, skipped dataset download link.")

# Save the label encoders
label_encoders_path = './model/label_encoders_4.pkl'
# Ensure the model directory exists
if not os.path.exists('./model'):
    os.makedirs('./model')
joblib.dump(label_encoders, label_encoders_path)
print(f"✅ Label encoders saved to {label_encoders_path}")

# Provide a download link if running in Google Colab
try:
    from google.colab import files
    files.download(label_encoders_path)
except ImportError:
    print("❌ Not running in Google Colab, skipped label encoder download link.")

# Save the scaler
scaler_path = './model/scaler_4.pkl'
# Ensure the model directory exists
if not os.path.exists('./model'):
    os.makedirs('./model')
joblib.dump(scaler, scaler_path)
print(f"✅ Scaler saved to {scaler_path}")

# Provide a download link if running in Google Colab
try:
    from google.colab import files
    files.download(scaler_path)
except ImportError:
    print("❌ Not running in Google Colab, skipped scaler download link.")

# Save the model
model_path = './model/support_vector_classifier_optimum.pkl'
# Ensure the model directory exists
if not os.path.exists('./model'):
    os.makedirs('./model')
# Save the model regardless of environment (Google Colab or local)
joblib.dump(support_vector_classifier_optimum, model_path)
print(f"✅ Model saved to {model_path}")

# Provide a download link if running in Google Colab
try:
    from google.colab import files
    files.download(model_path)
except ImportError:
    print("❌ Not running in Google Colab, skipped model download link.")

# Refences
Sakar, C. & Kastro, Y. (2018). Online Shoppers Purchasing Intention Dataset [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5F88Q.