# **Analyzing Server Connection Patterns Using Clustering and Regression**

### Data Science Final Project Report

**Names**: David Liu,  JunHyun Kim, Layni Janzen, Sydney Lee

**Group**: 10

---

In [95]:
### Run this cell before continuing.

# Import libraries for data manipulation and analysis
import pandas as pd
import numpy as np

# Import libraries for data visualization
import altair as alt
import plotly.graph_objects as go

# Import libraries for machine learning and preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsRegressor

### Required Libraries

To run this notebook, ensure the following Python libraries are installed:

- `pandas`: For data manipulation and analysis
- `numpy`: For numerical operations and array manipulation
- `altair`: For data visualization
- `plotly`: For creating interactive visualizations
- `scikit-learn`: For machine learning, clustering, regression, and preprocessing

#### Installation:
Run the following command in your terminal or Jupyter Notebook to install all required libraries:
```bash
pip install pandas numpy altair plotly scikit-learn

---
## **Introduction**

A research group from UBC computer science is conducting research on how individuals play video games. Through their Minecraft webserver they have cultivated two datasets with information about players and their session history. Hosting a web server requires resource management, and the research group wants to determine how to best handle the resource allocation. 

In this project, we will use their two datasets to answer the question of: "How can we predict and optimize server resource allocation based on the number of concurrent players in each session?"

As a reminder these datasets are:

1. **`players.csv`** – Contains information about the players, including their experience level, subscription status, and hours played.
2. **`sessions.csv`** – Contains session details, such as the start and end times of each player's session.

### **1. `players.csv` Dataset**

The `players.csv` dataset provides detailed information about the players, which includes their **experience level**, **subscription status**, **gameplay hours**, and basic demographic information.

#### Main columns in `players.csv`:
- **`experience`**: The experience level of the player (e.g., "Pro", "Veteran", "Regular", "Amateur").
- **`subscribe`**: A boolean value indicating whether the player has a subscription (`True` or `False`).
- **`hashedEmail`**: A unique identifier for the player, stored as a hashed email address.
- **`played_hours`**: The total number of hours the player has spent playing the game.
- **`name`**: The player's name.
- **`gender`**: The player's gender.
- **`age`**: The player's age.
- **`individualId`**: An identifier for the player, possibly an internal ID for tracking purposes.
- **`organizationName`**: The name of the player's organization (if applicable).

#### Key Information from `players.csv`:
- **Number of Rows**: 196 players.
- **Number of Columns**: 9 attributes per player.
- **Data Types**: The dataset contains both categorical (e.g., `experience`, `gender`) and numerical (e.g., `played_hours`, `age`) data.

This dataset is important for understanding **player behavior** and how factors like **experience** and **age** might influence the number of connections (active players) at different times of the day.

### **2. `sessions.csv` Dataset**

The `sessions.csv` dataset contains information about the sessions in which players were actively playing the game. It records **session start and end times**, and is crucial for determining how long players are active and when their sessions occur.

#### Main columns in `sessions.csv`:
- **`hashedEmail`**: The player's unique identifier (same as in `players.csv`).
- **`start_time`**: The start time of the player's session, recorded as a string in the format `DD/MM/YYYY HH:MM`.
- **`end_time`**: The end time of the player's session, also recorded as a string in the same format as `start_time`.
- **`original_start_time`**: A timestamp in Unix format, representing the session's start time.
- **`original_end_time`**: A timestamp in Unix format, representing the session's end time.

#### Key Information from `sessions.csv`:
- **Number of Rows**: 1535 session entries.
- **Number of Columns**: 5 attributes per session.
- **Data Types**: The dataset includes both categorical data (e.g., `hashedEmail`) and datetime data (e.g., `start_time`, `end_time`).

This dataset is critical for determining **when players are active**, and analyzing **peak hours** or **days of the week** for player activity. The session duration (calculated from `start_time` and `end_time`) will also be important for identifying **intense activity periods** and **server load**.

---


## **Methods & Results**

### **Loading the Data**

Before we can utilize the data, it first must be loaded. This can be done as such:

In [96]:
# Load the players dataset
players_df = pd.read_csv("players.csv")

# Display the first few rows of the dataset
print("Table 1: First few rows of players.csv:")
display(players_df.head())

Table 1: First few rows of players.csv:


Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,


In [97]:
# Load the sessions dataset
sessions_df = pd.read_csv("sessions.csv")

# Display the first few rows of the dataset
print("Table 2: First few rows of sessions.csv:")
display(sessions_df.head())

Table 2: First few rows of sessions.csv:


Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0


### **Data Cleaning and Wrangling**

Now that the data has been loaded, we can proceed to clean and wrangle the data. We have identified five tasks as a part of this preprocessing:
- **Datetime Conversion**: The `start_time` and `end_time` columns in the `sessions.csv` dataset were converted to proper `datetime` format to allow time-based analysis.
- **Session Duration**: We calculated the **session duration** by subtracting the `start_time` from the `end_time` to get the total duration of each session in minutes.
- **Handling Missing Data**: Rows with missing `end_time` values were dropped, as the session duration could not be calculated without this information.
- **Feature Extraction**: We extracted the **hour** and **day of the week** from the `start_time` column to enable analysis of player activity patterns by time.
- **Mapping To Numerical Values**: Day is originally a string, however to perform data analysis it is better to work with numberes, as such each day of the week should be mapped to a number

#### Converting `start_time` and `end_time` to `datetime`

In [98]:
# Convert "start_time" and "end_time" to datetime format
sessions_df["start_time"] = pd.to_datetime(sessions_df["start_time"], format="%d/%m/%Y %H:%M")
sessions_df["end_time"] = pd.to_datetime(sessions_df["end_time"], format="%d/%m/%Y %H:%M")

#### Dropping rows with missing data

In [99]:
# Drop rows with missing end times
sessions_df = sessions_df.dropna(subset=["end_time"])

#### Calculating the session duration in minutes

In [100]:
# Calculate session duration in minutes
sessions_df["session_duration"] = (sessions_df["end_time"] - sessions_df["start_time"]).dt.total_seconds() / 60

#### Extracting relevant features

In [101]:
# Filter out sessions with zero duration
sessions_df = sessions_df[sessions_df["session_duration"] > 0]

# Extract day of the week and hour from start time
sessions_df["day_of_week"] = sessions_df["start_time"].dt.day_name()
sessions_df["hour"] = sessions_df["start_time"].dt.hour

#### Mapping day to numeric vaule

In [102]:
# Mapping Day to numeric vaule, Start as Monday: 0
sessions_df["day_of_week_numeric"] = sessions_df["day_of_week"].replace({
    "Monday": 0,
    "Tuesday": 1,
    "Wednesday": 2,
    "Thursday": 3,
    "Friday": 4,
    "Saturday": 5,
    "Sunday": 6
})

#### Displaying the preprocessed data

In [103]:
# Reorder columns for consistency
sessions_df = sessions_df[["hashedEmail", "start_time", "end_time", "day_of_week", "day_of_week_numeric", "hour", "session_duration"]]

# Preview the cleaned and tidied data
print("Table 3: Preprocessed data") 
sessions_df.head()

Table 3: Preprocessed data


Unnamed: 0,hashedEmail,start_time,end_time,day_of_week,day_of_week_numeric,hour,session_duration
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-06-30 18:12:00,2024-06-30 18:24:00,Sunday,6,18,12.0
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-06-17 23:33:00,2024-06-17 23:46:00,Monday,0,23,13.0
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,2024-07-25 17:34:00,2024-07-25 17:57:00,Thursday,3,17,23.0
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-07-25 03:22:00,2024-07-25 03:58:00,Thursday,3,3,36.0
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-05-25 16:01:00,2024-05-25 16:12:00,Saturday,5,16,11.0


With the data preprocessed we can aggregate the data by **hour**, **day of the week**, and both **hour and day of the week** to identify trends in player activity, which is essential for understanding when to allocate server resources more efficiently.


### **Data Aggregation**
- Aggregation is used to summarize data by grouping it based on specific columns and applying an operation (like counting, summing, averaging, etc.).

- The `groupby()` function defines the groups, `.size()` computes the counts, and `.reset_index()` organizes the results into a readable table.

- These steps help in analyzing patterns, such as identifying peak hours or busy days for connections.

In [104]:
# Aggregate connections by hour
hourly_connections = sessions_df.groupby("hour").size().reset_index(name="connections")

# Aggregate connections by day of the week
daily_connections = sessions_df.groupby("day_of_week_numeric").size().reset_index(name="connections")

# Aggregate connections by both day of the week and hour
day_hour_connections = sessions_df.groupby(["day_of_week_numeric", "hour"]).size().reset_index(name="connections")

# Display aggregated data
print("Table 4: Hourly Connections")
print(hourly_connections.head())

print("\nTable 5: Daily Connections:")
print(daily_connections.head())

print("\nTable 6: Day-Hour Connections:")
print(day_hour_connections.head())

day_hour_connections

Table 4: Hourly Connections
   hour  connections
0     0          128
1     1           79
2     2          152
3     3          131
4     4          150

Table 5: Daily Connections:
   day_of_week_numeric  connections
0                    0          207
1                    1          203
2                    2          210
3                    3          223
4                    4          181

Table 6: Day-Hour Connections:
   day_of_week_numeric  hour  connections
0                    0     0           22
1                    0     1           11
2                    0     2           14
3                    0     3           19
4                    0     4           26


Unnamed: 0,day_of_week_numeric,hour,connections
0,0,0,22
1,0,1,11
2,0,2,14
3,0,3,19
4,0,4,26
...,...,...,...
133,6,19,18
134,6,20,19
135,6,21,20
136,6,22,20


### **Data Analysis**
We can use a line chart to visualize the number of connections by hour of the day using data from the `hourly_connections` dataset. By plotting the hours as discrete categories on the x-axis and the corresponding number of connections on the y-axis we can use the chart to highlight the hourly trends in connection activity, offering insights into peak usage times.

In [119]:
hour_chart = alt.Chart(hourly_connections).mark_line().encode(
    x=alt.X("hour:O", title="Hour of the Day", axis=alt.Axis(labelAngle=0)),
    y=alt.Y("connections:Q", title="Number of Connections"),
    color=alt.value("steelblue")
).properties(
    title="Figure 1: Number of Connections by Hour of the Day"
)

hour_chart

Furthermore, we can use a bar chart to illustrates the number of connections by day of the week. By color-coding the bars using the Viridis color scale, where the intensity reflects the number of connections for each day, we can provide a clear comparison of connection activity across the week, helping to identify patterns or peak days of usage.

In [120]:
# Replace day_of_week_numeric with day names
daily_connections["day_of_week"] = daily_connections["day_of_week_numeric"].replace({
    0: "Monday", 1: "Tuesday", 2: "Wednesday", 3: "Thursday", 
    4: "Friday", 5: "Saturday", 6: "Sunday"
})

day_chart = alt.Chart(daily_connections).mark_bar().encode(
    x=alt.X("day_of_week:O", title="Day of the Week", sort=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]),
    y=alt.Y("connections:Q", title="Number of Connections"),
    color=alt.Color("connections:Q", scale=alt.Scale(scheme="viridis"))
).properties(
    title="Figure 2: Number of Connections by Day of the Week"
)

day_chart

Additionally we can use a bubble chart to visualize the number of connections by day of the week and hour of the day. The x-axis represents the hours, while the y-axis represents the days in chronological order, with custom labels for day names. The size and color intensity of the bubbles indicate the number of connections. This type of visualization shows both the peak activity across daily and hourly dimensions.

In [107]:
day_hour_chart = alt.Chart(day_hour_connections).mark_circle(size=100).encode(
    x=alt.X("hour:O", title="Hour of the Day", axis=alt.Axis(labelAngle=0)),
    y=alt.Y("day_of_week_numeric:O", 
            title="Day of the Week", 
            sort=[0, 1, 2, 3, 4, 5, 6],
            axis=alt.Axis(
                labelExpr="datum.value == 0 ? 'Monday' : "
                          "datum.value == 1 ? 'Tuesday' : "
                          "datum.value == 2 ? 'Wednesday' : "
                          "datum.value == 3 ? 'Thursday' : "
                          "datum.value == 4 ? 'Friday' : "
                          "datum.value == 5 ? 'Saturday' : 'Sunday'"
            )),
    size=alt.Size("connections:Q", title="Connections"),
    color=alt.Color("connections:Q", scale=alt.Scale(scheme="blues")),
).properties(
    title="Figure 3: Number of Connections by Day and Hour",
    width=800,
    height=400
)

day_hour_chart


Using an Elbow Curve, we can visualize the Within-Cluster Sum of Squares (WSSD) for varying numbers of clusters ($k$) in a K-Means model. But first we must standardize the `connections` data before clustering to ensure consistency. The chart shows how WSSD decreases with increasing $k$, highlighting the "elbow point" where additional clusters provide diminishing returns. This analysis helps determine the optimal number of clusters for effective segmentation.

In [108]:
# Define the preprocessor
preprocessor = make_column_transformer(
    (StandardScaler(), ["connections"]),
    verbose_feature_names_out=False,
)

# Set the range for possible cluster numbers
ks = range(1, 10)

# Compute WSSD for each value of K using list comprehension
wssds = [
    make_pipeline(
        preprocessor,
        KMeans(n_clusters=k, random_state=42, n_init=10)  # n_init is explicitly set
    ).fit(day_hour_connections)[1].inertia_
    for k in ks
]

# Create a dataframe to store K and WSSD values
elbow_df = pd.DataFrame({
    "k": ks,
    "wssd": wssds,
})

# Plot the Elbow Curve
elbow_plot = alt.Chart(elbow_df).mark_line(point=True).encode(
    x=alt.X("k:N", title="Number of Clusters"),
    y=alt.Y("wssd:Q", title="Total Within-Cluster Sum of Squares (WSSD)"),
    tooltip=["k", "wssd"]
).properties(
    title="Figure 4: Elbow Method for Optimal K"
)
elbow_plot


By using the Elblow Method we determined that the optimal cluster count was $k = 3$, as such the day_hour_connections dataset should be segmented into three clusters using K-Means. Once again, the clustering pipeline must standardized the data before applying the K-Means algorithm to assign cluster labels. These labels were mapped to descriptive categories ("Low," "Medium," "High") for better interpretability. The resulting dataset now highlights connection density patterns across different hours and days, enabling more actionable insights.

In [109]:
# Set optimal number of clusters
optimal_k = 3

# Create and fit the pipeline
clustering_pipeline = make_pipeline(
    preprocessor,
    KMeans(n_clusters=optimal_k, random_state=42)
)
clustering_pipeline.fit(day_hour_connections)

# Add cluster labels to the original dataframe
day_hour_connections["cluster"] = clustering_pipeline[1].labels_

# Define a mapping for descriptive cluster labels
cluster_mapping = {0: "Low", 2: "Medium", 1: "High"}
day_hour_connections["density_label"] = day_hour_connections["cluster"].map(cluster_mapping)

# Display sample of labeled data
day_hour_connections[["hour", "day_of_week_numeric", "connections", "density_label"]].head()

print("Table 7: Clustered Connection Density Table")
day_hour_connections

Table 7: Clustered Connection Density Table






Unnamed: 0,day_of_week_numeric,hour,connections,cluster,density_label
0,0,0,22,1,High
1,0,1,11,2,Medium
2,0,2,14,2,Medium
3,0,3,19,1,High
4,0,4,26,1,High
...,...,...,...,...,...
133,6,19,18,1,High
134,6,20,19,1,High
135,6,21,20,1,High
136,6,22,20,1,High


We can highlight temporal patterns and density variations across different time periods by using a bubble chart to visualize the connection density by hour of the day, with points color-coded based on cluster labels ("Low," "Medium," "High") determined by K-Means clustering. The x-axis represents hours, while the y-axis shows the number of connections.

In [110]:
# Updated visualization with descriptive labels
cluster_plot = alt.Chart(day_hour_connections).mark_circle(size=60).encode(
    x=alt.X("hour:O", title="Hour of the Day"),
    y=alt.Y("connections:Q", title="Number of Connections"),
    color=alt.Color("density_label:N", title="Session Density"),
    tooltip=["hour", "day_of_week_numeric", "connections", "density_label"]
).properties(
    title="Figure 5: K-Means Clustering of Connection Density (Descriptive Labels)",
    width=800,
    height=400
)

cluster_plot


To gain a better understanding of the data, another bubble chart can be used to visualize the connection density across different times of the day and days of the week. The `density_label` ("Low," "Medium," "High") is mapped to specific point sizes, highlighting density variations, while colors differentiate clusters determined by K-Means clustering. The x-axis represents hours of the day and the y-axis represents days of the week.

In [111]:
# Map point sizes to the density labels for visualization
point_size_mapping = {"Low": 40, "Medium": 80, "High": 120}
day_hour_connections["size"] = day_hour_connections["density_label"].map(point_size_mapping)

# Updated visualization
day_hour_cluster_chart = alt.Chart(day_hour_connections).mark_circle().encode(
    x=alt.X("hour:O", title="Hour of the Day", axis=alt.Axis(labelAngle=0)),
    y=alt.Y("day_of_week_numeric:O", 
            title="Day of the Week", 
            sort=[0, 1, 2, 3, 4, 5, 6],
            axis=alt.Axis(labelExpr="datum.value == 0 ? 'Monday' : datum.value == 1 ? 'Tuesday' : datum.value == 2 ? 'Wednesday' : datum.value == 3 ? 'Thursday' : datum.value == 4 ? 'Friday' : datum.value == 5 ? 'Saturday' : 'Sunday'")
           ),
    color=alt.Color("density_label:N", title="Cluster"),
    size=alt.Size("size:Q", title="Point Size", legend=None),
    tooltip=["day_of_week_numeric", "hour", "density_label", "connections"]
).properties(
    title="Figure 6: Clustering Results: Day and Hour",
    width=800,
    height=400
)

day_hour_cluster_chart

We can then calculate the distribution of connection density clusters and provide insights into the frequency of each cluster across the data. The `value_counts` method is used to count the number of occurrences for each `density_label` ("Low," "Medium," "High") in the dataset. These counts are stored in a new DataFrame, `cluster_distribution`, with renamed columns ("Cluster" and "Count") for clarity. 

In [112]:
# Cluster distribution summary
cluster_distribution = day_hour_connections["density_label"].value_counts().reset_index()
cluster_distribution.columns = ["Cluster", "Count"]

print("\nTable 8: Cluster Distribution:")
print(cluster_distribution)


Table 8: Cluster Distribution:
  Cluster  Count
0  Medium     58
1     Low     48
2    High     32


### Modeling Approaches after Clustering

After clustering the data, three modeling approaches were explored:

1. **Full Dataset**: A single model trained on the entire dataset without incorporating density information.
2. **With Density Labels**: A model trained on the entire dataset with density labels (`Low`, `Medium`, `High`) as an additional feature.
3. **Cluster-Specific Models**: Separate models trained for each cluster (`Low`, `Medium`, `High`).

Each approach was evaluated using **K-Nearest Neighbors (KNN) regression**, with the optimal number of neighbors (`k`) determined through **GridSearchCV**, and performance measured using **Root Mean Squared Percentage Error (RMSPE)**.


In [113]:
# Convert 'density_label' to numeric values
day_hour_connections["density_label_numeric"] = day_hour_connections["density_label"].replace({
    "Low": 0,
    "Medium": 1,
    "High": 2
})

# Evaluate Full Dataset
print("============= Full Dataset =============")
features_full = day_hour_connections[["hour", "day_of_week_numeric"]]
target_full = day_hour_connections["connections"]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features_full, target_full, train_size=0.75, random_state=42)

# Define the KNN pipeline
pipeline_full = make_pipeline(StandardScaler(), KNeighborsRegressor())

# Perform GridSearchCV to find the optimal K
param_grid_full = {"kneighborsregressor__n_neighbors": range(1, 19)}
grid_search_full = GridSearchCV(pipeline_full, param_grid_full, cv=5, scoring="neg_root_mean_squared_error")
grid_search_full.fit(X_train, y_train)

# Retrieve the optimal K and evaluate the model
optimal_k_full = grid_search_full.best_params_["kneighborsregressor__n_neighbors"]
print(f"Optimal K (Full Dataset): {optimal_k_full}")

best_knn_full = grid_search_full.best_estimator_
y_pred_full = best_knn_full.predict(X_test)
rmspe_full = np.sqrt(np.mean((y_test - y_pred_full) ** 2) / np.mean(y_test ** 2))
print(f"RMSPE (Full Dataset): {rmspe_full:.4f}")


# Evaluate Full Dataset with Density Labels
print("\n=== Full Dataset with Density Labels ===")
features_with_density = day_hour_connections[["hour", "day_of_week_numeric", "density_label_numeric"]]
target_with_density = day_hour_connections["connections"]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features_with_density, target_with_density, train_size=0.75, random_state=42)

# Define the KNN pipeline
pipeline_density = make_pipeline(StandardScaler(), KNeighborsRegressor())

# Perform GridSearchCV to find the optimal K
param_grid_density = {"kneighborsregressor__n_neighbors": range(1, 19)}
grid_search_density = GridSearchCV(pipeline_density, param_grid_density, cv=5, scoring="neg_root_mean_squared_error")
grid_search_density.fit(X_train, y_train)

# Retrieve the optimal K and evaluate the model
optimal_k_density = grid_search_density.best_params_["kneighborsregressor__n_neighbors"]
print(f"Optimal K (With Density): {optimal_k_density}")

best_knn_density = grid_search_density.best_estimator_
y_pred_density = best_knn_density.predict(X_test)
rmspe_density = np.sqrt(np.mean((y_test - y_pred_density) ** 2) / np.mean(y_test ** 2))
print(f"RMSPE (With Density): {rmspe_density:.4f}")


# Evaluate Cluster-wise
print("\n======= Cluster-wise Regression =======")
clusters = {
    "Low": day_hour_connections[day_hour_connections["density_label"] == "Low"],
    "Medium": day_hour_connections[day_hour_connections["density_label"] == "Medium"],
    "High": day_hour_connections[day_hour_connections["density_label"] == "High"]
}

# Initialize overall RMSPE calculation
overall_rmspe_numerator = 0
overall_rmspe_denominator = 0

for label, cluster_data in clusters.items():
    # Prepare features and target for the current cluster
    X = cluster_data[["hour", "day_of_week_numeric"]]
    y = cluster_data["connections"]

    # Split data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=42)

    # Define the KNN pipeline
    pipeline_cluster = make_pipeline(StandardScaler(), KNeighborsRegressor())

    # Perform GridSearchCV to find the optimal K
    param_grid_cluster = {"kneighborsregressor__n_neighbors": range(1, 19)}
    grid_search_cluster = GridSearchCV(pipeline_cluster, param_grid_cluster, cv=5, scoring="neg_root_mean_squared_error")
    grid_search_cluster.fit(X_train, y_train)

    # Retrieve the optimal K and evaluate the model
    optimal_k_cluster = grid_search_cluster.best_params_["kneighborsregressor__n_neighbors"]
    print(f"Optimal K (Cluster '{label}'): {optimal_k_cluster}")

    best_knn_cluster = grid_search_cluster.best_estimator_
    y_pred_cluster = best_knn_cluster.predict(X_test)
    rmspe_cluster = np.sqrt(np.mean((y_test - y_pred_cluster) ** 2) / np.mean(y_test ** 2))
    print(f"RMSPE (Cluster '{label}'): {rmspe_cluster:.4f}")

    # Accumulate for overall RMSPE calculation
    cluster_size = len(cluster_data)
    overall_rmspe_numerator += cluster_size * (rmspe_cluster ** 2)
    overall_rmspe_denominator += cluster_size

# Calculate and print overall RMSPE for Cluster-wise Regression
overall_rmspe_cluster = np.sqrt(overall_rmspe_numerator / overall_rmspe_denominator)
print(f"\nOverall RMSPE (Cluster-wise): {overall_rmspe_cluster:.4f}")

Optimal K (Full Dataset): 4
RMSPE (Full Dataset): 0.5167

=== Full Dataset with Density Labels ===
Optimal K (With Density): 5
RMSPE (With Density): 0.2459

Optimal K (Cluster 'Low'): 9
RMSPE (Cluster 'Low'): 0.3604
Optimal K (Cluster 'Medium'): 11
RMSPE (Cluster 'Medium'): 0.2004
Optimal K (Cluster 'High'): 4
RMSPE (Cluster 'High'): 0.1397

Overall RMSPE (Cluster-wise): 0.2581


### K-Nearest Neighbors (KNN) Regression Evaluation

This project evaluates KNN regression models to predict connection density using `hour`, `day_of_week_numeric`, and `density_label_numeric` as features. To ensure robust comparison, models are evaluated over multiple iterations, and repetitive code has been modularized for clarity.

---

### **Three Modeling Approaches**

1. **Full Dataset**  
   - Trains a single model using all data without density labels.
   - RMSPE is calculated to evaluate predictive accuracy.

2. **Full Dataset with Density Labels**  
   - Includes density labels (`Low`, `Medium`, `High`) as an additional feature.
   - Improves accuracy by leveraging density information.

3. **Cluster-Specific Models**  
   - Separate models are trained for each density cluster.
   - A weighted RMSPE is calculated for overall performance.

---

### **Modularized Functions**

1. **`evaluate_knn_once(features, target)`**  
   - Trains and evaluates a single KNN regression model.  
   - Splits data into training and testing sets, performs GridSearchCV to find the optimal `k`, and calculates RMSPE.

2. **`evaluate_knn_once_with_k(features, target)`**  
   - Similar to `evaluate_knn_once`, but additionally returns the optimal `k` value.  
   - Useful for identifying the most frequently selected `k` across iterations.

3. **`evaluate_knn_multiple(features, target, iterations)`**  
   - Runs multiple iterations of KNN regression evaluation.  
   - Calculates the average and standard deviation of RMSPE to ensure robust comparisons.

In [114]:
print("Figure 7: Average RMSPE and Std RMSPE")

# Function to train and evaluate KNN with a single iteration
def evaluate_knn_once(features, target, train_size=0.75, k_range=range(1, 19), cv_folds=5):
    """
    Trains and evaluates a single KNN regression model.

    Args:
        features (DataFrame): Input features (X).
        target (Series): Target variable (y).
        train_size (float): Training set size proportion.
        k_range (range): Range of K values for GridSearchCV.
        cv_folds (int): Number of cross-validation folds.

    Returns:
        float: RMSPE for the tuned model.
    """
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(features, target, train_size=train_size)

    # Define the KNN pipeline
    knn_pipeline = make_pipeline(StandardScaler(), KNeighborsRegressor())

    # Perform GridSearchCV to find optimal K
    param_grid = {"kneighborsregressor__n_neighbors": k_range}
    grid_search = GridSearchCV(
        estimator=knn_pipeline,
        param_grid=param_grid,
        cv=cv_folds,
        scoring="neg_root_mean_squared_error"
    )
    grid_search.fit(X_train, y_train)

    # Evaluate tuned model
    best_knn_model = grid_search.best_estimator_
    y_pred = best_knn_model.predict(X_test)
    rmspe = np.sqrt(np.mean((y_test - y_pred) ** 2) / np.mean(y_test ** 2))
    return rmspe

# Function to train and evaluate KNN with a single iteration (modified to return optimal k)
def evaluate_knn_once_with_k(features, target, train_size=0.75, k_range=range(1, 19), cv_folds=5):
    """
    Trains and evaluates a single KNN regression model and returns RMSPE and optimal K.

    Args:
        features (DataFrame): Input features (X).
        target (Series): Target variable (y).
        train_size (float): Training set size proportion.
        k_range (range): Range of K values for GridSearchCV.
        cv_folds (int): Number of cross-validation folds.

    Returns:
        tuple: RMSPE for the tuned model, optimal K.
    """
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(features, target, train_size=train_size)

    # Define the KNN pipeline
    knn_pipeline = make_pipeline(StandardScaler(), KNeighborsRegressor())

    # Perform GridSearchCV to find optimal K
    param_grid = {"kneighborsregressor__n_neighbors": k_range}
    grid_search = GridSearchCV(
        estimator=knn_pipeline,
        param_grid=param_grid,
        cv=cv_folds,
        scoring="neg_root_mean_squared_error"
    )
    grid_search.fit(X_train, y_train)

    # Evaluate tuned model
    best_knn_model = grid_search.best_estimator_
    y_pred = best_knn_model.predict(X_test)
    rmspe = np.sqrt(np.mean((y_test - y_pred) ** 2) / np.mean(y_test ** 2))
    
    # Retrieve the optimal K value
    optimal_k = grid_search.best_params_["kneighborsregressor__n_neighbors"]

    return rmspe, optimal_k

# Function to run multiple iterations and calculate average RMSPE
def evaluate_knn_multiple(features, target, iterations=50, train_size=0.75, k_range=range(1, 19), cv_folds=5):
    """
    Runs multiple iterations of KNN evaluation and calculates average RMSPE.

    Args:
        features (DataFrame): Input features (X).
        target (Series): Target variable (y).
        iterations (int): Number of iterations.
        train_size (float): Training set size proportion.
        k_range (range): Range of K values for GridSearchCV.
        cv_folds (int): Number of cross-validation folds.

    Returns:
        tuple: (list of RMSPEs, average RMSPE, std RMSPE)
    """
    rmspe_list = []
    for _ in range(iterations):
        rmspe = evaluate_knn_once(features, target, train_size, k_range, cv_folds)
        rmspe_list.append(rmspe)
    avg_rmspe = np.mean(rmspe_list)
    std_rmspe = np.std(rmspe_list)
    return rmspe_list, avg_rmspe, std_rmspe

# Convert 'density_label' to numeric values
day_hour_connections["density_label_numeric"] = day_hour_connections["density_label"].replace({
    "Low": 0,
    "Medium": 1,
    "High": 2
})

# Evaluate Full Dataset
print("============= Full Dataset =============")
features_full = day_hour_connections[["hour", "day_of_week_numeric"]]
target_full = day_hour_connections["connections"]
rmspe_list_full, avg_rmspe_full, std_rmspe_full = evaluate_knn_multiple(features_full, target_full)
print(f"Average RMSPE (Full Dataset): {avg_rmspe_full:.4f}")
print(f"Std RMSPE (Full Dataset): {std_rmspe_full:.4f}")


# Evaluate Full Dataset with Density Labels and track optimal K
print("\n=== Full Dataset with Density Labels ===")
features_with_density = day_hour_connections[["hour", "day_of_week_numeric", "density_label_numeric"]]
target_with_density = day_hour_connections["connections"]

# Initialize list to store optimal K values
optimal_k_list_density = []

# Perform multiple iterations and store optimal K for each iteration
rmspe_list_density = []
for _ in range(50):  # Number of iterations
    rmspe, optimal_k = evaluate_knn_once_with_k(
        features=features_with_density,
        target=target_with_density,
        train_size=0.75,
        k_range=range(1, 19)
    )
    rmspe_list_density.append(rmspe)
    optimal_k_list_density.append(optimal_k)

# Calculate average and standard deviation for RMSPE
avg_rmspe_density = np.mean(rmspe_list_density)
std_rmspe_density = np.std(rmspe_list_density)

# Determine the most frequent optimal K value
optimal_k_density = max(set(optimal_k_list_density), key=optimal_k_list_density.count)

print(f"Average RMSPE (With Density): {avg_rmspe_density:.4f}")
print(f"Std RMSPE (With Density): {std_rmspe_density:.4f}")
print(f"Optimal K (With Density): {optimal_k_density}")


# Evaluate Cluster-wise
print("\n======= Cluster-wise Regression =======")
clusters = {
    "Low": day_hour_connections[day_hour_connections["density_label"] == "Low"],
    "Medium": day_hour_connections[day_hour_connections["density_label"] == "Medium"],
    "High": day_hour_connections[day_hour_connections["density_label"] == "High"]
}

# Initialize overall RMSPE list for Cluster-wise
overall_rmspe_list_cluster = []

# Loop over the number of iterations (matching the number of iterations in rmspe_list_full)
for i in range(len(rmspe_list_full)):  # Assuming all RMSPE lists have the same number of iterations
    overall_rmspe_numerator = 0
    overall_rmspe_denominator = 0

    # Calculate RMSPE for each cluster
    for label, cluster_data in clusters.items():
        cluster_features = cluster_data[["hour", "day_of_week_numeric"]]
        cluster_target = cluster_data["connections"]

        # Perform a single iteration of RMSPE calculation (corresponding to iteration i)
        rmspe_for_iteration = evaluate_knn_once(
            cluster_features,
            cluster_target,
            train_size=0.75,
            k_range=range(1, 19)
        )

        # Weighted RMSPE contribution for current iteration
        cluster_size = len(cluster_data)
        overall_rmspe_numerator += cluster_size * (rmspe_for_iteration ** 2)
        overall_rmspe_denominator += cluster_size

    # Append the overall RMSPE for this iteration
    overall_rmspe_list_cluster.append(np.sqrt(overall_rmspe_numerator / overall_rmspe_denominator))

# Calculate the average and standard deviation of overall RMSPE
avg_overall_rmspe_cluster = np.mean(overall_rmspe_list_cluster)
std_overall_rmspe_cluster = np.std(overall_rmspe_list_cluster)

# Print the results
print(f"\nOverall RMSPE (Cluster-wise): {avg_overall_rmspe_cluster:.4f}")
print(f"Std RMSPE (Cluster-wise): {std_overall_rmspe_cluster:.4f}")


Figure 7: Average RMSPE and Std RMSPE
Average RMSPE (Full Dataset): 0.4259
Std RMSPE (Full Dataset): 0.0520

=== Full Dataset with Density Labels ===
Average RMSPE (With Density): 0.2350
Std RMSPE (With Density): 0.0420
Optimal K (With Density): 5


Overall RMSPE (Cluster-wise): 0.3148
Std RMSPE (Cluster-wise): 0.0437


### Results Analysis

The evaluation of the K-Nearest Neighbors (KNN) models produced the following insights:

1. **Full Dataset**: The model trained on the entire dataset without additional features or clustering performed with moderate accuracy, demonstrating the highest error compared to the other methods.

2. **Full Dataset with Density Labels**: Incorporating the `density_label_numeric` feature significantly improved the model's performance, achieving the lowest average RMSPE. This suggests that adding cluster-specific information enhances predictive accuracy.

3. **Cluster-wise Regression**: Separately training and evaluating models for "Low," "Medium," and "High" density clusters resulted in an overall error that was lower than the full dataset approach but slightly higher than the model using density labels. This demonstrates that clustering helps capture more specific patterns in the data.

4. **Random Variability**: Due to the random splits in cross-validation and data sampling, the exact RMSPE values vary slightly across runs. Therefore, the trends and relative performance differences between the models are more important than the exact numerical values.

This analysis highlights the benefits of including additional contextual features or leveraging clustering to improve model performance.

### Altair Visualization of RMSPE

This script generates a line chart to visualize the RMSPE values across iterations for the three different models:

1. **Full Dataset**
2. **Full Dataset with Density**
3. **Cluster-wise**

By plotting RMSPE against iterations, the chart highlights the performance trends and differences between the models, enabling an intuitive comparison of their predictive accuracy.

The chart uses **Altair** for a clean and interactive representation, with tooltips for detailed inspection of data points.

In [115]:
# Prepare data for Altair
def prepare_rmspe_plot_data(rmspe_list, dataset_name):
    """
    Prepares RMSPE data for Altair visualization.
    
    Args:
        rmspe_list (list): List of RMSPE values.
        dataset_name (str): Name of the dataset.

    Returns:
        DataFrame: Prepared DataFrame for Altair.
    """
    return pd.DataFrame({
        "Iteration": list(range(1, len(rmspe_list) + 1)),
        "RMSPE": rmspe_list,
        "Dataset": dataset_name
    })

# Combine data from all RMSPE calculations
rmspe_plot_data = pd.concat([
    prepare_rmspe_plot_data(rmspe_list_full, "Full Dataset"),
    prepare_rmspe_plot_data(rmspe_list_density, "Full Dataset with Density"),
    prepare_rmspe_plot_data(overall_rmspe_list_cluster, "Cluster-wise")
])

# Create line chart using Altair
rmspe_chart = alt.Chart(rmspe_plot_data).mark_line(point=True).encode(
    x=alt.X("Iteration:Q", title="Iteration"),
    y=alt.Y("RMSPE:Q", title="RMSPE (Root Mean Squared Percentage Error)"),
    color=alt.Color("Dataset:N", title="Dataset"),
    tooltip=["Iteration", "RMSPE", "Dataset"]
).properties(
    title="Figure 8: RMSPE over Iterations for Different Datasets",
    width=800,
    height=400
)

# Display the chart
rmspe_chart


### Results Analysis

- The model with density labels performs best, both numerically and visually.
- Clustering and adding density labels significantly improve prediction accuracy.
- Results vary due to random splits but consistently favor the density label approach.

----
#### Interactive 3D Visualization of Connection Patterns Using KNN Regression

The relationship between the hour of the day, day of the week, and number of connections can best be visulized using a 3D surface plot. It fits a KNN regression model using the optimal number of neighbors (`optimal_k_density`) to estimate connection density over a fine-grained grid. The resulting surface represents **predicted connection values**, while scatter points overlay the actual data for comparison. 

In [116]:
# Prepare the data
x = day_hour_connections['hour']  # X-axis: Hour of the day
y = day_hour_connections['day_of_week_numeric']  # Y-axis: Day of the week (numeric)
z = day_hour_connections['connections']  # Z-axis: Number of connections

# Create a grid for surface plot
hour_grid = np.linspace(x.min(), x.max(), 50)  # Fine-grained hours
day_grid = np.linspace(y.min(), y.max(), 50)  # Fine-grained days
hour_mesh, day_mesh = np.meshgrid(hour_grid, day_grid)

# Fit a surface using a KNN model
knn_surface_model = KNeighborsRegressor(n_neighbors=optimal_k_density)
knn_surface_model.fit(np.column_stack((x, y)), z)
z_mesh = knn_surface_model.predict(np.column_stack((hour_mesh.ravel(), day_mesh.ravel()))).reshape(hour_mesh.shape)

# Create the surface plot
surface = go.Surface(
    x=hour_mesh,
    y=day_mesh,
    z=z_mesh,
    colorscale="Viridis",
    showscale=True,
    colorbar=dict(title="Connections")
)

# Create scatter points for actual data
scatter = go.Scatter3d(
    x=x,
    y=y,
    z=z,
    mode='markers',
    marker=dict(
        size=5,
        color=z,
        colorscale='Viridis',
        opacity=0.8
    ),
    name="Actual Data Points"
)

# Define day labels for better readability
day_labels = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 
              4: 'Friday', 5: 'Saturday', 6: 'Sunday'}

# Customize the layout with readable Y-axis labels
fig = go.Figure(data=[surface, scatter])
fig.update_layout(
    title="Figure 9: Interactive 3D Visualization of Connections",
    scene=dict(
        xaxis=dict(title="Hour of the Day"),
        yaxis=dict(
            title="Day of the Week",
            tickvals=list(day_labels.keys()),  # Numeric tick values
            ticktext=list(day_labels.values())  # Corresponding day names
        ),
        zaxis=dict(title="Number of Connections"),
    ),
    width=900,
    height=700
)

# Show the interactive plot
fig.show()


---
#### Predicting and Visualizing Connection Patterns with KNN Regression and Density Levels
By training a KNN regression model to predict connection counts based on the hour of the day, day of the week, and density level (Low, Medium, High). We can evaluate the performance using **RMSPE** and generates predictions for specific scenarios. A 3D surface plot visualizes how connections vary by **time** and **density**, with separate surfaces for **each density level**, offering an interactive view of **temporal and density-based patterns**.

In [117]:
# Define features and target
features_with_density = day_hour_connections[["hour", "day_of_week_numeric", "density_label_numeric"]]
target_with_density = day_hour_connections["connections"]

# Split data into training (75%) and testing (25%) sets
X_train, X_test, y_train, y_test = train_test_split(features_with_density, target_with_density, train_size=0.75, random_state=42)

# Create and train the KNN pipeline
knn_pipeline_with_density = make_pipeline(StandardScaler(), KNeighborsRegressor(n_neighbors=optimal_k_density))
knn_pipeline_with_density.fit(X_train, y_train)

# Check performance on the test data
y_pred_test = knn_pipeline_with_density.predict(X_test)
rmspe_test = np.sqrt(np.mean((y_test - y_pred_test) ** 2) / np.mean(y_test ** 2))
print(f"Test RMSPE (Full Dataset with Density): {rmspe_test:.4f}")

# Map day numeric values to names
day_labels = {0: "Monday", 1: "Tuesday", 2: "Wednesday", 3: "Thursday", 4: "Friday", 5: "Saturday", 6: "Sunday"}

# Map density numeric values to descriptive labels
density_labels = {0: "Low", 1: "Medium", 2: "High"}

# Generate new data for predictions with descriptive density labels
new_data = pd.DataFrame({
    "hour": [8, 12, 18],  # Example hours: 8 AM, 12 PM (noon), 6 PM
    "day_of_week_numeric": [0, 3, 5],  # Example days: Monday, Thursday, Saturday
    "density_label_numeric": [1, 2, 0]  # Example densities: Medium, High, Low
})

# Replace numeric density values with labels
new_data["density_label"] = new_data["density_label_numeric"].map(density_labels)

# Perform predictions
predicted_connections = knn_pipeline_with_density.predict(new_data[["hour", "day_of_week_numeric", "density_label_numeric"]])

# Output prediction results with descriptive density labels
for i, pred in enumerate(predicted_connections):
    print(f"Hour: {new_data['hour'][i]}, Day: {day_labels[new_data['day_of_week_numeric'][i]]}, "
          f"Density: {new_data['density_label'][i]} -> Predicted Connections: {pred:.2f}")

# Generate grid for visualization
hour_grid = np.linspace(0, 23, 50)  # Hours from 0 to 23 (50 points)
day_grid = np.linspace(0, 6, 7)  # Days of the week (Monday=0 to Sunday=6)
density_grid = np.array([0, 1, 2])  # Density levels: Low, Medium, High

# Create mesh grid (Hour, Day, Density)
hour_mesh, day_mesh, density_mesh = np.meshgrid(hour_grid, day_grid, density_grid, indexing='ij')

# Create a DataFrame for grid predictions
grid_data = pd.DataFrame({
    "hour": hour_mesh.ravel(),
    "day_of_week_numeric": day_mesh.ravel(),
    "density_label_numeric": density_mesh.ravel()
})

# Replace numeric density values with labels in grid data
grid_data["density_label"] = grid_data["density_label_numeric"].map(density_labels)

# Perform predictions for the grid
predicted_connections = knn_pipeline_with_density.predict(grid_data[["hour", "day_of_week_numeric", "density_label_numeric"]])
predicted_grid = predicted_connections.reshape(hour_mesh.shape)

# Create 3D surface plot
fig = go.Figure()

for i, density_label in enumerate(["Low", "Medium", "High"]):
    fig.add_trace(go.Surface(
        x=hour_mesh[:, :, i],  # Hour values for this density level
        y=day_mesh[:, :, i],  # Day values for this density level
        z=predicted_grid[:, :, i],  # Predicted connections for this density level
        colorscale="Viridis",  # Color scheme for visualization
        showscale=(i == 2),  # Show scale only for the first surface
        colorbar=dict(title="Connections", len=0.7, x=1.1),  # Adjust color bar size and position
        name=f"Density: {density_label}"  # Label for the surface
    ))

# Configure layout settings
fig.update_layout(
    title="Figure 10: Predicted Connections by Hour, Day, and Density",
    scene=dict(
        xaxis=dict(title="Hour of the Day", range=[0, 23]),  # X-axis title and range
        yaxis=dict(
            title="Day of the Week",
            tickvals=list(day_labels.keys()),  # Numeric tick values
            ticktext=list(day_labels.values())  # Corresponding day names
        ),
        zaxis=dict(title="Connections")  # Z-axis title
    ),
    width=900,  # Plot width
    height=700  # Plot height
)

# Display the interactive 3D plot
fig.show()


Test RMSPE (Full Dataset with Density): 0.2459
Hour: 8, Day: Monday, Density: Medium -> Predicted Connections: 12.80
Hour: 12, Day: Thursday, Density: High -> Predicted Connections: 20.80
Hour: 18, Day: Saturday, Density: Low -> Predicted Connections: 4.40


---
#### Identifying Conditions for Highest and Lowest Predicted Connections

In order to determine the conditions that the highest and lowest connections are expected, we map numeric density values to descriptive labels (Low, Medium, High) and uses `np.argmax` and `np.argmin` to find the indices of the maximum and minimum predictions in the `predicted_connections` array. Using these indices, we can retrieve the associated hour, day, and density from the prediction grid and formats the results into a readable summary.

In [118]:
# Map numeric density values to descriptive labels
density_labels = {0: "Low", 1: "Medium", 2: "High"}

# Find the maximum and minimum predicted connections
max_idx = np.argmax(predicted_connections)  # Index of the maximum predicted value
min_idx = np.argmin(predicted_connections)  # Index of the minimum predicted value

max_value = predicted_connections[max_idx]  # Maximum predicted value
min_value = predicted_connections[min_idx]  # Minimum predicted value

# Retrieve the corresponding hour, day, and density for the max and min predictions
max_hour = grid_data.iloc[max_idx]  # Data corresponding to the maximum value
min_hour = grid_data.iloc[min_idx]  # Data corresponding to the minimum value

# Print the results in a readable format
print(f"Table 9: Highest Predicted Connections: {max_value:.2f} at Hour {max_hour['hour']:.2f}, "
      f"Day {day_labels[int(max_hour['day_of_week_numeric'])]}, "
      f"Density {density_labels[int(max_hour['density_label_numeric'])]}")

print(f"Table: 10 Lowest Predicted Connections: {min_value:.2f} at Hour {min_hour['hour']:.2f}, "
      f"Day {day_labels[int(min_hour['day_of_week_numeric'])]}, "
      f"Density {density_labels[int(min_hour['density_label_numeric'])]}")

Table 9: Highest Predicted Connections: 24.00 at Hour 1.88, Day Tuesday, Density High
Table: 10 Lowest Predicted Connections: 1.40 at Hour 7.04, Day Thursday, Density Low


### **Discussion**

#### **Summarize What You Found**

This project analyzed server connection patterns and optimized resource allocation using three distinct modeling approaches: **Full Dataset Model**, **Full Dataset with Density Labels Model**, and **Cluster-Wise Models**. Each model offered unique insights into user behavior and server usage trends, providing valuable data for actionable strategies.

---

#### **Connection Patterns from EDA**  

1. **Time-Based Trends**:  
   - Peak activity occurred during **late-night hours (11:30 PM - 4:30 AM)**, with the highest total connections recorded on **Saturday at 2:00 AM**.  
   - Low activity was observed during **weekday mornings (6:00 AM - 3:00 PM)**, particularly on Thursdays.  

2. **Day-Based Trends**:  
   - **Saturday** exhibited the highest server usage, reflecting increased leisure activity on weekends.  
   - Unexpectedly, **Friday** showed the lowest connections, potentially influenced by external factors like fatigue or social obligations.  

---

#### **Predictive Modeling**

##### **1. Full Dataset Model**
- This model used the original dataset without any clustering or `density_label`.  
- It treated all data uniformly, failing to account for the variability in user density.  
- **Performance**:  
  - **RMSPE: 0.41**, the lowest accuracy among all models, highlighting the importance of including density information.

##### **2. Full Dataset with Density Labels Model**  
- This approach introduced `density_label` (`Low`, `Medium`, `High`) as an additional feature created via clustering.  
- The single KNN model could leverage density as contextual information, improving prediction accuracy.  
- **Performance**:  
  - **RMSPE: 0.23**, the highest accuracy achieved.  
  - **Optimal K: 5**, balancing model complexity and prediction quality.

##### **3. Cluster-Wise Models**  
- The dataset was divided into three clusters (`Low`, `Medium`, `High`) using clustering. Separate KNN models were then trained for each cluster.  
- This allowed the model to specialize in different density levels but introduced potential data imbalance and fragmentation.  
- **Performance**:  
  - **RMSPE: 0.32**, slightly less accurate than the `density_label` model.  
  - Imbalanced cluster sizes and potential information loss during clustering likely impacted results.

---

#### **Discuss Whether This Is What You Expected to Find**

##### **1. Expected Results**
- **High Activity During Late-Night Hours**:  
  - Late-night peaks and weekend activity aligned with expectations, reflecting user leisure patterns.  
  - The Full Dataset with Density Labels Model performed as expected, with the inclusion of density enhancing prediction accuracy.  

- **Low Activity During Daytime**:  
  - The drop in activity during 6:00 AM to 3:00 PM matched predictions, aligning with typical work and school schedules.

##### **2. Unexpected Results**
- **Friday’s Low Activity**:  
  - Contrary to expectations, Friday showed the lowest total connections, despite being close to the weekend.  
  - Possible causes include social commitments or fatigue at the end of the workweek.

- **Cluster-Wise Model Performance**:  
  - While clustering was expected to improve accuracy by reducing noise, it underperformed compared to the `density_label` model.  
  - Imbalanced cluster sizes and information loss during clustering likely contributed to this discrepancy.

---

#### **Discuss What Impact Could Such Findings Have**

##### **1. Server Resource Optimization**
- **Dynamic Resource Allocation**:  
  - Leveraging predictions, server resources can be dynamically adjusted:  
    - **High-Density Times (e.g., 1:00 AM - 4:00 AM)**: Allocate resources for up to **30 users**, ensuring smooth performance during peak hours.  
    - **Medium-Density Times (e.g., 6:00 AM - 3:00 PM)**: Scale down to handle up to **15 users**, optimizing costs without compromising service.  
    - **Event-Based Adjustments**: Preemptively increase density by one level during special promotions or events.

##### **2. Enhanced User Experience**
- **Reliable Performance**:  
  - Prevents server crashes or lag during high-demand periods, ensuring seamless gameplay.  
- **Off-Peak Engagement**:  
  - Incentives for off-peak activity (e.g., weekday mornings) can attract users and balance server load.

##### **3. Broader Applications**
- **Game Design**:  
  - Timing updates or tournaments during peak periods can maximize user engagement.  
- **Cost Efficiency**:  
  - Dynamic scaling reduces operational costs while maintaining user satisfaction.

---

#### **Discuss What Future Questions Could This Lead To**

##### **1. Expanding the Dataset**
- **How do connection patterns vary across multiple weeks or seasons?**  
  - Longer-term data collection could reveal recurring patterns or anomalies.  

- **What additional features could enhance predictions?**  
  - Variables like holidays, in-game events, and exam term could provide richer context for prediction.

##### **2. Improving Predictive Models**
- **Can advanced machine learning models improve predictions?**  
  - Models like Random Forests or Neural Networks could better handle non-linear patterns.  

- **How can temporal trends be incorporated?**  
  - Time-series analysis could model recurring daily or weekly cycles more effectively.

##### **3. Understanding User Behavior**
- **What motivates connection patterns for different player groups?**  
  - Analysis by demographics (e.g., age, skill level) could uncover distinct usage behaviors.  

- **How can engagement during off-peak times be increased?**  
  - Strategies like tailored rewards or time-specific events could attract more users.

---

### **Conclusion**
This project demonstrated the value of combining EDA, clustering, and KNN regression to predict server connection patterns.  
By leveraging these insights, server administrators can dynamically manage resources, enhance user satisfaction, and reduce operational costs.  
Future research should focus on expanding datasets, exploring advanced models, and understanding user behavior to refine these strategies further.

### **References**

- Timbers, T., Campbell, T., Lee, M., Ostblom, J., & Heagy, L. (2022). *Data Science: A First Introduction with Python*. Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Retrieved from [https://datasciencebook.ca](https://datasciencebook.ca)

-  Plotly Technologies Inc. (2015). *Collaborative data science*. Plotly Technologies Inc. https://plot.ly