# Data Science Final Project Report
**Names**: David Liu,  JunHyun Kim, Layni Janzen, Sydney Lee

**Group**: 10

---

In [9]:
### Run this cell before continuing.
import pandas as pd
import altair as alt

---
## **Introduction**

A research group from UBC computer science is conducting research on how individuals play video games. Through their Minecraft webserver they have cultivated two datasets with information about players and their session history. Hosting a web server requires resource management, and the research group wants to determine how to best handle the resource allocation. 

In this project, we will use their two datasets to answer the question of: "How can we predict and optimize server resource allocation based on the number of concurrent players in each session?"

As a reminder these datasets are:

1. **`players.csv`** – Contains information about the players, including their experience level, subscription status, and hours played.
2. **`sessions.csv`** – Contains session details, such as the start and end times of each player's session.

These datasets were used to answer the question: **"How can we predict and optimize server resource allocation based on the number of concurrent players in each session?"**

### **1. `players.csv` Dataset**

The `players.csv` dataset provides detailed information about the players, which includes their **experience level**, **subscription status**, **gameplay hours**, and basic demographic information.

#### Main columns in `players.csv`:
- **`experience`**: The experience level of the player (e.g., "Pro", "Veteran", "Regular", "Amateur").
- **`subscribe`**: A boolean value indicating whether the player has a subscription (`True` or `False`).
- **`hashedEmail`**: A unique identifier for the player, stored as a hashed email address.
- **`played_hours`**: The total number of hours the player has spent playing the game.
- **`name`**: The player's name.
- **`gender`**: The player's gender.
- **`age`**: The player's age.
- **`individualId`**: An identifier for the player, possibly an internal ID for tracking purposes.
- **`organizationName`**: The name of the player's organization (if applicable).

#### Key Information from `players.csv`:
- **Number of Rows**: 196 players.
- **Number of Columns**: 9 attributes per player.
- **Data Types**: The dataset contains both categorical (e.g., `experience`, `gender`) and numerical (e.g., `played_hours`, `age`) data.

This dataset is important for understanding **player behavior** and how factors like **experience** and **age** might influence the number of connections (active players) at different times of the day.

### **2. `sessions.csv` Dataset**

The `sessions.csv` dataset contains information about the sessions in which players were actively playing the game. It records **session start and end times**, and is crucial for determining how long players are active and when their sessions occur.

#### Main columns in `sessions.csv`:
- **`hashedEmail`**: The player's unique identifier (same as in `players.csv`).
- **`start_time`**: The start time of the player's session, recorded as a string in the format `DD/MM/YYYY HH:MM`.
- **`end_time`**: The end time of the player's session, also recorded as a string in the same format as `start_time`.
- **`original_start_time`**: A timestamp in Unix format, representing the session's start time.
- **`original_end_time`**: A timestamp in Unix format, representing the session's end time.

#### Key Information from `sessions.csv`:
- **Number of Rows**: 1535 session entries.
- **Number of Columns**: 5 attributes per session.
- **Data Types**: The dataset includes both categorical data (e.g., `hashedEmail`) and datetime data (e.g., `start_time`, `end_time`).

This dataset is critical for determining **when players are active**, and analyzing **peak hours** or **days of the week** for player activity. The session duration (calculated from `start_time` and `end_time`) will also be important for identifying **intense activity periods** and **server load**.

---


## **Methods & Results**

### **Data Cleaning and Wrangling**

The datasets were cleaned and preprocessed as follows:

- **Datetime Conversion**: The `start_time` and `end_time` columns in the `sessions.csv` dataset were converted to proper `datetime` format to allow time-based analysis.
- **Session Duration**: We calculated the **session duration** by subtracting the `start_time` from the `end_time` to get the total duration of each session in minutes.
- **Handling Missing Data**: Rows with missing `end_time` values were dropped, as the session duration could not be calculated without this information.
- **Feature Extraction**: We extracted the **hour** and **day of the week** from the `start_time` column to enable analysis of player activity patterns by time.

We then aggregated the data by **hour**, **day of the week**, and both **hour and day of the week** to identify trends in player activity, which is essential for understanding when to allocate server resources more efficiently.


In [16]:
# Load the players dataset
players_df = pd.read_csv('https://drive.google.com/uc?id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz')

# Display the first few rows of the dataset
print("Table 1: First few rows of players.csv:")
display(players_df.head())

Table 1: First few rows of players.csv:


Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,


In [17]:
# Load the sessions dataset
sessions_df = pd.read_csv('https://drive.google.com/uc?id=14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB')

# Display the first few rows of the dataset
print("Table 2: First few rows of sessions.csv:")
display(sessions_df.head())

Table 2: First few rows of sessions.csv:


Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0


## Discussion

## References