# DSCI 100 Group 37: Final Project

Raymond Lan, Varun Raval, Tianna Wong, Brendon Yih

# Predicting Experience Level Using Age and Hours Played

## Introduction

In this project, we will be working with local game data collected from a virtual world called PLAIcraft, which is run by a Computer Science research group at UBC, led by Frank Wood. PLAIcraft’s primary goal is “to create an advanced artificial intelligence (AI) that can act in a human-like way in a simulated setting” (PLAI, 2025). The data consists of two files: `players.csv` and `sessions.csv`; however, the aim of our analysis only requires the use of `players.csv`. This project aims to confer analysis and modelling of the data, as well as visualizing it graphically to predict any correlations between the variables used. The researchers want to determine what kind of players contribute a significant amount of data so they can target these players during their recruiting efforts. To explore this research goal, we will investigate whether age and playing hours can be used to predict the experience level of a player. The relationship between the variables (experience level, age, and playing hours) will help identify highly engaged players who are more likely to contribute the most data. The analysis of highly engaged players can help the researchers understand player behaviour, which will be used to “train and develop an advanced AI” (PLAI, 2025) that can interact in PLAIcraft more naturally. 


### Dataset Description 

The player dataset includes unique data for each individual player. There are 196 observations, which indicate the number of players in the dataset and 7 variables that highlight each players' characteristics and in game behaviour. 


|Variable        |Type       |Description of Variable                       |
|:---------------|:----------|:---------------------------------------------|
|experience      |Character  |Experience level of a player                  |
|subscribe       |Character  |If the player is subscribed to the news letter|
|hashedEmail     |Character  |Player's unique hashed email                  |
|played_hours    |Double     |Number of hours played                        |
|name            |Character  |Name of player                                |
|gender          |Character  |Gender of player                              |
|Age             |Double     |Age of player                                 |

# Methods and Results

## Loading the Data into Jupiter

Below only the `players.csv` will be loaded in, as it is the only file that we need to complete our data analysis.

In [None]:
library(tidyverse)
library(tidymodels)

In [None]:
# Initial data loading

player_url<- "https://raw.githubusercontent.com/tiannawong/dsci100-individual-project-/refs/heads/main/players.csv"

player_data <- read_csv(player_url)
head(player_data)

The head is shown above for `players.csv`.

## Wrangling the Data

Below, we will want to choose only the columns that are necessary, since we are trying to predict which experience level plays the most for certain age and playing time.  We will simplify our data by selecting for Experience, played_hours, and Age. We are working with experience as a categorial variable, so we will convert it to a factor type using the `as_factor` function. 

In [None]:
# Wrangling data to use only columns that are needed

select_player_data <- player_data |>
    mutate(experience = as_factor(experience)) |>
    select(experience, Age, played_hours) |>
    drop_na()
head(select_player_data)

### Summary Statistics

In [None]:
mean_values<- select_player_data|>
    select(played_hours, Age)|>
    map_df(mean, na.rm= TRUE)
mean_values

## Exploratory Data Analysis and Visualizations

In this project, we want to use KNN classification to predict a new user's experience level based on their age and playing hours. Before we model and train the data, we want to perform simple visualizations to get a better understanding of what we are working with.  Below will be graphs that visualize different aspects of the data.

In [None]:
# Create bar graph with player count of each experience level

select_player_data_bar <- select_player_data |>
    ggplot(aes(x = experience, fill = experience)) +
    geom_bar(stat = "count") +
    labs(x = "The Experience Levels for Different Players", title = "Fig 1: The Distribution of Different Experience Levels", fill = "Experience Level")
select_player_data_bar

Explanation of Visualization: Figure 1 is a bar graph that shows the amount of players per experience level.  We can see that there are more amateur players than any other players, followed by veterans. Regular and beginner players are more balanced and pro players make up the smallest group.

In [None]:
# Create scatterplot with Age vs. played_hours

select_player_data_plot <- select_player_data |>
    mutate(played_mins = (played_hours * 60)) |>
    ggplot(aes(x = Age, y = played_mins)) +
    ylim(0, 360) +
    geom_point(aes(color = experience)) +
    labs(x = "The age in years", y = "Number of played hours in minutes", color = "Type of Experience", title = "Fig 2: The Relationship Between Age and Played hours")
select_player_data_plot

Explanation of Visualization: Figure 2 is a scatter plot that shows age and the different types of experience levels relative to the playing time in minutes.  To make the graph visually pleasing, we had to limit the amount of play time to 6 hours (360 minutes).  From the graph, we are not able to pick up a pattern on age and playing time

## Training and Modeling the Data

To ensure that our data analysis is reproducible, we will set my seed value to 123.In this portion, we will start to train and model the players data so that we can predict which experience levels fits with a new data point given.

In [None]:
# create the 25/75 split of the training data into training and validation
set.seed(123)
player_split <- select_player_data |>
    drop_na()|>
    initial_split(prop = 0.75, strata = experience)
player_training <- training(player_split)
player_testing <- testing(player_split)

head(player_training)
head(player_testing)

In [None]:
# create the standardization recipe (scaling all predictors of Age and played_hours)
players_recipe <- recipe(experience ~ Age + played_hours, data = player_training) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())
players_recipe

In [None]:
# Creating the specification with tune() as neighbors to find the best k value.
players_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")
players_tune

# Discussion

