## DSCI 100 Group 11 Project ## 

Authors: Ishita Sharma (66518101), Eric Haung, Yunfeng Bu (70556543), Andrew Liu 


### Introduction ###

The objective of this project is to understand user behaviour when playing video games such as MineCraft. By understanding how users interact with the game, researchers can make adjustments that better serve the user. With this information they can also work on strategies to improve recruitment and create better experiences for the user. This will in turn keep users subscribed to the game for longer. The University of British Columbia’s Computer Science research team is using MineCraft to collect real-world data from users playing the game. For this project, we will be working on answering which kinds of players are most likely to contribute a large amount of data. The goal is to develop a model that can identify which players are most likely to engage heavily with the MineCraft server

We will be using two datasets for this project, which we will combine into one dataset. The first one is the Players dataset which provides the following information; name, gender, age, experience level, subscription status, hashed email of user, number of played hours, player ID, and an organization name. The second dataset is the Sessions dataset which provides the following information; the users hashed email, start and end time, and original start and end time. To answer our specific question we will be using a variable that includes the total played hours as a response variable. And then our explanatory variable will include gender, experience, age, subscription status, and played hours. We will be combining the dataset by using a common variable which is the Hashed Email. 

##### Question: #####
We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.


In [3]:
library(tidyverse)
library(scales)
library(ggplot2)
library(readr)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘scales’


The following object is masked from ‘package:purrr’:

    discard


The following object is masked from ‘package:readr’:

    col_factor



### Methods ### 

To start answering our question, we loaded two datasets; Players and Sessions. We proceeded to wrangle our datasets by cleaning them into tidy format, filtering for necessary variables, and then transforming them into a combined dataset. 

##### Sessions Dataset: 
We converted start_time and end_time into datetime format. And calculated the session durations in minutes. Lastly we pooled the data to compute the total play duration and frequency of sessions for each user. 

##### Players Dataset: 
We removed any unnecessary column such as individualID and organizationName. Then converted categorical variables into factor variables (experience, subscribe, and gender). And lastly filtered out any data points that did not include 

##### Combined Dataset: 
We combined these two datasets by use of the common variable; hashedEmail to understand how demographic factors were correlated to the player engagement.     


In [None]:
#Data Wrangling 
url1<-"https://drive.google.com/uc?export=download&id=14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB"
url2<-"https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"

session_data <- read_csv(url1)
players_data<-read_csv(url2)

session_data_wrangle<-session_data|>             
                      mutate(start_time=dmy_hm(start_time),
                      end_time=dmy_hm(end_time)
                    )|>
              filter(!is.na(start_time)&!is.na(end_time))|>
              mutate(Duration=as.numeric(end_time-start_time,unit="mins")
                    )|>
              group_by(hashedEmail)|>
              mutate(frequency = n(),
                     total_duration=sum(Duration,na.rm=TRUE)
                    ) |>
              ungroup()

players_data_wrangle<-players_data|>
                      select(-individualId,-organizationName)|>
                      mutate(experience = as.factor(experience),
                             subscribe = as.factor(subscribe),
                             gender = as.factor(gender))|>
                      filter(!is.na(experience) & !is.na(played_hours) & !is.na(age))



combined_data <- session_data_wrangle|>
                     left_join(players_data_wrangle, by = "hashedEmail")

  
combined_data_plot<-combined_data|>
               distinct(hashedEmail,start_time,.keep_all = TRUE)|>
               select(played_hours, name, gender, age,Duration,frequency, total_duration, experience, subscribe)|>
               distinct(name,.keep_all = TRUE)|>
               select(played_hours,gender,age,experience,frequency,total_duration)|>
               filter(age>=15&age<=30)



# head(session_data,10)
# head(players_data,10)
# head(combined_data_1,10)


In [None]:
#Visualizations 
analysis_plot_1<-combined_data_plot|>
               ggplot(aes(x=total_duration,y=experience,fill = experience))+
               geom_bar(stat="identity")+
               labs(
                   title="The relationship between the Total Play time and Player Experience",
                   y="Experience",
                   x="Total player time(minutes)"
               )+
               theme(text=element_text(size=12))
analysis_plot_1
#Time vs Experience
analysis_plot_2<-combined_data_plot|>
               ggplot(aes(x=frequency,y=played_hours,colour = experience))+
               geom_point(size = 3, alpha = 0.7)+
               labs(
                   title="The relation between the Frequency and the Total time",
                   x="Frequency",
                   y="Total Time(Hours)",
                   colour="Type"
               )+
               scale_x_log10(labels=comma)+
               scale_y_log10(labels=comma)+
               theme(text=element_text(size=12))
analysis_plot_2
#Time vs Frequency
analysis_plot_3<-combined_data_plot|>
               ggplot(aes(x=age,y=played_hours,fill = experience))+
               geom_bar(stat="identity")+
               labs(
                   title="The relationship between the age and the Total time",
                   x="Age Group",
                   y="Total Time(Hours)",
                   fill="Type"
               )+
               theme(text=element_text(size=12))
analysis_plot_3
#Time vs Age

In [None]:
# Spliting Data
set.seed(1004)


combined_data_model <- combined_data_plot|>
                    select(played_hours,frequency,age,experience)
#                filter(played_hours > ...)
# if you need to filter whose playe_hours is 0 

mc_data <- combined_data_model |>
  mutate(played_degree = cut(
    played_hours,
    breaks = c(0, 1, 4, Inf),
    labels = c("Try","Play","Enjoy"),
    include.lowest = TRUE))
mc_split <- initial_split(mc_data,prop=0.80,strata=played_degree)
mc_training <- training(mc_split)
mc_testing <- testing(mc_split)
# mc_training
# mc_testing

#### Classification Model: 
K-nearest neighbors (KNN) algorithm organizes data points within a multidimensional space based on their features (Murel & Kavlakoglu, 2024). We used the classification model to categorize players based on their engagement levels in terms of played hours. For this model we transformed the played_hours variable into a categorical variable. 
- Try (0-1 hrs): Minimal engagement 
- Play (1-4 hrs): Moderate engagement 
- Enjoy (4+ hrs): High engagement
  
The classification model will demonstrate the demographic characteristics that are associated with players in each engagement category. By predicting the likelihood of a player falling into a specific category, the model can pinpoint who to target for recruitment efforts. 


In [None]:
#Classification
set.seed(1004)
mc_recipe <- recipe(played_degree ~ ., data = mc_training) |>
  step_rm(played_hours) |>
  step_integer(all_predictors()) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

mc_vfold <- vfold_cv(mc_training, v = 5, strata = played_degree)#cross-validation

knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

k_vals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))

knn_results <- workflow() |>
  add_recipe(mc_recipe) |>
  add_model(knn_tune) |>
  tune_grid(resamples = mc_vfold, grid = k_vals) |>
  collect_metrics()|> 
  filter(.metric == "accuracy")

cross_val_plot <- ggplot(knn_results, aes(x = neighbors, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Neighbors", y = "Accuracy Estimate", title = "Classification Model Performance")
cross_val_plot #the plot to choose best k
#we may choose k= 7

mc_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 6) |>
  set_engine("kknn") |>
  set_mode("classification")

mc_classification_results <- workflow() |>
  add_recipe(mc_recipe) |>
  add_model(mc_spec) |>
  fit(data = mc_training) |>
  predict(new_data = mc_testing)

# mc_classification_results 

#### Regression Model: 

Regression is a method that is used to analyze the relationship between a dependent variable and one or more independent variables. It can help determine if changes in the dependent variable are linked to the changes in the independent variable (Beers, 2024). We also used the regression model to predict the exact amount of played hours a user might contribute to based on demographic characteristics. This model can output a continuous prediction of played hours, allowing for a precise measurement of how long a player is engaging with the game. It can also help quantify the impact of explanatory variables on the played hours. 



In [None]:
#Regression
set.seed(104)
mc_recipe_regression <- recipe(played_hours~ ., data = mc_training)|>
    step_integer(all_predictors()) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

mc_vfold_regression <- vfold_cv(mc_training, v = 5, strata = played_hours)

knn_tune_regression <- nearest_neighbor(weight_func = "rectangular" ,neighbors = tune()) |>
      set_engine("kknn") |>
      set_mode("regression")

k_vals_regression <- tibble(neighbors = seq(from = 1, to = 5, by = 1))

knn_results_regression <- workflow() |>
      add_recipe(mc_recipe_regression) |>
      add_model(knn_tune_regression) |>
      tune_grid(resamples =  mc_vfold_regression, grid = k_vals_regression) |>
      collect_metrics()|> 
      filter(.metric == "rmse")

cross_val_plot <- ggplot(knn_results_regression, aes(x = neighbors, y = mean))+
      geom_point() +
      geom_line() +
      labs(x = "Neighbors", y = "Accuracy Estimate")
cross_val_plot #the plot to choose best k
#we may choose k= 3

mc_spec_regression <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) |>
  set_engine("kknn") |>
  set_mode("regression")

mc_regression_results <- workflow() |>
  add_recipe(mc_recipe_regression) |>
  add_model(mc_spec_regression) |>
  fit(data = mc_training) |>
  predict(new_data = mc_testing)

# knn_rmspe <- workflow() |>
#          add_recipe(mc_recipe) |>
#          add_model(mc_spec) |>
#          fit(data = mc_training) |>
#           predict(new_data= mc_testing) |>
#           bind_cols(mc_testing) |>
#           metrics(truth =played_hours, estimate = .pred)
# knn_rmspe 

# mc_regression_results# the result of regression

#### References 

Beers, B. (n.d.). Regression: Definition, analysis, calculation, and example. Investopedia. https://www.investopedia.com/terms/r/regression.asp 

Murel, J., & Kavlakoglu, E. (2024, August 29). What are classification models?. IBM. https://www.ibm.com/topics/classification-models 
