
# Individual Project Planning Stage - Simran Heir - 69469427



## Loading Datasets and Preforming Summary


In [None]:
# load tidyverse into the R notebook.
library(tidyverse)


### Player Dataset


In [None]:
# loading player data
url <- "https://raw.githubusercontent.com/SimranHeir/DSCI-100-Project-Planning-Stage-Individual-Simran-Heir/refs/heads/main/players.csv"
player_data <- read_csv (url)


In [None]:
# printing out 6 rows of the player_data
head (player_data, 6)

In [None]:
#How many observations are there for player data?
nrow (player_data)


In [None]:
#Player summary statistics
summary_player <- summary (player_data)
summary_player

In [None]:
#experience catergories
experience_catergories <- player_data |>
    distinct (experience)
experience_catergories


### Session Dataset


In [None]:
#loading session data
url <- "https://raw.githubusercontent.com/SimranHeir/DSCI-100-Project-Planning-Stage-Individual-Simran-Heir/refs/heads/main/sessions.csv"
sessiondata <- read_csv (url)


In [None]:
# printing out 6 rows of the sessiondata
head (sessiondata, 6)

In [None]:
#How many observations are there for session data?
nrow(sessiondata)

In [None]:
#Session summary statistics
summary_session <- summary (sessiondata)
summary_session


# (1) Data Description:



## Where is data from?

The data is collected by Pacific Laboratory for Artificial Intelligence (PLAI) at UBC. The data is from players playing minecraft on plaicraft.ai that consented to the study.


## Player data:

- The data is about specific individual player information.

- There are **7 variables**

- There are **196 observations**

|Variable|Type|Meaning|
|---------------|--------|--------|
|experience|factor|Experience level of the player|
|subscribe|logical|If a player is subscribed to the newletter or not (True or False)|
|hashEmail|character|An encoded email address to have privacy for the players|
|played_hours|double|Total amount of time that Minecraft was played (in hours)|
|name|character|Name of the players|
|gender|factor|Gender of the players|
|Age|integer|Age of the player (in years)|


For experience and gender, they are incorrectly put as character, but they are factor for variable type. For Age it is incorrectly put as decimals, but is integer. Also, Age has two NA (missing) values.


### Additional details from summary:

##### Subscribe:

|Variable|True|False|
|---|---|---|
|Subscribe|144|52|

##### played_hours and Age summary values:

|Variable|Minimum Value(hours)|Median(hours)|Mean(hours)|Max(hours)|
|---|---|---|---|---|
|played_hours|0.00|0.10|5.85|233.10|
|Age|9.00|19.00|21.14|58.00|

##### Experience levels:

- Pro
- Veteran
- Amateur
- Regular
- Beginner



## Session data:

- The data is about each session playing Minecraft.

- There are **5 variables**

- There are **1535 observations**


|Variable|Type|Meaning|
|----|----|---|
|hashEmail|character|An encoded email adress to have privacy for the players|
|start_time|date + time|When the gameplay session started|
|end_time|date + time|When the gameplay session ended|
|original_start_time|double|Start time (in Unix Timestamp form)|
|original_end_time|double|End time (in Unix Timestamp form)|




For start_time and end_time it is originally put as character, but it is a combination of date + time variable type. These two variables are not tidy since they would need to separate into separate columns for time and date because they have two variable types and observations (date and time).

### Additional details from summary:

|Variable|Minimum Value|Median|Mean|Max|
|---|---|---|---|---|
|original_start_time|1.71e+12|1.72e+12|1.72e+12|1.73e+12|
|original_end_time|1.71e+12|1.72e+12|1.72e+12|1.73e+12|




## (2) Questions:



#### **Broad Question:** What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

#### **Specific Question:** Can played_hours and Age predict subscription to a game related newsletter in a dataset?

For my questions it makes the most sense to use the player dataset since we are trying to determine what variables are most predictive of subscribing to a game related newsletter and only the player data set has the subscribe variable.

To wrangle the data first I will need to make sure each variable has the correct data type. The data is tidy, so I do not need to tidy it. Then I select the columns needed for my specific question.



## Wrangling Data - Player data



### Give variables correct data type

In [None]:
# Give the variables the correct data type
player_data_wrangled <- player_data |>
    mutate(experience = as.factor (experience), gender = as.factor (gender), Age = as.integer (Age))


In [None]:
#print out the first 6 rows to see some of the dataset
head (player_data_wrangled, 6)


### Calculate the mean for each numerical value in player dataset


In [None]:
#calculating the mean for player_hours and Age
mean_player_data <- player_data_wrangled |>
    summarise(across(c (played_hours, Age),mean, na.rm = TRUE))
mean_player_data


### Wrangling Data to have only the columns needed for the Specific Question


In [None]:
#wrangling data for specific question by selecting certain columns
specific_player_data <- player_data_wrangled |>
    select (subscribe, played_hours, Age)

In [None]:
#showing only top 6 rows to make less cluttered
head(specific_player_data, 6)


## Create Visualizations



### Broad Question/looking at graphs for the different variables



#### Number of people subscribed of not subscribed:


In [None]:
#Creating a graph for number of subscribed vs not subscribed
subscribe_graph <- player_data_wrangled |>
    ggplot (aes (x= subscribe, fill = subscribe)) +
    geom_bar (color = "black") +
    xlab ("Subscribed (True or False)") +
    ylab ("Count") +
    ggtitle ("Number of people subscribed or not subscribed in barplot") +
    scale_fill_manual(values = c("steelblue", "darkorange")) +
    theme (text = element_text (size = 15)) 
    options (repr.plot.width = 10, repr.plot.height = 8)
subscribe_graph


### Experience Visualization:


In [None]:
#Number of subscribed and not subscribed vs experience
subscribe_graph <- player_data_wrangled |>
    ggplot (aes (x= experience, fill = subscribe)) +
    geom_bar (position = "dodge", color = "black") +
    xlab ("Experience") +
    ylab ("Number of subscribed and not subscribed") +
    ggtitle ("Number of subscribed and not subscribed vs experience") +
    scale_fill_manual(values = c("steelblue", "darkorange")) +
    theme (text = element_text (size = 15)) 
    options (repr.plot.width = 10, repr.plot.height = 8)
subscribe_graph


### Gender Visualization


In [None]:
subscribe_graph <- player_data_wrangled |>
    ggplot (aes (x= gender, fill = subscribe)) +
    geom_bar (color = "black") +
    xlab ("Gender") +
    ylab ("Number of subscribed and not subscribed") +
    ggtitle ("Number of subscribed and not subscribed vs gender") +
    scale_fill_manual(values = c("steelblue", "darkorange")) +
    theme (text = element_text (size = 15)) 
    options (repr.plot.width = 10, repr.plot.height = 8)
subscribe_graph


## Visualizations Related to Specific Question



### Total playing hours vs subscription visualizations



#### Mean total time played vs subscription


In [None]:
# wrangle data for mean played_hours and subscribe

played_hours_wrangled <- specific_player_data |>
    group_by(subscribe) |>
    summarise (mean_played_hours = mean (played_hours))
               
played_hours_wrangled

In [None]:
#graph for mean played hours
graph_played_hours <- played_hours_wrangled  |>
    ggplot (aes (x = subscribe, y = mean_played_hours, fill = subscribe)) +
    geom_bar(stat = "identity", color = "black") +
    xlab ("subscribed (True or False)") +
    ylab ("Mean total played hours") +
    ggtitle ("Mean total played hours vs subscribed") +
    scale_fill_manual(values = c("steelblue", "darkorange")) +
    theme (text = element_text (size = 15)) 
    options (repr.plot.width = 10, repr.plot.height = 8)
graph_played_hours



#### Total time played vs subscription


In [None]:
#total time played vs subscribed and not subscribed
histogram_played_hours <- specific_player_data|>
    ggplot (aes (x = played_hours, fill = subscribe)) +
    geom_histogram (alpha = 0.5, position = "identity", binwidth = 10) +
    xlab ("Total time played (hours)") +
    ylab ("Number of players subscribed or not subscribed") +
    ggtitle ("Total time played vs subscribed or not subscribed") +
    facet_grid(rows = vars(subscribe)) +
    scale_fill_manual(values = c("steelblue", "darkorange")) +
    theme (text = element_text (size = 20)) 
    options (repr.plot.width = 10, repr.plot.height = 8)
histogram_played_hours


### Age vs subscriptions visualizations



#### Mean age vs subscriptions


In [None]:
# wrangled for mean age and subscribed
mean_wrangled_age <- player_data_wrangled |>
    group_by(subscribe) |>
    summarize (mean_Age = mean(Age, na.rm = TRUE))           
mean_wrangled_age

In [None]:
#graph for mean age and subscribed
graph_mean_age <- mean_wrangled_age |>
    ggplot (aes (x = subscribe, y = mean_Age , fill = subscribe)) +
    geom_bar(stat = "identity", color = "black") +
    xlab ("subscribed (True or false)") +
    ylab ("Mean age") +
    ggtitle ("Mean age vs subscribed") +
    scale_fill_manual(values = c("steelblue", "darkorange")) +
    theme (text = element_text (size = 20))
    options (repr.plot.width = 10, repr.plot.height = 8)

graph_mean_age


#### Age vs subscription


In [None]:
library(RColorBrewer)
histogram_played_hours <- specific_player_data|>
    ggplot (aes (x = Age, fill = subscribe)) +
    geom_histogram (alpha = 0.5, position = "identity", binwidth = 0.6) +
    xlab ("Age of players") +
    ylab ("Count of how many people subscribed or did not subscribe") +
    ggtitle ("Age vs number of people subscribed or not subscribed") +
    scale_fill_manual(values = c("steelblue", "darkorange")) +
    facet_grid(rows = vars(subscribe)) +
    theme (text = element_text (size = 20)) 
    options (repr.plot.width = 10, repr.plot.height = 8)
histogram_played_hours


### Total time spent playing and age visualizations


#### Total time spent playing vs age 

In [None]:
graph_hours_age <- specific_player_data |>
    ggplot (aes (x = Age, y = played_hours, color = subscribe)) +
    geom_point () +
    xlab ("Age (years)") +
    ylab ("Total time spent playing (hours)") +
    ggtitle (" total played hours vs age") +
    scale_color_manual(values = c("steelblue", "darkorange")) +
    theme (text = element_text (size = 20))
    options (repr.plot.width = 10, repr.plot.height = 8)
graph_hours_age
            


## (3) Exploratory Data Analysis and Visualization:



### Minimum Wrangling to turn tidy format

The data is tidy data since it meets the requirements of each row being a single observation, each column is a single variable, and each value is a single cell. I changed the variable types to be the correct type to make it easier to preform calculations. There are some missing values, but I plan to deal with them when doing analysis by using na.rm = TRUE.




### Computing mean value for each quatitative variable in player dataset:

|Variable|Mean|
|---|---|
|played_hours|5.85|
|Age|21.14|



###  Overall important insights gained from these plots:

#### Broad Questions insights

- More data points for subscribed by a large amount vs not subscribed players. Since this is not even it may cause there to be more results to favor being subscribed compared to not subscribed when doing knn classification.

- Players that have middle experience account for most of the subscriptions and players compared to pros and beginners.

- There is a unevenness in gender representation as the male gender has a lot more data points and subscriptions.

#### Specific Question insights

- The average mean total played hours for subscribed players is higher than for not subscribed players. This demonstrated that for subscribers they usually have more total played hours compared to not subscribed players.

- A graph demonstrates that at the start where there is around 0 hours played there is the most subscribed and not subscribed players, but the subscribed players are still roughly doubled than of the not subscribed. As total time played increases, the graph shows that there is mostly only subscribed players and not really any not subscribed players.

- The mean age for subscribed and not subscribed player is very close in age with the mean age being slightly higher for not subscribed.

- Ages around 15-30 have larger amount of subscribed and not subscribed players compared to other ages. Before age 15 there is mostly subscribed players and after ager 30 there is a mix, but more not subscribed points.

- There is no clear relationship between age and total time played.

  


## (4) Methods and Plan:



### Prediction method:

The prediction method that should be used is **knn classification**. 

Knn classification is used when the outcome we are trying to predict is a **qualitative/character value** and regression methods (knn regression and linear regression) is used for a numerical outcomes. We are trying to predict whether a person is going to subscribe (qualitative) based on Age and time played.


#### The assumption for selecting this method: 

- model selects based on proximity to known data points observations, so the the point we are trying to predict must have similar outcomes otherwise the method will not work.
- data must be balanced and properly scaled since otherwise the data may account for one variable more than the other when making predictions.
- model does not assume linearity or trends which works well in this case since there is no clear trend between the variables.

#### Potential limitations: 

- model does not work well with many variables
- slow with large data sets
- senstitive to noise
- choosing the correct k-value is very important.

#### Comparing:

I can compare with other models by comparing the accuracy of the predictions.


#### To process the data to apply the model:

1. Split the data into training and testing data (70% training data and 30% testing data).
2. Preform a 5 vfold cross validation with training data.
3. Create a recipe to specify the predictors and processing steps for all variables (scale and standardize the training data).
4. Make a knn model specification (knn spec) with neighbors = tune ().
5. Create a workflow by adding the recipe and model specifications.
6. Use the tune-grid function on the validations splits to pick the best k-value with the highest accuracy.
7. Make a new model specifiaction with the best k-value found.
8. Then I would train and predict on the training data for the model
9. After that I would test it on the testing data by testing the accuray by using the predict function.




Github link: https://github.com/SimranHeir/DSCI-100-Project-Planning-Stage-Individual-Simran-Heir.git