# DSCI 100 Project Individual Planning Report

Group Information:
- Name: Jessie Chen
- Section: 009
- Group: 6

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)
options(repr.matrix.max.rows = 6)

## (1) Data Description

In [None]:
# LOAD PLAYERS AND SESSIONS INTO R
players <- read_delim("data/players.csv")
sessions <- read_delim("data/sessions.csv")
players
sessions

In [None]:
# PLAYERS SUMMARY STATISTICS
summary_players <- summary(players)
catagories_experience <- unique(players$experience)
catagories_gender <- unique(players$gender)

summary_players
catagories_experience
catagories_gender

# SESSIONS SUMMARY STATISTICS
summary_sessions <- summary(sessions)

summary_sessions

The data consist of two file: `players.csv` and `sessions.csv`. 

`players.csv` is a list of all unique players, including data about each player. There are
- 196 rows (observations)
- 7 variables
  | variable  | type | meaning |
  |-----------|------|---------|
  | experience | character | experience level of a player in MineCraft, which has 5 catagories, 'Pro', 'Veteran', 'Amateur', 'Regular', and'Beginner' |
  | subscribe | logical | true if a player subscribes to a game-related newsletter |
  | hashedEmail | character | hashed or encrypted email address of a player |
  | played_hours | double | the time (in hours) a player has played MineCraft |
  | name | character | name of a player |
  | gender | character | gender of a player, which has 7 catagories, 'Male', 'Female', 'Non-binary', 'Prefer not to say', 'Agender', 'Two-Spirited', and 'Other' |
  | Age | double | the age of a player in years |

  | double variable | units | minimum | maximum | mean | first quarter | third quarter |
  |----------------|-----|-------|----------|--------|--------------|--------|
  | played_hours | hours | 0 | 223.10 | 5.85 | 0 | 0.60 |
  | Age | years | 9.00 | 58.00 | 21.14 | 17.00 | 22.75 |
- Issues: 2 NA's in `Age`
- Potential issues: `experience` seems self-reported, this might result in a weak correlation with other variables, we also do not know what qualifications each level has.

`sessions.csv` is a list of individual play sessions by each player, including data about the session. There are
- 1535 rows (observations)
- 5 variables
  | variable  | type | meaning |
  |-----------|------|---------|
  | hashedEmail | character | hashed or encrypted email address of a player of the session |
  | start_time | character | the start time and date of a session |
  | end_time | character | the end time and date of a session |
  | original_start_time | double | the start time of a session recorded in UNIX time (milliseconds) |
  | original_end_time | double | the end time of a session recorded in UNIX time (milliseconds) |
- Issues: 2 NA's in `original_end_time`
- Potential issues: `start_time` and `end_time` consist of 2 pieces of information, date and time

`players.csv` and `sessions.csv` data were collected in the PLAICraft Minecraft browser window, the research team record players' gameplay, speech, and key presses.

## (2) Questions

I will address the first broad question, *Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?*

The specific question that I have formulated: Can `played_hours` and `Age` predict `subscribe` in `players.csv`?

The data will help me address the specific question because it has information about each player. I only need `players.csv`, and it is tidy as it fullfilled the 3 requirements of tidy data. I can apply the predictive method of K-nearest neighbors classification.

## (3) Exploratory Data Analysis and Visualization

See (1) Data Description that the dataset can be loaded into R. I only need `players.csv`, and it is already in a tidy format, there is no more wrangling I need to perform at this step.

In [None]:
# CALCULATE MEAN VALUE FOR EACH QUANTITATIVE VARIABLE IN THE PLAYERS.CSV
players_mean <- players |>
    summarize(played_hours = mean(played_hours, na.rm = TRUE), 
              age = mean(Age, na.rm = TRUE))
players_mean

The mean value for each quantitative variable in the `players.csv` data set (i.e., `played_hours` and `Age`).
| mean played hours (in hours) | mean age (in years) |
|-------------------|----------|
| 5.85 | 21.14 |

In [None]:
# EXPLORATORY VISUALIZATIONS OF THE DATA
options(repr.plot.width = 15, repr.plot.height = 8)

subscribe_experience_plot <- players |>
    ggplot(aes(x = experience, fill = subscribe)) +
    geom_bar() +
    labs(x = "Experience level in MineCraft",
         y = "Number of players",
         fill = "Subscribe to a game-related newsletter?",
         title = "Fig.1.1: Subscription of different experience levels players") +
    theme(text = element_text(size = 18))

subscribe_experience_proportion_plot <- players |>
    ggplot(aes(x = experience, fill = subscribe)) +
    geom_bar(position = 'fill') +
    labs(x = "Experience level in MineCraft",
         y = "Proportion of subscribed players",
         fill = "Subscribe to a game-related newsletter?",
         title = "Fig.1.2: Proportion of subscription of different experience levels players") +
    theme(text = element_text(size = 18))

subscribe_age_plot <- players |>
    ggplot(aes(x = Age, fill = subscribe)) +
    geom_histogram() +
    geom_vline(xintercept = 21.14, linetype = "dashed") +
    labs(x = "Age of players (in years)",
         y = "Number of players",
         fill = "Subscribe to a game-related newsletter?",
         title = "Fig.2: Subscription of different ages players") +
    theme(text = element_text(size = 18))

subscribe_played_hours_plot <- players |>
    ggplot(aes(x = played_hours, fill = subscribe)) +
    geom_histogram() +
    geom_vline(xintercept = 5.85, linetype = "dashed") +
    labs(x = "Cumulative played time (in hours)",
         y = "Number of players",
         fill = "Subscribe to a game-related newsletter?",
         title = "Fig.3: Subscription of different played hours players") +
    theme(text = element_text(size = 18))

subscribe_played_hours_age_plot <- players |>
    ggplot(aes(x = Age, 
               y = played_hours,
               color = subscribe,
               shape = subscribe)) +
    geom_point(size = 3.5) +
    labs(x = "Age of players (in years)",
         y = "Cumulative played time (in hours)",
         color = "Subscribe to a game-related newsletter?",
         shape = "Subscribe to a game-related newsletter?",
         title = "Fig.4: Subscription of different played hours and ages players") +
    theme(text = element_text(size = 18))

subscribe_experience_plot
subscribe_experience_proportion_plot
subscribe_age_plot
subscribe_played_hours_plot
subscribe_played_hours_age_plot

From Fig.1.1, we see two largest experience levels of players are Amateur and Veteran, and they contribute to the most subscription of game-related newsletters. 

However, from Fig.1.2, we learn that each experience level has similar proportions of subscribed players and non-subscribed players; all 5 levels have around 75% of players who subcribes to a game-related newsletter. There seems to have no correlation between `experience` and `subscribe`. Thus, we can logically suggest that `experience` may be less influencial and/or decisive than other variables in predicting `subscribe`.

From Fig.2, we see more younger players (i.e., younger than the mean age, 21.14 years old; left of the dased line) subscribe to a game-related newsletter than older players do. We may suggest that `Age` may be an effective and/or influencial factor in predicting `subscribe`.

From Fig.3, we realize that most players have played 0 hours in total. It may be insufficient to conclude a patterns in different `played_hours` and `subscribe`. But we can suggest that most players with played hours more than 0, are subscribed to a game-related newsletter.

It seems possible to perform K-nearest neighbors classification on Fig.4 to predict `subscribe` using `Age` and `played_hours`.

## (4) Methods and Plan

How are you going to compare and select the model?
How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?

I propose using K-nearest neighbors (KNN) classification method to predict `subscribe` based on a player's `Age` and `played_hours`. 

This method is appropriate because we are predicting a catagorical value. There are only two options: true if the player subscribe to a game-related newsletter, otherwise, false. 

No assumptions are required for KNN classification. 

Potential limitations or weaknesses of KNN classification: there may be heavy computation; as predictors increases, the prediction may not be accurate; when there is less data point, the prediction may not be accurate; the overall trend may not be well-interperated.

the `metrics` function from the `tidymodels` package can help us. To get the statistics about the quality of our model, you need to specify the `truth` and `estimate` arguments. If we had to improve our classifier, we have to change the parameter: number of neighbours, $K$. Since cross-validation helps us evaluate the accuracy of our classifier, we can use cross-validation to calculate an accuracy for each value of $K$ in a reasonable range, and then pick the value of $K$ that gives us the best accuracy. The great thing about the `tidymodels` package is that it provides a very simple syntax for tuning models. Using `tune()`, each parameter in the model can be adjusted rather than given a specific value. 

I am going to split the data so that 70% are used for training, 30% are used for testing. I will also use $5$-fold cross-validation.

## (5) GitHub Repository