In [None]:
library(tidyverse)

In [None]:
players_data<- read_csv("players.csv")
players_data

In [None]:
players_mean<- players_data|>
    summarise(mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2),
              min_played_hours = min(played_hours, na.rm = TRUE),
              max_played_hours = max(played_hours, na.rm = TRUE),
              missing_played_hours = round(mean(is.na(played_hours)) * 100, 2),
              mean_age = round(mean(Age, na.rm = TRUE), 2),
              min_age = min(Age, na.rm = TRUE),
              max_age = max(Age, na.rm = TRUE),
              missing_age = round(mean(is.na(Age)) * 100, 2))
players_mean

<h2> (1) Data Description: </h2>

<h3>Summary</h3>

This project analyzes data from the players.csv dataset, which contains information about 196 individual players on a Minecraft server, including their experience level, newsletter subscriptifon status, hashed email address, playtime, name, gender, and age.

The dataset includes seven variables. The table below summarizes each variable and its type.

<h3>Variables</h3>

 Variable Name | Type | Description | Example Value |
|----------------|------|--------------|----------------|
| `experience` | Categorical (`chr`) | Player’s skill level or rank. | `Pro` |
| `subscribe` | Boolean (`lgl`) | Indicates whether the player has an active subscription (TRUE) or not (FALSE). | `TRUE` |
| `hashedEmail` | String (`chr`) | Unique anonymized identifier for each player. | `f6daba4...` |
| `played_hours` | Numeric (`dbl`) | Total number of hours the player has spent playing. | `30.3` |
| `name` | String (`chr`) | Player’s first name. | `Morgan` |
| `gender` | Categorical (`chr`) | Player’s gender identity | `male` |
| `Age` | Numeric (`dbl`) | Player’s age in years. Contains some missing values (`NA`). | `17` |

---

<h3>Sumarry Statistic</h3>

| Variable | Mean | Min | Max | Missing (%) |
|-----------|------|-----------|------|--------------|
| `played_hours` | *5.85* | *0* | *223.1* | 0% |
| `Age` | *21.14* | *9* | *58* | 1.02% |
---

<h3>Direct Observations and Problems</h3>

- The **experience** variable may represent skill progression and could be useful in predicting playtime.

- The **hashedEmail** variable appears to be the unique player identifier, but is not relevant for analysis.

- The **played_hours** variable contains many zeros, possibly representing new players who have not yet begun playing; hwoever, this might affect our later prediction in answering the question based on the data.

- The **gender** variable contains has many different responses, such as “Other”, “Two-Spirited”, “Prefer not to say”, etc. This might make it hard to group or summarize.

- The **Age** variable has some missing values, which must be handled before modelling.

<h3>Other Potential Issue</h3>

- The data may not represent all types of players (for example, older players or casual players may be missing).
  
- Data by self-recording (such as age) may contain errors.

<h3>How the Data be Collected</h3>

<p> A research group in Computer Science at UBC, led by Frank Wood, is collecting data about how people play video games. They have set up a Minecraft serverLinks to an external site., and players' actions are recorded as they navigate through the world. </p >

<h2>(2) Questions:</h2>

<h3>Addressing Question 1</h3>

**What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?**

<h3>The Specific Question</h3>

Can a player’s total playtime and age predict whether they subscribe to the newsletter?

<h3> How the Data Helps Address the Specific Question</h3>

This dataset provides the three information needed for this question, which are **total playtime**, **age**, and **subscription status** for each player. By focusing on these variables and removing missing values in Age, I can examine how playtime and age differ between subscribers and non-subscribers.

<h2>(3) Exploratory Data Analysis and Visualization</h2>

In [None]:
visualization_age_subscribe<- players_data|>
            ggplot(aes(x = Age, fill = subscribe)) +
            geom_histogram(position = "identity", alpha = 0.3, binwidth = 5) +
            labs(title = "Age vs Subscribe",
                 x = "Age (years)",
                 y = "Count (per person)")
visualization_age_subscribe

visualization_playtime_subscribe<- players_data|>
            ggplot(aes(x = played_hours, fill = subscribe)) +
            geom_histogram(position = "identity", alpha = 0.3, binwidth = 5) +
            labs(title = "Played_hours vs Subscribe",
                 x = "played_hours (h)",
                 y = "Count (per person)")
visualization_playtime_subscribe

By applying the histogram, the graphs show the distribution of age and playtime for whether players have subscribed or not. We can see that most subscribers are between 13–17 years old and have 0–5 total played hours. These noticable patterns have strong distribution between both graphs.

<h2>(4) Methods and Plan</h2>

<h3>Methond</h3>

<p> The method that I might be using to address the question "Can the player's total playtime and age predict whether they subscribe to the newsletter in the player database?" will be the KNN regression model. This will allow me to classify players as "subscribers" and "non-subscribers" based on their playtime and age.</p >

<h3>Why is this method appropriate?</h3>

<p> This method is appropriate because KNN regression works well for numeric values. In addition, KNN regression also does not assume any specific relation between the predictor and outcome, which will be helpful to use since the age and playtime might have a non-linear relationship with the subscription. </p >

<h3>Which assumptions are required, if any, to apply the method selected?</h3>

 - The data must be scaled when using it. 

<h3>What are the potential limitations or weaknesses of the method selected?</h3>

 - the choice of k.
 - Outliers, especially when dealing with the game time data, as game time has a range from 0 to over 200 hours.

<h3>How are you going to compare and select the model?</h3>

<p> I’ll use k-fold cross-validation (around 5 folds) on the training data to choose a good value for k. Model performance will be compared mainly through accuracy and the misclassification rate. </p >

<h3>How are you going to process the data to apply the model? </h3>

<p>I’ll keep only the variables I need (played_hours, Age, and subscribe), remove missing values, and standardize the numeric predictors. I’ll also take a quick look at any extreme values in playtime since they might affect the distance calculations.</p >

<p>After cleaning and scaling, I will use 70% of the data for training and 30% for testing. The training set will be used for all model development, including scaling, tuning the number of neighbours (k), and running cross-validation. I plan to use 5-fold cross-validation within the training set to select the best k and ensure the model generalizes well. The test set will remain untouched until the final evaluation to provide an estimate of model performance.</p >