In [None]:
library(tidyverse)

In [None]:
players_data<- read_csv("players.csv")
players_data

In [None]:
players_mean<- players_data|>
    summarise(mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2),
              min_played_hours = min(played_hours, na.rm = TRUE),
              max_played_hours = max(played_hours, na.rm = TRUE),
              missing_played_hours = round(mean(is.na(played_hours)) * 100, 2),
              mean_age = round(mean(Age, na.rm = TRUE), 2),
              min_age = min(Age, na.rm = TRUE),
              max_age = max(Age, na.rm = TRUE),
              missing_age = round(mean(is.na(Age)) * 100, 2))
players_mean

<h2> (1) Data Description: </h2>

**Summary:** This project analyzes data from the players.csv dataset, which contains information about 196 individual players on a Minecraft server, including their experience level, newsletter subscriptifon status, hashed email address, playtime, name, gender, and age.

The dataset includes seven variables. The table below summarizes each variable and its type.

<h3>Variables</h3>

 Variable Name | Type | Description | Example Value |
|----------------|------|--------------|----------------|
| `experience` | Categorical (`chr`) | Player’s skill level or rank. | `Pro` |
| `subscribe` | Boolean (`lgl`) | Indicates whether the player has an active subscription (TRUE) or not (FALSE). | `TRUE` |
| `hashedEmail` | String (`chr`) | Unique anonymized identifier for each player. | `f6daba4...` |
| `played_hours` | Numeric (`dbl`) | Total number of hours the player has spent playing. | `30.3` |
| `name` | String (`chr`) | Player’s first name. | `Morgan` |
| `gender` | Categorical (`chr`) | Player’s gender identity | `male` |
| `Age` | Numeric (`dbl`) | Player’s age in years. Contains some missing values (`NA`). | `17` |

---

<h3>Sumarry Statistic</h3>

| Variable | Mean | Min | Max | Missing (%) |
|-----------|------|-----------|------|--------------|
| `played_hours` | *5.85* | *0* | *223.1* | 0% |
| `Age` | *21.14* | *9* | *58* | 1.02% |
---

<h3>Direct Observations and Problems</h3>

- The **experience** variable may represent skill progression and could be useful in predicting playtime.

- The **hashedEmail** variable appears to be the unique player identifier, but is not relevant for analysis.

- The **played_hours** variable contains many zeros, possibly representing new players who have not yet begun playing; hwoever, this might affect our later prediction in answering the question based on the data.

- The **gender** variable contains has many different responses, such as “Other”, “Two-Spirited”, “Prefer not to say”, etc. This might make it hard to group or summarize.

- The **Age** variable has some missing values, which must be handled before modelling.

<h3>other potential issue</h3>

- The data may not represent all types of players (for example, older players or casual players may be missing).
  
- If some of the data are self-recorded (such as age), the outcome when using this data set might not be that accurate.

<h3>how the data were collected</h3>
<p> A research group in Computer Science at UBC, led by Frank Wood, is collecting data about how people play video games. They have set up a Minecraft serverLinks to an external site., and players' actions are recorded as they navigate through the world. </p >

<h1>(2) Questions:</h1>

<h3>The Question that I will be addressing </h3>

**Question 1:** What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

<h3>The specific question</h3>

Can the player's total playtime and age  predict whether they subscribe to the newsletter in the player database?

<h3> How the data will help me address the question of interest</h3>

<p>This dataset contains information such as total playtime, age, and subscription status for each player. I will focus on these three variables and remove missing values (N/A). Then, I can use a predictive model (logistic regression) to predict whether playtime and age can explain and predict which kind of players will be more likely to subscribe to newsletters, as what has been asked in the broad question.</p >

In [None]:
visualization_age_subscribe<- players_data|>
    ggplot(aes(x = Age, fill = subscribe)) +
    geom_histogram(position = "identity", alpha = 0.3, binwidth = 5) +
    labs(title = "Age vs Subscribe",
       x = "Age (years)",
       y = "Count (per person) ")

visualization_playtime_subscribe<- players_data|>
    ggplot(aes(x = played_hours, fill = subscribe)) +
    geom_histogram(position = "identity", alpha = 0.3, binwidth = 5) +
    labs(title = "Played_hours vs Subscribe",
       x = "played_hours (h)",
       y = "Count (per person) ")

visualization_playtime_subscribe
visualization_age_subscribe