## Individual Project Portion

In [None]:
library(tidyverse)

In [None]:
players <- read_csv("https://raw.githubusercontent.com/Modas101/ds100-individual-project/refs/heads/main/data/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/Modas101/ds100-individual-project/refs/heads/main/data/sessions.csv")

# head(players)
# head(sessions)

# dim(players)
# dim(sessions)

# summary(players)
# summary(sessions)

# Data Description

### Tibble Data

### players.csv
**Observations:** 196\
**Number of Variables:** 7

### sessions.csv
**Observations:** 1535\
**Number of Variables:** 5

## Variable Description (name and type)

### players.csv
**experience (string type):** A self-rated assessment of their own experience.\
**subscribe (boolean type):** Whether they are subscribed to a game-related newsletter or not.\
**hashedEmail (string type):** Their email, hashed.\
**played_hours (double type):** Time played in hours.\
**name (string type):** Their name.\
**gender (string type):** Their gender.\
**Age (double type):** Their real life age in years.

### sessions.csv
**hashedEmail (string type):** Their email, hashed.\
**start_time (string type):** The initial session start time (dd/mm/yy).\
**end_time (string type):** The end session start time (dd/mm/yy).\
**original_start_time (double type):** The initial session start time in UNIX timestamp.\
**original_end_time (double type):** The end session start time in UNIX timestamp.

## Summary Statistics (player.csv)

| Variable | Mean | Min | Max |
| :--- | :--- | :--- | :--- |
| `played_hours` | 5.85 | 0.00 | 223.10 |
| `age` | 21.14 | 9.00 | 58.00 |

## Issues

The `experience` variable in `players.csv` is very subjective, and is more likely an indicator of how confident the player is, rather than an actual measure of their skill level.

`age` could easily be fabricated.

The `start_time` as well as `end_time` variable seem to be formatted in `dd/mm/yy` rather than UNIX time, making those variables hard to work with. Fortunately there appears to be another variation of the respective variables, titled `original_start_time` and `original_end_time` in UNIX time for easy access. 

Summary output shows 2 NA's in the age variable for `players.csv`. 
Summary output shows 2 NA's in the original_end_time variable for `sessions.csv`.

# Questions

### Broad Question
"We would like to know which 'kinds' of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts."

### Specific Question
"Can player characteristics (specifically `experience`, `subscribe`, `gender`, and `Age`) predict whether a player is a high data contributor (defined as having `played_hours` in the top 75th percentile) in the `players` dataset?"

### How the data will address the question of interest
The `players.csv` dataset contains all the necessary variables described in the specific question.

**Explanatory Variables (X):** The predictors will be the player characteristics `experience`, `subscribe`, `gender`, and `Age`.\
**Response Variable (Y):** This will be a new boolean variable I create called `high_contributor`.

**Wrangling**:
1.  Remove observations with `NA` values in the `Age` column.
2.  Calculate the 75th percentile of the `played_hours` column.
3.  Create the new response variable `high_contributor`. `TRUE` if `played_hours` > 75th percentile value, `FALSE` otherwise.

In [None]:
options(repr.plot.width = 8, repr.plot.height = 8)

clean_players <- players |>
    filter(!is.na(Age))

played_hours_75th_percentile <- clean_players |>
    pull(played_hours) |>
    quantile(0.75)
#played_hours_75th_percentile

clean_players <- clean_players |>
    mutate(high_contributor = played_hours > played_hours_75th_percentile)




# plots
# amount
clean_players |> ggplot(aes(x = Age, fill = high_contributor)) +
    geom_histogram(binwidth = 2, position = "stack", alpha = 0.8) +
    labs(title = "Amount of High Contributors by Age",
        x = "Age (bins of 2 years)",
        y = "Amount of Players",
        fill = "High Contributor") + 
    theme(element_text(size = 20))
# proportion
clean_players |> ggplot(aes(x = Age, fill = high_contributor)) +
    geom_histogram(binwidth = 2, position = "fill", alpha = 0.8) +
    labs(title = "Proportion of High Contributors by Age",
        x = "Age (bins of 2 years)",
        y = "Percent of Players",
        fill = "High Contributor") + 
    theme(element_text(size = 20))

clean_players |> ggplot(aes(x = experience, fill = high_contributor)) +
    geom_bar(position = "stack") +
    scale_y_continuous(labels = scales::percent) +
    labs(title = "Amount of High Contributors by Experience Level",
        x = "Experience Level",
        y = "Amount of Players",
        fill = "High Contributor") +
    theme(element_text(size = 20))

clean_players |> ggplot(aes(x = experience, fill = high_contributor)) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) +
    labs(title = "Proportion of High Contributors by Experience Level",
        x = "Experience Level",
        y = "Percent of Players",
        fill = "High Contributor") +
    theme(element_text(size = 20))

clean_players |> ggplot(aes(x = subscribe, fill = high_contributor)) +
    geom_bar(position = "stack") +
    scale_y_continuous(labels = scales::percent) +
    labs(title = "Amount of High Contributors by Subscription",
        x = "Subscribed to Newsletter",
        y = "Amount of Players",
        fill = "High Contributor") +
    theme(element_text(size = 20))

clean_players <- ggplot(aes(x = subscribe, fill = high_contributor)) +
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) +
    labs(title = "Proportion of High Contributors by Subscription",
        x = "Subscribed to Newsletter",
        y = "Percent of Players",
        fill = "High Contributor") +
    theme(element_text(size = 20))

mean_hours <- clean_players |>
  pull(played_hours) |>
  mean()
mean_age <- clean_players |>
  pull(Age) |>
  mean()

#mean_hours
#mean_age




# Exploratory Data Analysis and Visualization

## Mean values of quantitative variables (player.csv, cleaned)

| Variable | Mean |
| :--- | :--- |
| `played_hours` | 5.90 |
| `age` | 21.14 | 