In [None]:
suppressPackageStartupMessages(
    {suppressWarnings({
        library(tidyverse)
        library(repr)
        library(tidymodels)
        library(GGally)
        library(gridExtra)
        library(grid)
        library(tidyr)
    })
})
    

In [None]:
options(repr.plot.width = 6, repr.plot.height = 4, repr.matrix.max.rows = 7,readr.show_col_types = FALSE)

player_data <- read_csv("data/players.csv")
session_data <- read_csv("data/sessions.csv")



# **Data Science Project: Planning Stage (Individual)**

## **(1) Data Description:**
Our sample was collected through a volunteer sign-up on the plaicraft.ai website, with participants submitting their email and phone number to participate in the research project. After entering information, the user is granted access to the server to play, and then further data is collected by recording the server itself.
 

In [None]:
ply_summarised_num <- player_data |>
                summarise(subscribed_percent_decimal = mean(subscribe),
                       played_hours_avg = mean(played_hours),
                       played_hours_median = median(played_hours),
                       played_hours_max = max(played_hours),
                       played_hours_min = min(played_hours),
                       age_avg =  mean(Age, na.rm = TRUE), 
                       age_median = median(Age, na.rm = TRUE),
                       age_max = max(Age,  na.rm = TRUE), 
                        age_min = min(Age,  na.rm = TRUE)) |>
                        select(subscribed_percent_decimal,played_hours_avg,played_hours_median, played_hours_max, played_hours_min,age_avg,age_median, age_max, ) |>
                        round(2)

ply_summarised_gender <- player_data |>
                        select(gender) |>
                        count(gender) |>
                        mutate(percent_decimal = round(n /sum(n), 2)) |>
                        rename(count = n)

ply_summarised_experience <- player_data |>
                        select(experience) |>
                        count(experience) |>
                        mutate(percent_decimal = round(n /sum(n), 2)) |>
                        rename(count = n)


 #### **Player Data:**"
Player info from the survey and total play time on the server.

Observation: 196 

Variables: 7 

 Our **variables** from the player dataset:
> - `experience` \<char> label of players' experience as either `Beginner`, `Amateur`, `Regular`,  `Veteran` and  `Pro`.
> - `subscribe` \<Boolean> subscribed to the newsletter `TRUE` or `FALSE`.
> - `hashedEmail` \<char> is a unique hash used as a data ID for the player.
> - `played_hours` \<double> total hours spent playing for this research project server.
> - `name`\<char> users' first name.
> - `gender` \<char>  labels of users' gender from `Male`, `Female`, `Non-binary`, `Prefer not to say`, `Agender`, `Two-Spirited` and `Other`.
> - `Age` \<double> players age in years.
>

>





#### Summary across numeric values:

In [None]:
ply_summarised_num

#### Summary of participants Gender:

In [None]:
ply_summarised_gender

#### Summary participents Experience:

In [None]:
ply_summarised_experience

#### Potential Issues:
- Voluntary info could be misrepresented, such as age and experience or omitted entirely. In age, there are some NA values
- The play hours really vary, and many have yet to play on the server, which can cause issues in our interpretation of data.
- Our sample is dominated by male-identifying participants, which may affect results as we proceed.

In [None]:
distinct_players <- session_data |>
                    select(hashedEmail) |>
                    distinct() |> 
                    summarise(count(distinct_players))
distinct_players

 #### **Session Data:**
User session info, precise login time, and dates recorded from server.

Observation: 1535 

Variables: 5


Our variables from the session dataset:
> - `hashedEmail` \<char> is a unique hash used as a data ID for the player.
> - `start_time`  \<char> DD/MM/YYYY 23:59 session start time including military time
> - `end_time`    \<char> DD/MM/YYYY 23:59 session end time, including military time
> - `original_start_time` \<char> precise start time to the milisecond
> - `original_end_time` \<char> precise end time to the milisecond


> Summary Statistic:
> - mean ...



#### Summary of precise sessions:

#### Potential Issues:
- The data is not tidy; having the date and time of day in a single variable will cause issues when analyzing the data.
- Using hashEmail as the sole identifier of who is who complicates our ability to determine which player has logged in multiple times visualy. 

### **(2) Questions:**

### **(3) Exploratory Data Analysis and Visualization:**

### **(4) Methods and Plan:**