# Planning Stage Akbar Ismatullayev 17376021 #

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
# source("cleanup.R")

### Reading the data for analysing

In [None]:
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

sessions
players

### (1) Data Description: ###

In [None]:
glimpse(players)
glimpse(sessions)


This project used two related datasets, players.csv and sessions.csv. They together describe player characteristics and their activity on a Minecraft server hosted by UBC. 

player.csv contains demographic and information about each unique player. Each row represents one player, so there are 196 unique players.
For the variables:
* Experience is a character variable, with some classifactions like pro, veteran etc. It looks like an ordinal feature which means there is a orderning.
* subscribe is a boolean, saying if the player is subscribed to newsletter.
* hashed_email is a character variable which is unique for each player
* played_hours is a double variable, which says how many hours each player plays, however what I noticed is that there are players with zero play hours time, so they only created the account but no playing time, so we have to be carefull of that.
* name is a character variable, which is not unique and it is not usefull


session.csv contains records of all gameplay sessions for every player which can be connected through their 'hashed_email', also the variables as  their timestamps and duration which we can calculate. Each row corresponds to one gameplay session. Since there are 1535 rows which are more than the amount of playes, this suggest that there are duplicates of the hashed_email in the dataset so players are gaming more than once on their server. So we have to be carefull of that.
For the variables:
* hashed_email is a character variable which you can link to the player.
* start_time is a character variable saying when the session started.
* end_time is a character variable saying when the session ended.
* both orginal_start_time and orginal_end_time are the same or almost the same so I will remove them.

### Summary statistics ###

In [None]:
summary(players)
summary(sessions)


Looking at the summary, we notice that there are empty columns in age (players) and orginal_end_time (session). Which we need to be carefull. For the subscribe variable we notice it is very skewed, most players are subscribed so if we want to classify a player if it is subscribed we need to be carefull because it is imbalanced so it can be tricky.

#### Potential issues in short ####
* age and orginal_end_time has na values
* experience can be made as an ordinal feature so beginner = 3 while pro = 3.
* some players dont have play time so they dont have any game session.
* both original_start_time and original_end_time are almost the same or the same
* subscribe is very imbalanced so there more true values than false values which can mislead a classification model.

## (2) Project Statement ##

I want to know: what player characteristics and behaviours are most predictive of subscribing to the UBC Minecraft research newsletter? So to answer this question we have to look at some variables like: experience, age, gender and played hours. Where the predict variable is: "subscribe". So the sub question is: Can a playerâ€™s experience level, age, gender, or played_hours predict whether they subscribe to the Minecraft research newsletter in players dataset?

#### How this variables support this question? ###
* The players.csv dataset already links player demographics, playtime, and subscription status finding direct exploration of correlations.

* The sessions.csv dataset can enrich the analysis by quantifying the activity like the number or duration of sessions and linking it via  hashed_email.

* Together these variables allow us to explore whether active or experienced players tend to subscribe more often.

## (3) Exploratory Data Analysis and Visualization ##

### Minimal wrangling ###

In [None]:
# mutate the player dataset since experience, subscribe and gender are factor data's. But experience has an order so we make it as an ordinal feature

players <- players |>
  mutate(
    experience = factor(experience, levels = c("Beginner", "Amateur","Regular","Veteran","Pro")),
    subscribe  = as.factor(subscribe),
    gender     = as.factor(gender)
  )

### the mean of the numeric variables in player dataset

In [None]:
# Calculate the mean of every numeric variable in player set

players_means <- players |>
  summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
players_means


### exploratory visualizations ###

In [None]:
ggplot(players, aes(x = experience, fill = subscribe)) +
  geom_bar(position = "dodge") +
  labs(
    title = "Subscription Status by Experience Level",
    x = "Experience Level",
    y = "Number of Players",
    fill = "Subscribed"
  )