# DSCI 100 Project Planning Stage (Individual) 

**Ajay Antonio (90110792)**

# Introduction 

In this project we will be tasked with analyzing a data set in collaboration with a research science group at UBC. Spearheaded by Frank Wood and his team, this project aims to provide insights to the relationships between online gaming and its users, highlighting often arbitrary forms of data, such as newsletter subscription, time logged in, and types of players. Through the use of different classification techniques and data recorded from a Minecraft server, we can then make conclusions for more efficient and targeted outreach to players.

The broad question that I will be aiming to tackle in my project is 

# Data Description

We will be using the `players.csv` file given by DSCI 100, which provides detailed information and a list of all the unique players collected from the Minecraft Server. The data set has **196** observations and **8** variables.


#### Table 1 - Variable Names and Types in `players.csv`

| Variable     | Type      | Meaning                                                        |
|---------------|-----------|----------------------------------------------------------------|
| experience    | factor    | Experience of the player (Veteran, Pro, Amateur, Regular)       |
| subscribe     | logical   | Status of the player's subscription (TRUE OR FALSE)             |
| hashedEmail   | character | Email Addresses of the players (Encoded)                        |
| played_hours  | double    | Total hours of Minecraft played                                 |
| name          | character | Name of the player                                              |
| gender        | factor    | Gender of the player                                            |
| Age           | integer   | Age of the player in years                                      |

The second data set that we were given but is not used in the project is `sessions.csv`, which has **1535** observations and **5 variables**. 

#### Table 2 - Variable Names and Types in `sessions.csv`

| Variable           | Type      | Meaning                                                |
|--------------------|-----------|--------------------------------------------------------|
| hashedEmail        | character | Email Addresses of the players (Encoded)               |
| start_time         | character | Start time of gameplay                                 |
| end_time           | character | End time of gameplay                                   |
| original_start_time| double    | Start time in Epoch Milliseconds                       |
| original_end_time  | double    | End time in Epoch Milliseconds                         |

In [None]:
#Importing libraries I will use in this project
library(tidyverse)
library(tidymodels)
library(repr)
library(RColorBrewer)
# formatting graphs
options(repr.plot.width = 12, repr.plot.height = 6)

In [None]:
players <- read_csv("players.csv")
summary_stats <- players %>%
  summarise(
    n_obs = n(),
    mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2),
    sd_played_hours = round(sd(played_hours, na.rm = TRUE), 2),
    min_played_hours = round(min(played_hours, na.rm = TRUE), 2),
    max_played_hours = round(max(played_hours, na.rm = TRUE), 2),
    mean_age = round(mean(Age, na.rm = TRUE), 2),
    sd_age = round(sd(Age, na.rm = TRUE), 2),
    min_age = round(min(Age, na.rm = TRUE), 2),
    max_age = round(max(Age, na.rm = TRUE), 2)
  )

summary_stats

**Above we have imported the `players.csv` file and outlined the summary statistics (rounded to 2 decimal places).**

| Variable           | Amount    | 
|--------------------|-----------|
| mean_played_hours  | 5.85      | 
| sd_played_hours    | 28.36     |
| min_played_hours   | 0         | 
| max_played_hours   | 223.1     | 
| mean_age           | 21.14     | 
| sd_age             | 7.39      | 
| min_age            | 9         | 
| max_age            | 58        | 

#### Issues and Info about the Dataset:

Some potential issues that arise when dealing with the data are: 

**1:** The data in itself is not tidy, specifically the `sessions.csv` data set, and must be wrangled for simplicity and ease of understandability. For example, the variables of `original_start_time` and `original_end_time` are displayed in the Unix timestamp converter. Furthermore, there are multiple measurements in the columns of `start_time` and `end_time`, which does not adhere to the rule of tidy data - "each column must have a single variable".

**2:** Since my research question mainly deals with the `players.csv` data set and is focused experience and hours played, I will not be using the `name` or `gender` variable. 