# Individual Plan - Stephanie Ye

In [None]:
library(tidyverse)
library(dplyr)


player_url = "http://drive.google.com/uc?export=download&id=1gfwfCu-YNRc_NDVSoNmKdemC9t4lYXJs"
players_data <- read_csv(player_url)

head(players_data)

In [None]:
sessions_url = "http://drive.google.com/uc?export=download&id=1GHKAF_hpFRGvvXghIDMrERxU2GI_33N5"

sessions_data <- read_csv(sessions_url)

head(sessions_data)

## Data description

There are 196 oberservation and 9 variables in players table, the variables are:
1. `experience`categorical, cdecribe the experience level of players.
2. `subscribe`logical, decribe whether the players subscribe the game or not.
3. `hashedEmail`categorical, decribe the hashing encoded email addresses of players.
4. `played_hours`numeric, decribe how long the players play the game.

There are 1535 oberservations and 5 variables in sessions table, the variables are:
1. `hashedEmail` categorical, describe the hashing encoded email addresses of players.
2. `start_time` and `end_time`categorical, describe when the players start or end playing the game.
3. `original_start_time` and `original_end_time` numerical, describe the UNIX timestamp recorded by system.

## potential issues

### players

In [None]:
players_data|>
group_by(gender) |>
summarize(count=n())

1. In the `hashedEmail` column, the data is unreadable which may be useless.
2. There may be some extreme value in `played_hours` which may affect the final results.
3. Some categories in `gender` may be too small, such as `Other`, `Two-Spirited`. Model may cannot show the patterns of these groups.

### Sessions

In [None]:
session_date <- sessions_data |>
    separate(start_time, into = c("start_date", "start_hour"), sep = " ") |>
    separate(end_time,   into = c("end_date", "end_hour"),   sep = " ")
head(session_date)

In [None]:
session_time <- session_date |>
select(hashedEmail, start_date, start_hour, end_date, end_hour)
head(session_time)

In [None]:
duration_time <- session_time |>
    filter(start_date == end_date) |>
    mutate(start_hour_num = as.numeric(substr(start_hour, 1, 2)),
        start_min_num  = as.numeric(substr(start_hour, 4, 5)),
        end_hour_num   = as.numeric(substr(end_hour, 1, 2)),
        end_min_num    = as.numeric(substr(end_hour, 4, 5)),
        start_total = start_hour_num * 60 + start_min_num,
        end_total   = end_hour_num * 60 + end_min_num,
        duration_mins = end_total - start_total) |>
    select(start_date, start_hour, end_hour, duration_mins)
head(duration_time)

duration_mean <- duration_time |>
    summarise(mean_duration = mean(duration_mins, na.rm = TRUE))
duration_mean

In [None]:
players_hours <- players_data |>
select(hashedEmail, played_hours)
head(players_hours)

In [None]:
merged_data <- full_join(session_time, players_hours, by = "hashedEmail")
head(merged_data)
tail(merged_data)

1. The `hashedEmail` is unreadable, which may be useless.
2. There are sessions with a duration of 0 minutes or with very long duration . Such cases might indicate logging errors or players disconnecting immediately. These values could affect the final results of prediction model.
3. The `start_time` and `end_time` may need to convert into another form in order to calculate more easily.

## Data visualization

### players

In [None]:
players_plot1 <- ggplot(players_data, aes(x = subscribe)) +
geom_bar(aes(fill = experience), position = "dodge")
players_plot1

In [None]:
players_lessthan50 <- players_data |>
    filter(played_hours < 50) |>
    ggplot(aes(x = played_hours)) +
    geom_histogram() +
    facet_grid(cols = vars(subscribe))
players_biggerthan10

In [None]:
players_experience <- players_data |>
filter(played_hours < 10) |>
ggplot(aes(x = experience, y = played_hours, color = experience)) +
geom_point(alpha = 0.5)
players_experience

### Sessions

In [None]:
session_plot1 <- duration_time |>
    ggplot(aes(x = duration_mins)) +
    geom_histogram()
session_plot1

In [None]:
session_date <- 