# Individual Planning Report â€” Predicting Newsletter Subscription #

#### Narek Wartanian - 84186642 ####

In [None]:
# Libraries
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)

In [None]:
# Reading the data
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

head(players)
summary(players)
head(sessions)
summary(sessions)

## Data description ##
### player.csv ###
* It has 7 features and 196 rows.
* **experience**, **hashed_email** and **name** are characters, **subscribe** is a boolean and the rest are all doubles.
* Looking at the data shows that there are two cases in the **age** column with NA values.
* Experience is a character value, it would've been easier if it would be a ordinal value of some kind (0 = beginner, 1 = amateur etc.)
* The name column isn't useful as it is very likely that the name doesn't correlate to anything. (We need the hashed e-mail to find the appropriate entries in the sessions.csv)

### sessions.csv ###
* It has 5 features and 1535 rows.
* **hashed_email**, **start_time** and **end_time** are characters, the rest are all doubles.
* Looking at the data shows that there are two cases in the **original_end_time** column with NA values.
* The original_start_time and original_end_time variables are scaled in such a way that it makes them unuseable to work with (they're basically the same value)

## Project statement ##
This section aims to briefly cover the question and methods of the individual planning stage. The broad question that will be researched in this report is, *What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter?* With the specific question: *Can player demographic and in-game behaviour variables in players.csv and aggregated session statistics from sessions.csv predict whether a player subscribes to the newsletter?*. The response variable in our case is **subscribe** which is either TRUE or FALSE, and the predictor variables will be played_hours, gender, age and average session length which will be calculated using sessions.csv.

## Exploratory Data Analysis and Visualization ##
In this section we'll try to explore the data and perform minimal amounts of wrangling to turn the data into a tidy dataset.

In [None]:
player_agg <- sessions %>%
mutate(start_time = ymd_hms(start_time),
end_time = ymd_hms(end_time),
session_length = as.numeric(difftime(end_time, start_time, units='mins'))) %>%
group_by(hashedEmail) %>%
summarise(
num_sessions = n(),
total_play_mins = sum(session_length, na.rm=TRUE),
mean_session_mins = mean(session_length, na.rm=TRUE),
median_session_mins = median(session_length, na.rm=TRUE)
)


# --- join with players
df <- players %>% left_join(player_agg, by = c('player_id'))


# --- means for each quantitative variable in players.csv (rounded to 2 decimals)
num_vars <- df %>% select(where(is.numeric))
means_table <- as.data.frame(summarise_all(num_vars, ~round(mean(., na.rm=TRUE),2)))
means_table


# --- example plots
# 1) Distribution of total play minutes
library(ggplot2)
ggplot(df, aes(x = total_play_mins)) +
geom_histogram(bins=50) +
labs(title='Distribution of total play time (mins)', x='Total play minutes', y='Count')


# 2) Newsletter subscription by player_type
ggplot(df, aes(x = player_type, fill = factor(newsletter_subscribed))) +
geom_bar(position='fill') +
labs(title='Proportion subscribed by player type', y='Proportion', fill='Subscribed')


# 3) Scatter: num_sessions vs. total_play_mins colored by subscription
ggplot(df, aes(x = num_sessions, y = total_play_mins, color = factor(newsletter_subscribed))) +
geom_point(alpha=0.6) +
labs(title='Sessions vs playtime by subscription')