# Predicting Newsletter Subscription in Minecraft Players

## Introduction:

### Background
The studying of player engagement and behavior in games such as Minecraft is a growing in the field of data science. In this project, I will be wokring with a research group at the University of British Columbia led by Dr. Frank Wood, which is investigating the behaviours and activities of players in a custom Minecraft game server. They collected each player’s activity over time, which allowed them to study their behavior, preferences, and play patterns.

### Question
The aim of this project is to answer the following question:
"Can characteristics of a player (e.g., behavior, play frequency, time-of-day usage) predict whether they will subscribe to the game-related newsletter?"
This question is quite intersting as it can help identify the types of players who most likley to show more engaggemnet and interests in the projects. This will also help with gettings player who have more long-term interest, which will help prioritize players for future studies.

### Data Description
To answer this question, this project will use two datasets:

* `players.csv`: Contains demographic and behavioral features of individual players, such as age, device, and whether they are subscribed to a newsletter.
* `sessions.csv`: Contains a record of each game session for each player, whcih includes timestamps, session length, and actions taken.
In this project, these datasets will be wrangled and used to construct meaningful visuals and relationships related to player behavior and characteristics to identify which features are most predictive of newsletter subscription.



## Methods and Results:

### Reading and Loading the Data
First let's load the libary `tidyverse` to get a package of functions. Then, lets read both of the datasets using the function `read_csv` and assign them to an object. Since the datastes are quite big, we can use the function `head` to get the first few rows to see what we are working with.

In [26]:
library(tidyverse)
library(ggplot2)

In [27]:
players_data <- read_csv("data/players.csv")
head(players_data)
sessions_data <- read_csv("data/sessions.csv")
head(sessions_data)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


### Wrangling and Cleaning

Now that we have read our data, it is now time to clean and wrangle it to make it easier to analyze. First, I joined both of the datasets so i can work with just one combined one. Then, I grouped the data by player characteristics like gender, age, experience level, subscription status, and played hours. Lastly, I used the count function in summarize to see how many times each player joined the game and created a summary table with one row for each group of players. This made the data much simpler.

In [25]:
combined_data <- inner_join(players_data, sessions_data, by = "hashedEmail")

combined_summary <- combined_data |>
  group_by(name, gender, Age, hashedEmail, experience, subscribe, played_hours) |>
  summarise(session_count = n(), .groups = "drop")

final_data <- combined_summary |>
    select(!hashedEmail & !name) |>
    head()
final_data

gender,Age,experience,subscribe,played_hours,session_count
<chr>,<dbl>,<chr>,<lgl>,<dbl>,<int>
Prefer not to say,17.0,Beginner,True,0.2,1
Non-binary,17.0,Amateur,True,1.2,2
Other,,Pro,True,0.2,1
Prefer not to say,25.0,Veteran,False,1.4,6
Non-binary,20.0,Regular,True,218.1,95
Male,17.0,Amateur,True,53.9,130


### Wrangling and Cleaning

This data is now mcuh more clean to work with, making it ready for further analysis. With this cleaned dataset, I can now create visualizations using the library `ggplot2` and the fucntion `ggplot` to explore relationships between variables, allowing me to see which variables would make good predictors for classifying newsletter subscription status. Here are the plots I will be making:
* Gender VS Played Hours (With Subscription Status)
* Gender VS Session Count (With Subscription Status)
* Age VS Played Hours (With Subscription Status)
* Age VS Session Count (With Subscription Status)
* Experience VS Played Hours (With Subscription Status)
* Experience VS Session Count (With Subscription Status)