# Individual Planning Report

## Introduction

To answer the question "What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?", I have decided to analyze the "players_data" dataset and look at how subsciption status is affected by different factors including gender, age, and experience. A research group in Computer Science at UBC, led by Frank Wood, collected the dataset analyzed in this report. The game that this dataset is about is Minecraft.

In [None]:
library(tidyverse)

In [None]:
players_data <- read_csv("players_data.csv")

## Data Description and Exploratory Data Analysis and Visualization

In [None]:
players_data_statistics <- players_data |>
    summary()
players_data_statistics

There are 7 different variables in this dataset, with "subscribe" being the response variable. The goal of the dataset is to use all other variables, other than "name" and "hashedEmail", to predict the value of "subscribe". The explanatory variables would include "experience", "played_hours", "gender", and "age".

The "hashedEmail" variable describes the email of each player in hash; hash is not decodable, and this variable is not correlated to the research question.

The "name" variable describes the name of each player, and this variable is not correlated to the research question.

The "subscribe" variable describes whether each player is subscribed to a game-related newsletter.

The "experience" variable describes the experience level of players ranging from "Beginner" to "Veteran" with beginners being new to the game and veterans being the most experienced to the game.

The "played_hours" variable describes the total played hours of each player.

The "gender" variable describes the gender of each player.

The "age" variable describes the age of each player.

Note: played_hours is very skewed to the right as the mean is much greater than the median, age is slightly less skewed to the right

In [None]:
box_plot <- players_data |>
    ggplot(aes(x = played_hours, y = subscribe)) +
    geom_boxplot() +
    labs(title = "Correlation between played hours and subscription status",
    x = "Played Hours",
    y = "Subscription Status") +
    theme_minimal() +
    theme(plot.title = element_text(size = 15, hjust = 0.5))

box_plot

In [None]:
hist_plot <- players_data |>
    ggplot(aes(x = played_hours)) +
    geom_histogram(bins = 50) +
    labs( title = "Distribution of Total Played Hours of Each Player",
    x = "Played Hours", y = "Number of Players")

hist_plot

At a glance, total hours played seems too skewed to be a good predictor for subscription status.

## Questions

Broad Question: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Individual Question 1: Can the age of a player predict whether or not they will subscribe to a game-related newsletter?

Individual Question 2: Can the gender of a player predict whether or not they will subscribe to a game-related newsletter?

Individual Question 3: Can the experience level of a player predict whether or not they will subscribe to a game-related newsletter?

## Methods and Plan

I will use a classification model to answer questions 1 to 3. A classification model uses specified predictors to categorize new data for the response variable. Using a confusion matrix after building the classification model, I can check how accurate the model is by comparing how many times the model is wrong for each variable to how often it is correct; this will determine whether or not each variable is a good predictor for subscription status of Minecraft players.

## GitHub Repository

https://github.com/Jessica9521/Individual-Planning-Report/blob/main/Individual%20Planning%20Report.ipynb