# Predicting Subscription to a Video Game Newsletter Based on Age and Time Played on the Game

### By: Lila, Lauren, and Khush

## Introduction

Researchers have been collecting data on how gamers play video games. They have done this by collecting data through a server created by the data scientists on the game 'Minecraft'. This data can be used to explore a number of different questions.

The question we aim to answer through this report is *'Can age and time played of a player predict subscription to a newsletter in players.csv?'*

The dataset titled "players.csv" will be used to answer this question. It contains Information about each player observed. The information included is described in the table below. 

| Column Name           | Data Type        | Description                                                       |
|-----------------------|------------------|-------------------------------------------------------------------|
| experience | Categorical (string) | The players experience (Beginner, Amateur, Veteran, Pro) |
| subscribe | Boolean | Whether or not the player has suscribed to a game-related newsletter (true or false) |
| hashedEmail | String | The hashed email of the player |
| played_hours | Numerical | The hours they have spent playing |
| name | String | The name of the player |
| gender | Categorical | The gender of the player |
| Age | Numerical | The age of the player |

This dataset has 7 columns and 196 rows. 

For this report we will be focusing on just the age of the player, the time that they have spent playing, and whether or not they have suscribed to a game-related newsletter.

## Methods & Results

This section is for loading, wrangling, performing a summary and creating visualizations of the data:

This section is for loading, wrangling, performing a summary and creating visualizations of the data

In this analysis, we investigated whether a player's age and the number of hours they play the game can predict their subscription to a game-related newsletter using K-Nearest Neighbours (KNN) classification. 

First, we cleaned the dataset by dropping irrelevant columns: "name", "hashedEmail", "gender" and experience" as they are not conceptually related to the question being investigated. We also dropped rows with "NA" to make the dataset ready for K-NN classification because a dataset with rows that contain "NA" would cause errors in the classification. 

We then standardized the subscription labels to use "Yes/No" instead of "TRUE/FALSE" as it is more more easily understood and reduces ambiguity. We also scaled both predictor variables to make sure the distance metric for K-NN uses them with equal weighting. 

We then made explanatory summaries and visualizations to understand data patterns before the modeling and classification. This helped us gain a rough idea of what the expected outcome could be. 

Next, we split the data into training and testing data into 75% and 25% respectively. Using the training dataset, we determined the most effective *K* value. We did this by using a range of *K* values and gathering information on their accuracies and generating a plot to visualize this information. We chose the optimal *K* value for this classification problem based on highest accuracy and then performed the classification in the testing set using this *K* value to generate predictions for the dataset. 

Finally, we analysed performance by generating a confusion matrix from which we evaluated accuracy and classification. Using the obtained information, we made a heatmap to summarize how accurate the model was at differentiating subscribers and non-subscribers. These results were then used to answer the original question. 

In [1]:
library(tidyverse)
library(tidymodels)
       

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39