**DSCI 100: Project Final Report**
===
---

Introduction:
---

A research group led by Frank Wood has set up a Minecraft server to examine how people play video games. The data collected involves player characteristics and play times. Here, we ask a predictive question based on this data, to explore potential trends and relationships between players with similar traits.

Specifically, we ask *"<u>Can player age and hours played predict whether a player will subscribe to the video game newsletter?</u>"*

To do this, we use the *players.csv* dataset, which contains seven variables and 196 observations. These variables include
* experience (chr): Player's level of in-game experience.
* subscribe (lgl): Whether player is subscribed to the newsletter or not.    
* hashedEmail (chr): Player's hashed email.  
* played_hours (dbl): Player's number of hours played.  
* name (chr): Player's name.  
* gender (chr): Player's gender.  
* Age (dbl): Player's age.

Methods & Results:
---

* describe the methods you used to perform your analysis from beginning to end that narrates the analysis code.
* your report should include code which:
    - loads data
    - wrangles and cleans the data to the format necessary for the planned analysis
    - performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis
    - creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
    - performs the data analysis
    - creates a visualization of the analysis
    - note: all figures should have a figure number and a legend

The data was read as a .csv file using a URL from a GitHub repository. To prepare the data for answering the predictive question, the predictors and response variables, *Age*, *played_hours*, and *subscribe*, were selected and the response variable was set to a factor *(fct)* type to be used in a classification setting.

The specific question was answered with K-NN classification, as the response variable is not numeric but rather composed of two distinct values, *TRUE* and *FALSE*, meaning it is categorical.

The data was split into 75% training data and 25% testing data so that it could be trained on a large portion of data and still evaluated on a considerable portion of the data. Also, the use of 5-fold cross-validation gave the model more robust data to operate on. To improve the model, a value of K that maximised the model's accuracy.

With this value K and a multi-fold framework, the K-NN classification was performed. Then, the quality of the model was assessed by its metrics on the test data and the effectiveness of the model was visualised with plots of accuracy.

The variables *Age* and *played_hours* were selected due to their numeric type and because it seems interesting to explore whether there is a relation between these characteristics of a player and whether or not they will subscribe to a game-related newsletter. This addresses interesting questions, such as whether different ages tend to subscribe to newsletters relating to games they like or how much a person might play before becoming involved enough in a gaming community to subscribe to a game-related newsletter.

---

Loading in the library for wrangling and classification and reading in the players.csv dataset.

In [17]:
library(tidyverse)
players_url <- "https://raw.githubusercontent.com/Rafee1012/dsci-100-group-project-10/refs/heads/main/players%20(3).csv"
players <- read_csv(players_url)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Wrangling data into a necessary format for answering the predictive question.

In [18]:
players <- players |>
    select(Age, played_hours, subscribe) |>
    mutate(subscribe = as_factor(subscribe))

Summarising data relevant for planned analysis. (EXPLAIN THE PREDICTORS OF CHOICE AND THE RESPONSE VARIABLE + PERFORM SUMMARY STATISTICS).

Visualising data relevant for planned analysis.

Performing K-NN classification to answer the predictive question. (SPLIT DATA, PERFORM TUNING AND VISUALISE BEST K VALUE, 5-FOLD CROSS VALIDATION, EVALUATING USE METRICS AND VISUALISE EVALUATION).

Visualising data analysis of K-NN classification.

Discussion:
---
* summarize what you found
* discuss whether this is what you expected to find?
* discuss what impact could such findings have?
* discuss what future questions could this lead to?

References:
---
* You may include references if necessary, as long as they all have a consistent citation style.