# DSCI 100 Group 37: Final Project

## Description and usage of the game data:

In this project, we will be working with local game data from UBC (University of British Columbia), called PLAIcraft, ran by a group of people in the department of Computer Science. This project aims to conver analysis and modelling of the data, visualizing it graphically to predict any correlations between any variables used. The data consists of two files: Project_Planning_Players.csv and Sessions.  This project will only require the Players data.  

## Predicting Experience Level Using Age and Hours Played

### Introduction



The player dataset includes unique data for each individual player. There are 196 observations, which indicate the number of players in the dataset and 7 variables. 

|Variable|Type|Description|
|--------|----|-----------|
|experience|Character|Experience level of a player|
|subscribe|Character|If the player is subscribed to the news letter|
|hashedEmail|Character|Player's unique hashed email|
|played_hours|double|Number of hours played|
|name|Character|Name of player|
|gender|Character|Gender of player|
|Age|Double|Age of player|

## Loading the data into Jupiter

Below only the Project_Planning_Players.csv will be loaded in, as it is the only file that we need to complete the project

In [None]:
library(tidyverse)
library(tidymodels)

In [None]:
player_url<- "https://raw.githubusercontent.com/tiannawong/dsci100-individual-project-/refs/heads/main/players.csv"

player_data <- read_csv(player_url)
head(player_data)
tail(player_data)

The head and the tail is shown above for *Project_Planning_Players.csv*

## Wrangling

Below, we want to choose all the columns we need, as we are trying to predict which experience level plays the most for certain age and playing time.  We will make it simplified by selecting only Experience, played_hours, and Age.

In [None]:
select_player_data <- player_data |>
    mutate(experience = as_factor(experience)) |>
    select(experience, Age, played_hours)
head(select_player_data)
tail(select_player_data)

## Methods and Results

In this project, we want to use k-nn classification to predict a new user's experience level.  Before we model and train the data, we want to perform simple visualizations to get a better understanding of what we are working with.  Below will be graphs that visualize different aspects of the data.

In [None]:
select_player_data_bar <- select_player_data |>
    ggplot(aes(x = experience)) +
    geom_bar(stat = "count") +
    labs(x = "The experience levels for different players", title = "Fig 1: The distribution of different experience levels")
select_player_data_bar

Explaination of graph: The gar graph shows the amount of players per experience level.  We can see that there are more amateur players than any other players, followed by veterans.

In [None]:
select_player_data_plot <- select_player_data |>
    mutate(played_mins = (played_hours * 60)) |>
    ggplot(aes(x = Age, y = played_mins)) +
    ylim(0, 360) +
    geom_point(aes(color = experience)) +
    labs(x = "The age in years", y = "Number of played hours in minutes", color = "Type of experience", title = "Fig 2: The relationship between Age and Played hours")
select_player_data_plot

Explaination of graph: The scatter plot shows age and the different types of experience levels relative to the playing time in minutes.  To make the graph visually pleasing, we had to  limit the amount of play time to 6 hours (360 minutes).  From the graph, we are not able to pick up a pattern on age and playing time

## Training and modeling data

In this portion, we will start to train and model the players data so that we can predict which experience levels fits with a new data point given

In [None]:
## splitting the data into training and testing data

player_split <- select_player_data |>
    