**Name** : Adrian Joseph Santoso

**Student Number** : 10427052

**Section** : 006

## **1. Data Description**

The dataset comprises two files: **players.csv** and **sessions.csv**, which contain information on PLAICRAFT players and their game sessions. This data helps analyze player activity, experience levels, and engagement trends. However, for this particular project, I will be using **players.csv** exclusively.  

#### **Players Dataset (`players.csv`)**  
This dataset provides player-specific details, including experience level, subscription status, and demographic attributes.  

- **`experience` (String)**: Categorized into **Pro, Veteran, Amateur, or Regular**, reflecting a player's skill level.  
- **`subscribe` (Boolean)**: TRUE/FALSE value indicating whether the player is a PLAICRAFT subscriber.  
- **`hashedEmail` (String)**: Unique hashed identifier linking players to session data.  
- **`played_hours` (Float)**: Number representing total hours played.  
- **`name` (String)**: The player’s first name.  
- **`gender` (String)**: Player's gender categorized as **Male, Female, and Non-binary**.  
- **`Age` (Integer)**: Player’s age in years. *There is 2 missing data*.

#### **Data Summary and Issues**  
The dataset consists of **196 player records**. It might contain missing values in key fields such as **'Age'**, which could affect calculations. We may need to consider filtering the data before performing analysis.

## **2. Questions**

In this project, I will be answering the broad question of:  
**We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.** This will help understand which characteristics contribute most to player experience levels, allowing for better matchmaking and personalized game experiences.  

#### **Specific Question**  
Can an **Regular** player's `age` and `gender` be used to predict their total accumulated playing hours (`played_hours`)?  

#### **Hypothesis**  
Players with higher **played hours** are expected to have greater experience levels, while older players may have different gaming habits that affect skill progression. **Subscription status** and **gender** could influence engagement and playstyle, potentially impacting skill development. By analyzing key attributes like `played_hours`, `age`, `gender`, and `subscribe`, we can explore patterns in player behavior and how they relate to experience levels.  

#### **Plan on Data Wrangling** 
- Load and read **players.csv**.  
- Remove datasets with missing values to ensure data completeness.  
- Select relevant variables: `experience`, `age`, `gender`, and `played_hours`.  
- Filter the data to include only players with `experience` classified as **Regular**.

## **3. Exploratory Data Analysis and Visualization**

In [None]:
library(repr)
library(tidyverse)

In [None]:
players <- read_csv("data/players.csv")

players_filtered <- players |>
                    select(experience, played_hours, Age, gender) |>
                    filter(experience == "Regular") |>
                    drop_na(Age)

players_filtered 

quantitative_summary <- players |>
                    select(played_hours, Age) |>
                    summarise(
                        mean_age = mean(Age, na.rm = TRUE),
                        mean_played_hours = mean(played_hours, na.rm = TRUE)
                    )

quantitative_summary

**Result:** The average player age is 20.52 years, and the average total playtime is 5.85 hours.

In [None]:
options(repr.plot.width = 8, repr.plot.height = 6)

players_plot <- players_filtered |>
    ggplot(aes(x = Age, y = played_hours, colour = gender)) + 
    geom_point(alpha = 0.8) + # Deals with the transparency of the points, set it to an appropiate value
    labs(x = "Age (in years)", y = "Hours Played (in hours)", colour = "Gender") +
    ggtitle("Relationship between age, gender, and hours played")

players_plot

#### **Observations from the Scatter Plot**  
The scatter plot shows that most players have low playtime, with the majority clustered below 10 hours of gameplay. A few players, particularly those around ages 15-25, exhibit significantly high played hours (over 150-200 hours), which could indicate highly engaged players or potential data entry errors. There is no strong correlation between age and playtime, weakening the hypothesis that younger players accumulate more playtime. Additionally, male players appear more frequently in the dataset, but extreme playtime outliers come from multiple gender categories, suggesting engagement is not limited to one group. Overall, further analysis is needed to confirm trends, and potential outliers should be examined to ensure data accuracy.

## **4. Methods and Plan**  

To address our research question, we will use **linear regression** to analyze the relationship between `age`, `gender`, and `played_hours` for players classified as **Regular**. Linear regression is appropriate because it helps quantify how these factors influence total playtime in a straightforward manner. This method assumes that `age` and `gender` have a measurable impact on playtime, allowing us to determine if younger players generally play more and whether gender plays a role in gaming behavior.  

However, there are some limitations to consider. Outliers, such as players with exceptionally high or low playtime, could skew the results and make the model less accurate. Additionally, `age` and `gender` may not be the only factors influencing playtime—other elements like gaming preferences, skill level, or external commitments could also play a role, which our model does not account for.  

To ensure the reliability of our findings, we will **split the dataset into 70% training data and 30% test data**, allowing us to test how well our model generalizes to new players. We will also apply **cross-validation (CV fold)** to further improve accuracy and reduce overfitting. The performance of the model will be evaluated using **Root Mean Squared Error (RMSE)** and **Root Mean Squared Percentage Error (RMSPE)**, which measure how close the predicted playtime values are to the actual values. These steps will help determine whether `age` and `gender` are meaningful predictors of total playtime among Regular players.