# DSCI 100 Individual Project

In [4]:
# Necessary libraries loaded to load datasets, wrangle data, and perform some summarization and visualization
library(tidyverse)
library(repr)
library(tidymodels)
library(RColorBrewer)
options(repr.matrix.max.rows = 6)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

View and load the datasets: both appear to have a delimiter of "," so will use read_csv

In [5]:
players_data <- read_csv("data/players.csv")
sessions_data <- read_csv("data/sessions.csv")

players_data
sessions_data

ERROR: Error: 'data/players.csv' does not exist in current working directory ('/home/jovyan/work/dsci-100-project-final').


I can compute the mean values of quantitative data in the players dataset. 

In [None]:
mean_quantative <- summarize(players_data, 
                             mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2),
                             mean_age = round(mean(Age, na.rm = TRUE), 2))
                   
mean_quantative

## Data Description

(1) The two datasets, players.csv and sessions.csv, consist of information from survey results. The data could be inaccurate since it is based on survey results and self-ratings. There is also a lot of missing data. 

players.csv contains player demographics, information, and engagement and has 196 observations and 7 variables. The mean age of players is 21.14 years and the mean played hours 5.85 hours.
| Variable Name | Data Type | Meaning |  Issues  |
|:--------:|:--------:|:--------:|:--------:|
|  experience   |  Character   |  Player’s self-reported skill level   |  Users rate their skill: bias may be present |
|  subscribe    |  Logical   |  Player's subscription status   |  N/A  |
|  hashedEmail  |  Character   |  Encripted player email   |  N/A  |
|played_hours|Double|Total hours spent playing on server | Few players contribute to total playtime |
|name|Character|Player's name| N/A|
|gender|Character| Player's gender| Missing data |
|Age|Double|Player's age| Missing data |

sessions.csv contains information on individual play sessions for each player; each row represents one session's player and duration of play. It has 1535 observations and 5 variables:
| Variable Name | Data Type | Meaning |  Issues  |
|:--------:|:--------:|:--------:|:--------:|
|  hashedEmail  |  Character   |  Email linking session to player  |  N/A  |
|  start_time   |  Character   |  Session start time  |  N/A |
|  end_time    |  Character   |  Session end time |  Missing data  |
|  original_start_time | Double | Beginning of session timestamp | N/A|
|  original_end_time | Double | Ending of session timestamp| Missing data |

## Questions

(2) My predictive question is based on the broader question to determine what "kinds" of players are most likely to contribute a large amount of data. The question is: "Does age negatively affect the overall number of hours played?" To address this, I will use players.csv, specifically the variables, Age and played_hours. Before creating the visualization, I will remove invalid or missing values from both variables.

In [None]:
players_data <- players_data |>
  filter(played_hours > 0) |>
  filter(!is.na(Age))
players_data

## Exploratory Data Analysis and Visualization

In [None]:
options(repr.plot.width = 11, repr.plot.height = 8)

age_experience_effect <- players_data |>
  ggplot(aes(x = Age, y = played_hours)) +
         geom_point() +
         labs(x = "Player Age (years)",
              y = "Total Play Time (hours)",
              title = "Effect of Age and Experience on Total Hours Played") +
         scale_y_log10(labels = comma) +
         scale_color_brewer(palette = "Set2") +
         theme(text = element_text(size = 20))
age_experience_effect



(3) This plot shows the relationship between player age and total hours played. There appears to be a negative weak relationship between age and total hours played; younger players tend to spend more time than older players. Most players are under the age of 30. Although there is a downward trend, the points are quite spread out. The trend is approximately linear. 

In [None]:
player_playtime_distrib <- players_data |> 
  ggplot(aes(x = played_hours)) +
    geom_histogram(bins = 30, fill = "skyblue", color = "black") +
    scale_x_log10() +
    labs(title = "Distribution of Player Play Time",
         x = "Play Time (hours)",
         y = "Number of Players") +
    theme(text = element_text(size = 20))
player_playtime_distrib

(3) Most players spend very few time on the game and the number of players with high playtimes is small. This means a few “super-players” dominate, making any averages misleading.

In [None]:
player_age_distrib <- players_data |> 
  ggplot(aes(x = Age)) +
    geom_histogram(bins = 30, fill = "skyblue", color = "black") +
    scale_x_log10() +
    labs(title = "Distribution of Player Age",
         x = "Age (years)",
         y = "Number of Players") +
    theme(text = element_text(size = 20))
player_age_distrib

(3) The distribution is concentrated between younger ages. Outside this range, ages only appear a few times, which is the case for older players. This supports the idea that younger people have more data, players, and as a result, more playtime.

In [None]:
playtime_age_distrib <- players_data |>
  ggplot(aes(x = Age, y = played_hours)) +
    geom_bar(stat = "summary", fun = "sum",, fill = "skyblue", color = "black") +
    labs(title = "Distribution of Play Time by Age",
        x = "Age (years)",
        y = "Total Play Time (hours)") +
    theme(text = element_text(size = 20))

playtime_age_distrib

Most of the total hours come from younger players. Players over 30 contribute very little total playtime. This supports the trend that age is negatively related to total hours played.

## Methods and Plan

(4) To address my question, I will use a bar graph visualization. I will compare total playtime across ages and identify any trends. This method is appropriate because the question is exploratory. The bar graph accumulates values of total playtime so patterns are easy to spot. This is also effective because the data contains many same-age observations. It requires assumptions that that playtime is completely accurate. However, this graph has limitations. There are far many more younger players that the trend may not be accurate. The graph itself not account for other factors or individual variation (a single user's high playtime could result in a high total playtime). Since the method is exploratory and predictive, the most important factor is to select the best summary method. I will clean the data before creating my visualization by removing rows with missing data. 

(5) Github Repository link: https://github.com/elisdale/videogame_ds_project