For this project I am going to focus on the question 1 which is: which "kinds" of players are most likely to contribute a large amount of data. I will use the data from players.csv and sessions.csv to figure it out.


In [None]:
set.seed(2020)
library(readr)
library(lubridate)
library(ggplot2)
library(dplyr)

players <- read.csv("players.csv")
sessions <- read.csv("sessions.csv")

cat("Players dataset:", nrow(players), "rows,", ncol(players), "columns\n")
cat("Sessions dataset:", nrow(sessions), "rows,", ncol(sessions), "columns\n")

head(players)
head(sessions)

str(players)
str(sessions)


In this project, I will use two datasets — players.csv and sessions.csv — which together describe user demographics and their in-game activity logs. The datasets are linked through the variable hashedEmail, which acts as a unique player identifier.

Here is the data dor their type and description.

| Dataset        | Observations | Variables | Description                                                     |
| -------------- | ------------ | --------- | --------------------------------------------------------------- |
| `players.csv`  | 196          | 7         | Contains demographic and experience information for each player |
| `sessions.csv` | 1535         | 5         | Contains detailed time records of each player’s gaming sessions |

| Variable       | Type      | Description                                                                  |
| -------------- | --------- | ---------------------------------------------------------------------------- |
| `experience`   | Character | The player’s experience level (e.g., *Amateur*, *Regular*, *Pro*, *Veteran*) |
| `subscribe`    | Logical   | Whether the player has an active subscription (TRUE/FALSE)                   |
| `hashedEmail`  | Character | Anonymized unique identifier for each player, used to link across datasets   |
| `played_hours` | Numeric   | Total hours the player has played so far                                     |
| `name`         | Character | Player’s in-game name                                                        |
| `gender`       | Character | Player’s gender (*Male* or *Female*)                                         |
| `Age`          | Integer   | Player’s age (in years)                                                      |

| Variable              | Type      | Description                                                     |
| --------------------- | --------- | --------------------------------------------------------------- |
| `hashedEmail`         | Character | Both data has that as a unique identifier   |
| `start_time`          | Character | The start time of the game session |
| `end_time`            | Character | The end time of the game session                                |
| `original_start_time` | Numeric   | Unix-style timestamp representing the start of the session      |
| `original_end_time`   | Numeric   | Unix-style timestamp representing the end of the session        |


Both datasets can be merged using the hashedEmail field.

The players dataset provides demographic context, while sessions captures temporal play behavior.

The timestamps in sessions may require conversion for analysis because it is not easily to read by human.

The goal of subsequent steps is to combine and summarize these data to explore relationships between player characteristics and playing patterns.

After loading, I checked:
- The number of observations and variables.
- The types of variables (numeric, categorical, etc.).
- A quick preview of the first few rows.
- The missing data.


Here is the problem that I want to do the research about: Can a player’s age, total playtime predict whether they will subscribe to the game-related newsletter?

In [None]:

glimpse(players)
glimpse(sessions)

sessions <- sessions |>
  mutate(
    start_time = dmy_hm(start_time),
    end_time = dmy_hm(end_time),
    session_length = as.numeric(difftime(end_time, start_time, units = "hours"))
  )

players_summary <- sessions |>
  group_by(hashedEmail) |>
  summarise(
    total_play_time = sum(session_length, na.rm = TRUE),
    total_sessions = n()
  )

merged_data <- left_join(players, players_summary, by = "hashedEmail")

player_sessions <- sessions |>
  group_by(hashedEmail) |>
  summarise(
    total_play_time = sum(session_length, na.rm = TRUE),
    total_sessions = n()
  )

glimpse(merged_data)

colSums(is.na(merged_data))

sum(duplicated(merged_data$hashedEmail))

summary(merged_data)

Here is the data after cleaning:

Each player now has demographic information and their total play time and number of sessions.

Time variables are standardized.

Missing values are handled.

This cleaned dataset (merged_data) will be used in Exploratory Data Analysis (Step 4).

In [None]:
summary(merged_data[, c("Age", "played_hours", "total_play_time", "total_sessions")])

cor(merged_data[, c("Age", "played_hours", "total_play_time", "total_sessions")], use = "complete.obs")

ggplot(merged_data |>
  filter(!is.na(total_play_time)), aes(x = total_play_time)) +
  geom_histogram(fill = "violet", color = "black", bins = 30) +
  labs(title = "Distribution of Total Play Time",
       x = "Total Play Time (hours)",
       y = "Number of Players")


This histogram shows how does players distributed. 

In [None]:
ggplot(merged_data |>
  filter(!is.na(total_play_time)), aes(x = Age, y = total_play_time)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", color = "green") +
  labs(title = "Age vs Total Play Time",
       x = "Age", y = "Total Play Time (hours)")