# Data Science 100 Project

## introduction:
### Background: 
Video games are a popular way for people to play and connect with others. Game makers and researchers often use newsletters to share updates, events, or news with players. But not every player signs up for these newsletters. If we can find out which players are more likely to subscribe, we can better understand what kinds of players are more interested and involved.

In this project, we look at real data from a Minecraft research server. The data includes player information and how they behave in the game. We want to find out which player features and behaviors are most useful in predicting whether someone will subscribe to the newsletter. This can help game teams and researchers plan better ways to reach the right players.
### link to github:
https://github.com/90419359/data-science-project
### Questin 1 (the selection of the project):
This project explores whether a player’s gender can predict their likelihood of subscribing to a game-related newsletter, and whether this pattern differs across experience levels. Using data from players.csv, we focus on two variables—gender and experience—to compare subscription behavior among different player groups.

The response variable is subscribe (TRUE or FALSE), and the explanatory variable is gender, with experience level used as a secondary grouping variable. The goal is to visualize and describe any patterns that may suggest a relationship between these characteristics and subscription behavior.
### Data Description:
## 1.players.csv
Each row in this dataset represents an individual player. The columns include:
	
    experience: Self-reported gaming experience, categorized as Beginner, Amateur, Regular, Veteran, or Pro.
	
    subscribe: Indicating whether the player subscribed to the server’s content or notifications.
	
    hashedEmail: A pseudonymized identifier for each player.
	
    played_hours: Total number of hours the player has played on the server.
	
    name: The first name of the player.
	
    gender: Gender identity (Male, Female, Non-binary).
	
    age: The player’s self-reported age (integer).

## 2.sessions.csv
Each row represents one gameplay session and includes:
	
    hashedEmail: useless in our project
	
    start_time: The human-readable start time of the session.
	
    end_time: The human-readable end time of the session.
	
    original_start_time: Start time in Unix timestamp format.
	
    original_end_time: End time in Unix timestamp format.

These fields allow for the analysis of session length, activity patterns, and player engagement over time.

In [None]:
library(tidyverse)

In [None]:
# load the data

#save the website
player_url <-"https://raw.githubusercontent.com/90419359/data-science-project/refs/heads/main/players.csv"
session_url <- "https://raw.githubusercontent.com/90419359/data-science-project/refs/heads/main/sessions.csv"
#download the file
download.file(player_url,destfile ="players.csv")
download.file(session_url,destfile ="sessions.csv")
#read the file
Player_data <- read_csv("players.csv")
Sessions_data <- read_csv("sessions.csv")

In [None]:
Player_data

In [None]:
Sessions_data

In [None]:
# make the data more clean and perform summaries

In [None]:
Player_data <- Player_data |>
  mutate(gender_simple = ifelse(
    gender == "Male", "Male",
    ifelse(gender == "Female", "Female", "Other")
  ))
Player_data

In [None]:
gender_subscribe <- Player_data |>
  group_by(gender_simple, subscribe) |>
  summarize(count = n())
gender_subscribe

In [None]:
# creates a visualization and explain the relationship between them
gender_subscribe_female_bar <-  gender_subscribe |> 
     filter(gender_simple == "Female") |>
     ggplot(aes(x=subscribe,y=count)) +
     geom_bar(stat="identity")+
     labs(x="Subscription Status",y="Number Of Female Players",title="Female User Subscription Overview") 
gender_subscribe_female_bar

In [None]:
gender_subscribe_male_bar <-  gender_subscribe |> 
     filter(gender_simple == "Male") |>
     ggplot(aes(x=subscribe,y=count)) +
     geom_bar(stat="identity")+
     labs(x="Subscription Status",y="Number Of Male Players",title="Male User Subscription Overview") 
gender_subscribe_male_bar

In [None]:
gender_subscribe_gender_minorities_bar <-  gender_subscribe |> 
     filter(gender_simple == "Other") |>
     ggplot(aes(x=subscribe,y=count)) +
     geom_bar(stat="identity")+
     labs(x="Subscription Status",y="Number Of Minorities Players",title="Minorities User Subscription Overview") 
gender_subscribe_gender_minorities_bar

In [None]:
experience_subscribe <- Player_data |>
  group_by(experience, subscribe) |>
  summarize(count = n())
experience_subscribe 

In [None]:
experience_subscribe_pro_bar <- experience_subscribe |> 
  filter(experience == "Pro") |>
  ggplot(aes(x = subscribe, y = count)) +
  geom_bar(stat = "identity",fill="purple",color="black") +
  labs(x = "Subscription Status",
       y = "Number of Pro Players",
       title = "Pro User Subscription Overview")

experience_subscribe_pro_bar

In [None]:
experience_subscribe_beginner_bar <- experience_subscribe |>
  filter(experience == "Beginner") |>
  ggplot(aes(x = subscribe, y = count)) +
  geom_bar(stat = "identity", fill = "light blue", color = "black") +
  labs(x = "Subscription Status",
       y = "Number of Beginner Players",
       title = "Beginner User Subscription Overview")

experience_subscribe_beginner_bar

In [None]:
experience_subscribe_regular_bar <- experience_subscribe |>
  filter(experience == "Regular") |>
  ggplot(aes(x = subscribe, y = count)) +
  geom_bar(stat = "identity", fill = "orange", color = "black") +
  labs(
    x = "Subscription Status",
    y = "Number of Regular Players",
    title = "Regular User Subscription Overview"
  )

experience_subscribe_regular_bar

In [None]:
experience_subscribe_amateur_bar <- experience_subscribe |>
  filter(experience == "Amateur") |>
  ggplot(aes(x = subscribe, y = count)) +
  geom_bar(stat = "identity", fill = "light green", color = "black") +
  labs(x = "Subscription Status",
       y = "Number of Amateur Players",
       title = "Amateur User Subscription Overview")

experience_subscribe_amateur_bar

In [None]:
experience_subscribe_veteran_bar <- experience_subscribe |>
  filter(experience == "Veteran") |>
  ggplot(aes(x = subscribe, y = count)) +
  geom_bar(stat = "identity", fill = "light yellow", color = "black") +
  labs(x = "Subscription Status",
       y = "Number of Veteran Players",
       title = "Veteran User Subscription Overview")

experience_subscribe_veteran_bar

## Explain any insights you gain from these plots that are relevant to address your question:
Image-Based Interpretation
From the bar charts of both gender and experience level, we can observe general trends in subscription behavior. Across all genders—Male, Female, and Other—the “Subscribed” bar is consistently higher than the “Not Subscribed” bar. However, the visual difference between male and female players is relatively small, and no strong contrast can be observed without statistical modeling. Players identifying as “Other” also appear more likely to subscribe than not, but the smaller sample size makes it harder to draw firm conclusions. Therefore, gender may have some influence, but the pattern is not particularly strong based on visual evidence alone.

In contrast, experience level shows a much clearer relationship. Beginner and Regular players have noticeably taller “Subscribed” bars, suggesting stronger interest in newsletter features. Amateur players show a moderate difference, while Pro and Veteran players are more evenly split. These visual differences across experience levels are more apparent and consistent.

Overall, while both gender and experience level may relate to subscription behavior, the player’s experience level shows stronger and clearer visual differences, suggesting it may be the more influential factor in this analysis.

## data analysis:
### 1. Why is this method appropriate?
This method is appropriate because the variables of interest—subscribe, gender_simple, and experience—are categorical. Using bar charts allows for clear visual comparisons between groups and helps identify patterns in player behavior without requiring advanced statistical modeling. It is especially suitable for exploratory analysis.
### 2.Which assumptions are required, if any, to apply the method selected?
Bar charts used in exploratory data analysis do not require strong statistical assumptions such as normality or linearity. The only assumption is that the data are correctly grouped and categorized, and that each observation is independent.
### 3.What are the potential limitations or weaknesses of the method selected?
One limitation is that this method is descriptive only. It shows trends but does not provide statistical evidence or prediction. Also, differences in subscription rates may be influenced by other variables (such as age or playtime), which are not accounted for in this visual analysis. The sample size for certain groups, like “Other” gender or “Pro” players, may also be too small to generalize.
### 4.How did you compare and select the model?
No statistical model was applied in this analysis. The focus was on visual exploration through bar plots rather than prediction or classification, so model selection and comparison were not applicable.
### 5.How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?
Although no predictive model was applied, the dataset was pre-processed to support effective visualization. Specifically, the data were grouped and summarized by gender_simple and experience using group_by() and summarize() functions. This helped generate count-based tables that clearly show the number of subscribed and non-subscribed players in each category. These pre-aggregated tables made it easier to directly construct bar plots for visual analysis. No data splitting or cross-validation was performed, as no modeling was involved.

# Discussion

This analysis found that both gender and experience level are related to a player’s likelihood of subscribing to a game-related newsletter. In particular, beginner and regular players showed stronger subscription tendencies, while pro and veteran players were more balanced. Although female players appeared slightly more likely to subscribe than male players, the difference was small based on visual inspection. Players identifying as other genders also showed a tendency to subscribe, but the small sample size makes this harder to interpret.

These results were somewhat in line with expectations. It is reasonable to expect that newer or more casually engaged players might be more receptive to outreach like newsletters, while more experienced players may be less influenced. The subtle gender differences were expected but not especially pronounced.

The findings suggest that targeting users based on experience level may be more effective for increasing newsletter engagement. This could be useful for the game research team when deciding how to focus communication strategies or future recruitment efforts.

Future questions could include:1：Are there interaction effects between gender and experience? 2：Could a predictive model (e.g., logistic regression) help confirm these relationships? These directions could extend the current findings and lead to deeper understanding.