# Individual Planning Report

## 1.Data Description

In [None]:
library(tidyverse)
library(repr)
library(readxl)

In [None]:
players <- read_csv("https://raw.githubusercontent.com/Chenwen-Zhang/individual-planning-report/refs/heads/main/players.csv")

In [None]:
sessions <- read_csv("https://raw.githubusercontent.com/Chenwen-Zhang/individual-planning-report/refs/heads/main/sessions.csv")

### Data description of players dataset

1. Number of observations : 196
2. Number of variables: 7
3. Name and type of variables:
   
- experience(chr): the degree of familiarity with and skill in the game

- gender(chr): the gender of the player

- age(chr): the age of the player

- played_hours(dbl): how long does the person play the game

- subscribe(lgl): whether players subscribe the game

### Data description of sessions dataset

1. Number of observations : 1535
2. Number of variables: 5
3. Name and type of variables:
   
- hashedEmail(chr): the hashedEmail of players

- start_time(chr): the start time of the player

- end_time(chr): the end time of the player

- original_start_time(dbl): the original start time of the player

- original_end_time(dbl): the original end time of the player

## 2. Questions

1. One broad question: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?
   
2. Specific question:
- Can players' age predict whether they will subscribe a game-related newsletter?

3. How the data will help me address the question of interest?
- subscribe is responsible variable and experience, gender, age and played_hours are predictors

4. wrangle data
- process the missing data (NA)
- convert charactor into factor

5. prediction model: classification, because responsible variable is a categorical variable. And data requires to be
standardized

## 3. Exploratory Data Analysis and Visualization

### Summarize the variables

In [None]:
players_age_max <- players |>
select(Age)|>
summarize(players_age_max=max(Age,na.rm=TRUE))

players_age_max

players_age_min <- players |>
select(Age)|>
summarize(players_age_min=min(Age,na.rm=TRUE))

players_age_min

players_age_mean <- players |>
select(Age)|>
summarize(players_age_avg=mean(Age,na.rm=TRUE))

players_age_mean

In [None]:
played_hours_mean <- players |>
select(played_hours)|>
summarize(played_hours_avg=mean(played_hours,na.rm=TRUE))

played_hours_mean

played_hours_min <- players |>
select(played_hours)|>
summarize(played_hours_min=min(played_hours,na.rm=TRUE))

played_hours_min

played_hours_max <- players |>
select(played_hours)|>
summarize(played_hours_max=max(played_hours,na.rm=TRUE))

played_hours_max

In [None]:
player_experience_count <- players |>
select(experience)|>
group_by(experience)|>
summarize(count=n())

player_experience_count

player_gender_count <- players |>
select(gender)|>
group_by(gender)|>
summarize(count=n())

player_gender_count

player_subscribe_count <- players |>
select(subscribe)|>
group_by(subscribe)|>
summarize(count=n())

player_subscribe_count

### Wrangling

In [None]:
sessions_mutated <- sessions|>
select(start_time:original_end_time)|>
mutate(duration=original_end_time-original_start_time)

sessions_mutated

In [None]:
players_tidy <- players |>
mutate(experience=as.factor(Age),
       subscribe=as.factor(subscribe),
       gender=as.factor(gender))|>
drop_na()

players_tidy

### Visualization

In [None]:
options(repr.plot.width=12, repr.plot.height=7)

players_plot_1 <- players_tidy |>
select(Age,subscribe)|>
ggplot(aes(x = Age, fill = subscribe)) + 
   geom_bar(position = "dodge") +
   scale_x_continuous(breaks = seq(0,60,by=10))+
   xlab("The Age of the players ") +
   ylab("Players subscribe the game")+
   theme(text=element_text(size=15))

players_plot_1

This bar plot visualizes the comparison related to whether players subscribe the game within different ages. Based on this graph, it displays that 17-year-old players has the highest number for both subscribed and non-subscibed, which recorded by far highest figures.

In [None]:
options(repr.plot.width=12, repr.plot.height=7)

players_plot_2 <- players_tidy |>
select(Age,subscribe)|>
ggplot(aes(x = Age, fill = subscribe)) + 
   geom_bar(position = "fill") +
   scale_x_continuous(breaks = seq(0,60,by=10))+
   xlab("The Age of the players ") +
   ylab("players subscribe the game")+
   theme(text=element_text(size=15))

players_plot_2

This bar chart compares the proportion of players who subscribed across different age groups. It is clear that players around or under 15 years old all subscribed the game. Regarding to younger players (under 30), they are generally more likely to subscribe, but also a lot of them did not, especially those aged 18 that only has approximately 30% subscribed to the game. 

By contrast, players above 30 displays a quite balance subscription patterns, with roughly equal subscribed and non-subscibed.

Overall, the graph shows that even if the young occur many non-subscribed players, they also represent the majority of the data set. Therefore, the young players group has a great potential for game engagement.

## 4. Methods and Plan

1. The classification is appropriate because the responsible variable (subscribe) is a categorical type.
2. The K-nearest neighbors (KNN) algorithms can be used here because it can use data directly ot form the class by measuring the proximity to neigboring points to determine the class of a new sample.
3. KNN algorithms requires few assumptions about what the data must look like. It can essencially capture any kind of shape of data for a class.
4. the potential limitations or weaknesses of the method selected:
- becomes very slow as the training data gets larger
- may not perform well with a large number of predictors
- may not perform well when classes are inbalanced
  
5. I'm going to compare and select the model by cross-validation.
6. Once we have decided on a predictive question to answer and done some preliminary exploration,the next thing is to split the data into training data and testing data. I will use 75% of the data for training and 25% for testing.The subset of training data used for evaluation is the validation set.And then I will use it for cross-validation.

## 5. GitHub Repository

https://github.com/Chenwen-Zhang/individual-planning-report/tree/main