<div style="font-size:18px;">
Data Science 100 Final Project

Introduction:
There is a research group at UBC that is collecting data on how people play video games. They have collected data on players through a Minecreaft Server. They have presented us with many broad questions that we need to explore. The broad question that has been chosen for the project today is Which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts. From this we have formulated a predictuve question, the question we are trying to answer is Can we predict the number of sessions players partake in a Minecraft server based on their age?

This project uses two datasets that were taken from the minecraft server, the two CSV files are:
players.csv (This dataset has information on the players)
sessions.csv(This dataset has specific information on the players sessions)

These two datasets were cleaned and joined together by hashedEmail to create a dataset that we used to answer our question. 
|**Variable Name**  | **Type** |**Description**                     |
|-------------------|----------|------------------------------------|
|hashedEmail        |Character |An identifier for each player       |
|Age                |Numeric   |The Age of the player in years      |
|Total_duration     |Numeric(Double)  |The total amount of time that a player spent on the Mincraft server in seconds|
|Gender |Character | The self reported gender of the player  |
|num_session        |Integer|The total number of sessions played by the player|

**Summary of the dataset used**

Number of Observation: 123 observations

The variables used in the analysis: Age, num_session

The response variable: num_session

The explanatory variable: Age

In [None]:
### Run this cell before continuing. 
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

In [None]:
players<-read_csv("players.csv")
players

In [None]:
sessions<-read_csv("sessions.csv")
sessions

In [None]:
tidy_sessions<-sessions|>
mutate(session_duration = original_end_time - original_start_time)|>
filter(!is.na(hashedEmail), !is.na(session_duration))|>
group_by(hashedEmail)|>
summarize(num_session = n(), total_duration = sum(session_duration, na.rm = TRUE))
tidy_sessions

In [None]:
tidy_players<-players|>
select(hashedEmail, gender, Age)|>
filter(!is.na(Age))|>
semi_join(tidy_sessions, by = "hashedEmail")
tidy_players

In [None]:
player_session<-tidy_players|>
inner_join(tidy_sessions, by = "hashedEmail")
player_session

In [None]:
avg_player_session<-player_session|>
group_by(Age)|>
summarize(avg_session = mean(num_session))
avg_player_session

In [None]:
library(ggplot2)

options(repr.plot.width=12, repr.plot.height=7)
game_plot_1<-avg_player_session|>
ggplot(aes(x = Age, y = avg_session))+
geom_point(alpha = 1)+
labs(title = "Average Sessions Played Based on Age", x = "Age of Players", y = "Average Sessions Played")+
theme_minimal()
game_plot_1

In [None]:
player_session<-player_session|>
mutate(gender =as_factor(gender))
player_session

In [None]:
game_split<-initial_split(player_session, prop = 0.75, strata = num_session)
game_training<-training(game_split)
game_testing<-testing(game_split)

In [None]:
lm_spec<-linear_reg()|>
set_engine("lm")|>
set_mode("regression")
lm_spec

In [None]:
lm_recipe<-recipe(num_session ~ Age, data = game_training)

lm_fit<-workflow()|>
add_recipe(lm_recipe)|>
add_model(lm_spec)|>
fit(data = game_training)
lm_fit

In [None]:
options(repr.plot.width = 14, repr.plot.height = 10)
game_preds<-lm_fit|>
predict(game_training)|>
bind_cols(game_training)

lm_predictions<-game_preds|>
ggplot(aes(x = Age, y = num_session))+
geom_point(alpha = 0.4)+
geom_line(mapping = aes(x = Age, y = .pred), 
            color = "blue") +
        xlab("Age of Players") +
        ylab("The Number of Sessions played by Players") +
        theme(text = element_text(size = 20))
lm_predictions

In [None]:
lm_test_results <- lm_fit |>
                    predict(game_testing) |>
                    bind_cols(game_testing) |>
                    metrics(truth = num_session, estimate = .pred)                    

lm_rmspe <- lm_test_results |>
                filter(.metric == 'rmse') |>
                select(.estimate) |>
                pull()
lm_rmspe

In [None]:
options(repr.plot.width = 8, repr.plot.height = 7)


test_preds <- lm_fit |>
    predict(game_testing) |>
    bind_cols(game_testing)

lm_predictions_test <- test_preds |>
    ggplot(aes(x = Age, y = num_session)) +
        geom_point(alpha = 0.4) +
        geom_line(
            mapping = aes(x = Age, y = .pred), 
            color = "blue") +
        xlab("Age of Players") +
        ylab("Number of Sessions Of Players") +
        theme(text = element_text(size = 20))
### END SOLUTION
lm_predictions_test