# Title

## Introduction
* hook :
* research problem :
* research question :
* thesis :
* overview : 

## Methods 

**Reading Data**

To read in the data, we first loaded in the `tidyverse` library. `tidyverse` is a package in the R coding language used in data science for reading, wrangling, analysis, and visualization. 

Data was collected from two files : `players.csv`, which informs us of participant characteristics, and `sessions.csv`, which details every individual play session. In order to maintain reproducability, we published these files into our Github repository and read them in using a URL. The resulting datasets were named `players` and `sessions`, respectively.

The data was then made accessible for `tidyverse` functions ('tidied') before it was displayed. In our case, this meant reformatting the `hashedEmail` column to have underscores (`_`) between its words and using `as.POSIXct()` to reformat the start and end times in `sessions` to be `dttm` variables, which represent dates.

Finally, `players` and `sessions` were displayed using the `head()` function, which shows the first six rows of a dataset.

In [None]:
library(tidyverse)

In [None]:
players_url <- "https://raw.githubusercontent.com/20under20/dcsi100-project-group-002-13/refs/heads/main/players.csv"


players <- read_csv(players_url, show_col_types = FALSE) |>        # reading the 'players' dataset
           rename(hashed_email = hashedEmail,                      # tidying the 'players' dataset
                  age = Age) 
head(players)                                                      # displaying the 'players' dataset

sessions_url <- "https://raw.githubusercontent.com/20under20/dcsi100-project-group-002-13/refs/heads/main/sessions.csv"

sessions <- read_csv(sessions_url, show_col_types = FALSE) |>       # reading the 'sessions' dataset
            rename(hashed_email = hashedEmail) |>                   # tidying the 'sessions' dataset 
            mutate(start_time = as.POSIXct(start_time),
                   end_time = as.POSIXct(end_time)) 
head(sessions)                                                      # displaying the 'sessions' dataset

The `players.csv` dataset contains characteristic information for each player. As seen in Table 1, this specified by their hashed email- a hashed email is a code generated by an alrgorithm to encrypt email addresses. Player characteristics include demographic data like age and gender, aswell as sign-up choices such as subscribing to the newsletter and anonymous usernames. The data also includes the number of gameplay hours allocated to each player by their hashed email using server logs, as indicated by Table 1. 

Table 1 : Description of the `players.csv` dataset

 | Variable name           | Type       | Meaning                       | Possible values                    | Collection method |
 |----------------|------------|----------------------------------------|----------------------------------------|-------------|
 |  `experience`  |   `<chr>`   |  Their level of experience.    |  Amateur, Regular, Veteran, or Pro       | Sign-up form |
 |  `subscribe`  |   `<lgc>`   |  If they are subscribed to recieve emails when other contributors are on the server.  |  TRUE or FALSE    | Sign-up form |
 |  `hashed_email`  |   `<chr>`    |  Their anonymized email as an identifying token.   |  A string of 64 numbers and letters       | Sign-up form |
 |  `played_hours`  |   `<dbl>`    |  The total amount of hours they spent on PLAIcraft.   | A number rounded to it's nearest decimal | Server records |
 |  `name`  |   `<chr>`    | An anonymous name option they chose when they signed up for PLAIcraft.    |  A word or series of words connected by an `_| Sign-up form |
 |  `gender`  |   `<chr>`    |  The gender they inputted when they signed up for PLAIcraft.     |  Amateur, Regular, Veteran, or Pro       | Sign-up form |
 |  `age`  |   `<dbl>`    |  The age they inputted when they signed up for PLAIcraft (in years.)    |  A whole number from 9-99       | Sign-up form |
 
 
The `sessions.csv`dataset contains information for every session played, specified by a start time and anend time. As we can see in Table 2, this time is formatted both in dd/mm/yyy representing a PST data and as a five-decimal number representing a Unix date. Each session is also associated with a hashed email.

Table 2 : Description of the `sessions.csv` dataset

  | Variable name           | Type       | Meaning                       | Possible values                    | Collection method |
 |----------------|------------|----------------------------------------|----------------------------------------|------------|
 |  `hashed_email`  |   `<chr>`    |  Their anonymized email as an identifying token.   |  A string of 64 numbers and letters       | Sign-up form |
 |  `start_time`  |   `<dttm>`    |  The date and time the session started (in PST.)   | A value in dd/mm/yyy hh:mm format | Server records |
 |  `end_time`  |   `<dttm>`    | The date and time the session ended (in PST.)    |   A value in dd/mm/yyy hh:mm format | Server records |
 |  `original_start_time`  |   `<dbl>`    |  The date and time the session started (in Unix.)  |  A five-decimal number expressed in scientific notation.  | Server records |
 |  `original_end_time`  |   `<dbl>`    |   The date and time the session ended (in Unix.)    | A five-decimal number expressed in scientific notation.       | Server records |



**Wrangling data**

Our goal was to determine which player characteristics lead to the most played hours so that these kinds of players can be recruited as participants. We first needed to load the `lubridate` library to ... We then put all values into one merged dataset to then select the relevant variables from.  

For player characteristics, we used demographic data because it allows researchers to target different groups. This includes the `gender` and `age` column. We also want to keep `hashed_email` to identify each player. 

For played hours, we kept the `played_hours` column. We also want to see how our predictions change throughout the year, so we extracted `month` out of the `<dttm>` type start time variable, `start_time`.

**Wrangling data**

Our goal is to determine if demographic data, `age` and `gender`, can predict if a player will subscribe to a game-related news-letter, or the value of 

In [None]:
library(lubridate)

data <- left_join(sessions, players) |>
        select(hashed_email, gender, age, played_hours, start_time) |>
        mutate(
              month = month.name[month(start_time)])|>
        group_by (month, hashed_email) |>
        reframe (
            monthly_sessions = n(), 
            gender = gender, 
            played_hours = played_hours,
            age = age)
head(data)

**exploratory analysis**

**clustering**

**regression**

## Results

# Discussion