# Individual Plan - Stephanie Ye

In [77]:
library(tidyverse)

players_data <- read_csv("IPS/data/players.csv")

head(players_data)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


In [78]:
sessions_data <- read_csv("IPS/data/sessions.csv")

head(sessions_data)

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


## Data description

There are 196 oberservation and 9 variables in players table, the variables are:
1. `experience`categorical, cdecribe the experience level of players.
2. `subscribe`logical, decribe whether the players subscribe the game or not.
3. `hashedEmail`categorical, decribe the hashing encoded email addresses of players.
4. `played_hours`numeric, decribe how long the players play the game.

There are 1535 oberservations and 5 variables in sessions table, the variables are:
1. `hashedEmail` categorical, describe the hashing encoded email addresses of players.
2. `start_time` and `end_time`categorical, describe when the players start or end playing the game.
3. `original_start_time` and `original_end_time` numerical, describe the UNIX timestamp recorded by system.

## potential issues

### players

In [79]:
players_data|>
group_by(gender) |>
summarize(count=n())

gender,count
<chr>,<int>
Agender,2
Female,37
Male,124
Non-binary,15
Other,1
Prefer not to say,11
Two-Spirited,6


1. In the `hashedEmail` column, the data is unreadable which may be useless.
2. There may be some extreme value in `played_hours` which may affect the final results.
3. Some categories in `gender` may be too small, such as `Other`, `Two-Spirited`. Model may cannot show the patterns of these groups.

### Sessions

In [76]:
sessions_data |>
mutate(duration_mins = as.numeric(difftime(end_time, start_time, units = "mins"))) |>
select(start_time, end_time,duration_mins)

start_time,end_time,duration_mins
<chr>,<chr>,<dbl>
30/06/2024 18:12,30/06/2024 18:24,0
17/06/2024 23:33,17/06/2024 23:46,0
25/07/2024 17:34,25/07/2024 17:57,0
25/07/2024 03:22,25/07/2024 03:58,0
25/05/2024 16:01,25/05/2024 16:12,0
23/06/2024 15:08,23/06/2024 17:10,0
15/04/2024 07:12,15/04/2024 07:21,0
21/09/2024 02:13,21/09/2024 02:30,0
21/06/2024 02:31,21/06/2024 02:49,0
16/05/2024 05:13,16/05/2024 05:52,0


1. The `hashedEmail` is unreadable, which may be useless.
2. There are sessions with a duration of 0 minutes or with very long duration . Such cases might indicate logging errors or players disconnecting immediately. These values could affect the final results of prediction model.
3. The `start_time` and `end_time` may need to convert into another form in order to calculate more easily.