## Methods & Results

All analyses were performed in R in a Jupyter notebook. Libraries for `tidyverse`, `lubridate` and `tidymodels` were installed, a seed for reproducibility was set and `players.csv` and `sessions.csv` were read in via `read_csv`.

We started by converting the session-level log into engagement measures per player. We used `dmy_hm` from `lubridate` to parse `start_time` and `end_time` in `sessions` into date-time objects (`start_dt` and `end_dt`). We calculated length of session in minutes (`session_minutes`) as the difference of `end_dt` and `start_dt` through `difftime` (with `units = "mins"` converted to numeric) and subsetted data to exclude rows where `session_minutes` are `NA`. We grouped by `hashedEmail` and summarised 3 player engagement variables: `total_playtime_min` (sum of `session_minutes`), `avg_session_min` (mean of `session_minutes`), `session_count` (count of sessions).

Next, we joined those per player summaries back to player information. Using a left join on `hashedEmail`, we merged the session-level summary with the original player data to create `players_full`. In this new dataset, we made a new variable `played_hours = total_playtime_min/60`.We recoded `subscribe` from a logical variable to a factor with levels `"No"` and `"Yes"` so it could be used as a binary classification outcome.

To create our modelling dataframe (`players_model`), we used only rows with non-missing `subscribe` and selected variables `subscribe`, `Age`, `total_playtime_min`, `avg_session_min`, and `session_count`. To correct the skewness for play time, we created a new variable `log_playtime = log1p(total_playtime_min)` for the log-transformed playtime. Using `initial_split`, we split the data into train (70 percent) and test (30 percent) sets. Then, we created 5-fold cross-validation splits based off of the train set using `vfold_cv`.

We specified both models for logistic regression with `tidymodels`. For the baseline model, we specified a recipe with `subscribe` as outcome and `Age`, `log_playtime`, `avg_session_min`, `session_count` as predictors. Since `step_zv` removes predictors, it was also added for preprocessing. All numeric predictors would also be standardized (`step_normalize`). For the interaction model, the same predictor set was used with preprocessing steps plus `step_interact` for the term `Age:log_playtime` to include an interaction between age and log-transformed playtime. We defined models for logistic regression using `logistic_reg`, `set_engine("glm")` and combined model with its recipe using `workflow` to make a baseline model and interaction model.

We cross-validated between models. For each workflow, we used `fit_resamples` on the `vfold` cv object and judged performance with ROC AUC and accuracy using `metric_set(roc_auc, accuracy)`. We collected cross-validation results from each workflow using `collect_metrics`, added a column for model name then merged into one dataframe. Finally, we plotted two visuals - mean ROC AUC by model `"Cross-Validation ROC AUC by Model"` and mean cross-validated accuracy by model `"Cross-Validation Accuracy by Model"` which compared baseline logistic model to interaction model and concluded that the simpler baseline would be our final model based on these results.

Finally, we fit our final model to the entire train set using `fit`. From this fitted workflow we predicted probabilities of subscription using `predict` on the test set with `type = "prob"` and merged these predictions with original test set. We ROC tested on the test set using `roc_curve` with `subscribe` as truth and `.pred_Yes` as predicted probability which we plotted using `autoplot` for a visual titled `"ROC Curve on Test Set"` to see how well our model ranks Yes vs No across different levels. Predicted probabilities were turned into classes by labeling players `.pred_Yes > 0.5 = "Yes"`, else `"No"` and overall test set accuracy was calculated to see how well our model did on the test set with `accuracy`. This was plotted as a single bar plot titled `"Test Set Accuracy of Final Model"`, y-axis restricted from 0â€“1.

We also generated four plots to help with the understanding of the data and support our modelling decisions. We plotted a histogram of `Age` from `players` with a bin-width of 2 years which shows distribution of player ages. We plotted a histogram of `played_hours` from `players` with a bin-width of 5 hours which shows distribution of total recorded playtime (in hours) and the right skewness of it and its strong right skew (many low-playtime players and a few very high-playtime players). We plotted a boxplot of `total_playtime_min` by `experience` which restricted y-axis to its 95th percentile using `coord_cartesian` to compare total play time across experience levels. Finally, we grouped `players_full` by `subscribe`, calculated mean `total_playtime_min` by group and displayed these values so we can compare the average total play time between subscribers vs non-subscribers in a bar chart `"Mean total playtime by newsletter subscription"`. Thus, these plots and the logistic regression results (cross-validated ROC AUC and accuracy, test ROC curve, and test accuracy) all suggest that engagement measures are strongly related to whether a player subscribes to the newsletter.