## Methods & Results

All analyses were performed in R in a Jupyter notebook. Libraries for tidyverse, lubridate and tidymodels were installed, a seed for reproducibility was set and players.csv and sessions.csv were read in via read_csv.

We started by converting the session-level log into engagement measures per player. We used dmy_hm from lubridate to parse start_time and end_time in sessions into date-time objects (start_dt and end_dt). We calculated length of session in minutes (session_minutes) as the difference of end_dt and start_dt through difftime (with units = 'mins' converted to numeric) and subsetted data to exclude rows where session_minutes are NA. We grouped by hashedEmail and summarised 3 player engagement variables: total_playtime_min (sum of session_minutes), avg_session_min (mean of session_minutes), session_count (count of sessions).

Next, we joined those per player summaries back to player info. Using a left join on hashedEmail, we merged the session-level summary with the original player data to create players_full. In this new dataset, we created a new variable played_hours = total_playtime_min/60. We recoded subscribe from logical to factor (levels = "No", "Yes") for utility as binary classification outcome.

To create our modelling dataframe (players_model), we subsetted only rows with non-missing subscribe and selected variables subscribe, Age, total_playtime_min, avg_session_min, session_count. To correct for skewness for play time, we created a new variable log_playtime = log1p(total_playtime_min) for log-transformed playtime. Using initial_split, we split the data into train (70 percent) and test (30 percent) sets; we created 5-fold cross-validation splits based off of the train set using vfold_cv.

We specified both models for logistic regression with tidymodels. For the baseline model, we specified a recipe with subscribe as outcome and Age, log_playtime, avg_session_min, session_count as predictors. Since step_zv removes predictors with zero variance, it was also included for preprocessing. All numeric predictors would also be standardized (step_normalize). For the interaction model, the same predictor set was used with preprocessing steps plus step_interact for the term Age:log_playtime to include an interaction between age and log-transformed playtime. We defined models for logistic regression using logistic_reg, set_engine("glm") and combined model with its recipe via workflow to create a baseline model and interaction model.

We cross-validated between models. For each workflow, we called fit_resamples on the vfold cv object and assessed performance with ROC AUC and accuracy using metric_set(roc_auc, accuracy). We collected cross-validation results from each workflow using collect_metrics, added a column for model name then merged into one dataframe. Finally, we plotted two visuals - mean ROC AUC by model "Cross-Validation ROC AUC by Model" and mean cross-validated accuracy by model "Cross-Validation Accuracy by Model" which compared baseline logistic model to interaction model and determined that the simpler baseline would be our final model based on these results.

Finally, we fit our final model to the entire train set using fit. From this fitted workflow we predicted probabilities of subscription using predict on the test set with type = "prob" and merged these predictions with original test set. We ROC tested on the test set using roc_curve with subscribe as truth and .pred_Yes as predicted probability which we plotted using autoplot for a visual titled "ROC Curve on Test Set" to see how well our model ranks Yes vs No across different thresholds. Predicted probabilities were converted to classes by labeling players .pred_Yes > 0.5 = "Yes", else "No" and overall test set accuracy was calculated to see how well our model performed on the test set with accuracy. This was plotted as a single bar plot titled "Test Set Accuracy of Final Model", y-axis restricted from 0-1.

We also generated four exploratory plots to help clarify understanding of the data and support our modelling decisions. We plotted a histogram of Age from players with a bin-width of 2 years (Figure 1) which shows distribution of player ages. We plotted a histogram of played_hours from players with a bin-width of 5 hours (Figure 2) which shows distribution of total recorded playtime (in hours) and the right skewness of it and its strong right skew (many low-playtime players and a few very high-playtime players). We plotted a boxplot of total_playtime_min by experience (Figure 3) which restricted y-axis to its 95th percentile via coord_cartesian to compare total play time across experience levels. Finally, we grouped players_full by subscribe, calculated mean total_playtime_min by group and displayed these values (Figure 4) which compares average total play time between subscribers vs non-subscribers in a titled bar chart "Mean total playtime by newsletter subscription". Thus, these exploratory plots in addition to logistic regression evaluation (cross-validated ROC AUC and accuracy; test-set ROC curve; test-set accuracy) support that engagement measures - especially total play time - highly correlate to subscribing vs non-subscribing.