PLAYTIME - an analysis of who the target player is when marketing a game
-

Introduction: (402)
-

&nbsp;&nbsp;&nbsp;&nbsp; In this report, the prompt being answered is **"We would like to know which 'kinds' of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts."** 

&nbsp;&nbsp;&nbsp;&nbsp; To do this, we must analyse user behavior, which is the study of how people interact with a service, product, or interface. In this specific report, the aim is to determine how individuals interact with a particular game and how this data can be used to visualize and conceptualize a strategy for targeting these players for recruitment, whether that's for a different game, a research study, or another motivated campaign. To determine this, the dataset that will be used is players.csv. We will further specify the prompt to the question: **"Can we predict the hours a player played using age, gender, and experience?"**

&nbsp;&nbsp;&nbsp;&nbsp; Firstly, players.csv is a flat file containing 197 observations (rows) and 9 variables, providing individual records related to gaming or a similar digital service. The data describes an individual's personal information, including name, gender, age, played_hours, experience, subscription status, and an identifier email referred to as a hashed email. We will only use the relevant variables 'played_hours', 'age', 'gender', and 'experience' to answer our question, and are described as followed:

| Variable Name | Data type | Variable type | Description |
| --- | --- | --- | --- |
| experience | object | ordinal | player's experience with Minecraft with 5 categories : Pro, Veteran, Regular, Amateur, Beginner|
| played_hours | float | continuous | player's total playtime in hours |
| gender | object | nominal | player's gender with 7 categories: 'Female', 'Male', 'Non-binary', 'Two-Spirited', 'Prefer not to say', 'Agender', 'Other'|
| age | integer | discrete | player's age |

&nbsp;&nbsp;&nbsp;&nbsp; Furthermore, the analysis of players.csv must account for several potential data issues:
* There is missing data in the columns 'individualID' and 'organizationName', which will make them unusable for segmentation and thus, dropped.
* Within the 'gender' categorical variable, there are seven possible categories. There are very few observations for genders other than 'Male', 'Female' and 'Non-binary', so our model will not have enough data to accurately predict 'played_hours' for these genders. To avoid this, we will perform data analysis on the gender categories 'Male', 'Female', and 'Non-binary'.
* There are concerns about the unknown data collection methodology and the accuracy of self-reported fields, such as the experience of players. With 'experience' variable, a self-reported field, there could be bias and inconsistency in the understanding of each category.

Methods and Results
-


In [10]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import set_config
from sklearn.neighbors import KNeighborsRegressor


np.random.seed(10)

url1 = "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
url2 = "https://drive.google.com/uc?export=download&id=14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB"

players = pd.read_csv(url1)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [11]:
new_players = players.drop(columns=["name", "hashedEmail",'individualId','organizationName','subscribe' ])
to_drop = ['Agender', 'Other', 'Prefer not to say', 'Two-Spirited']
new_players = new_players[new_players['gender'].isin(to_drop) == False]
new_players

new_players = new_players.replace({'Beginner': '1', 'Amateur': '2', 'Regular': '3', 'Veteran': '4', 'Pro': '5', 'Male': 1, 'Female': 2, 'Non-binary': '3'})

players_train, players_test = train_test_split(
    new_players, train_size = 0.8
)

perform the summary


create a visualization of the dataset

In [12]:
age_chart = alt.Chart(players_train).mark_bar().encode(
    x = alt.X('age', bin=True, title="Age"),
    y = alt.Y('played_hours', title="Total amount of hours played"),
    color = alt.Color(
        "gender:N",
        legend=alt.Legend(
            title="Gender",
            labelExpr="{'1':'Male','2':'Female','3':'Non-Binary'}[datum.label]"
        )
    )
).properties(
    title="Figure 1"
)


age_chart_exp = alt.Chart(players_train).mark_bar().encode(
    x = alt.X('age', bin = True).title("Age"),
    y = alt.Y('played_hours').title("Total amount of hours played"),
    color = alt.Color(
        "experience",
    legend=alt.Legend(
            title="Gender",
            labelExpr="{'1':'Beginner','2':'Amateur','3':'Regular','4':'Veteran','5':'Non-Binary'}[datum.label]"
        ))
).properties(
    title="Figure 2"
)
(age_chart | age_chart_exp).resolve_scale(
    color='independent'
)

Data analysis:

In [13]:
preprocessor = make_column_transformer((StandardScaler(), ["age", 'gender', 'experience']))
pipeline = make_pipeline(preprocessor, KNeighborsRegressor())

X_train = players_train[["age",'gender','experience']]  # A single column data frame
y_train = players_train["played_hours"]  # A series

X_test = players_test[["age",'gender','experience']]  # A single column data frame
y_test = players_test["played_hours"]  # A series

param_grid = {
    "kneighborsregressor__n_neighbors": range(1, 21, 1),
}
gridsearch = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
)

gridsearch.fit(X_train, y_train)

results = pd.DataFrame(gridsearch.cv_results_)
results["sem_test_score"] = results["std_test_score"] / 5**(1/2)
results = (
    results[[
        "param_kneighborsregressor__n_neighbors",
        "mean_test_score",
        "sem_test_score"
    ]]
    .rename(columns={"param_kneighborsregressor__n_neighbors": "n_neighbors"})
)


results["mean_test_score"] = -results["mean_test_score"]

results


gridsearch.best_params_



{'kneighborsregressor__n_neighbors': 11}

visualization of the analysis:

In [16]:
analysis_vis = alt.Chart(results).mark_line(point=True).encode(
    x=alt.X("n_neighbors").title("Neighbors"),
    y=alt.Y("mean_test_score")
        .scale(zero=False)
        .title("Cross-validation RMSPE estimate")
).properties(
    title="Figure 3: RMSPE Estimate"
)

analysis_vis

## Discussion

summarize what you found

discuss whether this is what you expected to find?

discuss what impact could such findings have?
> by using the aforementioned variables, we can find the population of players that are most likely to contribute to data collection based on age, experience, gender, and time played. By plotting these variables, we can determine which part of the player population to target in future campaigns 

discuss what future questions could this lead to?