# 🎮Predicting Player Engagement: Unlocking Gaming Behavior Insights

#### Introduction

The online gaming industry has seen significant growth over the years, making it crucial for game developers to understand player behavior and engagement. This report outlines the process and findings of a data science project aimed at predicting player engagement levels based on various demographic and in-game metrics.

#### Dataset Overview

The dataset used in this analysis contains 40,034 records and 13 features capturing comprehensive metrics and demographics related to player behavior. Key features include:
- **PlayerID:** Unique identifier for each player.
- **Age:** Age of the player.
- **Gender:** Gender of the player.
- **Location:** Geographic location of the player.
- **GameGenre:** Genre of the game the player is engaged in.
- **PlayTimeHours:** Average hours spent playing per session.
- **InGamePurchases:** Indicator of in-game purchases (0 = No, 1 = Yes).
- **GameDifficulty:** Difficulty level of the game.
- **SessionsPerWeek:** Number of gaming sessions per week.
- **AvgSessionDurationMinutes:** Average duration of each gaming session in minutes.
- **PlayerLevel:** Current level of the player in the game.
- **AchievementsUnlocked:** Number of achievements unlocked by the player.
- **EngagementLevel:** Target variable indicating the level of player engagement categorized as 'High', 'Medium', or 'Low'.

#### Data Preprocessing

Data preprocessing included handling missing values, encoding categorical variables, and normalizing numerical features. The categorical variables such as Gender, Location, GameGenre, and GameDifficulty were label encoded. Numerical features were scaled using standardization.

#### Feature Engineering

A new feature, `TotalPlayTimePerWeek`, was created by multiplying `PlayTimeHours` by `SessionsPerWeek` to enhance the model's predictive power. This feature represents the total playtime per week for each player.

#### Model Building and Evaluation

A Random Forest Classifier was used to predict the `EngagementLevel` of players. The dataset was split into training (70%) and testing (30%) sets. The model was trained on the training set and evaluated on the testing set.

##### Performance Metrics

The model's performance was evaluated using confusion matrix and classification report, which includes metrics such as accuracy, precision, recall, and F1-score.

- **Accuracy:** 82%
- **Precision:** 81% (weighted average)
- **Recall:** 82% (weighted average)
- **F1 Score:** 81% (weighted average)

These metrics indicate that the model performs well in predicting player engagement levels.

#### Feature Importance

The feature importance analysis revealed that `TotalPlayTimePerWeek`, `AvgSessionDurationMinutes`, and `PlayerLevel` were among the most influential features in predicting player engagement. Understanding these key drivers can help game developers and marketers tailor their strategies to improve player retention and satisfaction.

#### Conclusion

This data science project successfully predicted player engagement levels using a comprehensive dataset of player demographics and in-game metrics. The Random Forest Classifier provided strong predictive performance, and the feature importance analysis highlighted critical factors influencing player engagement. Future work could involve hyperparameter tuning, exploring other machine learning models, and further feature engineering to enhance predictive accuracy.

#### Recommendations

1. **Game Design Optimization:** Focus on optimizing playtime and session duration to enhance player engagement.
2. **Targeted Marketing:** Use demographic data to tailor marketing campaigns aimed at players with high engagement potential.
3. **Personalized Player Experience:** Customize game difficulty and in-game purchases based on player profiles to increase retention.

By leveraging these insights, game developers can enhance the overall player experience, leading to higher engagement and retention rates.

In [None]:
import ipywidgets as widgets

# Distribution of Features

# Define a function to plot the disctribution of features
def plot_feature(feature):
    plt.figure(figsize=(10,6))
    df[feature].hist(bins=30)
    plt.title(f'Distribution of {feature}')
    plt.xlabel(feature)
    plt.ylabel('Frequency')
    plt.show()

# Create a dropdown widget with the dataframe's column names
dropdown = widgets.Dropdown(options=df.columns, description='Feature:')

# Use the interact function to create the widget and the plot
widgets.interact(plot_feature, feature=dropdown);

In [None]:
# Group data by 'GameGenre' and calculate the average playtime hours
average_playtime = df.groupby('GameGenre')['PlayTimeHours'].mean()

# Create a bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x=average_playtime.index, y=average_playtime.values, palette='coolwarm')
plt.xlabel('Game Genre')
plt.ylabel('Average PlayTimeHours')
plt.title('Average PlayTimeHours per Game Genre')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Create a countplot for InGamePurchases based on Location
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='Location', hue='InGamePurchases', palette='coolwarm')
plt.xlabel('Location')
plt.ylabel('Count')
plt.title('In-Game Purchases by Location')
plt.legend(title='In-Game Purchases', labels=['No Purchase', 'Purchase'])
plt.xticks(rotation=45)
plt.show()

In [None]:

Engagement Level by Gender
# Create a countplot
plt.figure(figsize=(6, 4))
sns.countplot(data=df, x='Gender', hue='EngagementLevel', palette='coolwarm')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.title('Engagement Level by Gender')
plt.legend(title='Engagement Level')
plt.show()

In [None]:
# Game Genre distribution
plt.figure(figsize=(12, 6))
sns.countplot(x='GameGenre', data=df, order=df['GameGenre'].value_counts().index)
plt.title('Game Genre Distribution')
plt.xlabel('Game Genre')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

In [None]:

Future Analysis Ideas¶

    Clustering Analysis: Group players based on their gaming behavior and demographics.
    Time Series Analysis: If we had time-based data, we could analyze trends in gaming behavior over time.
    Advanced Predictive Modeling: Use more sophisticated models like Random Forest, Gradient Boosting, or Neural Networks to improve prediction accuracy.
    Feature Engineering: Create new features that might capture more complex relationships in the data.
    Sentiment Analysis: If we had text data (e.g., player reviews), we could perform sentiment analysis to understand player satisfaction.



In [None]:
corr = df.corr()

target_corr = corr['engagement_level'].drop('engagement_level')

sns.set(font_scale=1.2)
sns.set_style("white")
sns.set_palette("PuBuGn_d")
sns.heatmap(target_corr.to_frame(), cmap="BrBG", annot=True, fmt='.2f')
plt.title('Correlation with Engagement Column')
plt.show()

In [None]:
#Boxplot to see outliers

df1 = df.copy()

fig, axs = plt.subplots(len(num_col) // 2 + len(num_col) % 2, 2, figsize=(12, 6))
axs = axs.flatten()

for i, col in enumerate(num_col):
    axs[i].boxplot(df1[col])
    axs[i].set_title(col, fontsize=10)
    axs[i].set_ylabel('Value')

for j in range(i+1, len(axs)):
    fig.delaxes(axs[j])

plt.suptitle('Boxplot of Numerical Variables', fontsize=13, y=1.02)
plt.tight_layout()
plt.show()


Predict Online Gaming Behavior

The world of video games is always evolving having millions of players diving into different genres and experiencing various levels of satisfaction and enjoyment. By understanding the player engagement factors will be the key for game developers in achieving in enhancing their user experience and retention. This project will dive into a rick dataset of video game players to discover patterns and insights that will help in their game development and their marketing strategies.

Gender
In the pie charts shown, I can see here that there are more male players playing video game than female players. Having a 59.8% population of male while female have a population of 40.2%. I can say that according to the data shown above, the target audience we should focus on are male audience but it doen't mean that we should ignore the 40.2% of the female audience.

Game Genre
This is the total population for each genre that the players loves to play. I can see that the most popular genre that are neck to neck are Sports genre and Action genre. Both having a 20.1% with sports having a lead of 8 players. I also see that all five genre are very close to each other in the number of players.


In [None]:
#Grouping the Gender and Genre together
genre_gender_count = data.groupby(['GameGenre', 'Gender']).size().unstack()

gender_list = data['Gender'].unique().tolist()
sorted_gender_list = sorted(gender_list)
legend_labels = {gender: gender for gender in sorted_gender_list}

print(genre_gender_count)
#Plotting the data into a bar Graph
ax = genre_gender_count.plot(kind='bar', figsize=(14,8), width=0.6)
plt.title('Game Genre Preference by Gender')
plt.xlabel('Game Genre')
plt.ylabel('Number of Players')
plt.xticks(rotation=45)
plt.legend(legend_labels.values()) 
plt.tight_layout()  
plt.show()

Gender      Female  Male
GameGenre               
Action        3149  4890
RPG           3235  4717
Simulation    3218  4765
Sports        3243  4805
Strategy      3230  4782



In the data provided, I discovered that the most popular genre for the male players is Action while the most popular genre for the female players is Sports. I also see that for every genre in each gender, all of them are not far from each other. Although male players have more players than female players, the popularity of the genre is near to each other. I noticed when I am analyzing the bar graph, the two gender have different genre that popular which is action for male and sports for female. According to the last pie chart sports is the has the highest population for all gender. The reason is that although male players have more players in action, there are less players playing action in the female side.

Location, Game Genre
The data above showed me that USA is the most popular country in video games. Having more players in every genre of the game. The second highest is Europe followed by Asia. The others having lowest score in every genre. I discovered exploring this chart is that like what the previous charts I did, all 5 genres have always been near to each other. Like for example USA is the highest population of players: When they compare all the same country by their genre, their difference is not large. All of them are equally near to each other. This shows that whatever country or gender, the genre will be near to each other with only the difference in the number of players living the country or by their genders. As for the data, the highest number of player in the USA is sports. In Europe, the highest is still sports. In Asia, highest population of players playing the action genre. Lastly, the highest players plays simulation games.

total_players_count = data.shape[0]
    
genre_country_percentage = genre_country_count.apply(lambda x: (x / total_players_count) * 100, axis=1)
print(genre_country_percentage)


Location        Asia    Europe     Other       USA
GameGenre                                         
Action      4.136484  6.064845  1.978318  7.900784
RPG         4.046560  5.924964  1.965829  7.925763
Simulation  4.054054  6.029875  1.985812  7.870810
Sports      3.951641  6.072339  1.935855  8.143078
Strategy    4.031573  5.892491  1.963331  8.125593

After careful calculation of the percentage of each country who played each genre. I discover that USA and Europe have the highest percentage of player both having 8.14% and 6.07% in the genre sports out of all their genres. Asia's most popular genre is Action having 4.13% of the population out of all their genres. Lastly Others having low population count, simulation is the most popular having a 1.93%.


In [None]:

7. Conclusion¶

By thoroughly investigating the data using exploration,analysis,calculation, and predictive modeling using logistic regression on the dataset containing player characteristics and engagement metrics,there are several significant insights have been discovered, shedding light on player behavior and factors influencing engagement levels in the gaming industry.
Exploration Findings:

    Gender and Reginal Analysis:

            There is no significant impact between Gender and the EngagementLevel eve though there are a higher number of male players compared to female players.
            According to the bar chart, USA has emerged as the dominant region for video game players all across the genre(Action, Simulation, Strategy, RPG, and Sports) followed by Europe, then Asia, and other regions. This regional distribution underscores the global appeal and diverse player demographics in gaming.

    Effect of Sessions on EngagementLevel: I found out that there is a strong relationship between session durations and engagemet levels while doing the correlation analysis.

            High Engagement Players: These players exhibited a positive correlation with session duration. It shows that the higher the session duration, the higher the player's engagement to the game.
            Low Engagement Players: These players exhibited a negative correlation with the session. It showed here that if a player get more session duration, the less likely the player will stay in the low level.


Predictive Model Insights:

Logistic Regression Model Performance

        Using the regression model that trained the datasets, showed a robust predictive capabilities with an overall accuracy of approximately 82%.


Classification Report Analysis

        The classification report provided detailed metrics for each engagement level category (High, Medium, Low), highlighting precision, recall, and F1-scores:

        Class 0 (High Engagement): Achieved high precision and recall, indicating accurate identification of highly engaged players.

        Class 1 (Low Engagement): Showed moderate performance with lower precision and recall compared to high engagement, suggesting challenges in predicting low engagement levels.

        Class 2 (Medium Engagement): Demonstrated balanced performance with good precision and recall, indicating reliable predictions for medium engagement levels.

