# Mobile Game Analysis

### An analysis of user progress data

Data can be found on Kaggle at [this link](https://www.kaggle.com/datasets/manchvictor/prediction-of-user-loss-in-mobile-games)

level_seq.csv

Each record is an attempt to play a level. The meaning of each column is as follows:
- 'user_id' : user id, which can be matched with those in training, verification, and test sets;
- 'level_id' : level id;
- f_success ': indicates whether to complete the clearance (1: completes the clearance, 0: fails).
- f_duration ': the duration of the attempt (unit: s);
- f_reststep ': the ratio of the remaining steps to the limited steps (failure is 0);
- f_help ': Whether extra help, such as props and hints, was used (1: used, 0: not used);
'time' : indicates the timestamp.

level_meta.csv

Some statistical characteristics of each level can be used to represent the level. The meaning of each column is as follows:

- f_avg_duration ': Average time spent on each attempt (unit s, including successful and failed attempts);
- 'f_avg_passrate' : average clearance rate;
- f_avg_win_duration ': Average time spent on each clearance (in s, including only the attempts to clear the clearance);
- f_avg_retrytimes' : Average number of retries (the second time to play the same level counts as the first retry);
- 'level_id' : indicates the id of the level, which can be matched with the level in level_seq.csv.

In [2]:
#Import libraries
import pandas as pd
import glob
import os
import datetime as dt
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.ticker import (MultipleLocator, FormatStrFormatter, AutoMinorLocator)
import matplotlib.dates
import matplotlib.dates as mdates
import seaborn as sns

In [3]:
#Import csv files
path = "/Users/raws/Downloads/mobile_game_data"
csv_files = glob.glob(path + "/*.csv")

csv_files

['/Users/raws/Downloads/mobile_game_data/test.csv',
 '/Users/raws/Downloads/mobile_game_data/level_seq.csv',
 '/Users/raws/Downloads/mobile_game_data/dev.csv',
 '/Users/raws/Downloads/mobile_game_data/train.csv',
 '/Users/raws/Downloads/mobile_game_data/level_meta.csv']

In [4]:
#Append csvs to list using list comprehension
df_list = [pd.read_csv(filename, delimiter='\t', index_col=None, header=0) for filename in csv_files]
df_list

[      user_id
 0           1
 1           2
 2           3
 3           4
 4           5
 ...       ...
 2768     2769
 2769     2770
 2770     2771
 2771     2772
 2772     2773
 
 [2773 rows x 1 columns],
          user_id  level_id  f_success  f_duration  f_reststep  f_help  \
 0          10932         1          1       127.0    0.500000       0   
 1          10932         2          1        69.0    0.703704       0   
 2          10932         3          1        67.0    0.560000       0   
 3          10932         4          1        58.0    0.700000       0   
 4          10932         5          1        83.0    0.666667       0   
 ...          ...       ...        ...         ...         ...     ...   
 2194346    10931        40          1       111.0    0.250000       1   
 2194347    10931        41          1        76.0    0.277778       0   
 2194348    10931        42          0       121.0    0.000000       1   
 2194349    10931        42          0       115.0  

In [5]:
users = df_list[1]
meta = df_list[4]

In [6]:
users

Unnamed: 0,user_id,level_id,f_success,f_duration,f_reststep,f_help,time
0,10932,1,1,127.0,0.500000,0,2020-02-01 00:05:51
1,10932,2,1,69.0,0.703704,0,2020-02-01 00:08:01
2,10932,3,1,67.0,0.560000,0,2020-02-01 00:09:50
3,10932,4,1,58.0,0.700000,0,2020-02-01 00:11:16
4,10932,5,1,83.0,0.666667,0,2020-02-01 00:13:12
...,...,...,...,...,...,...,...
2194346,10931,40,1,111.0,0.250000,1,2020-02-03 16:26:37
2194347,10931,41,1,76.0,0.277778,0,2020-02-03 16:28:06
2194348,10931,42,0,121.0,0.000000,1,2020-02-03 16:30:17
2194349,10931,42,0,115.0,0.000000,0,2020-02-03 16:33:40


In [7]:
#Suppress scientific notation
pd.set_option('display.float_format', lambda x: '%.2f' % x)
users.describe()

Unnamed: 0,user_id,level_id,f_success,f_duration,f_reststep,f_help
count,2194351.0,2194351.0,2194351.0,2194351.0,2194351.0,2194351.0
mean,6745.03,96.84,0.53,108.12,0.17,0.04
std,3942.09,84.11,0.5,53.61,0.23,0.21
min,1.0,1.0,0.0,1.0,0.0,0.0
25%,3287.0,41.0,0.0,77.0,0.0,0.0
50%,6688.0,80.0,1.0,100.0,0.05,0.0
75%,10163.0,142.0,1.0,127.0,0.29,0.0
max,13589.0,1509.0,1.0,600.0,1.0,1.0


In [10]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2194351 entries, 0 to 2194350
Data columns (total 7 columns):
 #   Column      Dtype  
---  ------      -----  
 0   user_id     int64  
 1   level_id    int64  
 2   f_success   int64  
 3   f_duration  float64
 4   f_reststep  float64
 5   f_help      int64  
 6   time        object 
dtypes: float64(2), int64(4), object(1)
memory usage: 117.2+ MB


In [8]:
users.duplicated().sum()

69322

In [9]:
dupes = users[users.duplicated()]
dupes

Unnamed: 0,user_id,level_id,f_success,f_duration,f_reststep,f_help,time
64,10932,50,0,153.00,0.00,0,2020-02-01 17:05:43
65,10932,50,0,153.00,0.00,0,2020-02-01 17:05:43
665,2774,42,1,130.00,0.05,0,2020-02-01 10:56:34
667,2774,44,1,116.00,0.21,0,2020-02-01 11:02:32
707,2774,62,1,89.00,0.43,0,2020-02-02 09:10:50
...,...,...,...,...,...,...,...
2192592,10924,36,1,110.00,0.50,0,2020-02-02 00:45:41
2193091,13586,104,0,155.00,0.00,0,2020-02-03 21:13:35
2193569,10927,137,1,73.00,0.29,0,2020-02-02 22:36:23
2193666,10927,207,0,166.00,0.00,0,2020-02-04 20:59:36


<div class="alert alert-warning">
  <strong>Summary of Findings</strong>
    <li>Out of roughly 2.2 million rows there are about 70k duplicates.</li>
    <li>user_id needs to be converted to string.</li>
    <li>time needs to be converted to datetime.</li>
</div>

In [None]:
users.

<div class="alert alert-success">
  <strong>Summary of Actions</strong>
    <li>Dropped 70k duplicates.</li>
</div>