In [18]:
import pandas as pd
import altair as alt

### Data Description <br>
players (players.csv) is a dataset containing a list of all unique players, including data about each player. There are 196 observations and 9 variables. Issues that I can see in the data is there are 2 variables causing problems; 'individualId', and 'organizationName'. They are causing problems because they have only missing values. The data in this dataset was collected by from a registration form that each individual who signed up to play Plaicraft filled out. <br> <br>
    *Description of Each Variable:* <br>
>experience (object): the experience that the individual has had on Minecraft prior to signing up for Plaicraft, according to their own judgement<br>
    subscribe (boolean): Whether or not the individual agreed to receive notifications<br>
    hashedEmail (object): encrypted email address<br>
    played_hours (numerical -> float64): Total hours played by the individual<br>
    name (object): selected name in the registration portion<br>
    gender (object): sex of the individual<br>
    age (numerical -> int64): age of the individual<br>
    individualId (numerical -> float64): ID of the individual<br>
    organizationName (numerical -> float64): ID of organization the individual resides from <br>

sessions (sessions.csv) is a dataset containing a list of individual play sessions by each player, including data about the session. There are 1535 observations and 5 variables. Issues that I can see in the data is in the values in the 'original_start_time' and 'original_end_time' are identical. The data in this dataset was collected by tracking the time and activity of each play session. <br> <br>
    *Description of Each Variable:* <br>
>hashedEmail (object): encrypted email address<br>
    start_time (object): date and time of start of play session<br>
    end_time (object): date and time of end of play session<br>
    original_start_time (numerical -> float64): Start time from January 1st in seconds<br>
    original_end_time (numerical -> float64): End time from January 1st in seconds

### Question <br>


We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts. <br>

*Response Variable:* Played Hours  <br>
*Explanatory Variables:* Experience, Age, Subscribe, Gender <br>

The data that will be used to answer this question is contained in the dataset, 'player.csv'. The data from this dataset will be used to determine what type of individual contributes the most total hours to Minecraft. The wrangling that will be required will be the grouping of the values in each explanotory variable using the group_by() function, to find which, if any of the different values in each explanatory variable affect the variable, 'played_hours'.

### Exploratory Data Analysis and Visualization <br>

In [2]:
players = pd.read_csv('data/players.csv')
sessions = pd.read_csv('data/sessions.csv')
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [3]:
players_tidy = players.drop(columns = ['name', 'individualId', 'organizationName'])
players_tidy

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,gender,age
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Male,9
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Male,17
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Male,17
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Female,21
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Male,21
...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Female,17
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Male,22
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Prefer not to say,17
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Male,17


Removed the variables *'individualId'*, *'organizationName'*, *'name'* from the dataset, because these variables only contained missing data and were of no use to answerting the chosen question.

In [4]:
players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   experience        196 non-null    object 
 1   subscribe         196 non-null    bool   
 2   hashedEmail       196 non-null    object 
 3   played_hours      196 non-null    float64
 4   name              196 non-null    object 
 5   gender            196 non-null    object 
 6   age               196 non-null    int64  
 7   individualId      0 non-null      float64
 8   organizationName  0 non-null      float64
dtypes: bool(1), float64(3), int64(1), object(4)
memory usage: 12.6+ KB


In [5]:
players_sort2a = players_tidy['experience']\
    .value_counts()
players_sort2a

experience
Amateur     63
Veteran     48
Regular     36
Beginner    35
Pro         14
Name: count, dtype: int64

In [6]:
players_sortb = players_tidy[['experience', 'played_hours']]\
    .groupby('experience')\
    .sum()\
    .reset_index()
players_sortb.assign(num_exp = [63,48,36,35,14])
players_sortb['num_exp'] = [63,48,36,35,14]
players_sortb = players_sortb.assign(avg_h_exp = players_sortb['played_hours']/players_sortb['num_exp'])
avg_hours_experience = alt.Chart(players_sortb).mark_bar().encode(
    x = alt.X('experience').title('Experience Playing Minecraft'),
    y = alt.Y('avg_h_exp').title('Average Time Played (hours)'),
    color = alt.Color('experience')
)
avg_hours_experience

**Reflection on Graph (1):** <br>
From analyzing graph 1, it is evident that the variable, 'experience', can be used to determine the type of individual who will contribute the most hours to playing Minecraft, and therefore contribute the most data. Different categories of experience give varying amounts of hours played, they are not uniform, rendering the variable as useful.

In [7]:
players_age = players_tidy[['age']].value_counts().reset_index()
players_age.head()

Unnamed: 0,age,count
0,17,75
1,21,18
2,22,15
3,20,14
4,23,13


In [8]:
players_age[['age','count']]
players_age_hrs = players_tidy[['age', 'played_hours']]\
    .groupby('age')\
    .sum()\
    .reset_index()

In [17]:
players_age_table = pd.merge(players_age, players_age_hrs, on='age')
players_age_table1 = players_age_table.assign(avg_hrs = players_age_table['played_hours'] / players_age['count'])
age_hrs_plot = alt.Chart(players_age_table1).mark_bar().encode(
    x = alt.X('age').title('Age'),
    y = alt.Y('avg_hrs').title('Average Played Hours'),
)
age_hrs_plot

**Reflection on Graph (2):** <br>
From analyzing graph 2, I can see that the variable, 'age', can be used to determine the type of individual who will contribute the most hours to playing Minecraft. Different age values give varying amounts of hours played, so we can use the age of an individual to give a rough range of the number of played hours they will contribute to the dataset.

In [10]:
players_sub = players_tidy[['subscribe']].value_counts().reset_index()
players_sub[['subscribe','count']]
players_sub_hrs = players_tidy[['subscribe', 'played_hours']]\
    .groupby('subscribe')\
    .sum()\
    .reset_index()
players_sub_table = pd.merge(players_sub, players_sub_hrs, on='subscribe')
players_sub_table1 = players_sub_table.assign(avg_hrs = players_sub_table['played_hours'] / players_sub['count'])
sub_hrs_plot = alt.Chart(players_sub_table1).mark_bar().encode(
    x = alt.X('subscribe').title('Accepted Notifications'),
    y = alt.Y('avg_hrs').title('Average Played Hours'),
)
sub_hrs_plot

**Reflection on Graph (3):** <br>
From analyzing graph 3, I can see that the variable, 'subscribe', can be used to determine the type of individual who will contribute the most hours to playing Minecraft. Whether an individual signs up to recieve notifications or not heavily influences the amount of hours and data they are expected to contribute to the dataset.

In [11]:
players_sex = players_tidy[['gender']].value_counts().reset_index()
players_sex[['gender','count']]
players_sex_hrs = players_tidy[['gender', 'played_hours']]\
    .groupby('gender')\
    .sum()\
    .reset_index()
players_sex_table = pd.merge(players_sex, players_sex_hrs, on='gender')
players_sex_table1 = players_sex_table.assign(avg_hrs = players_sex_table['played_hours'] / players_sex['count'])
sex_hrs_plot = alt.Chart(players_sex_table1).mark_bar().encode(
    x = alt.X('gender').title('Gender'),
    y = alt.Y('avg_hrs').title('Average Played Hours'),
    color = alt.Color('gender')
)
sex_hrs_plot

**Reflection on Graph (4):** <br>
From analyzing graph 4 I can see that the variable, 'gender', can be used to determine the type of individual who will contribute the most hours to playing Minecraft. Each gender gives a different average played hours, and some genders are expected to contribute more data than others according to their average player.

In [12]:
sessions 

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,30/06/2024 18:12,30/06/2024 18:24,1.719770e+12,1.719770e+12
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,17/06/2024 23:33,17/06/2024 23:46,1.718670e+12,1.718670e+12
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,25/07/2024 17:34,25/07/2024 17:57,1.721930e+12,1.721930e+12
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,25/07/2024 03:22,25/07/2024 03:58,1.721880e+12,1.721880e+12
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,25/05/2024 16:01,25/05/2024 16:12,1.716650e+12,1.716650e+12
...,...,...,...,...,...
1530,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,10/05/2024 23:01,10/05/2024 23:07,1.715380e+12,1.715380e+12
1531,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e...,01/07/2024 04:08,01/07/2024 04:19,1.719810e+12,1.719810e+12
1532,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,28/07/2024 15:36,28/07/2024 15:57,1.722180e+12,1.722180e+12
1533,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,25/07/2024 06:15,25/07/2024 06:22,1.721890e+12,1.721890e+12


In [13]:
sessions_tidy = sessions.loc[:,:'end_time']
sessions_tidy

Unnamed: 0,hashedEmail,start_time,end_time
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,30/06/2024 18:12,30/06/2024 18:24
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,17/06/2024 23:33,17/06/2024 23:46
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,25/07/2024 17:34,25/07/2024 17:57
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,25/07/2024 03:22,25/07/2024 03:58
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,25/05/2024 16:01,25/05/2024 16:12
...,...,...,...
1530,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,10/05/2024 23:01,10/05/2024 23:07
1531,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e...,01/07/2024 04:08,01/07/2024 04:19
1532,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,28/07/2024 15:36,28/07/2024 15:57
1533,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,25/07/2024 06:15,25/07/2024 06:22


Removed the variables *'original_start_time'* and *'orignal_end_time'* from the dataset, because these variables contaianed data that was unable to be used. The values in the variable *'original_start_time'* and *'original_end_time'* are identical.

In [14]:
players.dtypes , sessions.dtypes #finding out the type each variable is

(experience           object
 subscribe              bool
 hashedEmail          object
 played_hours        float64
 name                 object
 gender               object
 age                   int64
 individualId        float64
 organizationName    float64
 dtype: object,
 hashedEmail             object
 start_time              object
 end_time                object
 original_start_time    float64
 original_end_time      float64
 dtype: object)

### Methods and Plan

One method to answering the question is look at each explanatory variable and find which value (category/label) provides the most hours played. This is under the assumption that more hours played is equivalent to more data contributed to the dataset. To find which value provides the most hours played in each explanatory variable I can either look at the what is typical number of hours played for each value (normal distrubtion) or look at the average number of hours played for each value (mean) in each explanatory variable. In summary, identfying what type of people are most likely to contribute the most data. Weaknesses in the method is that the range and variation in total hours played for each person is massive. Some people play 40 hours and others play 0. Therefore, saying that the average player plays 20 hours is an incorrect interpretation of the data. I will split up the data to each of their explanatory variables at the beginning and later on view relationships between each of them and the hours played that is expected from them collectively.


In [15]:
list(players)

['experience',
 'subscribe',
 'hashedEmail',
 'played_hours',
 'name',
 'gender',
 'age',
 'individualId',
 'organizationName']