### **Individual Planning Report**

<mark>Each student is expected to prepare a 1 page (max 500 words) written proposal that describes the data they are working on, demonstrates an understanding of all variables and potential issues in the data, and identifies the question they would like to answer using that dataset for their project. The proposal should be done in a Jupyter notebook, and then submitted in two formats:</mark>

+ <mark>as an .html file (File -> Download As -> HTML)</mark>
+ <mark>as an .ipynb file. This file must be fully reproducible. It must run completely from top to bottom without any additional files.</mark>

<mark>It's important to note that this first step in the project will be completed individually. Every student needs to write and submit their own assignment. We aim to ensure that all students in the group are well-prepared and able to contribute effectively to the final report.</mark>


**1) Data Description:**
The data consist of two files:

+ *<font color='violet'>'players.csv'</font>*: A list of 196 unique players, including the following 9 data fields for each player.
    + **experience** <font color='green'>(**type**:selected text)</font>
    Defines whether a player is experienced or just starting out. &#10140; This user input (perceived data) might need checking.
    + **subscribe** <font color='green'>(**type**:boolean)</font>
    States if a player has subscribed to the mailing list. &#10140; A subscriber might be a more persistent player.
    + **hashedEmail** <font color='green'>(**type**:encryption text)</font>
    Gives the encripted email, a unique user ID. &#10140; This will help us link the data of each player with the other dataset.
    + **played_hours** <font color='green'>(**type**:float)</font>
    Total amount of time each person has played, summed over the play sessions. &#10140; This field might be redundant with the sessions' logs contained in the other dataset but useful for preliminary plotting. Some players have never played (min=0.0).
    + **name** <font color='green'>(**type**:free text)</font> Username allowing players to communicate between themselves. &#10140; This field will probably not be useful for the analysis.
    + **gender** <font color='green'>(**type**:selected text)</font>
    &#10140; This question was optional and might therefore be incomplete.
    + **age** <font color='green'>(**type**:integer)</font>
    &#10140; It would be interesting to see if age influences the analysis. Note that players are mostly between 17 and 22 (25th and 75th percentile), which is narrow.
    + **individualld** <font color='green'>(**type**:float)</font>
    Could be a unique identifier, an ID encoded number as a float.&#10140; This field is not useful as always left empty.
    + **organizationName** <font color='green'>(**type**:float)</font>
    &#10140; This field has no relevant data inside. 

+ *<font color='violet'>'sessions.csv'</font>*: A list of 1535 individual play sessions, regardless of the player, including the following 5 data fields.
    + **hashedEmail** <font color='green'>(**type**:selected text)</font>
    Same as in the other dataset.
    &#10140; This will help us link the data of each player between both datasets.
    + **start_time** <font color='green'>(**type**:selected text)</font>
    Gives the time each player connected to play.&#10140; This will be useful in our analysis on users' persistence.
    + **end_time** <font color='green'>(**type**:selected text)</font>
    Gives the time each player disconnected.&#10140; This will be useful in our analysis on users' persistence.
    + **original_start_time** <font color='green'>(**type**:selected text)</font>
    &#10140; A large number that is not precise enough to be meaningful. Will be discarded. 
    + **original_end_time** <font color='green'>(**type**:selected text)</font>
    &#10140; 
    Same comment.

There are 1535 sessions and only 196 emails : a player can connect multiple times over the survey period.

**2) Challenge question:**

My group and I have decided to answer Question 3: "We would like to know something about our populations of users, in particular, we would like to have a good model of whether or not a player will continue contributing given past participation."

For this question, we will need both datasets, but not all the variables (see previous comments).

It is anticipated that the most useful dataset is sessions.csv. The start time/date and end time/date will tell us not only how many times the users played but also for how long, and when during the day/week. So this plays a big role to determine the frequency of play for building a good model. The start and end time/date are encoded as one cell as ":" separated "hour:minute" and as "/" separated "day/month/year". These will have to be parsed in an appropriate format (ideally done when reading csv file).

In players.csv, we are going to focus on 'experience', 'age', 'subscribe' in case a relation appears with the time and frequency played. These could help build a good model. We will also need the emails in order to link these explanatory variables with the information in sessions.csv.

**3) Exploratory Data Analysis and Visualization**

Here is how we could load each dataset, at the same time removing the unuseful variables and parsing the date/time.


In [1]:
import pandas as pd
players = pd.read_csv('players.csv', usecols=["experience", "subscribe", "hashedEmail", "played_hours", "age"])
sessions = pd.read_csv('sessions.csv', parse_dates=['start_time','end_time'], dayfirst=True, usecols=["hashedEmail", "start_time", "end_time"])

Here is how we can merge the two datasets into a tidy dataset, starting with sessions.csv so we can use a chaining rule, using "hashedEmail" as the common entry, while also adding extra variables :
+  the played_minutes of each session
+  the mean session time
+  and ordering the experience category

&#10140; Not all players played, and only those that have an entry in sessions.csv are kept (how=["inner"]) : the new dataframe has the same number of entries as sessions. <br>&#10140; We can later check which variable (age, experience, subscribe) plays a role in frequency of play.

In [2]:
import datetime
tidywhole = (
    pd.merge(sessions, players, how="inner", on=["hashedEmail"])
    .assign(played_minutes = (sessions["end_time"]-sessions["start_time"])/ datetime.timedelta(minutes=1))
    .assign(mean_time = sessions[["end_time","start_time"]].mean(axis=1))
    .replace(['Beginner', 'Amateur', 'Regular', 'Pro', 'Veteran'], ['1-Beginner', '2-Amateur', '3-Regular', '4-Pro', '5-Veteran'])
)

Now, lets do a few plots to better understand the dataset and check some of the assumptions.

+ Lets first compare **played_hours** extracted from *<font color='violet'>'players.csv'</font>* and the cumulative played time **played_minutes** calculated from *<font color='violet'>'sessions.csv'</font>*, using a scatter plot.

In [12]:
import altair as alt
tidysum = pd.merge(tidywhole.groupby(['hashedEmail'])['played_minutes'].sum().reset_index(), players[['hashedEmail','played_hours']], how="inner", on=["hashedEmail"])
tidysum['played_hours'] = tidysum['played_hours'] * 60
scat1 = alt.Chart(tidysum).mark_circle().encode(
    x=alt.X("played_hours").scale(domain=(0, 16000)).title('Played Hours (Given and in Min)'),
    y=alt.Y("played_minutes").scale(domain=(0, 16000)).title('Played Minutes (Calculated)')
)
xyline = alt.Chart( pd.DataFrame({'x': [0, 16000]})).mark_line(color='lightgrey').encode(x=alt.X('x'),y=alt.Y('x') )
scat1+xyline

&#10140; There are more played minutes than there are played hours (not all points are on the 1:1 line). It looks like the **played_hours** in *<font color='violet'>'players.csv'</font>* were calculated from sessions of 15 minutes or more, which is not helpful for us. We will prefer **played_minutes** to **played_hours** in the rest of the assignment.

+ Let us now do a simple plot of the times played during the day.

In [15]:
scat2 = alt.Chart(tidywhole.reset_index()).mark_circle().encode(
    x=alt.X("index").title('Every session of each player'),
    y=alt.Y("start_time:T", timeUnit='hoursminutes').title('Time when played')
)
scat2

&#10140; Virtually noone played between 9:00AM and 2:00PM, which is odd. We assume that the timestamp is GMT rather than PST, which would put unplayed time between midnight and 7:00 in the morming, which is more sensible. 

In [5]:
tidywhole_pst = tidywhole.copy(deep=True)
tidywhole_pst["start_time"] = tidywhole_pst["start_time"] + pd.Timedelta(hours=-8)
tidywhole_pst["end_time"] = tidywhole_pst["end_time"] + pd.Timedelta(hours=-8)
tidywhole_pst["mean_time"] = tidywhole_pst["mean_time"] + pd.Timedelta(hours=-8)


In [17]:
scat3 = alt.Chart(tidywhole_pst.reset_index()).mark_circle().encode(
    x=alt.X("index").title('Activity per player'),
    y=alt.Y("start_time:N", timeUnit='day').title('Days connected')
)
bar0 = alt.Chart(tidywhole_pst).mark_bar().encode(
    x="count()",
    y=alt.Y("start_time:N", timeUnit='day').title('Days connected')
)
scat3 | bar0 

&#10140; This shows relatively little influence of the day of the week (even if Sunday has the most sessions).

+ Bar plot of experience vs played minutes and number of sessions.

In [27]:
bar1 = alt.Chart(tidywhole).mark_bar().encode(
    x=alt.X("sum(played_minutes)").title('Total minutes played'),
    y=alt.Y("experience").title('Level')
)
bar2 = alt.Chart(tidywhole).mark_bar().encode(
    x="count()",
    y=alt.Y("experience").title('Level')
)
bar1 | bar2

&#10140; This shows that the perceived experience is not correlated to the playtime (with Amateurs and Regulars contributing more than more experienced players).

+ Bar plot of age vs played minutes and number of sessions.<br>


In [28]:
bar1 = alt.Chart(tidywhole[(tidywhole["age"]<25) & (tidywhole["age"]>15)]).mark_bar().encode(
    x=alt.X("sum(played_hours)").title('Total minutes played'),
    y=alt.Y("age").title('Age')
)
bar2 = alt.Chart(tidywhole[(tidywhole["age"]<25) & (tidywhole["age"]>15)]).mark_bar().encode(
    x="count()",
    y=alt.Y("age").title('Age')
)
bar1 | bar2

&#10140; There is no strong correlation between age and played minutes or number of connections.

+ Here is a plot of all sessions over time vs. **played_minutes**, together with the info provided from the 'descibe' function.


In [54]:
scat4 = alt.Chart(tidywhole.reset_index()).mark_circle().encode(
    x=alt.X("start_time").title('Start Time'),
    y=alt.Y("played_minutes").title('Played Minutes')
).properties(
    width='container',
    height=200
)
line_mean = alt.Chart(tidywhole).mark_rule(strokeDash=[10], size=3, color='darkorange').encode( y=alt.datum(50.858447))
top5000 = tidysum["hashedEmail"][(tidysum["played_minutes"]>14500)]
print(top5000)
scat5 = alt.Chart(tidywhole[tidywhole["hashedEmail"].isin(top5000)], title= 'Overall time played per player').mark_point(filled=False, color='red').encode(
    x=alt.X("start_time").title('Start Time'),
    y=alt.Y("played_minutes").title('Played Minutes')
)
final_chart = scat4+scat5+line_mean
final_chart

84    bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...
Name: hashedEmail, dtype: object


On average every player per session connected played for 50 min.

**4) Methods and Plan**

To proceed to answer to our question, we shall need to split the dataset (final_chart) into two parts, a training set (for which we will adjust our model) and a testing set (for which we will test our model to see if it functions correctly). This shall be done by taking a portion of the sessions of players and putting them asside and keeping the remaining to experiment on. Although it should be the first thing we do since we do not want to start creating a model using the future testing set otherwise we can not assure ourselves success. We would split the dataset into a ratio of 65% traing and 35% testing. At first we shall need to do both knn-regression and linear gression in order to compare to see which model will have the best accuracy when predicting. We will take the model that has a lower RMSPE (Root Mean Squared Prediction Error). The knn-regression may have too much noise and therefor will not be able to predict correctly but the linear regression may also not take into account enough data, but I think the linear regression model will be the most efective. After having done so we shall proceed to the testing dataset meaning that we will predict the data using our model but first we cross validate (train our model on whole dataset using the RMSPE as our prediction metric)our model to assure us a correct prediction. While doing this process we need to assume that all players were playing for fun and not participating in a required play time like us, we also need to remember that in the players data set they do not take into account all the sessions under 15 minutes.

___
<mark>In your explanation, respond to the following questions: **Question (4)**</mark>
+ Why is this method appropriate?
+ Which assumptions are required, if any, to apply the method selected?
+ What are the potential limitations or weaknesses of the method selected?
+ How are you going to compare and select the model?
+ How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?

<mark>**Do not perform any modelling or present results at this stage. We are looking for high-level planning regarding model choice and justifying that choice.**</mark>
___