### Exploring the Sequences of Technical Difficulties
The aim here is to discover identifiers of technical difficulties. Are there sequences of behaviour that lead to the outcome have having technical difficulties?

*Why do this?*
This would provide an insight into the types of sequence patterns that users of interactive-media experiences exhibit when they experience some form of technical difficulty with the experience itself. Obviously, in this case, the focus is on video bufferring issues at this is the main reported technical problem with CAKE.

*What do we currently know about the technical difficulties?*
That the technical difficulties (apart from two users) are video-related issues. Each of the participants report that once they began to experience issues with the videos, they all switch to the written/cardview (where the cooking instructions are in a written format).

*Points to consider*
While each of the users in this example persevered through the video issues (they switched views), it's resonable to presume that this is due to the nature of where the data was collected from (the user experience study) whereas in a real setting most users probably wouldn't do this -- if they had issues, then it's unlikely that they would persevered (maybe a small percentage would). But conversationally, this highlights the need to detect and provide some form of fix for these issues -- to increase the retention rate of the experience(s). 

In [4]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

As we know, due to the participants reporting, each participant that had video-related issues switched to the card view to solve this problem and complete the user experience study. So the first thing to establish is for each of these users, when is the first occurrence of this switching of views.

As a note: It's worth noting that the switching to the card view, in this case, is seen as a solution to the problem as the participants report a positive experience (overall) with CAKE while using the card view.

In [7]:
# First read in the two datasets; statistical and raw.
td_raw = pd.read_csv('./data/cake_td_raw_data.csv')
td_stats = pd.read_csv('./data/cake_td_stats_data.csv')
td_raw.head()

Unnamed: 0,id,participant_id,timestamp,pagetime,item,action,message,combined,combined_num
0,2251,109,2017-08-20 19:21:27,0,play_pause_button,play,,play_pause_button play,9
1,2258,109,2017-08-20 19:26:34,0,full_screen_button,fullscreen,,full_screen_button fullscreen,27
2,3218,407,2017-08-22 22:00:41,0,play_pause_button,play,,play_pause_button play,9
3,3219,407,2017-08-22 22:01:46,0,play_pause_button,play,,play_pause_button play,9
4,3220,407,2017-08-23 08:06:17,0,play_pause_button,play,,play_pause_button play,9


Let's firstly drop the participant 407, they only have three events logged and their issue was with the login form not working.

In [18]:
td_raw.drop(td_raw[td_raw['participant_id'] == 407].index, inplace=True)
td_stats.drop(td_stats[td_stats['userid'] == 407].index, inplace=True)

In [19]:
# Now, let's look at when the first occurrence of when the card view toggle is.
td_pids = td_raw['participant_id'].unique()
first_occurrence = {}
for pid in td_pids:
    temp_df = td_raw[td_raw['participant_id'] == pid]
    loc_index = temp_df['item'].eq('cardView').idxmax()
    first_occurrence[pid] = loc_index

first_occurrence

{109: 29, 112: 234, 113: 553, 121: 165, 205: 68, 217: 489, 220: 309}

In [28]:
# So first off, let's look at the number of events before the first card view toggle.
number_of_events = {}
for key, value in first_occurrence.items():
    temp_df = td_raw[td_raw['participant_id'] == key]
    number_of_events[key] = len(temp_df.loc[:value])

number_of_events

{109: 27, 112: 25, 113: 11, 121: 20, 205: 13, 217: 21, 220: 17}

Now that we have some basic data, let's just turn the two dictionaries into dataframes.

In [31]:
td_event_df = pd.DataFrame.from_dict(first_occurrence, orient='index')
td_event_df.columns = ['first_occ']
td_event_df['num_events_up_to'] = pd.Series(number_of_events)
td_event_df

Unnamed: 0,first_occ,num_events_up_to
109,29,27
205,68,13
121,165,20
112,234,25
220,309,17
217,489,21
113,553,11


In [32]:
# Even though these are basic bits of information, let's just print out some descriptive statistics.
td_event_df.describe()

Unnamed: 0,first_occ,num_events_up_to
count,7.0,7.0
mean,263.857143,19.142857
std,200.213576,5.89996
min,29.0,11.0
25%,116.5,15.0
50%,234.0,20.0
75%,399.0,23.0
max,553.0,27.0


Let's look at some time-based data

In [None]:
# Convert the timestamp column to a datetime object type.
td_raw['timestamp'] = pd.to_datetime(td_raw['timestamp'])

In [42]:
def diff_in_mins(t1, t2):
    td = t1 - t2
    return td.total_seconds() / 60

In [47]:
for row in td_event_df.itertuples():
    temp_df = td_raw[td_raw['participant_id'] == row.Index]
    timestamps = temp_df.loc[:row.first_occ, 'timestamp']
    first_ts = timestamps.iloc[0]
    last_td = timestamps.iloc[-1]

2017-08-20 19:21:27
2017-08-23 18:10:20
2017-08-25 16:08:16
2017-08-25 18:06:24
2017-08-25 21:14:47
2017-08-27 18:37:20
2017-08-30 07:44:50
