# Hero.Coli Data Analysis Summary

Interactive list of readworthy results from Hero.Coli data analysis.

## Table of Contents

[Preparation](#preparation)
1. [Google form analysis](#sampledForm)
2. [Game sessions](#sessions)
3. [Per session and per user analysis](#peruser)
4. [User comparison](#usercomp)
5. [Game map](#map)
    1. [List of questions](#qlist)
    2. [English](#enform)
    3. [French](#frform)
    4. [Language selection](#langsel)
3. [Basic operations](#basicops)
4. [Checkpoint / Question matching](#checkquestmatch)

# Preparation
<a id=preparation />

In [None]:
%run "../Functions/6. Time analysis.ipynb"
%run "../Functions/Plot.ipynb"

In [None]:
sampledForm = samplePlaytestPretestPosttestUniqueProfilesVolunteersPhase1.copy()

In [None]:
rmdf = rmdfPlaytestPretestPosttestUniqueProfilesVolunteersPhase1.copy()

# 1. Google form analysis
<a id=sampledForm />

## Survey counts

In [None]:
print("sample:               gform")
print("surveys:              %s" % len(gform))
print("unique users:         %s" % getUniqueUserCount(gform))
print("RM before:            %s" % len(gform[gform[QTemporality] == answerTemporalities[0]]))
print("GF before:            %s" % len(getGFormBefores(gform)))
print("RM after:             %s" % len(gform[gform[QTemporality] == answerTemporalities[1]]))
print("GF after:             %s" % len(getGFormAfters(gform)))
print("unique biologists:    %s" % getUniqueUserCount(getSurveysOfBiologists(gform)))
print("unique gamers:        %s" % getUniqueUserCount(getSurveysOfGamers(gform)))
print("unique perfect users: %s" % getUniqueUserCount(getSurveysOfUsersWhoAnsweredBoth(gform)))
print("unique perfect users: %s" % getPerfectPretestPostestPairsCount(gform))

In [None]:
print("sample:               sampledForm")
print("surveys:              %s" % len(sampledForm))
print("unique users:         %s" % getUniqueUserCount(sampledForm))
print("RM before:            %s" % len(sampledForm[sampledForm[QTemporality] == answerTemporalities[0]]))
print("GF before:            %s" % len(getGFormBefores(sampledForm)))
print("RM after:             %s" % len(sampledForm[sampledForm[QTemporality] == answerTemporalities[1]]))
print("GF after:             %s" % len(getGFormAfters(sampledForm)))
print("unique biologists:    %s" % getUniqueUserCount(getSurveysOfBiologists(sampledForm)))
print("unique gamers:        %s" % getUniqueUserCount(getSurveysOfGamers(sampledForm)))
print("unique perfect users: %s" % getUniqueUserCount(getSurveysOfUsersWhoAnsweredBoth(sampledForm)))
print("unique perfect users: %s" % getPerfectPretestPostestPairsCount(sampledForm))

#### formatted version for nice display

In [None]:
print("category | count")
print("--- | ---")
print("sample | gform")
print("surveys | %s" % len(gform))
print("unique users | %s" % getUniqueUserCount(gform))
print("RM before | %s" % len(gform[gform[QTemporality] == answerTemporalities[0]]))
print("GF before | %s" % len(getGFormBefores(gform)))
print("RM after | %s" % len(gform[gform[QTemporality] == answerTemporalities[1]]))
print("GF after | %s" % len(getGFormAfters(gform)))
print("unique biologists | %s" % getUniqueUserCount(getSurveysOfBiologists(gform)))
print("unique gamers | %s" % getUniqueUserCount(getSurveysOfGamers(gform)))
print("unique perfect users | %s" % getUniqueUserCount(getSurveysOfUsersWhoAnsweredBoth(gform)))
print("unique perfect users | %s" % getPerfectPretestPostestPairsCount(gform))
print()
#print("(" + str(pd.to_datetime('today').date()) + ")")
print("("+dataFilesNamesStem+")")

In [None]:
print("category | count")
print("--- | ---")
print("sample | sampledForm")
print("surveys | %s" % len(sampledForm))
print("unique users | %s" % getUniqueUserCount(sampledForm))
print("RM before | %s" % len(sampledForm[sampledForm[QTemporality] == answerTemporalities[0]]))
print("GF before | %s" % len(getGFormBefores(sampledForm)))
print("RM after | %s" % len(sampledForm[sampledForm[QTemporality] == answerTemporalities[1]]))
print("GF after | %s" % len(getGFormAfters(sampledForm)))
print("unique biologists | %s" % getUniqueUserCount(getSurveysOfBiologists(sampledForm)))
print("unique gamers | %s" % getUniqueUserCount(getSurveysOfGamers(sampledForm)))
print("unique perfect users | %s" % getUniqueUserCount(getSurveysOfUsersWhoAnsweredBoth(sampledForm)))
print("unique perfect users | %s" % getPerfectPretestPostestPairsCount(sampledForm))
print()
#print("(" + str(pd.to_datetime('today').date()) + ")")
print("("+dataFilesNamesStem+")")

### 1.1 complete sample

In [None]:
plotSamples(getDemographicSamples(sampledForm))

In [None]:
#plotSamples(getTemporalitySamples(sampledForm))

### 1.2 Per temporality

#### 1.2.1 answered only before

In [None]:
gf_befores = getGFormBefores(sampledForm)
rm_befores = getRMBefores(sampledForm)
gfrm_befores = getRMBefores(getGFormBefores(sampledForm))

In [None]:
(gf_befores[QUserId] == rm_befores[QUserId]).all()

In [None]:
#plotSamples(getDemographicSamples(gf_befores))

#### 1.2.2 answered only after

In [None]:
gf_afters = getGFormAfters(sampledForm)
rm_afters = getRMAfters(sampledForm)
gfrm_afters = getRMAfters(getGFormBefores(sampledForm))

In [None]:
(gf_afters[QUserId] == rm_afters[QUserId]).all()

In [None]:
#plotSamples(getDemographicSamples(gf_afters))

#### 1.2.3 answered both before and after

In [None]:
gf_both = getSurveysOfUsersWhoAnsweredBoth(sampledForm, gfMode = True, rmMode = False)
rm_both = getSurveysOfUsersWhoAnsweredBoth(sampledForm, gfMode = False, rmMode = True)
gfrm_both = getSurveysOfUsersWhoAnsweredBoth(sampledForm, gfMode = True, rmMode = True)

In [None]:
#plotSamples(getDemographicSamples(gf_both))

In [None]:
#plotSamples(getDemographicSamples(rm_both))

In [None]:
#plotSamples(getDemographicSamples(gfrm_both))

#### 1.2.4 pretest vs posttest

##### 1.2.4.1 phase1

In [None]:
plotBasicStats(sampledForm, horizontalPlot=True, sortedAlong="progression", figsize=(20,4));

### 1.3 Per demography

#### 1.3.1 English speakers

In [None]:
cohortEN = sampledForm[sampledForm[QLanguage] == enLanguageID]

In [None]:
#plotSamples(getTemporalitySamples(cohortEN))

#### 1.3.2 French speakers

In [None]:
cohortFR = sampledForm[sampledForm[QLanguage] == frLanguageID]

In [None]:
#plotSamples(getTemporalitySamples(cohortFR))

#### 1.3.3 Female

In [None]:
cohortF = sampledForm[sampledForm[QGender] == 'Female']

In [None]:
#plotSamples(getTemporalitySamples(cohortF))

#### 1.3.4 Male

In [None]:
cohortM = sampledForm[sampledForm[QGender] == 'Male']

In [None]:
#plotSamples(getTemporalitySamples(cohortM))

#### 1.3.5 biologists

##### strict

In [None]:
cohortBioS = getSurveysOfBiologists(sampledForm)

In [None]:
#plotSamples(getTemporalitySamples(cohortBioS))

##### broad

In [None]:
cohortBioB = getSurveysOfBiologists(sampledForm, False)

In [None]:
#plotSamples(getTemporalitySamples(cohortBioB))

#### 1.3.6 gamers

##### strict

In [None]:
cohortGamS = getSurveysOfGamers(sampledForm)

In [None]:
#plotSamples(getTemporalitySamples(cohortGamS))

##### broad

In [None]:
cohortGamB = getSurveysOfGamers(sampledForm, False)

In [None]:
#plotSamples(getTemporalitySamples(cohortGamB))

### 1.4 answered only after

### 1.1 answers to scientific questions

In [None]:
sciBinarizedBefore = getAllBinarized(_form = getRMBefores(sampledForm))
#sciBinarizedBefore = getAllBinarized(getGFBefores())

In [None]:
#plotCorrelationMatrix( _binarizedMatrix, _title='Questions\' Correlations', _abs=False, _clustered=False, _questionNumbers=False ):
plotCorrelationMatrix(
                        sciBinarizedBefore,
                        _abs=True,
                        _clustered=False,
                        _questionNumbers=True,
                        _annot = True,
                        _figsize = (20,20),
                        _title='Correlations on survey questions before',
                    )

#plotCorrelationMatrix( _binarizedMatrix, _title='Questions\' Correlations', _abs=False, _clustered=False, _questionNumbers=False ):
thisClustermap, overlay = plotCorrelationMatrix(
                        sciBinarizedBefore,
                        _abs=True,
                        _clustered=True,
                        _questionNumbers=True,
                        _annot = True,
                        _figsize = (20,20),
                        _metric='correlation'
                    )

In [None]:
sciBinarizedAfter = getAllBinarized(_form = getRMAfters(sampledForm))

In [None]:
#plotCorrelationMatrix( _binarizedMatrix, _title='Questions\' Correlations', _abs=False, _clustered=False, _questionNumbers=False ):
plotCorrelationMatrix(
                        sciBinarizedAfter,
                        _abs=True,
                        _clustered=False,
                        _questionNumbers=True,
                        _annot = True,
                        _figsize = (20,20),
                        _title='Correlations on survey questions after',
                    )

In [None]:
#plotCorrelationMatrix( _binarizedMatrix, _title='Questions\' Correlations', _abs=False, _clustered=False, _questionNumbers=False ):
thisClustermap, overlay = plotCorrelationMatrix(
                        sciBinarizedAfter,
                        _abs=True,
                        _clustered=True,
                        _questionNumbers=True,
                        _annot = True,
                        _figsize = (20,20),
                        _metric='correlation'
                    )

thisClustermap.ax_heatmap.annotate(overlay)

dir(thisClustermap)

dir(thisClustermap.ax_heatmap)

vars(thisClustermap)

vars(thisClustermap.ax_heatmap)

### 1.2 answers to all questions

In [None]:
allQuestions = correctAnswers + demographicAnswers

allBinarized = getAllBinarized(_source = allQuestions, _form = sampledForm)
allBinarizedBefore = getAllBinarized(_source = allQuestions, _form = getRMBefores(sampledForm))
allBinarizedAfter = getAllBinarized(_source = allQuestions, _form = getRMAfters(sampledForm))

In [None]:
#plotCorrelationMatrix( _binarizedMatrix, _title='Questions\' Correlations', _abs=False, _clustered=False, _questionNumbers=False ):
plotCorrelationMatrix(
                        allBinarized,
                        _abs=True,
                        _clustered=False,
                        _questionNumbers=True,
                        _annot = True,
                        _figsize = (20,20),
                        _title='Correlation of all answers',
                    )

thisClustermap, overlay = plotCorrelationMatrix(
                        allBinarizedAfter,
                        _abs=True,
                        _clustered=True,
                        _questionNumbers=True,
                        _annot = True,
                        _figsize = (20,20),
                        _metric='correlation'
                    )

### 1.3 answers to all questions, only before having played

In [None]:
#plotCorrelationMatrix( _binarizedMatrix, _title='Questions\' Correlations', _abs=False, _clustered=False, _questionNumbers=False ):
plotCorrelationMatrix(
                        allBinarizedBefore,
                        _abs=True,
                        _clustered=False,
                        _questionNumbers=True,
                        _annot = True,
                        _figsize = (20,20),
                        _title='Correlations on all questions before',
                    )

thisClustermap, overlay = plotCorrelationMatrix(
                        allBinarizedBefore,
                        _abs=True,
                        _clustered=True,
                        _questionNumbers=True,
                        _annot = True,
                        _figsize = (20,20),
                        _metric='correlation'
                    )

### 1.4 answers to all questions, only after having played

In [None]:
plotCorrelationMatrix(
                        allBinarizedAfter,
                        _abs=True,
                        _clustered=False,
                        _questionNumbers=True,
                        _annot = True,
                        _figsize = (20,20),
                        _title='Correlation of all answers after',
                    )

# 2. Game sessions
<a id=sessions />

In [None]:
#startDate = minimum152Date
#endDate = maximum152Date

startDate = rmdf['userTime'].min().date() - datetime.timedelta(days=1)
endDate = rmdf['userTime'].max().date() + datetime.timedelta(days=1)

In [None]:
valuesPerDay = rmdf['userTime'].map(lambda t: t.date()).value_counts().sort_index()
plotPerDay(valuesPerDay, title='RedMetrics events', startDate=startDate, endDate=endDate)

In [None]:
valuesPerDay[pd.to_datetime('2017-09-01', utc=True).date():pd.to_datetime('2017-09-30', utc=True).date()]

In [None]:
valuesPerDay = rmdf[rmdf['type'] == 'start']['userTime'].map(lambda t: t.date()).value_counts().sort_index()
plotPerDay(valuesPerDay, title='sessions', startDate=startDate, endDate=endDate)

In [None]:
valuesPerDay[pd.to_datetime('2017-09-01', utc=True).date():pd.to_datetime('2017-09-30', utc=True).date()]

In [None]:
valuesPerDay = rmdf.groupby('userId').agg({ "userTime": np.min })['userTime'].map(lambda t: t.date()).value_counts().sort_index()
plotPerDay(valuesPerDay, title='game users', startDate=startDate, endDate=endDate)

In [None]:
valuesPerDay[pd.to_datetime('2017-09-01', utc=True).date():pd.to_datetime('2017-09-30', utc=True).date()]

In [None]:
valuesPerDay = sampledForm.groupby(localplayerguidkey).agg({ QTimestamp: np.min })[QTimestamp].map(lambda t: t.date()).value_counts().sort_index()
plotPerDay(valuesPerDay, title='survey answers', startDate=startDate, endDate=endDate)

In [None]:
valuesPerDay[pd.to_datetime('2017-09-01', utc=True).date():pd.to_datetime('2017-09-30', utc=True).date()]

In [None]:
beforesPerDay = sampledForm[sampledForm[QTemporality] == answerTemporalities[0]].groupby(localplayerguidkey).agg({ QTimestamp: np.min })[QTimestamp].map(lambda t: t.date()).value_counts().sort_index()
aftersPerDay = sampledForm[sampledForm[QTemporality] == answerTemporalities[1]].groupby(localplayerguidkey).agg({ QTimestamp: np.min })[QTimestamp].map(lambda t: t.date()).value_counts().sort_index()
undefinedPerDay = sampledForm[sampledForm[QTemporality] == answerTemporalities[2]].groupby(localplayerguidkey).agg({ QTimestamp: np.min })[QTimestamp].map(lambda t: t.date()).value_counts().sort_index()

plotPerDay(beforesPerDay, title='survey befores', startDate=startDate, endDate=endDate)
plotPerDay(aftersPerDay, title='survey afters', startDate=startDate, endDate=endDate)
plotPerDay(undefinedPerDay, title='survey undefined', startDate=startDate, endDate=endDate)

# 3. Per session and per user analysis
<a id=peruser />

# 4. User comparison
<a id=usercomp />

to do: transfer part of 1.3's "'Google form analysis' functions tinkering" code here

## percentagesCrossCorrect

In [None]:
#pretests = gform[gform[QTemporality] == answerTemporalities[0]]
#pretests[pretests[QBBFunctionPLASMID] == ]

In [None]:
binarized = sciBinarizedBefore
intermediaryNumerator = getCrossCorrectAnswers(binarized).round().astype(int)*100
percentagesCrossCorrect = (intermediaryNumerator / binarized.shape[0]).round().astype(int)
totalPerQuestion = np.dot(np.ones(binarized.shape[0]), binarized)
sciBinarizedBefore.columns[totalPerQuestion == 0]

In [None]:
def getPercentageCrossCorrect(binarized, figsize=(40,100)):
    
    cbar_kws = dict(orientation= "horizontal")
    #cbar_kws = dict(orientation= "horizontal",location="top")
    #cbar_kws = dict(orientation= "horizontal", position="top")
    
    intermediaryNumerator = getCrossCorrectAnswers(binarized).round().astype(int)*100
    percentagesCrossCorrect = (intermediaryNumerator / binarized.shape[0]).round().astype(int)
    _fig = plt.figure(figsize=figsize)
    _ax = plt.subplot(121)
    _ax.set_title('percentage correct')
    sns.heatmap(percentagesCrossCorrect,ax=_ax,cmap=plt.cm.jet,square=True,annot=True,fmt='d',cbar_kws=cbar_kws)
    
    totalPerQuestion = np.dot(np.ones(binarized.shape[0]), binarized)
    totalPerQuestion[totalPerQuestion == 0] = 1
    percentagesConditionalCrossCorrect = (intermediaryNumerator / totalPerQuestion).round().astype(int).fillna(0)
    _ax = plt.subplot(122)
    _ax.set_title('percentage correct, conditionnally: p(y | x)')
    sns.heatmap(percentagesConditionalCrossCorrect,ax=_ax,cmap=plt.cm.jet,square=True,annot=True,fmt='d',cbar_kws=cbar_kws)
    
    plt.tight_layout()

In [None]:
getPercentageCrossCorrect(sciBinarizedBefore, figsize=(40,40))

In [None]:
getPercentageCrossCorrect(sciBinarizedAfter, figsize=(40,40))

In [None]:
# small sample
#allData = getAllUserVectorData( getAllUsers( rmdf )[:10] )

# complete set
#allData = getAllUserVectorData( getAllUsers( rmdf ) )

# subjects who answered the sampledForm
allData = getAllUserVectorData( getAllResponders(sampledForm), _source = correctAnswers )

# 10 subjects who answered the sampledForm
#allData = getAllUserVectorData( getAllResponders(sampledForm)[:10] )

In [None]:
#pretestData = getAllUserVectorData( sampledForm[sampledForm[QTemporality] == answerTemporalities[0]], _source = correctAnswers )
#posttestData = getAllUserVectorData( sampledForm[sampledForm[QTemporality] == answerTemporalities[1]], _source = correctAnswers )

In [None]:
plotAllUserVectorDataCorrelationMatrix(allData.T, _abs=True, _figsize = (40,40))

In [None]:
allData.shape

In [None]:
#allBinarized

# 5. Game map
<a id=map />

# Player filtering

In [None]:
#players = rmdf.loc[:, playerFilteringColumns]
players = safeGetNormalizedRedMetricsCSV( rmdf )
players.shape

In [None]:
#players = players.dropna(how='any')
#players.head(1)
#rmdf.head(1)

In [None]:
players.shape[0]

In [None]:
#players = players[~players['userId'].isin(excludedIDs)];
#players.shape[0]

## Sessions (filtered)

In [None]:
sessionscount = players["sessionId"].nunique()
sessionscount

## Sessions of dev IDs

## Unique players

In [None]:
uniqueplayers = players['userId']
uniqueplayers = uniqueplayers.unique()
uniqueplayers.shape[0]

In [None]:
#uniqueplayers

## Unique platforms

In [None]:
uniqueplatforms = players['customData.platform'].unique()
uniqueplatforms

## Checkpoints passed / furthest checkpoint (unfiltered)

In [None]:
checkpoints = rmdf.loc[:, ['type', 'section', 'sessionId']]
checkpoints = checkpoints[checkpoints['type']=='reach'].loc[:,['section','sessionId']]
checkpoints = checkpoints[checkpoints['section'].str.startswith('tutorial', na=False)]
checkpoints = checkpoints.groupby("sessionId")
checkpoints = checkpoints.max()
#len(checkpoints)
checkpoints.head()

In [None]:
maxCheckpointTable = pd.DataFrame({"maxCheckpoint" : checkpoints.values.flatten()})
maxCheckpointCounts = maxCheckpointTable["maxCheckpoint"].value_counts()
maxCheckpointCounts['Start'] = None
maxCheckpointCounts = maxCheckpointCounts.sort_index()
print('\nmaxCheckpointCounts=\n{0}'.format(str(maxCheckpointCounts)))

In [None]:
maxCheckpointCountsTable = pd.DataFrame({"maxCheckpoint" : maxCheckpointCounts.values})
maxCheckpointCountsTableCount = maxCheckpointCountsTable.sum(0)[0]
maxCheckpointCountsTableCount

In [None]:
checkpoints.count()

In [None]:
maxCheckpointCountsTable.head()

In [None]:
maxCheckpointCountsTable.describe()

In [None]:
genericTreatment( maxCheckpointCountsTable, "best checkpoint reached", "game sessions", 0, maxCheckpointCountsTableCount, False, True )

## Session starts

In [None]:
#starts = rmdf.loc[:, checkpointsRelevantColumns]
#starts = checkpoints[checkpoints['type']=='start'].loc[:,['playerId']]
#starts = checkpoints[checkpoints['section'].str.startswith('tutorial', na=False)]
#starts = checkpoints.groupby("playerId")
#starts = checkpoints.max()
#starts.head()

In [None]:
startTutorial1Count = sessionscount
neverReachedGameSessionCount = startTutorial1Count - maxCheckpointCountsTableCount
fullMaxCheckpointCounts = maxCheckpointCounts
fullMaxCheckpointCounts['Start'] = neverReachedGameSessionCount
fullMaxCheckpointCountsTable = pd.DataFrame({"fullMaxCheckpoint" : fullMaxCheckpointCounts.values})

genericTreatment( fullMaxCheckpointCountsTable, "best checkpoint reached", "game sessions", 0, startTutorial1Count, False, True )

print('\nfullMaxCheckpointCountsTable=\n{0}'.format(fullMaxCheckpointCountsTable))
fullMaxCheckpointCountsTable.describe()

## Duration

Duration of playing sessions

In [None]:
durations = players.groupby("sessionId").agg({ "serverTime": [ np.min, np.max  ] })
durations["duration"] = pd.to_datetime(durations["serverTime"]["amax"]) - pd.to_datetime(durations["serverTime"]["amin"])
durations["duration"] = durations["duration"].map(lambda x: np.timedelta64(x, 's'))
durations = durations.sort_values(by=['duration'], ascending=[False])
durations.head()

Duration plot

In [None]:
#durations.loc[:,'duration']
#durations = durations[4:]
durations["duration_seconds"] = durations["duration"].map(lambda x: pd.Timedelta(x).seconds)
maxDuration = np.max(durations["duration_seconds"])
durations["duration_rank"] = durations["duration_seconds"].rank(ascending=False)
durations.plot(x="duration_rank", y="duration_seconds")
plt.xlabel("game session")
plt.ylabel("time played (s)")
plt.legend('')
plt.xlim(0, sessionscount)
plt.ylim(0, maxDuration)
durations["duration_seconds"].describe()
#durations.head()