# Hero.Coli Data Analysis Summary

Interactive list of readworthy results from Hero.Coli data analysis.

## Table of Contents

[Preparation](#preparation)
1. [Google form analysis](#sampledForm)
2. [Game sessions](#sessions)
3. [Per session and per user analysis](#peruser)
4. [User comparison](#usercomp)
5. [Game map](#map)
    1. [List of questions](#qlist)
    2. [English](#enform)
    3. [French](#frform)
    4. [Language selection](#langsel)
3. [Basic operations](#basicops)
4. [Checkpoint / Question matching](#checkquestmatch)

# Preparation
<a id=preparation />

In [None]:
%run "../Functions/6. Time analysis.ipynb"
%run "../Functions/Plot.ipynb"

## Select the sample here

In [None]:
#sampledForm = samplePlaytestPretestPosttestUniqueProfilesVolunteers.copy()
sampledForm = samplePlaytestPretestPosttestUniqueProfilesVolunteersPhase1.copy()
#sampledForm = samplePlaytestPretestPosttestUniqueProfilesVolunteersPhase2.copy()

In [None]:
#rmdf = rmdfPlaytestPretestPosttestUniqueProfilesVolunteers.copy()
rmdf = rmdfPlaytestPretestPosttestUniqueProfilesVolunteersPhase1.copy()
#rmdf = rmdfPlaytestPretestPosttestUniqueProfilesVolunteersPhase2.copy()

In [None]:
# small sample
#allData = getAllUserVectorData( getAllUsers( rmdf )[:10] )

# complete set
#allData = getAllUserVectorData( getAllUsers( rmdf ) )

# subjects who answered the sampledForm
allData = getAllUserVectorData( getAllResponders(sampledForm), _rmDF = rmdf, _gfDF = sampledForm, _source = correctAnswers + demographicAnswers )

# 10 subjects who answered the sampledForm
#allData = getAllUserVectorData( getAllResponders(sampledForm)[:10] )

# 1. Google form analysis
<a id=sampledForm />

## Survey counts

In [None]:
print("sample:               gform")
print("surveys:              %s" % len(gform))
print("unique users:         %s" % getUniqueUserCount(gform))
print("RM before:            %s" % len(gform[gform[QTemporality] == answerTemporalities[0]]))
print("GF before:            %s" % len(getGFormBefores(gform)))
print("RM after:             %s" % len(gform[gform[QTemporality] == answerTemporalities[1]]))
print("GF after:             %s" % len(getGFormAfters(gform)))
print("unique biologists:    %s" % getUniqueUserCount(getSurveysOfBiologists(gform)))
print("unique gamers:        %s" % getUniqueUserCount(getSurveysOfGamers(gform)))
print("unique perfect users: %s" % getUniqueUserCount(getSurveysOfUsersWhoAnsweredBoth(gform)))
print("unique perfect users: %s" % getPerfectPretestPostestPairsCount(gform))

In [None]:
print("sample:               sampledForm")
print("surveys:              %s" % len(sampledForm))
print("unique users:         %s" % getUniqueUserCount(sampledForm))
print("RM before:            %s" % len(sampledForm[sampledForm[QTemporality] == answerTemporalities[0]]))
print("GF before:            %s" % len(getGFormBefores(sampledForm)))
print("RM after:             %s" % len(sampledForm[sampledForm[QTemporality] == answerTemporalities[1]]))
print("GF after:             %s" % len(getGFormAfters(sampledForm)))
print("unique biologists:    %s" % getUniqueUserCount(getSurveysOfBiologists(sampledForm)))
print("unique gamers:        %s" % getUniqueUserCount(getSurveysOfGamers(sampledForm)))
print("unique perfect users: %s" % getUniqueUserCount(getSurveysOfUsersWhoAnsweredBoth(sampledForm)))
print("unique perfect users: %s" % getPerfectPretestPostestPairsCount(sampledForm))

#### formatted version for nice display

In [None]:
print("category | count")
print("--- | ---")
print("sample | gform")
print("surveys | %s" % len(gform))
print("unique users | %s" % getUniqueUserCount(gform))
print("RM before | %s" % len(gform[gform[QTemporality] == answerTemporalities[0]]))
print("GF before | %s" % len(getGFormBefores(gform)))
print("RM after | %s" % len(gform[gform[QTemporality] == answerTemporalities[1]]))
print("GF after | %s" % len(getGFormAfters(gform)))
print("unique biologists | %s" % getUniqueUserCount(getSurveysOfBiologists(gform)))
print("unique gamers | %s" % getUniqueUserCount(getSurveysOfGamers(gform)))
print("unique perfect users | %s" % getUniqueUserCount(getSurveysOfUsersWhoAnsweredBoth(gform)))
print("unique perfect users | %s" % getPerfectPretestPostestPairsCount(gform))
print()
#print("(" + str(pd.to_datetime('today').date()) + ")")
print("("+dataFilesNamesStem+")")

In [None]:
print("category | count")
print("--- | ---")
print("sample | sampledForm")
print("surveys | %s" % len(sampledForm))
print("unique users | %s" % getUniqueUserCount(sampledForm))
print("RM before | %s" % len(sampledForm[sampledForm[QTemporality] == answerTemporalities[0]]))
print("GF before | %s" % len(getGFormBefores(sampledForm)))
print("RM after | %s" % len(sampledForm[sampledForm[QTemporality] == answerTemporalities[1]]))
print("GF after | %s" % len(getGFormAfters(sampledForm)))
print("unique biologists | %s" % getUniqueUserCount(getSurveysOfBiologists(sampledForm)))
print("unique gamers | %s" % getUniqueUserCount(getSurveysOfGamers(sampledForm)))
print("unique perfect users | %s" % getUniqueUserCount(getSurveysOfUsersWhoAnsweredBoth(sampledForm)))
print("unique perfect users | %s" % getPerfectPretestPostestPairsCount(sampledForm))
print()
#print("(" + str(pd.to_datetime('today').date()) + ")")
print("("+dataFilesNamesStem+")")

### 1.1 complete sample

In [None]:
#plotSamples(getDemographicSamples(sampledForm))

In [None]:
#plotSamples(getTemporalitySamples(sampledForm))

### 1.2 Per temporality

#### 1.2.1 answered only before

In [None]:
gf_befores = getGFormBefores(sampledForm)
rm_befores = getRMBefores(sampledForm)
gfrm_befores = getRMBefores(getGFormBefores(sampledForm))

In [None]:
(gf_befores[QUserId] == rm_befores[QUserId]).all()

In [None]:
#plotSamples(getDemographicSamples(gf_befores))

#### 1.2.2 answered only after

In [None]:
gf_afters = getGFormAfters(sampledForm)
rm_afters = getRMAfters(sampledForm)
gfrm_afters = getRMAfters(getGFormBefores(sampledForm))

In [None]:
(gf_afters[QUserId] == rm_afters[QUserId]).all()

In [None]:
#plotSamples(getDemographicSamples(gf_afters))

#### 1.2.3 answered both before and after

In [None]:
gf_both = getSurveysOfUsersWhoAnsweredBoth(sampledForm, gfMode = True, rmMode = False)
rm_both = getSurveysOfUsersWhoAnsweredBoth(sampledForm, gfMode = False, rmMode = True)
gfrm_both = getSurveysOfUsersWhoAnsweredBoth(sampledForm, gfMode = True, rmMode = True)

In [None]:
#plotSamples(getDemographicSamples(gf_both))

In [None]:
#plotSamples(getDemographicSamples(rm_both))

In [None]:
#plotSamples(getDemographicSamples(gfrm_both))

#### 1.2.4 pretest vs posttest

##### 1.2.4.1 phase1

In [None]:
matrixToDisplay = plotBasicStats(sampledForm, horizontalPlot=True, sortedAlong="progression", figsize=(20,4));

In [None]:
#matrixToDisplay.to_csv("../../data/sortedPrePostProgression.csv")

In [None]:
#matrixToDisplay.T

### 1.3 Per demography

#### 1.3.1 English speakers

In [None]:
cohortEN = sampledForm[sampledForm[QLanguage] == enLanguageID]

In [None]:
#plotSamples(getTemporalitySamples(cohortEN))

#### 1.3.2 French speakers

In [None]:
cohortFR = sampledForm[sampledForm[QLanguage] == frLanguageID]

In [None]:
#plotSamples(getTemporalitySamples(cohortFR))

#### 1.3.3 Female

In [None]:
cohortF = sampledForm[sampledForm[QGender] == 'Female']

In [None]:
#plotSamples(getTemporalitySamples(cohortF))

#### 1.3.4 Male

In [None]:
cohortM = sampledForm[sampledForm[QGender] == 'Male']

In [None]:
#plotSamples(getTemporalitySamples(cohortM))

#### 1.3.5 biologists

##### strict

In [None]:
cohortBioS = getSurveysOfBiologists(sampledForm)

In [None]:
#plotSamples(getTemporalitySamples(cohortBioS))

##### broad

In [None]:
cohortBioB = getSurveysOfBiologists(sampledForm, False)

In [None]:
#plotSamples(getTemporalitySamples(cohortBioB))

#### 1.3.6 gamers

##### strict

In [None]:
cohortGamS = getSurveysOfGamers(sampledForm)

In [None]:
#plotSamples(getTemporalitySamples(cohortGamS))

##### broad

In [None]:
cohortGamB = getSurveysOfGamers(sampledForm, False)

In [None]:
#plotSamples(getTemporalitySamples(cohortGamB))

### 1.4 answered only after

### 1.1 answers to scientific questions

In [None]:
sciBinarizedBefore = getAllBinarized(getRMBefores(sampledForm))
#sciBinarizedBefore = getAllBinarized(getGFBefores())

In [None]:
#plotCorrelationMatrix( _binarizedMatrix, _title='Questions\' Correlations', _abs=False, _clustered=False, _questionNumbers=False ):
plotCorrelationMatrix(
                        sciBinarizedBefore,
                        _abs=False,
                        _clustered=False,
                        _questionNumbers=True,
                        _annot = True,
                        _figsize = (20,20),
                        _title='Correlations on survey questions before',
                    )

#plotCorrelationMatrix( _binarizedMatrix, _title='Questions\' Correlations', _abs=False, _clustered=False, _questionNumbers=False ):
thisClustermap, overlay = plotCorrelationMatrix(
                        sciBinarizedBefore,
                        _abs=True,
                        _clustered=True,
                        _questionNumbers=True,
                        _annot = True,
                        _figsize = (20,20),
                        _metric='correlation'
                    )

In [None]:
sciBinarizedAfter = getAllBinarized(getRMAfters(sampledForm))

In [None]:
#plotCorrelationMatrix( _binarizedMatrix, _title='Questions\' Correlations', _abs=False, _clustered=False, _questionNumbers=False ):
plotCorrelationMatrix(
                        sciBinarizedAfter,
                        _abs=False,
                        _clustered=False,
                        _questionNumbers=True,
                        _annot = True,
                        _figsize = (20,20),
                        _title='Correlations on survey questions after',
                    )

In [None]:
#plotCorrelationMatrix( _binarizedMatrix, _title='Questions\' Correlations', _abs=False, _clustered=False, _questionNumbers=False ):
thisClustermap, overlay = plotCorrelationMatrix(
                        sciBinarizedAfter,
                        _abs=False,
                        _clustered=True,
                        _questionNumbers=True,
                        _annot = True,
                        _figsize = (20,20),
                        _metric='correlation'
                    )

thisClustermap.ax_heatmap.annotate(overlay)

dir(thisClustermap)

dir(thisClustermap.ax_heatmap)

vars(thisClustermap)

vars(thisClustermap.ax_heatmap)

### 1.2 answers to all questions

In [None]:
allQuestions = correctAnswers + demographicAnswers

allBinarized = getAllBinarized(sampledForm, _source = allQuestions)
allBinarizedBefore = getAllBinarized(getRMBefores(sampledForm), _source = allQuestions)
allBinarizedAfter = getAllBinarized(getRMAfters(sampledForm), _source = allQuestions)

In [None]:
#plotCorrelationMatrix( _binarizedMatrix, _title='Questions\' Correlations', _abs=False, _clustered=False, _questionNumbers=False ):
plotCorrelationMatrix(
                        allBinarized,
                        _abs=True,
                        _clustered=False,
                        _questionNumbers=True,
                        _annot = True,
                        _figsize = (20,20),
                        _title='Correlation of all answers',
                    )

thisClustermap, overlay = plotCorrelationMatrix(
                        allBinarizedAfter,
                        _abs=True,
                        _clustered=True,
                        _questionNumbers=True,
                        _annot = True,
                        _figsize = (20,20),
                        _metric='correlation'
                    )

### 1.3 answers to all questions, only before having played

In [None]:
#plotCorrelationMatrix( _binarizedMatrix, _title='Questions\' Correlations', _abs=False, _clustered=False, _questionNumbers=False ):
plotCorrelationMatrix(
                        allBinarizedBefore,
                        _abs=False,
                        _clustered=False,
                        _questionNumbers=True,
                        _annot = True,
                        _figsize = (20,20),
                        _title='Correlations on all questions before',
                    )

thisClustermap, overlay = plotCorrelationMatrix(
                        allBinarizedBefore,
                        _abs=True,
                        _clustered=True,
                        _questionNumbers=True,
                        _annot = True,
                        _figsize = (20,20),
                        _metric='correlation'
                    )

### 1.4 answers to all questions, only after having played

In [None]:
plotCorrelationMatrix(
                        allBinarizedAfter,
                        _abs=False,
                        _clustered=False,
                        _questionNumbers=True,
                        _annot = True,
                        _figsize = (20,20),
                        _title='Correlation of all answers after',
                    )

# 2. Game sessions
<a id=sessions />

In [None]:
#startDate = minimum152Date
#endDate = maximum152Date

startDate = rmdf['userTime'].min().date() - datetime.timedelta(days=1)
endDate = rmdf['userTime'].max().date() + datetime.timedelta(days=1)

In [None]:
valuesPerDay = rmdf['userTime'].map(lambda t: t.date()).value_counts().sort_index()
plotPerDay(valuesPerDay, title='RedMetrics events', startDate=startDate, endDate=endDate)

In [None]:
valuesPerDay[pd.to_datetime('2017-09-01', utc=True).date():pd.to_datetime('2017-09-30', utc=True).date()]

In [None]:
valuesPerDay = rmdf[rmdf['type'] == 'start']['userTime'].map(lambda t: t.date()).value_counts().sort_index()
plotPerDay(valuesPerDay, title='sessions', startDate=startDate, endDate=endDate)

In [None]:
valuesPerDay[pd.to_datetime('2017-09-01', utc=True).date():pd.to_datetime('2017-09-30', utc=True).date()]

In [None]:
valuesPerDay = rmdf.groupby('userId').agg({ "userTime": np.min })['userTime'].map(lambda t: t.date()).value_counts().sort_index()
plotPerDay(valuesPerDay, title='game users', startDate=startDate, endDate=endDate)

In [None]:
valuesPerDay[pd.to_datetime('2017-09-01', utc=True).date():pd.to_datetime('2017-09-30', utc=True).date()]

In [None]:
valuesPerDay = sampledForm.groupby(localplayerguidkey).agg({ QTimestamp: np.min })[QTimestamp].map(lambda t: t.date()).value_counts().sort_index()
plotPerDay(valuesPerDay, title='survey answers', startDate=startDate, endDate=endDate)

In [None]:
valuesPerDay[pd.to_datetime('2017-09-01', utc=True).date():pd.to_datetime('2017-09-30', utc=True).date()]

In [None]:
beforesPerDay = sampledForm[sampledForm[QTemporality] == answerTemporalities[0]].groupby(localplayerguidkey).agg({ QTimestamp: np.min })[QTimestamp].map(lambda t: t.date()).value_counts().sort_index()
aftersPerDay = sampledForm[sampledForm[QTemporality] == answerTemporalities[1]].groupby(localplayerguidkey).agg({ QTimestamp: np.min })[QTimestamp].map(lambda t: t.date()).value_counts().sort_index()
undefinedPerDay = sampledForm[sampledForm[QTemporality] == answerTemporalities[2]].groupby(localplayerguidkey).agg({ QTimestamp: np.min })[QTimestamp].map(lambda t: t.date()).value_counts().sort_index()

plotPerDay(beforesPerDay, title='survey befores', startDate=startDate, endDate=endDate)
plotPerDay(aftersPerDay, title='survey afters', startDate=startDate, endDate=endDate)
plotPerDay(undefinedPerDay, title='survey undefined', startDate=startDate, endDate=endDate)

# 3. Per session and per user analysis
<a id=peruser />

# 4. User comparison
<a id=usercomp />

to do: transfer part of 1.3's "'Google form analysis' functions tinkering" code here

## percentagesCrossCorrect

In [None]:
#pretests = gform[gform[QTemporality] == answerTemporalities[0]]
#pretests[pretests[QBBFunctionPlasmid] == ]

In [None]:
binarized = sciBinarizedBefore
intermediaryNumerator = getCrossCorrectAnswers(binarized).round().astype(int)*100
percentagesCrossCorrect = (intermediaryNumerator / binarized.shape[0]).round().astype(int)
totalPerQuestion = np.dot(np.ones(binarized.shape[0]), binarized)
sciBinarizedBefore.columns[totalPerQuestion == 0]

In [None]:
def getPercentageCrossCorrect(binarized, figsize=(40,100)):
    
    cbar_kws = dict(orientation= "horizontal")
    #cbar_kws = dict(orientation= "horizontal",location="top")
    #cbar_kws = dict(orientation= "horizontal", position="top")
    
    intermediaryNumerator = getCrossCorrectAnswers(binarized).round().astype(int)*100
    percentagesCrossCorrect = (intermediaryNumerator / binarized.shape[0]).round().astype(int)
    _fig = plt.figure(figsize=figsize)
    _ax = plt.subplot(121)
    _ax.set_title('percentage correct')
    sns.heatmap(
        percentagesCrossCorrect,
        ax=_ax,
        cmap=plt.cm.jet,
        square=True,
        annot=True,
        fmt='d',
        cbar_kws=cbar_kws,
        vmin=0,
        vmax=100,
    )
    
    totalPerQuestion = np.dot(np.ones(binarized.shape[0]), binarized)
    totalPerQuestion[totalPerQuestion == 0] = 1
    percentagesConditionalCrossCorrect = (intermediaryNumerator / totalPerQuestion).round().astype(int).fillna(0)
    _ax = plt.subplot(122)
    _ax.set_title('percentage correct, conditionnally: p(y | x)')
    sns.heatmap(
        percentagesConditionalCrossCorrect,
        ax=_ax,
        cmap=plt.cm.jet,
        square=True,
        annot=True,
        fmt='d',
        cbar_kws=cbar_kws,
        vmin=0,
        vmax=100,
    )
    
    plt.tight_layout()

In [None]:
getPercentageCrossCorrect(sciBinarizedBefore, figsize=(40,40))

In [None]:
getPercentageCrossCorrect(sciBinarizedAfter, figsize=(40,40))

In [None]:
# small sample
#allData = getAllUserVectorData( getAllUsers( rmdf )[:10] )

# complete set
#allData = getAllUserVectorData( getAllUsers( rmdf ) )

# subjects who answered the sampledForm
allData = getAllUserVectorData( getAllResponders(sampledForm), _source = correctAnswers + demographicAnswers, _rmDF = rmdf, _gfDF = sampledForm )

# 10 subjects who answered the sampledForm
#allData = getAllUserVectorData( getAllResponders(sampledForm)[:10] )

In [None]:
len(sampledForm), len(getAllResponders(sampledForm))

In [None]:
matrixToDisplay = plotBasicStats(sampledForm, horizontalPlot=True, sortedAlong="progression", figsize=(20,4));

In [None]:
subjectCount = allData.shape[1]
measuredPretest = 100*allData.loc[pretestScientificQuestions,:].sum(axis='columns')/subjectCount
measuredPretest.index = scientificQuestions
measuredPosttest = 100*allData.loc[posttestScientificQuestions,:].sum(axis='columns')/subjectCount
measuredPosttest.index = scientificQuestions
measuredDelta2 = (measuredPosttest - measuredPretest)
measuredDelta2 = pd.DataFrame(measuredDelta2.round().astype(int))
measuredDelta2.columns = ["measuredDelta2"]
measuredDelta2 = measuredDelta2.sort_values(by = "measuredDelta2", ascending = True).T
_fig = plt.figure(figsize=(20,2))
_ax1 = plt.subplot(111)
_ax1.set_title("measuredDelta2")
sns.heatmap(
            measuredDelta2,
            ax=_ax1,
            cmap=plt.cm.jet,
            square=True,
            annot=True,
            fmt='d',
            vmin=0,
            vmax=100,
        )

In [None]:
(matrixToDisplay.loc['progression',scientificQuestions] - measuredDelta2.loc['measuredDelta2',scientificQuestions])

In [None]:
testDF = pd.DataFrame(columns=[
    'pretest1', 'posttest1', 'measuredDelta',
    'pretest2', 'posttest2', 'matrixToDisplay'], data = 0, index= scientificQuestions)
testDF['pretest1'] = measuredPretest
testDF['posttest1'] = measuredPosttest
testDF['measuredDelta'] = measuredDelta2.T['measuredDelta2']
testDF['pretest2'] = matrixToDisplay.T['pretest'][scientificQuestions]
testDF['posttest2'] = matrixToDisplay.T['posttest'][scientificQuestions]
testDF['matrixToDisplay'] = matrixToDisplay.T['progression'][scientificQuestions]
testDF = testDF.round().astype(int)
testDF

In [None]:
measuredDelta = allData.loc[deltaScientificQuestions,:].sum(axis='columns')
measuredDelta.mean(), measuredDelta.median()
measuredDelta.sort_values()

In [None]:
#pretestData = getAllUserVectorData( sampledForm[sampledForm[QTemporality] == answerTemporalities[0]], _source = correctAnswers )
#posttestData = getAllUserVectorData( sampledForm[sampledForm[QTemporality] == answerTemporalities[1]], _source = correctAnswers )

In [None]:
plotAllUserVectorDataCorrelationMatrix(
    allData.T,
    _abs=False,
    _figsize = (40,40),
    _clustered=False
)

In [None]:
demographicCriteria = demographicQuestions.copy()

overallScoreCriteria = ["scorepretest", "scoreposttest", "scoredelta",]

stemTimesCriteria = ["ch" + "{0:0=2d}".format(i) for i in range(0,15)]
completionTimesCriteria = [st + "completion" for st in stemTimesCriteria] + ["completionTime"]
totalTimesCriteria = [st + "total" for st in stemTimesCriteria] + ["totalTime"]

plotAllUserVectorDataCorrelationMatrix(
    allData.T,
    _abs=False,
    _figsize = (20,20),
    _clustered=False,
    columnSubset=[]\
        + completionTimesCriteria
        + totalTimesCriteria
        + pretestScientificQuestions
        #+ posttestScientificQuestions
        #+ deltaScientificQuestions
        + overallScoreCriteria
        #+ demographicCriteria
)

In [None]:
#completers = rmdf[rmdf['type'] == 'complete'][QUserId]
#nonCompleter = rmdf[~rmdf[QUserId].isin(completers)][QUserId].iloc[0]

In [None]:
#getUserDataVector(nonCompleter)#.loc[14,:]

In [None]:
#allData.shape

In [None]:
#allData.index

# completed vs played time

In [None]:
data = pd.DataFrame(index=allData.columns, columns=["time", "posttestScore", "deltaScore","completed"])
for userId in data.index:
    data.loc[userId, "time"] = getPlayedTimeUser(userId, _rmDF = rmdf)['tutorial']['totalSpentTime'].total_seconds()
    data.loc[userId, "posttestScore"] = allData.loc['scoreposttest', userId]
    data.loc[userId, "deltaScore"] = allData.loc['scoredelta', userId]
    data.loc[userId, "completed"] = allData.loc['complete', userId]
data.shape

x = allScores.copy()
x2 = completedScores.copy()
y = allPlayedTimes.copy()
y2 = completedPlayedTimes.copy()

plotDF = pd.DataFrame(index = x.index, data = x)
plotDF['times'] = y
#plotDF
#(plotDF['times'] == y).all()

In [None]:
x = data["posttestScore"]
x2 = data[data["completed"]==1]["posttestScore"]
y = data["time"]
y2 = data[data["completed"]==1]["time"]

plt.figure(figsize=(12, 4))
ax1 = plt.subplot(121)
plt.scatter(x, y)#, c='blue', alpha=0.5)
plt.scatter(x2, y2)#, c='red', alpha=0.5)
plt.xlabel('score')
plt.ylabel('time')
plt.title("time against score, n=" + str(len(x)))
#ax1.legend(loc='center left', bbox_to_anchor=(1, 0.5))

ax2 = plt.subplot(122)
plt.scatter(y, x)
plt.scatter(y2, x2)
plt.xlabel('time')
plt.ylabel('score')
plt.title("score against time, n=" + str(len(x)))
ax2.legend(loc='center left', bbox_to_anchor=(-1.2, 0.9), labels =["unfinished games","completed games"])

plt.show()

## linear regression

In [None]:
x = data["posttestScore"].astype(float)
x2 = data[data["completed"]==1]["posttestScore"].astype(float)
y = data["time"].astype(float)
y2 = data[data["completed"]==1]["time"].astype(float)

# Get the linear models
lm_original = np.polyfit(x, y, 1)
 
# calculate the y values based on the co-efficients from the model
r_x, r_y = zip(*((i, i*lm_original[0] + lm_original[1]) for i in x))
 
# Put in to a data frame, to keep is all nice
lm_original_plot = pd.DataFrame({
'scores' : r_x,
'times' : r_y
})

lm_original_plot = lm_original_plot.drop_duplicates()
lm_original_plot = lm_original_plot.sort_values(by="scores")
lm_original_plot = lm_original_plot.drop(lm_original_plot.index[1:-1])

In [None]:
plt.figure(figsize=(6, 4))
ax = plt.subplot(111)
plt.scatter(x, y)
plt.scatter(x2, y2)
# Plot the original data and model
#lm_original_plot.plot(kind='line', color='Red', x='scores', y='times', ax=ax)
plt.plot('scores', 'times', data=lm_original_plot, color='Red')
plt.xlabel('score')
plt.ylabel('time') 
plt.show()

## linear regression 2

In [None]:
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

x = data["posttestScore"].astype(float)
x2 = data[data["completed"]==1]["posttestScore"].astype(float)
y = data["time"].astype(float)
y2 = data[data["completed"]==1]["time"].astype(float)

xReshaped = x.values.reshape(-1, 1)

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(xReshaped, y)

# Make predictions using the testing set
pred = regr.predict(xReshaped)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(y, pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y, pred))

# Plot outputs
plt.scatter(x, y, color='black')
plt.plot(x, pred, color='blue', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

In [None]:
regr.intercept_,regr.coef_

## linear regression 3

In [None]:
sns.regplot(x=x, y=y, color="b")
plt.scatter(x2, y2, color='red')
plt.xlabel("score")
plt.ylabel("time played")

data = pd.DataFrame(index = range(0, len(xReshaped)), data = xReshaped, columns = ['score'])

data['time'] = y.values

data

source https://www.ritchieng.com/machine-learning-evaluate-linear-regression-model/

data2 = data.loc[:, ["time", "posttestScore"]]
data2.index = range(0, data.shape[0])

data2

## linear regression 4

In [None]:
#import patsy
import statsmodels.formula.api as smf

In [None]:
data2 = data.astype(float)

In [None]:
### STATSMODELS ###

timeScoreformula = 'time ~ posttestScore'

# create a fitted model
lm1 = smf.ols(formula=timeScoreformula, data=data2).fit()

# print the coefficients
#lm1.params


#lm1.summary()

In [None]:
# print the confidence intervals for the model coefficients
lm1.conf_int()

In [None]:
# print the p-values for the model coefficients
# Represents the probability that the coefficient is actually zero
lm1.pvalues

In [None]:
# print the R-squared value for the model
lm1.rsquared

### Completed vs non-completed

In [None]:
### STATSMODELS ###
timeScoreformula = 'time ~ posttestScore'
lm1 = smf.ols(formula=timeScoreformula, data=data2).fit()
lm2 = smf.ols(formula=timeScoreformula, data=data2[data2["completed"] == 0]).fit()
lm3 = smf.ols(formula=timeScoreformula, data=data2[data2["completed"] == 1]).fit()
lm1.rsquared,lm2.rsquared,lm3.rsquared

# Correlations between durations and score on questions

## correlations between completion time of checkpoint n and score on question Q

In [None]:
overallScoreCriteria = ["scorepretest", "scoreposttest", "scoredelta",]

In [None]:
stemTimesCriteria = ["ch" + "{0:0=2d}".format(i) for i in range(0,15)]
completionTimesCriteria = [st + "completion" for st in stemTimesCriteria] + ["completionTime"]
totalTimesCriteria = [st + "total" for st in stemTimesCriteria] + ["totalTime"]

In [None]:
#chosenPrefix = answerTemporalities[0]
chosenPrefix = answerTemporalities[1]
#chosenPrefix = "delta"

chosenCriteria = [chosenPrefix + " " + q for q in scientificQuestions]

durationsScoresCorrelations = pd.DataFrame(index=completionTimesCriteria+totalTimesCriteria, columns=chosenCriteria, data=np.nan)
durationsScoresCorrelations = durationsScoresCorrelations.rename(str, axis='rows')
annotationMatrix = np.empty(shape=[durationsScoresCorrelations.shape[0], 1], dtype=int)
#annotationMatrix2D = np.empty(durationsScoresCorrelations.shape, dtype=str)

allData2 = allData.T.rename(str,axis="columns")
for i in range(len(durationsScoresCorrelations.index)):
    checkpoint = durationsScoresCorrelations.index[i]
    allData3 = allData2[allData2[checkpoint] < pd.Timedelta.max.total_seconds()]
    annotationMatrix[i] = len(allData3)
    for q in durationsScoresCorrelations.columns:
        corr = np.corrcoef(allData3[checkpoint], allData3[q])
        if corr[0,0] < 0:
            print("[" + checkpoint + ";" + q + "]:" + str(corr[0,0]))
        durationsScoresCorrelations.loc[checkpoint, q] = corr[0,1]
        
_fig, (_a0, _a1) = plt.subplots(1,2, gridspec_kw = {'width_ratios':[50, 1]}, figsize=(15,10))

_a0.set_title("correlations between times and " + chosenPrefix + " scores")
sns.heatmap(durationsScoresCorrelations, ax=_a0, cmap=plt.cm.jet, square=True, vmin=-1, vmax=1,)
            # annot=annotationMatrix2D
            #cbar_kws= {'panchor':(0.0, 0.0)}

_a1.set_title("")
sns.heatmap(annotationMatrix, ax=_a1, annot=annotationMatrix)

_fig.tight_layout()

In [None]:
#getAllResponders(sampledForm), _source = correctAnswers, _rmDF = rmdf
#testUserId = "4731525f-62dd-4128-ab56-3991b403e17e"
#getUserDataVector(testUserId,_source = correctAnswers, _rmDF = rmdf)

# 5. Game map
<a id=map />

# Player filtering

In [None]:
#players = rmdf.loc[:, playerFilteringColumns]
players = safeGetNormalizedRedMetricsCSV( rmdf )
players.shape

In [None]:
#players = players.dropna(how='any')
#players.head(1)
#rmdf.head(1)

In [None]:
players.shape[0]

In [None]:
#players = players[~players['userId'].isin(excludedIDs)];
#players.shape[0]

## Sessions (filtered)

In [None]:
sessionscount = players["sessionId"].nunique()
sessionscount

## Sessions of dev IDs

## Unique players

In [None]:
uniqueplayers = players['userId']
uniqueplayers = uniqueplayers.unique()
uniqueplayers.shape[0]

In [None]:
#uniqueplayers

## Unique platforms

In [None]:
uniqueplatforms = players['customData.platform'].unique()
uniqueplatforms

## Checkpoints passed / furthest checkpoint (unfiltered)

In [None]:
checkpoints = rmdf.loc[:, ['type', 'section', 'sessionId']]
checkpoints = checkpoints[checkpoints['type']=='reach'].loc[:,['section','sessionId']]
checkpoints = checkpoints[checkpoints['section'].str.startswith('tutorial', na=False)]
checkpoints = checkpoints.groupby("sessionId")
checkpoints = checkpoints.max()
#len(checkpoints)
checkpoints.head()

In [None]:
maxCheckpointTable = pd.DataFrame({"maxCheckpoint" : checkpoints.values.flatten()})
maxCheckpointCounts = maxCheckpointTable["maxCheckpoint"].value_counts()
maxCheckpointCounts['Start'] = None
maxCheckpointCounts = maxCheckpointCounts.sort_index()
print('\nmaxCheckpointCounts=\n{0}'.format(str(maxCheckpointCounts)))

In [None]:
maxCheckpointCountsTable = pd.DataFrame({"maxCheckpoint" : maxCheckpointCounts.values})
maxCheckpointCountsTableCount = maxCheckpointCountsTable.sum(0)[0]
maxCheckpointCountsTableCount

In [None]:
checkpoints.count()

In [None]:
maxCheckpointCountsTable.head()

In [None]:
maxCheckpointCountsTable.describe()

In [None]:
genericTreatment( maxCheckpointCountsTable, "best checkpoint reached", "game sessions", 0, maxCheckpointCountsTableCount, False, True )

## Session starts

In [None]:
#starts = rmdf.loc[:, checkpointsRelevantColumns]
#starts = checkpoints[checkpoints['type']=='start'].loc[:,['playerId']]
#starts = checkpoints[checkpoints['section'].str.startswith('tutorial', na=False)]
#starts = checkpoints.groupby("playerId")
#starts = checkpoints.max()
#starts.head()

In [None]:
startTutorial1Count = sessionscount
neverReachedGameSessionCount = startTutorial1Count - maxCheckpointCountsTableCount
fullMaxCheckpointCounts = maxCheckpointCounts
fullMaxCheckpointCounts['Start'] = neverReachedGameSessionCount
fullMaxCheckpointCountsTable = pd.DataFrame({"fullMaxCheckpoint" : fullMaxCheckpointCounts.values})

genericTreatment( fullMaxCheckpointCountsTable, "best checkpoint reached", "game sessions", 0, startTutorial1Count, False, True )

print('\nfullMaxCheckpointCountsTable=\n{0}'.format(fullMaxCheckpointCountsTable))
fullMaxCheckpointCountsTable.describe()

## Duration

Duration of playing sessions

In [None]:
durations = players.groupby("sessionId").agg({ "serverTime": [ np.min, np.max  ] })
durations["duration"] = pd.to_datetime(durations["serverTime"]["amax"]) - pd.to_datetime(durations["serverTime"]["amin"])
durations["duration"] = durations["duration"].map(lambda x: np.timedelta64(x, 's'))
durations = durations.sort_values(by=['duration'], ascending=[False])
durations.head()

Duration plot

In [None]:
type(durations)

In [None]:
#durations.loc[:,'duration']
#durations = durations[4:]
durations["duration_seconds"] = durations["duration"].map(lambda x: pd.Timedelta(x).seconds)
maxDuration = np.max(durations["duration_seconds"])
durations["duration_rank"] = durations["duration_seconds"].rank(ascending=False)
ax = durations.plot(x="duration_rank", y="duration_seconds")
plt.xlabel("game session")
plt.ylabel("time played (s)")
#plt.legend('')
ax.legend_.remove()
plt.xlim(0, sessionscount)
plt.ylim(0, maxDuration)
durations["duration_seconds"].describe()
#durations.head()

## Phase 1 vs Phase 2 comparison

### Completion rate

In [None]:
def getCompletedRate(_rmdf):
    players = _rmdf[QUserId].nunique()
    completers = _rmdf[_rmdf['type'] == 'complete'][QUserId].nunique()
    return float(completers)/float(players)

In [None]:
getCompletedRate(rmdfPlaytestPretestPosttestUniqueProfilesVolunteers),\
getCompletedRate(rmdfPlaytestPretestPosttestUniqueProfilesVolunteersPhase1),\
getCompletedRate(rmdfPlaytestPretestPosttestUniqueProfilesVolunteersPhase2),\

### Played time on critical checkpoints

# Best players

In [None]:
getRecordPlayer(rmdf1522, gform)

In [None]:
getRecordPlayer(rmdf160, gform)