# Google form analysis tests

Purpose: determine in what extent the current data can accurately describe correlations, underlying factors on the score.
Especially concerning the 'before' groups: are there underlying groups explaining the discrepancies in score? Are those groups tied to certain questions?

## Table of Contents


[Cross-samples t-tests](#crossttests)

   - [biologists vs non-biologists](#biologistsvsnonbiologists)
   
   - [biologists vs non-biologists *before*](#biologistsvsnonbiologistsbefore)
   
[PCAs](#PCAs)

In [None]:
%run "../Functions/1. Google form analysis.ipynb"

## Cross-samples t-tests
<a id=crossttests />


Purpose: find out whether a question can be used to discriminate different groups.

### biologists vs non-biologists
<a id=biologistsvsnonbiologists />

In [None]:
biologists = getSurveysOfBiologists(gform)
nonBiologists = gform.drop(biologists.index)
biologistsScores = biologists.apply(getGFormRowScore, axis=1)
nonBiologistsScores = nonBiologists.apply(getGFormRowScore, axis=1)
#print(len(gform), len(biologists), len(nonBiologists))
#print(len(gform), len(biologistsScores), len(nonBiologistsScores))
#print(type(biologistsScores), len(biologistsScores),\
#type(nonBiologistsScores), len(nonBiologistsScores))
ttest = ttest_ind(biologistsScores, nonBiologistsScores)
ttest

In [None]:
biologistsScores.values

In [None]:
np.std(biologistsScores)

In [None]:
np.std(nonBiologistsScores)

Conclusion: the two groups have distinct scores.

### biologists vs non-biologists *before*
<a id=biologistsvsnonbiologistsbefore />

In [None]:
gfBefores = getGFormBefores(gform)
biologistsBefores = getSurveysOfBiologists(gfBefores, hardPolicy = False)
nonBiologistsBefores = gfBefores.drop(biologistsBefores.index)
biologistsBeforesScores = biologistsBefores.apply(getGFormRowScore, axis=1)
nonBiologistsBeforesScores = nonBiologistsBefores.apply(getGFormRowScore, axis=1)
#print(len(gfBefores), len(biologistsBefores), len(nonBiologistsBefores))
#print(len(gfBefores), len(biologistsBeforesScores), len(nonBiologistsBeforesScores))
#print(type(biologistsScores), len(biologistsScores),\
#type(nonBiologistsScores), len(nonBiologistsScores))
ttest = ttest_ind(biologistsBeforesScores, nonBiologistsBeforesScores)
ttest

In [None]:
np.std(biologistsBeforesScores)

In [None]:
nonBiologistsBeforesScores

In [None]:
np.std(nonBiologistsBeforesScores)

## PCAs
<a id=PCAs />


Purpose: find out which questions hqve the more weight in the computation of the score.

Other leads: LDA, ANOVA.

In [None]:
binarized = getAllBinarized()

In [None]:
score = np.dot(binarized,np.ones(len(binarized.columns)))

### Standardizing

In [None]:
from sklearn.preprocessing import StandardScaler
Q_std = StandardScaler().fit_transform(binarized)

### Covariance Matrix

In [None]:
mean_vec = np.mean(Q_std, axis=0)
cov_mat = (Q_std - mean_vec).T.dot((Q_std - mean_vec)) / (Q_std.shape[0]-1)
print('Covariance matrix \n%s' %cov_mat)

#### eigendecomposition on the covariance matrix:

In [None]:
cov_mat = np.cov(Q_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)

#print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

### Correlation Matrix

#### Eigendecomposition of the standardized data based on the correlation matrix:

In [None]:
cor_mat1 = np.corrcoef(Q_std.T)
eig_vals, eig_vecs = np.linalg.eig(cor_mat1)

#print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

#### Eigendecomposition of the raw data based on the correlation matrix:

In [None]:
cor_mat2 = np.corrcoef(binarized.T)
eig_vals, eig_vecs = np.linalg.eig(cor_mat2)

#print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)