# Predict Correctness of Withheld Question

A major goal of this project has been to show that placing questions in topic space provides us with a richer picture of a person's knowledge than simply reporting the fraction of questions the person answered correctly. One way of showing this is to demonstrate that, if we take into account the questions' positions in topic space, we're able to predict the correctness of the participant's response to a held-out question with greater accuracy than if we make our prediction considering only the percentage of the other (non-held-out) questions the participant answered correctly.

Below is my attempt to make such a comparison.

-Will Baxley

### Imports

In [1]:
import numpy as np
import pandas as pd
import os
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

### Read in the data

In [2]:
# the locations of some relevant directories
vid_transc_dir = '../video transcript analysis'
answers_dir = '../graded_answers/'

In [3]:
# the graded answers
results = pd.read_csv(os.path.join(answers_dir, "Graded_results_19f_49.csv"))
results.head()

Unnamed: 0.1,Unnamed: 0,correct?,participantID,qID,set,video
0,0,0,40,29,0,2
1,1,0,40,31,0,0
2,2,0,40,17,0,2
3,3,1,40,36,0,0
4,4,0,40,7,0,1


### Determine our accuracy at predicting held-out questions using only each individual's percentage correct
For each participant, hold out each question, one at a time, and try to predict whether the held-out question was answered correctly or incorrectly by considering the participants' average correctness across the other (non-held-out) questions. Report the total accuracy of all such preductions.

In [4]:
def predict_using_own_accuracy(data):
    # the numerator and demonimator of the accuracy term we'll return
    correct = total = 0
    
    # iterate across each participant
    for participant in data["participantID"].unique():
        
        # filter the data to include only our given participant
        participant_responses = data[data["participantID"] == participant]
        
        # iterate across each question the participant answered (in the given section)
        for question in participant_responses["qID"]:
            
            # the participant's percentage correct across all questions except the held-out one 
            avg_correct = np.mean(participant_responses[participant_responses["qID"] != question]["correct?"])
            
            # if avg_correct is at least 50%, guess correct; otherwise, guess incorrect
            predicted_response = 1 if (avg_correct >= 0.5) else 0
            
            # the true correctness of the held-out question
            true_response = int(participant_responses[participant_responses["qID"] == question]["correct?"])
            
            # update correct and total
            if predicted_response == true_response:
                correct += 1
            
            total += 1
    
    # return the total accuracy
    return correct / total 

#### Break the data into three sections: (0) questions before video 1, (1) questions between video 1 and 2, and (2) questions after video 2
We expect participants' knowledge to change after watching each video since, presumably, they're learning. We see, after all, that across all participants, the accuracy in answering questions increases from ~46% to ~78% from before video 1 (section 0) to after video 2 (section 2). As a result, I think it makes sense to consider each of these sections separately; in other words, to say that each participant has a unique "knowledge" - whatever that is - that differs across the three sections.

In [5]:
set0_results = results[results["set"] == 0]
set1_results = results[results["set"] == 1]
set2_results = results[results["set"] == 2]

Below is the average percencentage correct in each section. Any model that doesn't predict the correctness of held-out questions with at least this accuracy is worse than the naive always-guess-correct model.

In [6]:
set0_results["correct?"].mean()

0.4553846153846154

In [7]:
set1_results["correct?"].mean()

0.6030769230769231

In [8]:
set2_results["correct?"].mean()

0.7846153846153846

#### Within each section, across all participants, report the accuracy of predicting success at held-out question

By "accuracy", I mean (number of correct predictions) / (number of total predictions). Other performance metrics might be more appropriate, but I chose this one as a first attempt because it's simple.

In [9]:
predict_using_own_accuracy(set0_results)

0.5707692307692308

In [10]:
predict_using_own_accuracy(set1_results)

0.5707692307692308

In [11]:
predict_using_own_accuracy(set2_results)

0.7892307692307692

### Determine our accuracy at predicting held-out questions using all other participants' success at those questions
This calculation is a bit tangential to the main point of this notebook, but it's something I was curious about. For each participant, consider each question and predict whether or not it was answered correctly by considering how successful the *other* participants were, on average, at answering that question. Again, report the total accuracy of all such predictions.

In [12]:
def predict_using_others_accuracy(data):
    # for determining accuracy
    correct = total = 0
    
    # iterate across each participant
    for participant in data["participantID"].unique():
        
        # 1. compute the percentage correct on each question, excluding the responses of the given participant
        avg_response = data[data["participantID"] != participant].groupby("qID").mean()["correct?"]
        
        # 2. for each question the participant answered, try to guess if it will be correct or not
        participant_responses = data[data["participantID"] == participant]  # the responses of our given participant
        
        for question in participant_responses["qID"]:
            guess = 1 if (avg_response[question] >= 0.5) else 0
            
            true_response = int(participant_responses[participant_responses["qID"] == question]["correct?"])
            
            if guess == true_response:
                correct += 1
            
            total += 1
    
    return correct / total 

#### Within each section, across all participants, report the accuracy of predicting success at held-out question

In [13]:
predict_using_others_accuracy(set0_results)

0.6507692307692308

In [14]:
predict_using_others_accuracy(set1_results)

0.7215384615384616

In [15]:
predict_using_others_accuracy(set2_results)

0.8046153846153846

## Now fit some topic models
I took much of the code in this section from predict-knowledge-topic-space.inpyb. Refer to it for questions about the choices I've made (e.g. what text I used to fit the model, what stopwords I used, etc.).

#### Load in the video transcripts

In [16]:
# Four Forces transcript, diced via a sliding window
forces_video_df = pd.read_csv(os.path.join(vid_transc_dir,'fourforcesdiced.tsv'), 
                            error_bad_lines=False, header=None, sep='\t', usecols=[0])
forces_video_samples = forces_video_df[0].tolist()

# Birth of Stars transcript, diced via a sliding window
bos_video_df = pd.read_csv(os.path.join(vid_transc_dir, 'birthofstarsdiced.tsv'), 
                            error_bad_lines=False, header=None, sep='\t', usecols=[0])
bos_video_samples = pd.Series(bos_video_df[0]).tolist()

#### Load in the questions

In [17]:
# read in the questions as a dataframe
questions_df = pd.read_csv('../data analysis/astronomyquestions.tsv', sep='\t', 
            names=['index', 'video', 'question', 'ans_A', 'ans_B', 'ans_C', 'ans_D'], index_col='index') 

# organize question by type (FFs, BoS, general)
forces_questions_samples = questions_df.loc[questions_df.video == 1].question.tolist()
bos_questions_samples = questions_df.loc[questions_df.video == 2].question.tolist()
general_question_samples = questions_df.loc[questions_df.video == 0].question.tolist()

#### Remove stopwords and punctuation, convert to lowercase, etc.

In [18]:
all_stopwords = stopwords.words('english') + ["let's", "they'd", "they're", "they've", "they'll", "that's", 
                                              "I'll", "I'm"]

def format_text(text):
    """
    Function to format documents for tokenization and modeling
    """
    
    clean_text = []
    
    for sentence in text:
        no_punc = re.sub("[^a-zA-Z\s']+", '', sentence.lower())
        no_stop = ' '.join([word for word in no_punc.split() if word not in all_stopwords])
        clean = re.sub("'+", '', no_stop)
        clean_text.append(clean)
    
    return clean_text

In [19]:
# format lecture and question text
fvs_formatted = format_text(forces_video_samples)
bvs_formatted = format_text(bos_video_samples)
fqs_formatted = format_text(forces_questions_samples)
bqs_formatted = format_text(bos_questions_samples)
gqs_formatted = format_text(general_question_samples)

#### Initialize some topic modelling parameters

In [20]:
vec_params = {
    'max_df': 0.95,
    'min_df': 2,
    'max_features': 500,
    'stop_words': 'english'
}

lda_params = {
    'n_components': 12,
    'max_iter': 10,
    'learning_method': 'online',
    'learning_offset':50.,
    'random_state': 0
}

#### Count vectorizer

In [21]:
# initialize count vectorizer
tf_vectorizer = CountVectorizer(**vec_params)

# fit to both lectures and all questions
tf_vectorizer.fit(fvs_formatted + fqs_formatted
                  + bvs_formatted + bqs_formatted)

# transform question samples
forces_questions_tf = tf_vectorizer.transform(fqs_formatted)
bos_questions_tf = tf_vectorizer.transform(bqs_formatted)
general_questions_tf = tf_vectorizer.transform(gqs_formatted)

# vectorize the entire corpus (both video transcripts and the related questions)
all_tf = tf_vectorizer.transform(fvs_formatted + fqs_formatted + bvs_formatted + bqs_formatted)

#### LDA

In [22]:
# initialize LDA model, fit to both lectures and lecture-related questions
lda = LatentDirichletAllocation(**lda_params)
lda.fit(all_tf)

# transform questions
forces_q_traj = lda.transform(forces_questions_tf)
bos_q_traj = lda.transform(bos_questions_tf)
general_q_traj = lda.transform(general_questions_tf)

#### Reorganize the questions

In [23]:
# combine the question trajectories into a single list
all_q_traj = list(forces_q_traj) + list(bos_q_traj) + list(general_q_traj)

# insert an empty list at the 0th position so that trajectories can be indexed from 1 to 39 (i.e. by qID)
all_q_traj.insert(0, np.ndarray([]))

#### A method for determining how close two vectors are in space
I was looking for some metric that's higher when two vectors are closer and lower when they're further appart. I chose this method (inverse of euclidean distance) because it's simple, but I'm sure there are many other possible ways of doing this, and I'd be open to suggestions.

In [24]:
# given two equi-dimensional vectors a and b (in np.ndarray form), return 1 / (euclidean_dist(a, b))
def inverse_euclidean_distance(a, b):
    # compute the euclidean distance between the two vectors
    dist = np.linalg.norm(a-b)
    
    # if the distance is 0, return an arbitrary large number
    if dist == 0:
        print("bet you didn't expect that to happen, did you?")
        return 10^6
    
    # otherwise, return 1 / dist
    return 1 / dist

### Determine our accuracy at predicting held-out questions using their proximity in topic space to other questions answered by the participant
For each participant, for each held-out question, compute the participant's "knowledge" at the point of the held-out question by summing the "closeness" to each of the correctly answered questions and subtracting the "closeness" to each incorrectly answered question. Predict correct when knowledge >= 0, otherwise incorrect, and report the total accuracy of all such predictions.

In [25]:
def predict_using_own_knowledge(data, trajs, closeness_metric):
    # for determining accuracy
    correct = total = 0
    
    # iterate across each participant
    for participant in data["participantID"].unique():
        
        # get the data relating only to that one participant
        participant_responses = data[data["participantID"] == participant]
        
        # withhold each question, one at a time, and see how well we can predict it
        for qID in participant_responses["qID"]:
            
            # all the questions answered by the participant except the withheld one
            other_qIDs = [q for q in participant_responses["qID"] if q is not qID]
            
            # for each question in other_qIDs, a record of whether is was answered correctly
            q_correctness = [int(participant_responses[participant_responses["qID"] == q]["correct?"]) for q in other_qIDs]
            
            # tranform terms in q_correcness from the set {0,1} to {-1,1}
            q_correctness = (np.array(q_correctness) * 2) - 1
            
            # determine the magnitude of the individual's knowledge at the location of the withheld question
            knowledge = np.sum([c * closeness_metric(trajs[qID], trajs[q]) for q,c in zip(other_qIDs, q_correctness)])
            
            # our prediction of how the participant answered the given question
            predicted_response = 1 if (knowledge >= 0) else 0
            
            # how the participant actually answered the withheld question
            true_response = int(participant_responses[participant_responses["qID"] == qID]["correct?"])
            
            # update correct and total
            if predicted_response == true_response:
                correct += 1
            total += 1
            
    return correct / total

#### Within each section, across all participants, report the accuracy of predicting success at held-out question

In [26]:
predict_using_own_knowledge(set0_results, all_q_traj, inverse_euclidean_distance)

0.5784615384615385

In [27]:
predict_using_own_knowledge(set1_results, all_q_traj, inverse_euclidean_distance)

0.6

In [28]:
predict_using_own_knowledge(set2_results, all_q_traj, inverse_euclidean_distance)

0.7415384615384616

#### Repeat, but this time use correlation as a measure of closeness

In [29]:
predict_using_own_knowledge(set0_results, all_q_traj, np.correlate)

0.563076923076923

In [30]:
predict_using_own_knowledge(set1_results, all_q_traj, np.correlate)

0.6230769230769231

In [31]:
predict_using_own_knowledge(set2_results, all_q_traj, np.correlate)

0.7492307692307693

### Summary
These results show that our accuracy at predicting the correctness of a held-out question is similar when we (1) consider the participant's accuracy on the other questions and (2) when we instead consider the relashionship between the position of the held-out question and the non-held-out questions in topic space. This is a bit disappointing, but of course there are many ways this procedure could be modified, and some of those might lead to better results.