# Public Editor Convergence Tool 

In [1]:
import numpy as np
import pandas as pd
import re
import os
from collections import Counter

## Goal: ## 
Our goal of this tool is to find the minimum number of users it takes to reach a consensus for each question. We are treating the consensus (the majority of answers) as the "correct answer". Then we are using a bootstrapping method in order to take simple random samples until we reach the correct answer of that specific question. 

## Reading In The Data and Organizing It ##
We'll need to read in our data file and the schema for it. As of right now, we are handling it file by file since each DataHuntAnswers.csv and Scheme.csv are different questions and answers. In the future, we hope we can build a tool to be able to read in all the data easily.

In [2]:
#These are all the tasks we have for now. Requires changes if we have more tasks or the representation of tasks in the 
#datafile names changes.
tasks = ['Evidence', 'Language', 'Probability', 'Reasoning'] 
# tasks = ['Evidence']
data_dir = '../testing-format/' #tailing / in necessary
data_files = os.listdir(data_dir)
data_files
dfs = {}

for t in tasks:
    data_files_t = [path for path in data_files if re.match('.*{}.*'.format(t), path) ]
    dfs[t] = [pd.read_csv(data_dir + data_file_t) for data_file_t in data_files_t]
# df_master = pd.read_csv('../newDataFormat/BETA_Language-2020-01-18T0225-DataHuntSubmitted.csv')
# Functions about schema is depreciated
# schema = pd.read_csv("../urap/Demo1Evi-2019-06-22T0023-Schema.csv")
# backup_schema = schema
# df_master.head()

In [3]:
# Hard-coded schema
# Currently this dict will be mutated by later code due to simplicity
multi_choice = {"Evidence": ["T1.Q2"], 
                "Language": ["T1.Q1", "T1.Q6"], 
                "Probability": ["T1.Q12"], 
                "Reasoning": ["T1.Q1", "T1.Q2", "T1.Q3", "T1.Q6"]}

The following code takes in the data and organizes it based on contributor ID. It shows which questions each user had reached as well as their answers to those respective questions.

In [4]:
'''
s: a series

Return:
    A dataframe, hot-encoded in a managable manner
'''
def hotcode_multiple_choices(s):
    keys = set()
    for elem in s:
        if type(elem) == list:
            for key in elem:
                keys.add(key)
    
    encode_data = np.zeros((len(s), len(keys)))
    i = 0
    for elem in s:
        if type(elem) == list:
            j = 0
            for key in keys:
                if key in elem:
                    encode_data[i][j] = 1
                j += 1
        i += 1
    encode_data = pd.DataFrame(encode_data, dtype=np.uint8)
    encode_data.columns = keys
    encode_data.index = s.index
    return encode_data

'''
This function get an article from 2020 feeds and convert it into the format of 2019 feeds so the code below can be reused

df: the dataframe of 2020 format, e.g. BETA_Language-2020-01-18T0225-DataHuntSubmitted.csv
article_id: article_number in df

Return:
    A pandas dataframe similiar to the format of 2019 feeds
'''
def getArticle(df, article_id, task):
    article = df[df['article_number'] == article_id]
    #We use quiz_taskrun_uuid as dummy index, since it is a primary index for 2019 feed
    tbl = pd.pivot_table(article, values='answer_label', index='quiz_taskrun_uuid', columns='question_label', aggfunc=list)
    for col in tbl.columns:
        if col in multi_choice[task]:
            encoded = hotcode_multiple_choices(tbl[col])
            for col_ in encoded.columns:
                tbl[col_] = encoded[col_]
            tbl = tbl.drop(labels=col, axis=1)
    return tbl

## Bootstrapping ##
We used a Bootstrapping strategy in order to find the number it takes to match the consensus.

We chose a p-value of 0.01 where we want 99% of the time the majority reaches the "correct" consensus. 

In the future, we hope to integrate the user reputation tool and be able to put weights on the users. As of right now, we do not have the tool so the user weights/reputation scores are all just 1. 

The following function is to find the consensus of each question by finding the max of the answers. 

In [5]:
def getConsensus(answers):
    qcounts = Counter(answers)
    maxKey = max(qcounts.keys(), key=lambda key: qcounts[key])
    return maxKey

To help improve the effiency of our code, we converted all of the answers from strings to integers.

In [6]:
def strToInt(lst):
    vals = lst.unique()
    mapper = {vals[i]:i for i in range(len(vals))}
    return lst.replace(mapper)

The following function is to bootstrap and find the minimum number of users needed to find a convergence for each question. If we are unable to reach a confidence of 99% after sampling a max number of users, then we cannot find a convergence. The number of users that we sample can be changed in the future. 

In [7]:
# n = number of answer choices for questions
# c = consensus of entire dataset for question
# df = the dataframe
# answers = answer column

def getN(df, questionName):
    answers = df[questionName].dropna().explode()
#     answers = df[questionName].dropna()
    answers = strToInt(answers)
    n = len(pd.unique(answers))
    c = getConsensus(answers)
    
    for i in range(n + 1, max_group_size):
        count = 0
        for s in range(0, 1000):
            sample = np.random.choice(answers, i, replace=True)
            consensus = getConsensus(sample)
            if consensus == c:
                count += 1
        
        if count/1000 > .99:
            return i
    
    return -1 #Use -1 to represent not converged data, we are probably just ignoring it

We ran the function using a sample size of 100. We used a print statement in order to figure out the efficiency of it and organized the final results in a panda dataframe.

In [8]:
need_group_for_tasks = {}
for t in tasks:
    needed_groups = []
    for df_master in dfs[t]:
        needed_group = 0

        for article_id in df_master['article_number'].unique():
            df = getArticle(df_master, article_id, t)
            cols = ['Question', 'Min']
            lst=[]
            max_group_size = 100 # FEEL FREE TO CHANGE THIS
            for q in df.columns:
                if len(df[q].dropna().explode().unique()) < 20:
                    n = getN(df, q)
                    lst.append(n)
                    print('Article Number:', article_id, "Question:", q, "Converge:", n)
            max_needed = max(lst)
            if max_needed > needed_group:
                needed_group = max_needed
        needed_groups.append(needed_group)
    need_group_for_tasks[t] = max(needed_groups)

Article Number: 1712 Question: T1.Q1 Converge: -1
Article Number: 1712 Question: T1.Q10 Converge: -1
Article Number: 1712 Question: T1.Q11 Converge: -1
Article Number: 1712 Question: T1.Q12 Converge: -1
Article Number: 1712 Question: T1.Q13 Converge: -1
Article Number: 1712 Question: T1.Q14 Converge: -1
Article Number: 1712 Question: T1.Q3 Converge: 2
Article Number: 1712 Question: T1.Q4 Converge: 66
Article Number: 1712 Question: T1.Q5 Converge: -1
Article Number: 1712 Question: T1.Q6 Converge: 11
Article Number: 1712 Question: T1.Q7 Converge: 2
Article Number: 1712 Question: T1.Q8 Converge: -1
Article Number: 1712 Question: T1.Q9 Converge: 13
Article Number: 1712 Question: T1.Q2.A1 Converge: 43
Article Number: 1712 Question: T1.Q2.A3 Converge: 5
Article Number: 1712 Question: T1.Q2.A6 Converge: 7
Article Number: 1712 Question: T1.Q2.A8 Converge: 3
Article Number: 1712 Question: T1.Q2.A4 Converge: 14
Article Number: 1712 Question: T1.Q2.A5 Converge: 65
Article Number: 1712 Question: T

Article Number: 100002 Question: T1.Q6.A3 Converge: 8
Article Number: 100002 Question: T1.Q6.A8 Converge: 5
Article Number: 100002 Question: T1.Q6.A2 Converge: 13
Article Number: 100002 Question: T1.Q6.A1 Converge: 6
Article Number: 100003 Question: T1.Q10 Converge: 15
Article Number: 100003 Question: T1.Q11 Converge: -1
Article Number: 100003 Question: T1.Q12 Converge: -1
Article Number: 100003 Question: T1.Q13 Converge: -1
Article Number: 100003 Question: T1.Q14 Converge: -1
Article Number: 100003 Question: T1.Q15 Converge: -1
Article Number: 100003 Question: T1.Q2 Converge: -1
Article Number: 100003 Question: T1.Q3 Converge: 72
Article Number: 100003 Question: T1.Q5 Converge: -1
Article Number: 100003 Question: T1.Q7 Converge: 2
Article Number: 100003 Question: T1.Q9 Converge: -1
Article Number: 100003 Question: T1.Q1.A2 Converge: 15
Article Number: 100003 Question: T1.Q1.A4 Converge: 29
Article Number: 100003 Question: T1.Q1.A6 Converge: 5
Article Number: 100003 Question: T1.Q1.A3 

In [9]:
getArticle(dfs["Evidence"][0], 1712, "Evidence")

question_label,T1.Q1,T1.Q10,T1.Q11,T1.Q12,T1.Q13,T1.Q14,T1.Q3,T1.Q4,T1.Q5,T1.Q6,...,T1.Q8,T1.Q9,T1.Q2.A1,T1.Q2.A3,T1.Q2.A6,T1.Q2.A8,T1.Q2.A4,T1.Q2.A5,T1.Q2.A7,T1.Q2.A2
quiz_taskrun_uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01252259-8678-4e9c-9c17-74f0ca564554,"[T1.Q1.A1, T1.Q1.A1]",[T1.Q10.A4],[T1.Q11.A4],[T1.Q12.A3],[T1.Q13.A5],[T1.Q14.A5],,[T1.Q4.A3],[T1.Q5.A2],[T1.Q6.A3],...,[T1.Q8.A5],[T1.Q9.A1],0,0,0,0,0,1,0,0
01c8cf56-e445-44a4-b899-d23293a40137,"[T1.Q1.A3, T1.Q1.A3, T1.Q1.A3, T1.Q1.A2]",[T1.Q10.A3],[T1.Q11.A3],[T1.Q12.A3],[T1.Q13.A7],[T1.Q14.A3],,[T1.Q4.A4],[T1.Q5.A2],[T1.Q6.A3],...,[T1.Q8.A5],[T1.Q9.A1],0,0,0,0,0,1,0,0
022bae38-109b-4399-b8d3-b2101a10154b,[T1.Q1.A1],,,[T1.Q12.A3],[T1.Q13.A5],[T1.Q14.A6],,[T1.Q4.A3],[T1.Q5.A4],,...,[T1.Q8.A3],[T1.Q9.A3],0,0,0,0,0,1,0,0
09a08a19-5829-4bcd-b8af-3fafb7a5f101,"[T1.Q1.A1, T1.Q1.A1]",[T1.Q10.A4],[T1.Q11.A4],[T1.Q12.A4],[T1.Q13.A5],[T1.Q14.A5],,[T1.Q4.A3],,,...,[T1.Q8.A4],[T1.Q9.A1],1,0,0,0,0,0,0,0
09d9d2cf-988c-4dbd-8674-c73b0b79c43c,"[T1.Q1.A2, T1.Q1.A1]",[T1.Q10.A3],[T1.Q11.A3],[T1.Q12.A3],[T1.Q13.A4],[T1.Q14.A7],,[T1.Q4.A3],,,...,[T1.Q8.A4],[T1.Q9.A1],1,0,1,0,0,0,0,0
0ed20626-62af-4289-a986-eeef66272a33,[T1.Q1.A1],[T1.Q10.A5],[T1.Q11.A2],[T1.Q12.A2],[T1.Q13.A7],[T1.Q14.A8],,[T1.Q4.A4],,,...,[T1.Q8.A4],[T1.Q9.A1],1,0,0,0,1,0,0,1
1c798217-c825-426a-90a3-80056ed22f49,[T1.Q1.A2],[T1.Q10.A5],[T1.Q11.A2],[T1.Q12.A2],[T1.Q13.A2],[T1.Q14.A8],,[T1.Q4.A4],,,...,[T1.Q8.A2],[T1.Q9.A1],1,0,0,0,0,0,0,0
20bbec91-66f2-479d-93db-e5d39213207a,[T1.Q1.A1],[T1.Q10.A5],[T1.Q11.A3],[T1.Q12.A3],[T1.Q13.A6],[T1.Q14.A7],,[T1.Q4.A6],,,...,[T1.Q8.A3],[T1.Q9.A1],1,0,0,0,1,0,0,1
2bf71c4f-70e8-435a-8609-3cacaa7e066f,[T1.Q1.A2],[T1.Q10.A5],[T1.Q11.A1],[T1.Q12.A3],[T1.Q13.A6],[T1.Q14.A7],,,,,...,[T1.Q8.A2],[T1.Q9.A1],0,0,0,0,1,0,0,0
31779311-4328-4eec-93d1-195aa3549b27,[T1.Q1.A1],[T1.Q10.A5],[T1.Q11.A2],[T1.Q12.A1],[T1.Q13.A3],[T1.Q14.A9],,[T1.Q4.A4],,,...,[T1.Q8.A2],[T1.Q9.A1],1,0,0,0,0,0,0,0


In [10]:
for key in need_group_for_tasks.keys():
    need_group_for_tasks[key] = [need_group_for_tasks[key]]

In [11]:
pd.DataFrame.from_dict(need_group_for_tasks).to_csv('needed_people_for_tasks.csv')