# Answer Consensus

In [11]:
import numpy as np
import pandas as pd
import re
import os
from collections import Counter

## Goal: ## 
Supposed to work hand in hand with our **Convergence Tool.** This is just to print out the general consensus for each question in a pretty CSV file.  

## Reading In The Data and Organizing It ##
**This is the same as the our convergence tool. **
We'll need to read in our data file. This should be able to read in all the data easily using Regex. We don't need to use schema anymore, but it can help in the future to add more details. 

In [12]:
#These are all the tasks we have for now. Requires changes if we have more tasks or the representation of tasks in the 
#datafile names changes.
tasks = ['Evidence', 'Language', 'Probability', 'Reasoning'] 
tasks = ['Evidence']
data_dir = '../newDataFormat/' #tailing / in necessary
data_files = os.listdir(data_dir)
data_files
dfs = {}

for t in tasks:
    data_files_t = [path for path in data_files if re.match('.*{}.*'.format(t), path) ]
    dfs[t] = [pd.read_csv(data_dir + data_file_t) for data_file_t in data_files_t]

# df_master = pd.read_csv('../newDataFormat/BETA_Language-2020-01-18T0225-DataHuntSubmitted.csv')
# Functions about schema is depreciated
# schema = pd.read_csv("../urap/Demo1Evi-2019-06-22T0023-Schema.csv")
# backup_schema = schema
# df_master.head()

Again, we aren't using the schemas anymore.

In [13]:
# Depreciated, waiting to be fixed
# textQ = list(schema[schema['question_type'] == 'TEXT']["question_label"])
# textQ

The following code takes in the data and organizes it based on contributor ID. It shows which questions each user had reached as well as their answers to those respective questions.

In [14]:
'''
s: a series

Return:
    A dataframe, hot-encoded in a managable manner
'''
def hotcode_multiple_choices(s):
    keys = set()
    for elem in s:
        if type(elem) == list:
            for key in elem:
                keys.add(key)
    
    encode_data = np.zeros((len(s), len(keys)))
    i = 0
    for elem in s:
        if type(elem) == list:
            j = 0
            for key in keys:
                if key in elem:
                    encode_data[i][j] = 1
                j += 1
        i += 1
    encode_data = pd.DataFrame(encode_data, dtype=np.uint8)
    encode_data.columns = keys
    encode_data.index = s.index
    return encode_data

'''
This function get an article from 2020 feeds and convert it into the format of 2019 feeds so the code below can be reused

df: the dataframe of 2020 format, e.g. BETA_Language-2020-01-18T0225-DataHuntSubmitted.csv
article_id: article_number in df

Return:
    A pandas dataframe similiar to the format of 2019 feeds
'''
def getArticle(df, article_id):
    article = df[df['article_number'] == article_id]
    #We use quiz_taskrun_uuid as dummy index, since it is a primary index for 2019 feed
    tbl = pd.pivot_table(article, values='answer_label', index='quiz_taskrun_uuid', columns='question_label', aggfunc=list)
    for col in tbl.columns:
        helper = tbl[col].apply(lambda x: len(x) if type(x) == list else 0)
        if max(helper) <= 1: #just unwrap from list if everyone only choose one answer
            tbl[col] = tbl[col].apply(lambda x: x[0] if type(x) == list else 0)
        else:
            encoded = hotcode_multiple_choices(tbl[col])
            for col_ in encoded.columns:
                tbl[col_] = encoded[col_]
#             tbl = pd.concat([tbl, encoded], axis=0, join='inner', sort=False)
            tbl = tbl.drop(labels=col, axis=1)
    return tbl

## Consensus Function ##
The following function is to find the consensus of each question by finding the max of the answers. 

In [15]:
def getConsensus(answers):
    qcounts = Counter(answers)
    maxKey = max(qcounts.keys(), key=lambda key: qcounts[key])
    return maxKey

To help improve the effiency of our code, we converted all of the answers from strings to integers.

In [16]:
def strToInt(lst):
    vals = lst.unique()
    mapper = {vals[i]:i for i in range(len(vals))}
    return lst.replace(mapper)

Below we are putting everything together to get the Answer Consensus and we put it into a pretty csv file. Again, this is 

In [17]:
cols = ['Article Number', 'Question Label', 'Answer Label']
lst=[]

for t in tasks:
    for df_master in dfs[t]:
        
        for article_id in df_master['article_number'].unique():
            df_q = getArticle(df_master, article_id)

            for q in df_q.columns[1:]:
                if len(df_q[q].dropna().unique()) < 20:
                    answers = df_q[q].dropna()
                    print('Article Number:', article_id, 'Question:', q, 'Answer:', getConsensus(answers))
                    lst.append([article_id, q, getConsensus(answers)])
                    consensus_df = pd.DataFrame(lst, columns=cols)

Article Number: 1712 Question: T1.Q11 Answer: T1.Q11.A2
Article Number: 1712 Question: T1.Q13 Answer: T1.Q13.A5
Article Number: 1712 Question: T1.Q14 Answer: T1.Q14.A7
Article Number: 1712 Question: T1.Q3 Answer: 0
Article Number: 1712 Question: T1.Q4 Answer: T1.Q4.A4
Article Number: 1712 Question: T1.Q5 Answer: 0
Article Number: 1712 Question: T1.Q6 Answer: 0
Article Number: 1712 Question: T1.Q7 Answer: T1.Q7.A1
Article Number: 1712 Question: T1.Q8 Answer: T1.Q8.A4
Article Number: 1712 Question: T1.Q1.A1 Answer: 1
Article Number: 1712 Question: T1.Q1.A2 Answer: 0
Article Number: 1712 Question: T1.Q1.A3 Answer: 0
Article Number: 1712 Question: T1.Q12.A4 Answer: 0
Article Number: 1712 Question: T1.Q12.A1 Answer: 0
Article Number: 1712 Question: T1.Q12.A3 Answer: 0
Article Number: 1712 Question: T1.Q12.A2 Answer: 0
Article Number: 1712 Question: T1.Q2.A6 Answer: 0
Article Number: 1712 Question: T1.Q2.A8 Answer: 0
Article Number: 1712 Question: T1.Q2.A7 Answer: 0
Article Number: 1712 Ques

In [18]:
consensus_df

Unnamed: 0,Article Number,Question Label,Answer Label
0,1712,T1.Q11,T1.Q11.A2
1,1712,T1.Q13,T1.Q13.A5
2,1712,T1.Q14,T1.Q14.A7
3,1712,T1.Q3,0
4,1712,T1.Q4,T1.Q4.A4
5,1712,T1.Q5,0
6,1712,T1.Q6,0
7,1712,T1.Q7,T1.Q7.A1
8,1712,T1.Q8,T1.Q8.A4
9,1712,T1.Q1.A1,1


In [23]:
consensus_df.to_csv(r'Answer_Consensus.csv')