# Using the gqa code

Tests and example code on how to use the code from the [gqa repo](https://github.com/dorarad/gqa).

The use cases are:
* create questions
* downsample the questions

In [43]:
from collections import Counter, defaultdict
import random
import pandas as pd
import plotly.graph_objects as go

In [44]:
show_big_figures = True

## Load VG data


A few files need to be loaded before `createQuestions()` can run: x2imageFiles, topFiles, vocabFiles, ..., trainkeys.txt, vg_spatial_imgsInfo.json, ...

I extended `download.sh` to also download all the vg data from version 1.4

## Create questions

With `args.create` the question generation code is executed: `gen()` -> `genQuestionRep()` -> `generateQuestion()`

## Downsample the questions

With `args.normalize` the subsampling is executed: `select()`, `unbias()`, `toRatio()`, 2x `downsampleQuestions()`, 2x `typesampleQuestions()`, remove the same questions. The downsampling is always done on the image(!!) level!

* coin: `random.random() < prob`
* select: takes the `goodIDs` for each image and increases an `outCounter` (question["group"]) and a `pretypeCounter` (question["type"]) for each qID
* unbias: takes the `ratios`
  * ratios: from `unbiasRatios()`:
    * if there is only one answer, it gets removed
    * boolean questions: get a weight of `ratios["boolean"][cond]["yes"] = min(1, (pn / py) )` if e.g. yes is more predominant
    * open questions: each answer gets weighted by smoothed / count -> infrequent answers get a score close to 1, others less than 1
    * unclear with which values it is smoothed
    * differenziation per `cond`! 
* `customSmoother(b, gamma, gup, maxGamma)`
  * are the commented final values the values used in the end (final:b = 1, gamma = 1.05, gup = 0.02 maxGamma = 1.38)? How was it decided?
* toRatio: `ratio = tnum / fnum` which are of itself sums of counters
  * unclear what `questionSubtypes` does
  * returns this ratio as probability if > 1, else it's inverse
* downsampleQuestions: accept if `(question["group"] not in gProb) or coin(gProb[question["group"]])`, i.e. outProb = gProb of `["categoryRelS", "categoryRelO"]` etc.
  * why is it on the group level and not in the answers?
* typesampleQuestions: accept if `(question["type"] not in typeSamples) or coin(typeSamples[question["type"]])`, i.e. accept all which are not 'verify', 'choose', 'logical' and those with the specified probability
  * what???

In [27]:
# understanding unbiasRatios():

opencounter = defaultdict(lambda: defaultdict(int))
booleancounter = defaultdict(lambda: defaultdict(int))

quest_inst = [("10c-snowboards_vposition", "yes"), ("10c-snowboards_vposition", "yes"), ("10c-snowboards_vposition", "yes"), ("10c-snowboards_vposition", "yes"),
              ("10c-snowboards_vposition", "no"),
              ("14-basket_contain,o", "red"), ("14-basket_contain,o", "red"), ("14-basket_contain,o", "red"), ("14-basket_contain,o", "white"), ("14-basket_contain,o", "white"),
              ("14-basket_contain,o", "green"), ("14-basket_contain,o", "green"), ("14-basket_contain,o", "green"), ("14-basket_contain,o", "green"), ("14-basket_contain,o", "green"),]
for cond, ans in quest_inst:
    if ans in ["yes", "no"]:
        booleancounter[cond][ans] += 1
    else:
        opencounter[cond][ans] += 1

counters = {'open':opencounter,
            'boolean':booleancounter}
ratios = {"open": defaultdict(dict), "boolean": defaultdict(dict)}

smallThr = 1.
def uniformSmoother(thr):
    def usmoother(counts):
        s = sum(counts)
        # ps = [float(sum(counts[:(i+1)])) for i in range(len(counts))]
        k = 1 
        for ck in range(1,len(counts))[::-1]:
            if ((ck + 1) * counts[ck]) / s > thr:
                k = ck
                break
        newCounts = [(counts[k] if i <= k else 0) for i in range(len(counts))]
        return newCounts
    return usmoother

lsmoother = [2, 1.25, 0.03, 1.4]
lsmoother_final = [1, 1.05, 0.02, 1.38]
def customSmoother(b, gamma, gup, maxGamma): # gamma = 1.3, gup = 0.05  nts, b = 2, gamma = 1.3)
    def csmoother(counts):
    # gamma = 1.2 #1.3
        s = sum(counts)
        probs = [c/s for c in counts] 
        for i in range(len(counts)):
            # print(counts)
            if i == 0:
                continue
            s = sum(counts)
            tail = sum([c for c in counts[i:]])
            head = s - tail
            p = (min(0.1*(i+b),0.85))
            newHead = (p * tail / (1-p)) if i > 1 else 0 # (i-1) * tail # (1 - 1/i) * s
            # print((min(0.1*(i+1),0.85)), newHead, tail)
            if (sum(probs[i:]) > 0.099 or i == 1) and (head > newHead): # tail / s > 0.1
                newGamma = min(gamma + (i-1) * gup, maxGamma) # 1.38 1.5
                newProbs = [(probs[j]/sum(probs[:i])) for j in range(i)]
                for j in range(i-1)[::-1]:
                    newProbs[j] = min(newProbs[j], newGamma * newProbs[j+1])
                n = sum(newProbs)
                newProbs = [p/n for p in newProbs] 
                agamma = (1.1 if i > 1 else gamma)
                if newProbs[i-1] * newHead < agamma * counts[i]: # gamma * 
                    newHead = (agamma * counts[i]) / newProbs[i-1]
                    # print(newHead)
                for j in range(i): #[::-1]:
                    counts[j] = min(newProbs[j] * newHead, counts[j]) # max((probs[j]/sum(probs[:i])) * newHead, newGamma * counts[j+1])
        return counts
    return csmoother
smtr_uni = uniformSmoother(smallThr) 
smtr_cust = customSmoother(b = lsmoother_final[0], gamma = lsmoother_final[1], gup = lsmoother_final[2], maxGamma = lsmoother_final[3])
counters

{'open': defaultdict(<function __main__.<lambda>()>,
             {'14-basket_contain,o': defaultdict(int,
                          {'red': 3, 'white': 2, 'green': 5})}),
 'boolean': defaultdict(<function __main__.<lambda>()>,
             {'10c-snowboards_vposition': defaultdict(int,
                          {'yes': 4, 'no': 1})})}

In [29]:
smoother = smtr_cust

for cond in counters["open"]:
    ansDist = counters["open"][cond]
    answers = list(ansDist.keys()) 
    if len(answers) == 1: # and cond not in ["age", ]
        print("removedOpen", cond)            
        ratios["open"][cond][answers[0]] = min(1, float(1) / ansDist[answers[0]]) # 0.0
    else:
        sansDist = list(ansDist.items())
        sansDist = sorted(sansDist, key = lambda x: x[1], reverse = True)
        counts = [float(c) for _,c in sansDist]
        newCounts = smoother(counts)
        for i,c in enumerate(newCounts):
            print(f"{cond} at {sansDist[i][0]}: {c} / {sansDist[i][1]} = {float(c) / sansDist[i][1]}")
            ratios["open"][cond][sansDist[i][0]] = float(c) / sansDist[i][1]

ratios

14-basket_contain,o at green: 2.354 / 5 = 0.4708
14-basket_contain,o at red: 2.2 / 3 = 0.7333333333333334
14-basket_contain,o at white: 2.0 / 2 = 1.0


{'open': defaultdict(dict,
             {'14-basket_contain,o': {'green': 0.4708,
               'red': 0.7333333333333334,
               'white': 1.0}}),
 'boolean': defaultdict(dict, {})}

In [19]:
# understanding typesampleQuestions():
def coin(prob):
    if prob == 1:
        return True
    if prob == 0:
        return False
    return random.random() < prob

typeSamples = {
    "verify": 0.75,   #newAll * 0.17, #0.75, 
    "choose": 0.865,  #newAll * , #0.865, 
    "logical": 0.95,  #newAll, #0.95
}

questions = [{"type": 'verify'}, {"type": 'choose'}, {"type": 'logical'}, {'type': 'verify'}, {"type": 'verify'}, {"type": 'choose'},
             {"type": 'compare'}, {"type": 'compare'}, {"type": 'compare'}, {'type': 'query'}, {"type": 'query'}, {"type": 'query'},]

for question in questions:
    coin_throw = coin(typeSamples[question["type"]])
    if (question["type"] not in typeSamples) or coin_throw:
        print("added", question["type"], coin_throw)
    else:
        print("not added", question["type"], coin_throw)

added verify True
added choose True
added logical True
not added verify False
added verify True
added choose True


KeyError: 'compare'

In [16]:
# understanding customSmoother():

def customSmoother(b, gamma, gup, maxGamma):
# b >= 1, \in (1,2)
# gamma > 1, \in (1.05, 1.3)
# gup > 0, \in (0.02, 0.05)
# maxGamma > 1, \in (1.38, 1.5)

    def csmoother(counts):
        s = sum(counts)
        probs = [c/s for c in counts] 

        for i in range(len(counts)):
            if i == 0:
                continue
            s = sum(counts)  # unnecessary

            tail = sum([c for c in counts[i:]])
            head = s - tail

            p = (min(0.1*(i+b),0.85))  # i>=1, b>=1 and \in (1,2) -> (0.2, 0.85)
            newHead = (p * tail / (1-p)) if i > 1 else 0  # is the tail weighted by a factor of (0.25, 5.67)

            if (sum(probs[i:]) > 0.099 or i == 1) and (head > newHead): 
            # if there is enough probability mass left or if i == 1, and if the head is greater than the weighted tail
                newGamma = min(gamma + (i-1) * gup, maxGamma)  # increase gamma by a small step until maxGamma
                newProbs = [(probs[j]/sum(probs[:i])) for j in range(i)]  # normalize the head probabilities

                # the probability of each entry has to be less than the newGamma times the next larger entry
                for j in range(i-1)[::-1]:
                    newProbs[j] = min(newProbs[j], newGamma * newProbs[j+1])

                # normalize the new probabilities
                n = sum(newProbs)
                newProbs = [p/n for p in newProbs]

                # value of the count[i-1] stays (at least) (by *agamma) >= count[i]
                agamma = (1.1 if i > 1 else gamma)
                if newProbs[i-1] * newHead < agamma * counts[i]:
                    newHead = (agamma * counts[i]) / newProbs[i-1]

                # reduces the counts in the head in a consecutive decreasing order 
                for j in range(i):
                    counts[j] = min(newProbs[j] * newHead, counts[j])

        return counts
    return csmoother

In [7]:
# load the case where it works badly
all_filtered_by_column = pd.read_csv('results/BAL_02q-place_all.csv')
balanced_filtered_by_column = pd.read_csv('results/BAL_02q-place_balanced.csv')

In [23]:
# try to find some counter-example for the csmoother()
opencounter = defaultdict(lambda: defaultdict(int))
booleancounter = defaultdict(lambda: defaultdict(int))

# quest_inst = [("10c-snowboards_vposition", "yes"), ("10c-snowboards_vposition", "yes"), ("10c-snowboards_vposition", "yes"), ("10c-snowboards_vposition", "yes"),
#               ("10c-snowboards_vposition", "no"),
#               ("14-basket_contain,o", "red"), ("14-basket_contain,o", "red"), ("14-basket_contain,o", "red"), ("14-basket_contain,o", "white"), ("14-basket_contain,o", "white"),
#               ("14-basket_contain,o", "green"), ("14-basket_contain,o", "green"), ("14-basket_contain,o", "green"), ("14-basket_contain,o", "green"), ("14-basket_contain,o", "green"),]
# for cond, ans in quest_inst:
#     if ans in ["yes", "no"]:
#         booleancounter[cond][ans] += 1
#     else:
#         opencounter[cond][ans] += 1

for i, row in all_filtered_by_column.iterrows():
    opencounter[row['local_question_group']][row['answer']] += 1

counters = {'open':opencounter,
            'boolean':booleancounter}
ratios = {"open": defaultdict(dict), "boolean": defaultdict(dict)}


In [24]:
# smoother: final:b = 1, gamma = 1.05, gup = 0.02 maxGamma = 1.38
smtr_cust = customSmoother(b = 1, gamma = 1.05, gup = 0.02, maxGamma = 1.38)

In [25]:
smoother = smtr_cust

# relevant code from unbiasRatios()
for cond in counters["open"]:
    ansDist = counters["open"][cond]
    answers = list(ansDist.keys()) 
    if len(answers) == 1: # and cond not in ["age", ]
        print("removedOpen", cond)            
        ratios["open"][cond][answers[0]] = min(1, float(1) / ansDist[answers[0]]) # 0.0
    else:
        sansDist = list(ansDist.items())
        sansDist = sorted(sansDist, key = lambda x: x[1], reverse = True)
        counts = [float(c) for _,c in sansDist]
        newCounts = smoother(counts)
        for i,c in enumerate(newCounts):
            print(f"{cond} at {sansDist[i][0]}: {c} / {sansDist[i][1]} = {float(c) / sansDist[i][1]}")
            ratios["open"][cond][sansDist[i][0]] = float(c) / sansDist[i][1]

# relevant code from toRatios(): not possible because of the missing questionSubtypes

ratios

02q-place at field: 6264.3 / 9196 = 0.681198347107438
02q-place at road: 5817.654419135448 / 5966 = 0.9751348339147583
02q-place at street: 4215.691608069165 / 5130 = 0.8217722432883363
02q-place at sidewalk: 3054.8489913544677 / 3069 = 0.9953890489913547
02q-place at beach: 2872.692795389049 / 2886 = 0.9953890489913545
02q-place at ocean: 2306.9682072622477 / 2671 = 0.8637095497050722
02q-place at park: 1671.7160922190203 / 2000 = 0.8358580461095102
02q-place at pavement: 1211.3884726224785 / 1217 = 0.9953890489913546
02q-place at city: 1191.4806916426514 / 1197 = 0.9953890489913546
02q-place at forest: 1022.2645533141213 / 1027 = 0.9953890489913547
02q-place at airport: 861.0115273775217 / 865 = 0.9953890489913546
02q-place at station: 777.398847262248 / 781 = 0.9953890489913546
02q-place at zoo: 771.4265129682999 / 775 = 0.9953890489913547
02q-place at parking lot: 750.5233429394815 / 754 = 0.9953890489913547
02q-place at river: 736.5878962536025 / 740 = 0.9953890489913547
02q-place

{'open': defaultdict(dict,
             {'02q-place': {'field': 0.681198347107438,
               'road': 0.9751348339147583,
               'street': 0.8217722432883363,
               'sidewalk': 0.9953890489913547,
               'beach': 0.9953890489913545,
               'ocean': 0.8637095497050722,
               'park': 0.8358580461095102,
               'pavement': 0.9953890489913546,
               'city': 0.9953890489913546,
               'forest': 0.9953890489913547,
               'airport': 0.9953890489913546,
               'station': 0.9953890489913546,
               'zoo': 0.9953890489913547,
               'parking lot': 0.9953890489913547,
               'river': 0.9953890489913547,
               'restaurant': 0.9953890489913546,
               'store': 0.9953890489913546,
               'lake': 0.9953890489913546,
               'train station': 0.9953890489913547,
               'skate park': 0.9953890489913547,
               'yard': 0.9953890489913546,
        

In [37]:
# iterate through ratios['open']['02q-place'] and take the keys as the index and the value as the value
# then format it as a dataframe
smoothed_by_column = pd.DataFrame([(k, v) for k, v in ratios['open']['02q-place'].items()], columns=['index', 'value'])
smoothed_by_column.index = smoothed_by_column['index']
smoothed_by_column.drop(columns=['index'], inplace=True)
smoothed_by_column

Unnamed: 0_level_0,value
index,Unnamed: 1_level_1
field,0.681198
road,0.975135
street,0.821772
sidewalk,0.995389
beach,0.995389
...,...
coffee shop,1.000000
cemetery,1.000000
pub,1.000000
pizza shop,1.000000


In [54]:
# make a histogram of the all_filtered_by_column.value_counts()
all_filtered_by_column_vc = all_filtered_by_column['answer'].value_counts()
balanced_filtered_by_column_vc = balanced_filtered_by_column['answer'].value_counts()
all_filtered_by_column_vc

answer
field          9196
road           5966
street         5130
sidewalk       3069
beach          2886
               ... 
coffee shop       9
cemetery          6
pub               6
pizza shop        4
lounge            4
Name: count, Length: 90, dtype: int64

In [41]:
smoothed_by_column_vc = smoothed_by_column['value'] * all_filtered_by_column_vc
smoothed_by_column_vc

airport           861.011527
amusement park     12.000000
auditorium         28.000000
backyard           66.000000
bakery             36.000000
                     ...    
tunnel             69.000000
village            22.000000
walkway           395.169452
yard              419.058790
zoo               771.426513
Length: 90, dtype: float64

In [55]:
# plot the answer distribution change with a plotly stacked histogram, counting the number of ansers in balanced_filtered_by_column and all_filtered_by_column
fig = go.Figure()
fig.add_trace(go.Bar(y=all_filtered_by_column_vc, x=all_filtered_by_column_vc.index, name='all'))
fig.add_trace(go.Bar(y=smoothed_by_column_vc, x=smoothed_by_column_vc.index, name='smoothed'))
fig.add_trace(go.Bar(y=balanced_filtered_by_column_vc, x=balanced_filtered_by_column_vc.index, name='balanced'))

order = list(all_filtered_by_column['answer'].value_counts().index)
fig.update_xaxes(categoryorder='array', categoryarray= order)
fig.update_layout(barmode='overlay', title=f'Smoothed count distribution for 02q-place questions')

fig.write_image(f'results/BAL_smoothed_02q-place_answer_distribution.png', width=1500, height=600)
if show_big_figures:
    fig.show()