## Computation

Using the file from the previous step, run the python script `compute_weights.py` on this file. You need to have `numpy` and `pandas` libraries installed, although no in-depth knowledge of those is required. The script prints out result universe sizes computed based on weighting for each combination of q2 and q4 and true universe sizes from given from QUOTAS. These numbers should be the same for each group. The script can read HDF files or CSV files. It heavily uses `pandas` and all necessary informations can be found in pandas documentation.

On the first run, you will probably see something like:

$ python compute_weights.py tmp.hdf
(q2, q4): (0, 0)	 1643.0 	 1000
(q2, q4): (0, 1)	 21698.0 	 2000
(q2, q4): (0, 2)	 26548.0 	 3000
(q2, q4): (1, 0)	 1743.0 	 5000
(q2, q4): (1, 1)	 21658.0 	 8000
(q2, q4): (1, 2)	 26809.0 	 3000

Numbers on each line clearly don’t match, so you need to edit the code so it does match. There is a bug in computing the `factor` in function `get_factor_weights`, which is a value which should be assigned to a respondent with a given combination of q2 and q4. 

##TASK:
You have to locate and fix this bug in order to complete this part. The numbers on each line should match. Hence, the output of a correct solution should be:

$ python compute_weights.py tmp.hdf
(q2, q4): (0, 0)	 1000.0 	 1000
(q2, q4): (0, 1)	 2000.0 	 2000
(q2, q4): (0, 2)	 3000.0 	 3000
(q2, q4): (1, 0)	 5000.0 	 5000
(q2, q4): (1, 1)	 8000.0 	 8000
(q2, q4): (1, 2)	 3000.0 	 3000


In [1]:
import pandas as pd
import numpy as np
import os

PATH = r'C:\Jian temp\GWIndex'
data_filename = r'result_data.csv'


QUOTAS = (
    ([0, 0], 1000),
    ([0, 1], 2000),
    ([0, 2], 3000),
    ([1, 0], 5000),
    ([1, 1], 8000),
    ([1, 2], 3000),
)

def get_factor_weights(quotas, counts):
    factors = []
    for (q2, q4), quota_size in quotas: 
        count = counts.loc[(q2, q4)]
        #factor = 1
        factor = quota_size/count
        factors.append(((q2, q4), factor))
    return factors

def distribute_factors(df, factors):
    for (q2, q4), factor in factors:
        df.loc[(df['q2'] == q2) & (df['q4'] == q4), 'weighting'] = factor
    return df

def load_data(filename):
    if filename.endswith('.csv'):
        return pd.read_csv(os.path.join(PATH, filename), index_col=0)
    else:
        return pd.read_hdf(filename, key='df')

def groupby_factors(df):
    return df.groupby(['q2', 'q4']).q2.count()
    

def assign_weights(df):
    counts = groupby_factors(df)
    factors = get_factor_weights(QUOTAS, counts)
    dfp = distribute_factors(df, factors)
    return dfp

def validate_weights(df):
    sums = df.groupby(['q2', 'q4']).weighting.sum()
    for (q2, q4), quota_size in QUOTAS:
        print('(q2, q4): ({}, {})\t'.format(q2, q4),
              sums.loc[(q2, q4)], '\t', quota_size
             )

    
def main():
#    import argparse
#    parser = argparse.ArgumentParser()
#    parser.add_argument('filename', help= r'C:\Jian temp\GWIndex')
#    args = parser.parse_args()
    df = load_data(data_filename)
    groupby_factors(df)
    dfp = assign_weights(df)
    validate_weights(dfp)
#
if __name__ == '__main__':
    main()

(q2, q4): (0, 0)	 1000.0 	 1000
(q2, q4): (0, 1)	 2000.0 	 2000
(q2, q4): (0, 2)	 3000.0 	 3000
(q2, q4): (1, 0)	 5000.0 	 5000
(q2, q4): (1, 1)	 8000.0 	 8000
(q2, q4): (1, 2)	 3000.0 	 3000


In original compute_weights.py, inside function 'get_factor_weights', factor = 1 means for every combination of q2 and q4, the sample size is exactly equal to the population size. This is usually not the right case in practice.
The right way to calculate the factor is:

factor = population_counts/respondent_counts