In [1]:
import pandas as pd

In [5]:
df = pd.read_csv('../../data/napire_truth.csv', sep=';')

We have 23 solo free text variables and three ranges of free text variables.
* Short Free Text - List-Supplementing: `sft_paired`
* Short Free Text - List-Supplanting: `sft_independent`
* Long Free Text - Filtered: `lft_some`
* Long Free Text - Not Filtered (i.e., lots of answers): `lft_all`

Note that this categorization is rough, so we'll put the total number of not-evidently-NULL-answers in the file names instead (see below).

In [46]:
sft_paired =      ['v_2','v_5','v_15','v_18',
                   'v_35','v_46','v_52',
                   'v_60','v_67','v_81','v_96','v_105',
                   'v_113','v_164','v_166',
                   'v_194'
                  ] 
sft_independent = ['v_3','v_19','v_20',] 
lft_some =        ['v_26','v_27'
                  ] + [f'v_{x}' for x in list(range(343,352))] 
lft_all =         [f'v_{x}' for x in range(168,174)
                  ] + [f'v_{x}' for x in range(277,287)
                      ] + ['v_297']

In [47]:
all_vars = sft_paired + sft_independent + lft_some + lft_all
len(all_vars) # Yay...not.

47

In [50]:
var_dict = {
    'short_few': sft_paired,
    'short_all': sft_independent,
    'long_few':  lft_some,
    'long_all':  lft_all
}

In [60]:
def write_var_files(df, var_dict, basedir='../../data/freetext'):
    dffilter = {'NotAnswered','NotApplicable','NotShown'}
    for k,v in var_dict.items():
        for var in v:
            fdf = df[['lfdn', f'{var}']
              ][~df[f'{var}'].isin(dffilter)]
            fdf.to_csv(f'{basedir}/{var}_{k[:-4]}_{len(fdf)}.csv', index=False)
        print(f'Wrote {len(v)} files of type "{k}" to directory: {basedir}')

In [61]:
write_var_files(df, var_dict)

Wrote 16 files of type "short_few" to directory: ../../data/freetext
Wrote 3 files of type "short_all" to directory: ../../data/freetext
Wrote 11 files of type "long_few" to directory: ../../data/freetext
Wrote 17 files of type "long_all" to directory: ../../data/freetext


### About the Translations...

Manual inspection showed that the following variables still have responses in foreign languages (i.e., the columns were falsely excluded from translation):
* v_3
* v_15
* v_19
* v_105

Note that only a tiny number of responses are affected in each case, so we might as well hand the task of dealing with the foreign language responses down to our beloved coders. 
The responses are understandable to all with some knowledge of some Romance languages.

### About Excel...
Look at our nice Excel-produced errors (look out for the `#VALUE!`) - I'm not going to fix them. (**TOLD YOU NOT TO WORK WITH EXCEL**) (*SIGH*)

In [55]:
df[df.lfdn == 1484].values

array([[1484, 'Español', 'Ecuador', '6', 'Education', 'NotAnswered',
        'Business information systems', 'NotAnswered', 'quoted',
        'quoted', 'not quoted', 'quoted', 'not quoted', 'quoted',
        'not quoted', 'quoted', 'not quoted', 'NotAnswered', 'Yes',
        'Test Manager / Tester', 'NotAnswered', '4', 'DO NOT',
        'Main contractor (main responsible for the development project)',
        'NotAnswered', 'Plan-driven', 'Good', 'NotShown', 'NotShown',
        'not quoted', 'quoted', 'not quoted', 'NotAnswered', 'quoted',
        'not quoted', 'quoted', 'not quoted', 'quoted', 'not quoted',
        'not quoted', 'not quoted', 'not quoted', 'not quoted',
        'not quoted', 'NotAnswered', 'Project Lead / Project Manager',
        'NotAnswered',
        'We document high-level requirements at beginning of the project and refine them to detailed requirements when needed (for instance, we document epics and refine them to user stories for the sprints).',
        'NotAns

### How many answers do we have to 'code' (somewhat) in total?

In [62]:
import os

In [70]:
sum([int(x.split('_')[-1][:-4]) 
     for x in os.listdir(f'../../data/freetext') 
     if x.endswith('.csv')])

7200

# OOPS.