In this step we correct the discrepancies in the meaning of the indicators that we found in the previous step.

This step should be done manually; however we use pandas to visualize the different `.csv` files on which we will work.

In [1]:
from IPython.display import display, Markdown
import pandas as pd
import os

CHECK_DIR = '/path/to/check'
HELEVEL_data = pd.read_csv(os.path.join(CHECK_DIR, 'hh', 'HELEVEL.csv'), index_col=0)

In [2]:
display(HELEVEL_data)

Unnamed: 0,label,used_indicator,1.0,2.0,3.0,4.0,5.0,9.0
Bangladesh,Education of household head,HELEVEL,,Primary incomplete,Primary complete,Secondary incomplete,Secondary complete or higher,Missing/DK
Pakistan (Punjab),Education of household head,HELEVEL,None/pre-school,Primary,Middle,Secondary,Higher,Missing/DK
Nigeria,Education of household head,HELEVEL,,Primary,Secondary / Secondary-technical,Higher,Non-formal,Missing/DK


### Inconsistencies in the meaning of the acronyms

In case the description of the item for a given country indicates that the information stored by the item is not the one you expected, you should look for the correct acronym in the original `.sav` file of the questionnaire of "inconsistent" country.

`.sav` files can be opened with [PSPP](https://www.gnu.org/software/pspp/), a free and open source alternative to the well known software for statistical analysis SPSS.


After you have identified the correct acronym, you should repeat the "04_check_indicators" step, by properly creating a dictionary to be used as `swap_indicators` parameter. This will cause the `check_indicators` function to consider the correct acronym instead of the wrong one.
The row corresponding to the "inconsistent" country should now show a consistend description and the correct acronym as `used_indicator`.


In this tutorial, the items that we selected are consistent in the three countries that are considered.
In fact, for all indicators, the descritpions in the `label` column are the same. 

Please note that minor differences shoud be expected, for instance due to different languages or grammar.


### Inconsistencies in the numerical representation

Again, considering the `HELEVEL` item, we note that the meaning of some numerical representations is different for the different countries. For instance, for the value 4.0 

(note that it is actually a string representation: `'4.0'`, not `4`):

In [3]:
display(HELEVEL_data.loc[:,'4.0'])

Bangladesh           Secondary incomplete
Pakistan (Punjab)               Secondary
Nigeria                            Higher
Name: 4.0, dtype: object

To coherently analyse the data, we need the numerical representation used by the different country to be the same.
To this aim, we need to "recode" the numerical values using a different encoding.

First, you should plan how to recode the numerical values.
In this tutorial, we decide that we focus only on households whose head has completed the secondary level of education.
For this reason we need recode as "1" all the numerical values that indicate a secondary level of instruction, and as "0" all other values.
This way, when we will analyse the data, we will only retain the households where the (recoded) `HELEVEL` will be `= 1`.

`mics_library` provides a function to define a new encoding, starting from information stored in formatted `.csv` files.

These are the (suggested) steps:
- Create a copy of the folder where you saved the results of the `check_indicators` function (we will call the created folder `RECODE_DIR`)
- Delete the files of indicators that do not need to be recoded;
- Manual edit the `.csv` file of the indicators that need to be recoded, by substituiting the Description of the meaning of each numerical value, with the new numerical value.

These is an example of the final result.

In [4]:
RECODE_DIR = '/path/to/recode'
HELEVEL_recoding_data = pd.read_csv(os.path.join(RECODE_DIR, 'hh', 'HELEVEL.csv'), index_col=0)

`HELEVEL.csv` file manually edited to recode the numerical indicators (stored in the `{RECODE_DIR}/hh` folder:

In [5]:
display(HELEVEL_recoding_data)

Unnamed: 0,label,used_indicator,1,2,3,4,5,9
Bangladesh,Education of household head,HELEVEL,0,0,0,0,1,
Pakistan (Punjab),Education of household head,HELEVEL,0,0,0,1,0,
Nigeria,Education of household head,HELEVEL,0,0,1,0,0,


This procedure should be repeated for each item that we need to recode.

Note that the `HL3` indicator has incoherent numerical representations (value 14):

In [6]:
HL3_data = pd.read_csv(os.path.join(CHECK_DIR, 'hl', 'HL3.csv'), index_col=0)
display(HL3_data)

Unnamed: 0,label,used_indicator,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,97.0,98.0,99.0,96.0
Bangladesh,Relationship to the head,HL3,Head,Wife / Husband,Son / Daughter,Son-In-Law / Daughter-In-Law,Grandchild,Parent,Parent-In-Law,Brother / Sister,Brother-In-Law / Sister-In-Law,Uncle / Aunt,Niece / Nephew,Other relative,Adopted / Foster / Stepchild,Not related,Inconsistent,Don't know,Missing,
Pakistan (Punjab),Relationship to the head,HL3,Head,Wife / Husband,Son / Daughter,Son-In-Law / Daughter-In-Law,Grandchild,Parent,Parent-In-Law,Brother / Sister,Brother-In-Law / Sister-In-Law,Uncle / Aunt,Niece / Nephew,Other relative,Adopted / Foster / Stepchild,Servant (Live-in),Inconsistent,DK,Missing,Other (Not related)
Nigeria,Relationship to the head,HL3,Head,Spouse/Partner,Son / Daughter,Son-In-Law / Daughter-In-Law,Grandchild,Parent,Parent-In-Law,Brother / Sister,Brother-In-Law / Sister-In-Law,Uncle / Aunt,Niece / Nephew,Other relative,Adopted / Foster / Stepchild,Servant (Live-in),Inconsistent,DK,Missing,Other (Not related)


However, in our analysis we will only focus on children and grandchildren, which are encoded coherently between the different countries.