First things first: set the ROOTDIR and import the necessary packages:

In [1]:
from IPython.display import display, Markdown

import mics_library
import os

ROOTDIR = '/path/to/original' 

mics_library.set_rootdir(ROOTDIR)

## Check indicators
In this step, we need to make sure that the information provided by each country is consistent and coherent.

Specifically:

### 1. That the same item indicates the same question for all countries
The same acronym might indicate different questions in different countries.
Either because of variations in the way the question is asked; or because a question has different acronyms between countries.
Then, we need to make sure that we are using the correct items/acronyms for each country.

### 2. That the encoded answers have the same meaning for all countries
MICS items record the answers to questions that are asked to participants.
Most of the MICS questions are *multiple choice* questions: answers are then categorical variables, that are encoded as numerical values.
For instance, a MICS item/question with Yes/No answers (e.g. "Is the natural father alive") encodes for the following answers:
- 1 : "Yes"
- 2 : "No"
- 9 : Missing answer or "Don't know"

However, different countries may have different categories, with different numerical encoding.

When we analyse MICS data we rely on the numerical representation of the answers, so we must make sure that the same number (i.e. *numerical representation*) corresponds to the same meaning across different countries.

Therefore, for each indicator, we need to check the numerical representations that are used by each country.
We use the `mics_library.preview.check_values` function:

In [2]:
from mics_library.preview import check_values

In the step before we selected the indicators of interest for the analysis:

In [3]:
ROUND = 5

select_indicators = {'hh': ['HELEVEL'], #education level of the household head
                     'hl': ['HL3'],     #relation to the household head
                     'ch': ['EC1',      #number of books
                            'EC5',      #attend early education programme
                            'AG2']      #age of child
                    }

We use the `check_values` function to obtain information about the meaning and numerical encoding of each indicator for each country.

In [4]:
dataframes = check_values(micsround=ROUND, indicators=select_indicators, swap_indicators={}, ignorecase=True)

The result is a dictionary -- of dictionaries -- of dataframes:
                    
`{'questionnaire1' : {'indicator' : dataframe, ...}, ...}`

Let's see:

In [5]:
#%%
for quest in dataframes.keys():
    display(Markdown('---'))
    display(Markdown(f'## {quest}'))
    dataframes_quest = dataframes[quest]
    
    for ind in dataframes_quest.keys():
        display(Markdown(f'### {ind}'))
        display(dataframes_quest[ind])

---

## hh

### HELEVEL

Unnamed: 0,label,used_indicator,1.0,2.0,3.0,4.0,5.0,9.0
Bangladesh,Education of household head,HELEVEL,,Primary incomplete,Primary complete,Secondary incomplete,Secondary complete or higher,Missing/DK
Pakistan (Punjab),Education of household head,HELEVEL,None/pre-school,Primary,Middle,Secondary,Higher,Missing/DK
Nigeria,Education of household head,HELEVEL,,Primary,Secondary / Secondary-technical,Higher,Non-formal,Missing/DK


---

## hl

### HL3

Unnamed: 0,label,used_indicator,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,97.0,98.0,99.0,96.0
Bangladesh,Relationship to the head,HL3,Head,Wife / Husband,Son / Daughter,Son-In-Law / Daughter-In-Law,Grandchild,Parent,Parent-In-Law,Brother / Sister,Brother-In-Law / Sister-In-Law,Uncle / Aunt,Niece / Nephew,Other relative,Adopted / Foster / Stepchild,Not related,Inconsistent,Don't know,Missing,
Pakistan (Punjab),Relationship to the head,HL3,Head,Wife / Husband,Son / Daughter,Son-In-Law / Daughter-In-Law,Grandchild,Parent,Parent-In-Law,Brother / Sister,Brother-In-Law / Sister-In-Law,Uncle / Aunt,Niece / Nephew,Other relative,Adopted / Foster / Stepchild,Servant (Live-in),Inconsistent,DK,Missing,Other (Not related)
Nigeria,Relationship to the head,HL3,Head,Spouse/Partner,Son / Daughter,Son-In-Law / Daughter-In-Law,Grandchild,Parent,Parent-In-Law,Brother / Sister,Brother-In-Law / Sister-In-Law,Uncle / Aunt,Niece / Nephew,Other relative,Adopted / Foster / Stepchild,Servant (Live-in),Inconsistent,DK,Missing,Other (Not related)


---

## ch

### AG2

Unnamed: 0,label,used_indicator
Bangladesh,Age of child,AG2
Pakistan (Punjab),Age of child,AG2
Nigeria,Age of child,AG2


### EC1

Unnamed: 0,label,used_indicator,10.0,99.0,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
Bangladesh,Number of children's books or picture books fo...,EC1,Ten or more books,Missing,,,,,,,,,,
Pakistan (Punjab),Number of children's books or picture books fo...,EC1,Ten or more books,Missing,,,,,,,,,,
Nigeria,Number of children's books or picture books fo...,EC1,Ten or more books,Missing,,One,Two,Three,Four,Five,Six,Seven,Eight,Nine


### EC5

Unnamed: 0,label,used_indicator,1.0,2.0,8.0,9.0
Bangladesh,Attends early childhood education programme,EC5,Yes,No,DK,Missing
Pakistan (Punjab),Attends early childhood education programme,EC5,Yes,No,DK,Missing
Nigeria,Attends early childhood education programme,EC5,Yes,No,DK,Missing


Dataframes report, for each country (on rows):
- `label`: Textual description of the meaning of the item for the given country
- `used_indicator`: The indicator that has been used for the given country. See Note on swap_indicators below.
- `1..99..` : If the answer to the question is categorical, the number indicate the numerical representation and the cells (country, number) indicate the meaning of the given numerical representation for the given country.

Note that:
- The `AG2` item (Age of child) requires a numerical answer (not a categorical answer), so there are no numerilcal columns;
- The numerical representation of the answers to the 'HELEVEL' item (Education of the Household Head) means different education levels for different countries (and we are only considering 3 countries!!!). The same for the `HL3` item (Relation to the Household Head.
- The `EC1` item (Number of children's books) report the number of books (up to 10) for Nigeria, but it is considered a categorical variable (">=10", "<10") for the two other countries.

We suggest to save the dataframes into `.csv` files, to facilitate the inspection of the available information.
We will use the saved files to correct the discrepancies between the numerical representations.


In [6]:
CHECKDIR = '/path/to/check'

#this will create a folder for each questionnaire
#and a csv file (inside the folder) for each item (of the questionnaire)
for quest in dataframes.keys():
    dataframes_quest = dataframes[quest]
    os.makedirs(os.path.join(CHECKDIR, quest), exist_ok=True)
    
    for ind in dataframes_quest.keys():
        dataframes_quest[ind].to_csv(os.path.join(CHECKDIR, quest, f'{ind}.csv'))

### `swap_indicators`

The `swap_indicators` parameter in the `mics_library` is used to deal with the different acronyms that are used by each country to indicate the SAME question.
The `swap_indicators` parameter should be a dictionary:

`{'questionnaire': 
    {'COUNTRY': {'TARGET_INDICATOR' : 'USED_INDICATOR', 
                 ...}        
     ...}
 }`
 
 Where `TARGET_INDICATOR` is the acronym of item that is used by the other countries and `USED_INDICATOR` is the acronym of the same item in the specific `COUNTRY`.
 
 The `mics_library` already includes a database of indicators that should be swapped for each MICS round, which will be updated as we find more inconsistencies. 

In [7]:
from mics_library.swap_indicators import swap_indicators

display(swap_indicators[3])

{'hl': {'Ukraine': {'HL1': 'LN'},
  'Georgia': {'HL1': 'LN'},
  'Palestinians in Lebanon': {'HL10': '', 'HL12': ''}},
 'wm': {'Burundi': {'LN': 'WM4'},
  'Mongolia': {'LN': 'HL1'},
  'Albania': {'LM': 'WMID'},
  'Palestinians in Lebanon': {'LN': 'WM4'}},
 'ch': {'Burundi': {'LN': 'UF4'},
  'Mongolia': {'LN': 'HL1'},
  'Albania': {'LN': 'UFID'},
  'Palestinians in Lebanon': {'LN': 'UF4', 'UF6': ''}},
 'hh': {'Nigeria': {'HC9F': 'HC9I'}, 'Cameroon': {'HC9F': 'HC9H'}}}

 To notify more indicators that should be swapped, please open an Issue on the dedicated [gitlab]() page.
 
 When a different indicator is used for a specific country, the column `used_indicator` will show the true `USED_INDICATOR` (instead of the `TARGET_INDICATOR`).

### `ignorecase`
The `ignorecase` parameter is used to force the `mics_library` to treat acronyms composed by the same characters but different case, as the same acronym.
If `ignorecase=True`, `AG2` and `ag2` or `Ag2` (etc etc) will be considered as the same item.