First things first: set the ROOTDIR and import the necessary packages:

In [1]:
from IPython.display import display, Markdown

import mics_library
import os

ROOTDIR = '/path/to/original' 
mics_library.set_rootdir(ROOTDIR)

## Import MICS items
This is last step of the procedure to extract MICS items.

To do so we will use the `mics_library.loaders.import_dataset` function:

In [2]:
from mics_library.loaders import import_dataset

As usual, we need to set the MICS round we want to analyse 

And define the target MICS items we want to extract.

In [3]:
ROUND = 5

select_indicators = {'hh': ['HELEVEL'], #education level of the household head
                     'hl': ['HL3'],     #relation to the household head
                     'ch': ['EC1',      #number of books
                            'EC5',      #attend early education programme
                            'AG2']      #age of child
                    }

We will also need to correct the errors in the MICS (different acronyms or different numerical representations used by different countries).

First, we could define a `swap_indicator` dictionary to correct the errors in the acronyms. 
Although for this tutorial we do not need it, we create one just to show how that would look like:

In [4]:
swap_indicators = {'ch': {'Yemen': {'BR8AF': 'BR8AM', 
                                    'BR8AM': 'BR8AF',
                                    'BR8BF': 'BR8BM', 
                                    'BR8BM': 'BR8BF',
                                    'BR8CF': 'BR8CM', 
                                    'BR8CM': 'BR8CF',
                                    'BR8DF': 'BR8DM', 
                                    'BR8DM': 'BR8DF',
                                    'BR8EF': 'BR8EM', 
                                    'BR8EM': 'BR8EF',
                                    'BR8FF': 'BR8FM', 
                                    'BR8FM': 'BR8FF'}}}

Then we need to "use" the recoding `.csv` files we created in the previous step. 

Since we created the `.csv` files starting from the results of the `check_indicators` step, 
indicatorthe files are already properly organized, and we can use the `mics_library.utils.create_recoding_dict` function:

In [5]:
from mics_library.recode import create_recoding_dict

We need to point the function to the `RECODE_DIR` folder, which contains the questionnaire folders and `.csv` files for each questionnaire with the recodings.

In [6]:
RECODE_DIR = '/path/to/recode'

recoding_dictionary = create_recoding_dict(RECODE_DIR)

Which creates the following dictionary:

In [7]:
display(recoding_dictionary)

{'ch': {'EC1': {'Bangladesh': {10.0: 1,
    99.0: nan,
    0.0: 0,
    1.0: 0.0,
    2.0: 0.0,
    3.0: 0.0,
    4.0: 0.0,
    5.0: 0.0,
    6.0: 0.0,
    7.0: 0.0,
    8.0: 0.0,
    9.0: 0.0},
   'Pakistan (Punjab)': {10.0: 1,
    99.0: nan,
    0.0: 0,
    1.0: nan,
    2.0: nan,
    3.0: nan,
    4.0: nan,
    5.0: nan,
    6.0: nan,
    7.0: nan,
    8.0: nan,
    9.0: nan},
   'Nigeria': {10.0: 1,
    99.0: nan,
    0.0: 0,
    1.0: 0.0,
    2.0: 0.0,
    3.0: 0.0,
    4.0: 0.0,
    5.0: 0.0,
    6.0: 0.0,
    7.0: 0.0,
    8.0: 0.0,
    9.0: 0.0}}},
 'hh': {'HELEVEL': {'Bangladesh': {1.0: 0,
    2.0: 0,
    3.0: 0,
    4.0: 0,
    5.0: 1,
    9.0: nan},
   'Pakistan (Punjab)': {1.0: 0, 2.0: 0, 3.0: 0, 4.0: 1, 5.0: 0, 9.0: nan},
   'Nigeria': {1.0: 0, 2.0: 0, 3.0: 1, 4.0: 0, 5.0: 0, 9.0: nan}}}}

Then we extract the data:

In [8]:
dataset = import_dataset(ROUND, select_indicators, swap_indicators={},
                         recoding_dictionary=recoding_dictionary, ignorecase=True)    

## Questionnaires and keys

The result of the `import_dataset` function is a dictionary:

`{QUESTIONNAIRE : [data, keys],
  ...}`
 
Where `QUESTIONNAIRE` is the questionnaire and values is a list with two elements: `[data, keys]`.
- `data`: is a pandas.DataFrame with the items extracted from the questionnaire;
- `keys`: is a pandas.DataFrame with the keys to allow linking information between the questionnaires


### `keys`
A unique participant may respond to multiple MICS questionnaires.
For instance, a woman answers to the _Household Listings_ module (_Household Questionnaire_) and to the _Women Questionnaire_.
Since each questionnaire focuses on specific aspects, information about a participant is spreaded across muliple questionnaires.

Further, we might be interested in linking information from related participants. 
For instance, obtain information about health of mothers of children with disabilities.
In this case, we need to (a) obtain information about disability of children; (b) select children with disability and (c) obtain information about the health of their mothers.

`keys` are used to allow linking the same participant across different questionnaires or to associate the participants to other participants (e.g. relatives).
`keys` are created by joining together MICS items that are used to identify households and participants.

#### `HHID`
The basic key is `HHID`, that identifies a household in the dataset.
`HHID` has the following format: `X_COUNTRY_YY_ZZ`, where:
- `X` indicates the MICS round
- `COUNTRY` indicates the country
- `YY` indicates the cluster (within the country)
- `ZZ` indicates the household (within the cluster)

#### `HLID`
Another fundamental key is `HLID`, that identifies a participant.
`HLID` is created starting from the `HHID`: `HHID_N`, where `N` is the identifier of the member of the household (often called _Line Number_).
`N` is set to `-1` when the _Line Number_ of the referred member (e.g. mother of child) is not provided/present/available.

#### `keys` and questionnaires
A default list of `keys` is computed for each questionnaire.
Keys in [brackets] is used as identifier of the elements in the questionnaire:

**`hh`**:
- [`HHID`]
- `child_HLID`: `HLID` of the child chosen as target of _Household Questionnaire's Modules_ (e.g. _Child Labor_, _Discipline_)


**`hl`**:
- [`HLID`]
- `HHID`
- `mother_HLID`: `HLID` of the mother, if provided/present/available *
- `father_HLID`: `HLID` of the father, if provided/present/available *

**`ch`**:
- [`HLID`]
- `HHID`
- `caretaker_HLID`: `HLID` of the reported primary caretaker

**`bh`**:
- [`HLID`]
- `HHID`
- `mother_HLID`: `HLID` of the mother, if provided/present/available *

**`wm`**:
- [`HLID`]
- `HHID`

**`mn`**:
- [`HLID`]
- `HHID`

*_Depending on the MICS rounds, `mother_HLID` and `father_HLID` might refer to the biological parent or to the member in the household that assumed the role._

The advanced user will be able to create new keys, using functions provided in the `mics_library`.

## Merging the datasets

Using the computed `keys` we can easily merge data from different questionnaires.

We can do this _by-hand_, if we need to comply with special requirements, or use the default function `mics_library.utils.merge_questionnaires` for general use cases:

In [9]:
from mics_library.loaders import merge_questionnaires

In [10]:
data, keys = merge_questionnaires(dataset)

In [11]:
display(data)

Unnamed: 0,HL3,HELEVEL,AG2,EC1,EC5
5_Bangladesh_1000_10_1,1.0,0.0,,,
5_Bangladesh_1000_10_2,2.0,0.0,,,
5_Bangladesh_1000_10_3,3.0,0.0,1.0,0.0,
5_Bangladesh_1000_11_1,1.0,0.0,,,
5_Bangladesh_1000_11_10,11.0,0.0,,,
...,...,...,...,...,...
5_Pakistan (Punjab)_9_9_5,3.0,0.0,,,
5_Pakistan (Punjab)_9_9_6,3.0,0.0,,,
5_Pakistan (Punjab)_9_9_7,3.0,0.0,,,
5_Pakistan (Punjab)_9_9_8,3.0,0.0,,,


In [12]:
display(keys)

Unnamed: 0_level_0,HHID,HLID,mother_HLID,father_HLID,country,child_HLID,caretaker_HLID
HLID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
5_Bangladesh_1000_10_1,5_Bangladesh_1000_10,5_Bangladesh_1000_10_1,5_Bangladesh_1000_10_-1,5_Bangladesh_1000_10_-1,Bangladesh,5_Bangladesh_1000_10_3,
5_Bangladesh_1000_10_2,5_Bangladesh_1000_10,5_Bangladesh_1000_10_2,5_Bangladesh_1000_10_-1,5_Bangladesh_1000_10_-1,Bangladesh,5_Bangladesh_1000_10_3,
5_Bangladesh_1000_10_3,5_Bangladesh_1000_10,5_Bangladesh_1000_10_3,5_Bangladesh_1000_10_2,5_Bangladesh_1000_10_1,Bangladesh,5_Bangladesh_1000_10_3,5_Bangladesh_1000_10_2
5_Bangladesh_1000_11_1,5_Bangladesh_1000_11,5_Bangladesh_1000_11_1,5_Bangladesh_1000_11_-1,5_Bangladesh_1000_11_-1,Bangladesh,5_Bangladesh_1000_11_6,
5_Bangladesh_1000_11_10,5_Bangladesh_1000_11,5_Bangladesh_1000_11_10,5_Bangladesh_1000_11_0,5_Bangladesh_1000_11_0,Bangladesh,5_Bangladesh_1000_11_6,
...,...,...,...,...,...,...,...
5_Pakistan (Punjab)_9_9_5,5_Pakistan (Punjab)_9_9,5_Pakistan (Punjab)_9_9_5,5_Pakistan (Punjab)_9_9_2,5_Pakistan (Punjab)_9_9_1,Pakistan (Punjab),5_Pakistan (Punjab)_9_9_6,
5_Pakistan (Punjab)_9_9_6,5_Pakistan (Punjab)_9_9,5_Pakistan (Punjab)_9_9_6,5_Pakistan (Punjab)_9_9_2,5_Pakistan (Punjab)_9_9_1,Pakistan (Punjab),5_Pakistan (Punjab)_9_9_6,
5_Pakistan (Punjab)_9_9_7,5_Pakistan (Punjab)_9_9,5_Pakistan (Punjab)_9_9_7,5_Pakistan (Punjab)_9_9_2,5_Pakistan (Punjab)_9_9_1,Pakistan (Punjab),5_Pakistan (Punjab)_9_9_6,
5_Pakistan (Punjab)_9_9_8,5_Pakistan (Punjab)_9_9,5_Pakistan (Punjab)_9_9_8,5_Pakistan (Punjab)_9_9_2,5_Pakistan (Punjab)_9_9_1,Pakistan (Punjab),5_Pakistan (Punjab)_9_9_6,


We can now save the resulting dataframes.
Saving alle the `keys` might result in (very) big `.csv` files; consider to save only the keys needed for the subsequent analysis.

In [13]:
OUT_DIR = '/path/to/datasets'
data.to_csv(os.path.join(OUT_DIR, 'data.csv'))
keys.to_csv(os.path.join(OUT_DIR, 'keys.csv'))

## Congratulations!

You succesfully extracted a dataset of MICS items using `mics_library`!

Next steps involve data processing and analysis that can be carried on using standard methods (e.g. using [pandas](https://pandas.pydata.org/)), starting from the (coherent and consistent) information stored `.csv` files we have just created.
