# 10-MERGE

In [None]:
from IPython.core.display import display
%load_ext autoreload
%autoreload 2

In [None]:
import logging
import pandas as pd
import os, webbrowser, pathlib
from fddc.annex_a.merger import configuration, find_sources, read_sources, merge_dataframes
from fddc.annex_a.merger.file_scanner import ScanSource

In [None]:
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logging.debug("This is just to get logging to work - it seems to refuse to log unless you log something!")

Now that we have configured the runtime environment, we need to configure the explicit application settings. 
First we will configure the application with the standard Annex A tables and columns.

We can load these from [config/annex-a-merge.yml](./config/annex-a-merge.yml).

In [None]:
# Configure standard settings
data_sources = configuration.parse_datasources("config/annex-a-merge.yml")

Now we are to find some files to include. We can provide individual file names:

In [None]:
sources = find_sources('examples/example-A-2005.xls', 'examples/example-B-2004.xlsx', data_sources=data_sources)

Or we can provide 'glob' patterns:

In [None]:
sources = find_sources('examples/example-*.*', data_sources=data_sources)

pd.DataFrame([{
    'filename':s.sheet.sheet_detail.filename, 
    'sort_key':s.sheet.sheet_detail.sort_key, 
    'sheetname':s.sheet.sheet_detail.sheetname
} for s in sources])


When we come to merging these later, the results will get deduplicated by taking the most recent values (assumed to be cleaner) - we can see that annoyingly our sample files do not sort properly due to the A and B in front of the year. 

We can provide a 'sort_key' - a regular expression for extracting these values. We're going to use a simple one that matches the year part of the filename. See the [docs](./docs/merger%20-%20components.ipynb) for more details.

In [None]:
file_pattern = ScanSource('examples/example-*.*', [r'/.*?(\d+).*/\1/'])
sources = find_sources(file_pattern, data_sources=data_sources)

pd.DataFrame([{
    'filename':s.sheet.sheet_detail.filename, 
    'sort_key':s.sheet.sheet_detail.sort_key, 
    'sheetname':s.sheet.sheet_detail.sheetname
} for s in sources])


We can also save the output of the scan so we can see if we discovered all the
sheets and columns. This report uses some Excel formulae so it's best viewed in 
Excel.

In [None]:
sources = find_sources(file_pattern, data_sources=data_sources, column_report_filename="tmp_report.xlsx")
webbrowser.open(pathlib.Path(os.path.abspath("tmp_report.xlsx")).as_uri())

You can edit the mappings in this file to match any incorrect columns or sheets. See this 
[example](./examples/matcher-report/report.xlsx) for details.

These can then be loaded back in for rematching.

In [None]:
sources = read_sources("tmp_report.xlsx", data_sources=data_sources)

You can play around with the mappings until you are happy with the result, then you can merge the data

In [None]:
merge_dataframes(sources, data_sources=data_sources, output_file="merged.xlsx")
webbrowser.open(pathlib.Path(os.path.abspath("merged.xlsx")).as_uri())

For more information on the merger, see the [component documentation](./docs/merger%20-%20components.ipynb)
