# FDMBuilder - The Extras

If you've arrived here, hopefully you're already comfortable with the basics of the `FDMBuilder` API and have played aroudnd with the `FDMTable` and `FDMDataset` classes a bit. If not, take a look at `fdm_builder_basics_tutorial` and then come back here when you're ready.

This tutorial will cover the extra methods and helpers that can be found in the FDMBuilder library. Hopefully, they can help you speed up your FDM building workflow even more. We'll start by taking a look at each of the tools/methods/helpers first, followed by a (hopefully) motivating example that will demonstrate how these tools/methods/helpers can be used "in the wild".

As with the basics tutorial, we'll start by loading the `FDMBuilder` libraries and create a `DATASET_ID` variable for your test dataset, to make the example code below a little easier to run. Same as before, don't specify a dataset with any tables in it that you'll miss, because we're about to delete them! Replace the `YOUR DATASET HERE` text with the id of your test dataset, and then run the below cell:

In [2]:
from FDMBuilder.FDMTable import *
from FDMBuilder.FDMDataset import *
from FDMBuilder.testing_helpers import *

### !!REPLACE THIS TEXT!! ###

DATASET_ID = "CY_SAM_TEST"

###

# Leave this bit alone!
if check_dataset_exists(DATASET_ID):
    clear_dataset(DATASET_ID)
    print("Good to go!")
else:
    print("#" * 33 + " PROBLEM!! " + 33 * "#" + "\n")
    print("Something doesn't look right. Check you spelled everything correctly,\n" 
          "your dataset has been created in GCP, and you have the right permisssions\n")
    print("#" * 80)

Good to go!


## BigQuery Cell Magics

The first tool isn't anything the FDM pipeline can take credit for. Packaged in the python bigquery library is a "cell magic", that allows you to run pure SQL queries directly from a Jupyter notebook cell. A cell magic is a little bit of syntax that changes or adds extra functionality to a notebook cell - the general syntax of a cell magic is `%%magic-name`, so the bigquery cell magic is `%%bigquery`. Add that to the top of a cell, write your SQL below, and juypter/python will do the rest. Give the below a try:

In [9]:
%%bigquery
SELECT *
FROM `CY_FDM_MASTER.person`
LIMIT 10

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 504.67query/s] 
Downloading: 100%|██████████| 10/10 [00:00<00:00, 12.02rows/s]


Unnamed: 0,person_id,gender_concept_id,year_of_birth,month_of_birth,day_of_birth,birth_datetime,death_datetime,race_concept_id,ethnicity_concept_id,location_id,provider_id,care_site_id,person_source_value,gender_source_value,gender_source_concept_id,race_source_value,race_source_concept_id,ethnicity_source_value,ethnicity_source_concept_id
0,10871865,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10871865,Male,45454912,British,0,,0
1,10877333,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10877333,Female,45454912,British,0,,0
2,10861223,45454912,2010,1,15,2010-01-15,NaT,0,0,,,,10861223,Female,45454912,British,0,,0
3,10855850,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10855850,Female,45454912,British,0,,0
4,10874629,45454912,2010,1,15,2010-01-15,NaT,0,0,,,,10874629,Male,45454912,British,0,,0
5,10855432,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10855432,Female,45454912,British,0,,0
6,10861024,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10861024,Male,45454912,British,0,,0
7,10874854,45454912,2010,1,15,2010-01-15,NaT,0,0,,,,10874854,Female,45454912,British,0,,0
8,10869527,45454912,2016,2,15,2016-02-15,NaT,0,0,,,,10869527,Female,45454912,British,0,,0
9,10856693,45454912,2010,2,15,2010-02-15,NaT,0,0,,,,10856693,Female,45454912,British,0,,0


Easy. 

For those familiar with the pandas library, you can store the results of your query as a `pandas.DataFrame` by naming it immediately after the `%%bigquery` magic. So the following cell runs the same query as above, and stores the result in `eg_df`:

In [10]:
%%bigquery eg_df
SELECT *
FROM `CY_FDM_MASTER.person`
LIMIT 10

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 355.15query/s] 
Downloading: 100%|██████████| 10/10 [00:00<00:00, 15.11rows/s]


You can then run the following to take a look at the contents of `eg_df`:

In [11]:
eg_df

Unnamed: 0,person_id,gender_concept_id,year_of_birth,month_of_birth,day_of_birth,birth_datetime,death_datetime,race_concept_id,ethnicity_concept_id,location_id,provider_id,care_site_id,person_source_value,gender_source_value,gender_source_concept_id,race_source_value,race_source_concept_id,ethnicity_source_value,ethnicity_source_concept_id
0,10871865,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10871865,Male,45454912,British,0,,0
1,10877333,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10877333,Female,45454912,British,0,,0
2,10861223,45454912,2010,1,15,2010-01-15,NaT,0,0,,,,10861223,Female,45454912,British,0,,0
3,10855850,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10855850,Female,45454912,British,0,,0
4,10874629,45454912,2010,1,15,2010-01-15,NaT,0,0,,,,10874629,Male,45454912,British,0,,0
5,10855432,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10855432,Female,45454912,British,0,,0
6,10861024,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10861024,Male,45454912,British,0,,0
7,10874854,45454912,2010,1,15,2010-01-15,NaT,0,0,,,,10874854,Female,45454912,British,0,,0
8,10869527,45454912,2016,2,15,2016-02-15,NaT,0,0,,,,10869527,Female,45454912,British,0,,0
9,10856693,45454912,2010,2,15,2010-02-15,NaT,0,0,,,,10856693,Female,45454912,British,0,,0


If so inclined, you can run and document your SQL pipelines in a notebook by using the above cell magics, and then documenting your work in markdown text cells (like this). Be sure to only stick SQL cells marked with the `%%bigquery` magic - everything inside these cells is interpreted as SQL, so you'll get some pretty colourful errors if you try sticking python in there too.

Now on to the extra bits of the FDM pipeline. 

## FDMTable Helpers

The `FDMTable` comes with a bunch of extra "methods" that quicken up some of the more "fiddly" bits of the BigQuery SQL environment. We'll be using the last of the test tables `test_table_3` to try out these new helpers. By now, it hopefully won't be a shock to see us begin by initialising an `FDMTable` with the location of `test_table_3` and our test dataset:

In [10]:
test_table_3 = FDMTable(
    source_table_id = "CY_FDM_BUILDER_TESTS.test_table_3",
    dataset_id=DATASET_ID
)

### copy_table_to_dataset

In your adventures with the `.build()` process, you may have noticed that the first stage is copying the source table into the FDM dataset. This is a requirement before you start manipulating a table - you don't want to be messing about with the original copy of the data in GCP. Oh no. We need a pristine copy of the original source data, in case anything goes wrong.

Based on this, none of the methods below will work if you don't first make a copy of your source table in your FDM dataset - otherwise there's no table in GCP for your `FDMTable` to manipulate. You can quickly create a copy of the source table using the `copy_table_to_dataset` method like so:

In [11]:
test_table_3.copy_table_to_dataset()

As you'd expect, if you pop over to your GCP SQL workspace, you'll find a fresh copy of `test_table_3` in your test dataset. Pretty simple. Now you can begin playing around with some of the helper functions.

### head

If you want a quick reminder of the contents of your `FDMTable` you can call the `head` function - by default it will return a dataframe with the first 10 rows of your table, like so:

In [12]:
test_table_3.head()

Unnamed: 0,education_reference,examination_period,some_data
0,9E7D67BAFDD3B62FF48BBAD3C228D79,Apr/1991-Apr/1991,98115
1,B6E58AD17BB26AE7029FAEF9A4F956E,Apr/2001-May/2001,98708
2,7193839AB41440C858192B6DC92D39,Apr/2020-Mar/2020,40912
3,D160A728A4625DBE073A3A5C91A68F5,Aug/1990-Jul/2017,34109
4,C25932BF4C5F1219B7B1B22297E5573,Aug/1991-Aug/1991,91875
5,F0A0E6DE856841DC8128A43A111CB,Aug/1991-Sep/1991,99482
6,68BC8D3DA5608BDFF87CCB578EC23E63,Aug/1995-Sep/1995,93362
7,AFC6241DB99BD1C8B82E13F725A9C5E,Aug/2001-Aug/2001,53449
8,5234F059535C97AF66605F7F6C59C0AC,Aug/2014-Aug/2014,4005
9,30D399349CA45E52075EC2879A581DC,Aug/2014-Sep/2014,51604


If, for some reason, you'd like to see a different number of rows, you can use the `n` argument like so:

In [13]:
test_table_3.head(n=5)

Unnamed: 0,education_reference,examination_period,some_data
0,9E7D67BAFDD3B62FF48BBAD3C228D79,Apr/1991-Apr/1991,98115
1,B6E58AD17BB26AE7029FAEF9A4F956E,Apr/2001-May/2001,98708
2,7193839AB41440C858192B6DC92D39,Apr/2020-Mar/2020,40912
3,D160A728A4625DBE073A3A5C91A68F5,Aug/1990-Jul/2017,34109
4,C25932BF4C5F1219B7B1B22297E5573,Aug/1991-Aug/1991,91875


simple.

### rename_columns

Next is `rename_columns` - you'll be shocked to hear this renames columns in your table. This is surprisingly awkward to do in BigQuery SQL syntax. Hopefully you'll find this a little easier. The logic is as follows:

`rename_columns` takes one argument - a python "dictionary". Dictionaries are a data type in python that comine a set of "keys" with a set of "values". They look like this:

```
    example_dict = {
        "key_1": "value_1",
        "key_2": "value_2",
        "key_3": "value_3",
        ...
    }
```

Dictionaries are defined inside curly braces - `{}` - inside which are "keys" and "values" separated by a colon - `:` - and each key-value pair is separated by a comma - `,`. The input to `rename_columns` is a dictionary where each key is an existing column name that you want to change, and each value is the new name, like so:

```
    example_rename_columns_input = {
        "old_name_1": "new_name_1",
        "old_name_2": "new_name_2",
        "old_name_3": "new_name_3",
        ...
    }
```

So we can rename the `some_data` column in our `test_table_3` to `some_new_data` by running the following code cell:

In [15]:
test_table_3.rename_columns({"some_data": "some_new_data"})

	Renaming Columns:
	some_data -> some_new_data
	Renaming Complete



and we can check that worked using our new found `head` function:

In [16]:
test_table_3.head()

Unnamed: 0,some_new_data,education_reference,examination_period
0,98115,9E7D67BAFDD3B62FF48BBAD3C228D79,Apr/1991-Apr/1991
1,98708,B6E58AD17BB26AE7029FAEF9A4F956E,Apr/2001-May/2001
2,40912,7193839AB41440C858192B6DC92D39,Apr/2020-Mar/2020
3,34109,D160A728A4625DBE073A3A5C91A68F5,Aug/1990-Jul/2017
4,91875,C25932BF4C5F1219B7B1B22297E5573,Aug/1991-Aug/1991
5,99482,F0A0E6DE856841DC8128A43A111CB,Aug/1991-Sep/1991
6,93362,68BC8D3DA5608BDFF87CCB578EC23E63,Aug/1995-Sep/1995
7,53449,AFC6241DB99BD1C8B82E13F725A9C5E,Aug/2001-Aug/2001
8,4005,5234F059535C97AF66605F7F6C59C0AC,Aug/2014-Aug/2014
9,51604,30D399349CA45E52075EC2879A581DC,Aug/2014-Sep/2014


### add_column

Next on the list is `add_column`. Shockingly, this addes a new column to our Table. It takes one argument, a string that should look like one column of your standard `SELECT` statement. So, for example:

    "some_new_data * 100 AS some_new_data_x_100"
    "LENGTH(education_reference) AS ed_ref_length"
    "LOWER(educatoin_reference) AS lower_case_ed_ref"

If you could stick it at the start of a select statement, it'll work in `add_column`. If you're wondering what any of the above will do, given them a try in the next cell:

In [18]:
test_table_3.add_column("some_new_data * 100 AS new_data_x_100")

test_table_3.head()

Unnamed: 0,some_new_data,education_reference,examination_period,new_data_x_10,new_data_x_100
0,98115,9E7D67BAFDD3B62FF48BBAD3C228D79,Apr/1991-Apr/1991,981150,9811500
1,98708,B6E58AD17BB26AE7029FAEF9A4F956E,Apr/2001-May/2001,987080,9870800
2,40912,7193839AB41440C858192B6DC92D39,Apr/2020-Mar/2020,409120,4091200
3,34109,D160A728A4625DBE073A3A5C91A68F5,Aug/1990-Jul/2017,341090,3410900
4,91875,C25932BF4C5F1219B7B1B22297E5573,Aug/1991-Aug/1991,918750,9187500
5,99482,F0A0E6DE856841DC8128A43A111CB,Aug/1991-Sep/1991,994820,9948200
6,93362,68BC8D3DA5608BDFF87CCB578EC23E63,Aug/1995-Sep/1995,933620,9336200
7,53449,AFC6241DB99BD1C8B82E13F725A9C5E,Aug/2001-Aug/2001,534490,5344900
8,4005,5234F059535C97AF66605F7F6C59C0AC,Aug/2014-Aug/2014,40050,400500
9,51604,30D399349CA45E52075EC2879A581DC,Aug/2014-Sep/2014,516040,5160400


### drop_column

This one couldn't be easier - `drop_column` takes one argument, a column name, and drops/deletes the named column. Give it a try:

In [20]:
test_table_3.drop_column("some_new_data")

test_table_3.head()

Unnamed: 0,education_reference,examination_period,new_data_x_10,new_data_x_100
0,9E7D67BAFDD3B62FF48BBAD3C228D79,Apr/1991-Apr/1991,981150,9811500
1,B6E58AD17BB26AE7029FAEF9A4F956E,Apr/2001-May/2001,987080,9870800
2,7193839AB41440C858192B6DC92D39,Apr/2020-Mar/2020,409120,4091200
3,D160A728A4625DBE073A3A5C91A68F5,Aug/1990-Jul/2017,341090,3410900
4,C25932BF4C5F1219B7B1B22297E5573,Aug/1991-Aug/1991,918750,9187500
5,F0A0E6DE856841DC8128A43A111CB,Aug/1991-Sep/1991,994820,9948200
6,68BC8D3DA5608BDFF87CCB578EC23E63,Aug/1995-Sep/1995,933620,9336200
7,AFC6241DB99BD1C8B82E13F725A9C5E,Aug/2001-Aug/2001,534490,5344900
8,5234F059535C97AF66605F7F6C59C0AC,Aug/2014-Aug/2014,40050,400500
9,30D399349CA45E52075EC2879A581DC,Aug/2014-Sep/2014,516040,5160400


### quick_build

When you were running through the basics of the FDM API, you may have thought the `build` process was a bit lengthy - what with all the long-winded explanations and requests for input. That's where `quick_build` comes in. It's basically a more "programatic" way of completing the build process. It takes the training wheels off, so to speak, and in doing so makes the process a lot "snappier".

`quick_build` takes up to five arguments:

* `fdm_start_date_cols`: Either a string or a list, detailing the columns that contain the start date information. If the start date is found in a single column with a datetime or a string that can easily be parsed, then the input would be a string with the column name. If the start date is in multiple columns with individual year, month and day, the input would be a list with the column names - or, a static value for one or more of the year/month/day. If we think back to the test examples, for `test_table_1` we would use `"start_date"` and for `test_table_2` we would input `["start_year", "start_month", "15"]`
* `fdm_start_date_format`: This is one of `"YMD"`/`"YDM"`/`"DMY"`/`"MDY"`. Hopefully fairly self explanatory. Simply the date format of the start date data. This is required both if you input a single column or multiple columns - so it doesn't matter what order you input your `fdm_start_date_cols` provided you correctly specify the date format.
* `fdm_end_date_cols`: This is an optional argument, depending on the need for an end date in the source data you're FDMing. It takes exactly the same input format as the `fdm_start_date_cols` argument, so no need to go over that again.
* `fdm_end_date_format`: Again, an optional argument depending on the presence of an end date, with the same input specification as the `fdm_start_date_format`
* `verbose`: By default this is set to `True`, and controls the console output while the `quick_build` process is running. When set to `True`, the console will output text telling you what stage the script has reached. If set to `False`, the console output is surpressed.

That may seem a little daunting to take in all at once, but the process is really pretty simple once you get started. We'll save working examples for the moment, as we'll see plenty when we take a look at an example workflow below.

### recombine

You'll hopefully recall that, as part of the `FDMDataset`'s `build` process, the tool splits any entries that are found to have "problems" into separate tables. This is an important part of completing an FDM dataset, but presents an issue if you then want to start manipulating the source data after the build - you have two separate tables that contain the source data! This could quickly lead to errors, particularly if you want to correct any of the problems identified during the FDM build.

`recombine` is a method designed to resolve this issue. If you want to start manipulating a table that has been split from it's problematic entries, you must first `recombine` it - stich the two tables back together. The method itself couldn't be easier to use - simply call `your_table.recombine()` on a table that has an associated "fdm_problems" table, and the script does the rest. 

You'll find that if you try and use any of the above helpers on a table that has a separate "fdm_problems" table or you try to build an FDM from a dataset containing "fmd_problems" tables, the method will return an error and ask you to first recombine the problem entries before continuning.  To help in any efforts to correct problem entries, the "problem" column from the associated "fdm_problems" table is kept after using `recombine`, and is `NULL` for any entries that don't have an associated problem. 

It would be a bit of a faff setting up an example here, but there will be examples of `recombine` in the workflow below.

## Other Helpers

Before we take a look at a more involved example workflow, a quick mention of a couple of functions we've already seen but have glossed over until now. These aren't methods attached to either the `FDMTable` or `FDMDataset` classes, but stand-alone functions in their own right. They're all very simple to use:

### check_dataset_exists / check_table_exists

Both do what they say on the tin - either checking a table or a dataset exists! Simply stick the id of a table or dataset in the function, and it will return either `True` or `False` depending on the existence of the named table/dataset:


In [25]:
check_dataset_exists("CY_FDM_MASTER")

True

In [26]:
check_table_exists("this_table.doesn't.exist")

False

### clear_dataset

Again, does pretty much what it says on the tin. `clear_dataset` will remove every table from the dataset you point it at. Obviously, to be used with caution! We'll need to clear out our test dataset before we can start with the example workflow below, so let's do just that:

In [27]:
clear_dataset(DATASET_ID)

## Example FDM Building Workflow

Righty, we've reached the now much talked about example workflow. In this example, we'll assume we're building an FDM from scratch, using the tables in `CY_FDM_BUILDER_TESTS`. The process will be similar to that of the basics tutorial, but a little more "programatic" and with less navigating backwards and forwards between jupyter and GCP.

A quick note - you should expect to see some errors when going through this workflow. Indeed errors are a normal part of programming in any language! The FDM tools have been designed to be (relatively) robust to various missteps, so you should see an informative error message if you something one of the tools doesn't like. At any rate, if you see an error here, don't panic - it's supposed to be there. Just give the error message a quick read (the main message will be at the bottom of the error readout) and continue with the tutorial - it will (hopefully) keep you abreast of what's happening. 

We'll start, as we did before, by building each of the individual tables in our test datset. Let's start with `test_table_1`: 

In [29]:
test_table_1 = FDMTable(
    source_table_id = "CY_FDM_BUILDER_TESTS.test_table_1",
    dataset_id = DATASET_ID
)

let's take a quick look at the data in `test_table_1`

In [30]:
test_table_1.head()

ValueError: 
    A copy of yhcr-prd-phm-bia-core.CY_SAM_TEST.test_table_1 doesn't yet exist in f"CY_SAM_TEST.
    Try running .copy_table_to_dataset() and then try again 

and there's our first error! You'll remember that we need a copy of our source dataset in our working datset, otherwise we cann't do anything with it. If you don't, reading the error message above should hopefully serve as a reminder. Let's make a copy and try that again:

In [31]:
test_table_1.copy_table_to_dataset()
test_table_1.head()

Unnamed: 0,person_id,start_date,some_data
0,10868457,,36393
1,10863750,,74032
2,10870103,,79308
3,10870825,,26188
4,10865857,,93727
5,10855043,9-May-2010,37427
6,10863830,11-May-2013,55338
7,10532576,14-May-1934,87276
8,10581816,19-May-2028,74068
9,99999992,2-July-2010,77732


Great. `test_table_1` looks like it's already in a perfect state to complete the FDM build process. We'll have our first go with the `quick_build` method. Remember, we just need to point the method to the column(s) that contain the start dates and, if necessary, the end dates:

In [32]:
test_table_1.quick_build(
    fdm_start_date_cols="start_date",
    fdm_start_date_format="DMY"
)

Building test_table_1:
    existing copy of test_table_1 in CY_SAM_TEST

    test_table_1 already contains person_id column


    fdm_start_date column added
    no fdm_end_date info provided
Done.
