## Quick Jupyter notebooks primer

A jupyter notebook allows you to write markdown and execute python/R script in one document. 

This text has been written in a markdown cell (double click right here and you'll be able to edit the markdown). Running a markdown cell renders it (so displaying the markdown in it's non-scripted format).

Immediately below this is the first code cell - it contains script that imports all of the required python libraries to run the FDMBuilder. Any output from a code cell will be displayed immediately below the cell.

Be sure to write text and documentation in a markdown cell, and script in a code cell - otherwise you'll get some pretty colourful errors!

There are a bunch of controls to manage each cell in the notebook: the UI has buttons above that can run a code cell, change a code cell to a markdown cell or visa-versa, stop execution of a code cell, execute every cell in the notebook, and so on... Hover over each of the buttons above to see what they do. You can also perform all cell-related activities by selecting the `Cell` menu in the toolbar and choosing the relevant option.

However, hotkeys are usually the easiest way to quickly run code cells (and render markdown). Simply select a code cell and:

* press `ctrl+enter` to run the cell 
* press `shift+enter` to run the cell and move focus to the cell below
* press `ctrl+shift+enter` to run the cell and create a new code cell below

That should be enough to get started - plenty of other online guides exist if you want to get better acquainted with the jupyter notebook environment.

Get started by running the below code cell, which imports all the required python libraries for the FDMBuilder:

In [1]:
from FDMBuilder.FDMTable import *
from FDMBuilder.FDMDataset import *
from FDMBuilder.testing_helpers import *

## Tutorial Prerequisites:

You'll need to have an empty dataset in GCP to play around with the FDMBuilder tools - so either create one now in preparation, or ask someone with the relevant priviledges to make one for you if you can't.

Once you have a dataset, replace the `YOUR_DATASET_HERE` below with your dataset's id (be sure to keep the quotes or python will get upset with you):

In [2]:
DATASET_ID = "CY_SAM_TEST"
if check_dataset_exists(DATASET_ID):
    print("Good to go!")
else:
    print("#" * 33 + " PROBLEM!! " + 33 * "#" + "\n")
    print("Something doesn't look right. Check you spelled everything correctly,\n" 
          "your dataset has been created in GCP, and you have the right permisssions\n")
    print("#" * 80)

Good to go!


As you may have figured out, the above code stores the name of your new dataset in a variable called `DATASET_ID`, and does a quick check to make sure the dataset exists - this just makes the following examples a little easier to run without any silly errors occurring. Hopefully you should see a message saying `Good to go!` in the cell output above.

We'll also be using a couple of pre-prepared source datasets that can be found in `CY_FDM_BUILDER_TESTS`. Take a quick look in the BigQuery SQL workspace to check you've been given access to the test dataset and that you can see the two test tables, `test_table_1` & `test_table_2`, in there (if not give me - Sam - a shout). 

## FDM Builder - The basics

Note: This guide assumes you're familiar with the term FDM and associated concepts.

The FDMBuilder library has been designed with the hope that a non-python user shouldn't (hopefully) have too much difficulty using the FDM tools to build a dataset from scratch. The workflow is split into two major steps:

1. Prepare the source tables
2. Build the FDM

Each step comes with it's own tool or helper that walks through the process of preparing and bulding an FDM dataset. Source tables are "built" or prepared for the FDM process with the `FDMTable` tool - this is a python "class" that contains all the bits and pieces needed to clean and prep a table for FDMing. Once all the source tables are ready, the FDM dataset itself is "built" using the `FDMDataset` tool - another python class responsible for drawing all the source tables together and building the standard FDM tables (person and observation_period).

We'll begin with the basics of using the FDMTable and FDMDataset tools to buld an FDM dataset. Once we're more comfortable with the python workflow, we can then move onto the more "advanced" functions that can streamline many of the more common cleaning/manipulation activities that pop up during the FDM process.

## FDMTable

To begin the FDM process, we need to prep each source table. This process ensures that:

1. The source table is copied to the FDM dataset location
2. person_ids are added to each entry
3. An event_start_date is added to each entry in a cleaned `DATETIME` format
4. If needed an event_end_date is added to each entry in a cleaned `DATETIME` format

To do this using the python FDMBuilder, you first need to define an individual FDMTable object for each of the source tables in your FDM dataset. We can do just that by running the code cell below:


In [3]:
test_table_1 = FDMTable(
    source_table_id="CY_FDM_BUILDER_TESTS.test_table_1",
    dataset_id=DATASET_ID
)

The above code cell creates a new FDMTable object and stores it as a python variable `test_table_1` - the arguments when creating or initialising an FDMTable are:

* `source_table_id`: the id of the source table (hopefully that wasn't a surprise!). This can be in "project.dataset_id.table_id" form or just "dataset_id.table_id" form
* `dataset_id`: the id of the dataset in which you'll be copying/building your FDM dataset - in this tutorial all such the `dataset_id` parameter intputs will be replaced with the `DATASET_ID` variable we created at the beginning of the notebook

Initialising the `FDMTable` class FDMTable doesn't actually do anything particularly substantive - it just creates and stores an object in python. To start working with the tool, you need to run or "call" one of the FDMTable's "methods". Methods are functions attached to a specific class, that update/manipulate/otherwise mess about with the related class object. So the FDMTable class has methods that manipulate the associated FDM table in GCP doing things like adding/deleting/renaming columns and so on.

To start, we'll look at the most of important of these methods `build` - fortunately it's also the easiest to get to grips with. Methods are called by specifying the class object, followed by a `.` and then the name of the method. So we call the `build` method on the above FDMTable we just defined by running:

```
test_table_1.build()
```

The `build` method is designed to walk the user through the process of preparing an FDM table, stopping each time user input is required. Each time the script stops, it will give a short explanation why and will ask for input with a bit of guidance on the input required.

Give it a try! Run the below cell to build your first FDM table - simply read what the build script says and enter the required input when asked:

(Note: You'll need the soure table available in a separate SQL workspace, so you can preview the data in `test_table_1`)

In [5]:
clear_dataset(DATASET_ID)

In [6]:
test_table_1.build()

	 ##### BUILDING FDM TABLE COMPONENTS FOR test_table_1 #####
________________________________________________________________________________

1. Copying test_table_1 to CY_SAM_TEST:

    Table test_table_1 copied to CY_SAM_TEST!

2. Adding person_id column:
    test_table_1 already contains person_id column

3. Adding fdm_start_date column:



    An event start date is required to build the observation_period table. This 
    information should be contained within one or more columns of your table. 
    If unsure a quick look at the table data in BigQuery should clarify.
    
    To start, is the event start data found in one column that can be easily 
    parsed with a day, month and year?
    > Type y or n  y

    Which column contains the event start date?
    > Type the name (case sensitive):  start_date

    What format does the date appear in YMD/YDM/DMY/MDY?
    > Type one:  YMD


    fdm_start_date column added

4. Adding fdm_end_date column:



    An event end date may or may not be relevant to this source data. For example, 
    hospital visits or academic school years have an end date as well as a start 
    date.
    If you're unsure weather or not the source data should include an event end 
    date, seek help from the CYP data team."
    Does this data have an event end date?"
    > Type y or n:  n


________________________________________________________________________________

	 ##### BUILD PROCESS FOR test_table_1 COMPLETE! #####



Hopefuly that went without a hitch! If not give me (Sam) a shout... 

If you quickly take a look over at GCP and give your tab a quick refresh, you should notice that your test dataset now contains a copy of `test_table_1` and the table has a shiny new `fdm_start_date` column. You'll also notice the parser has taken a string with text and digits and has converted it into a SQL datetime - this should hopefully save a lot of manual faff in the long run!

This was a pretty simple example that didn't ask much of the FDMBuilder - the second example throws a couple more curveballs into the mix, but hopefully the build script should still guide you through. When you're ready, run the below code cell:


In [7]:
test_table_2 = FDMTable(
    source_table_id="CY_FDM_BUILDER_TESTS.test_table_2",
    dataset_id=DATASET_ID
)
    
test_table_2.build()

	 ##### BUILDING FDM TABLE COMPONENTS FOR test_table_2 #####
________________________________________________________________________________

1. Copying test_table_2 to CY_SAM_TEST:

    Table test_table_2 copied to CY_SAM_TEST!

2. Adding person_id column:



    No identifier columns found! FDM process requires a person_id column 
    in each table -  or  a digest/EDRN column to be able to link  person_ids.
    person_id/digest/EDRN columns may be present under a different name - do any 
    of the following colums contain digests or EDRNs? 
    (Note: identifiers are case sensitive)
    
    
		digest_with_wrong_name
		start_month
		start_year
		end_month
		end_year
    
    If so, type the column in question. If not, type n.
    > Response:  digest_with_wrong_name

    Does digest_with_wrong_name contain person_ids, digests or EDRNs?
    > Type either person_id, digest or EDRN:  dige

    Response needs to match one of person_id, digest or EDRN and is 
    case-sensitive.
    > Response:
                     digest


	Renaming Columns:
	digest_with_wrong_name -> digest
	Renaming Complete


    person_id column added

3. Adding fdm_start_date column:



    An event start date is required to build the observation_period table. This 
    information should be contained within one or more columns of your table. 
    If unsure a quick look at the table data in BigQuery should clarify.
    
    To start, is the event start data found in one column that can be easily 
    parsed with a day, month and year?
    > Type y or n  n

    We'll build the event start date beginning with the year. Where can the
    year information be found?
    Your response can be the name of a column that contains the year (year only,
    other formats can't be parsed) or a static value (e.g. 2022).
    If the year information isn't contained in one column, type quit as your 
    response, add a column with the year information and then re-run .build(). 
    You may find the .add_column() method useful for this.
    > Response:  start_year

    And now we'll move onto the month. The same guidance as above applies.
    Remebmer, a static value like 02, or Feb, o


    adding fdm_start_date_column...
    fdm_start_date column added

4. Adding fdm_end_date column:



    An event end date may or may not be relevant to this source data. For example, 
    hospital visits or academic school years have an end date as well as a start 
    date.
    If you're unsure weather or not the source data should include an event end 
    date, seek help from the CYP data team."
    Does this data have an event end date?"
    > Type y or n:  y

    The process will now proceed in exactly the same way as with the event start 
    date. Refer to the guidance above if at all unsure about the responses to any
    of the following questions.
    Is the event end data found in one column that can be easily parsed with a 
    day, month and year?
    > Type y or n:  n

    Where can the event end year be found?
    > Response:  end_year

    Where can the event end month be found?
    > Response:  end_month

    Where can the event end day be found?
    > Response:  15


    fdm_end_date column added
________________________________________________________________________________

	 ##### BUILD PROCESS FOR test_table_2 COMPLETE! #####






Hopefully you made it through that without too much issue. Like before, if you hop over to GCP and refresh your SQL workspace, you should see a `test_table_2` in your dataset. This time the FDMTable tool has done a little more work - it renamed the (somewhat meta) `digest_with_wrong_name` column, added `person_id`s from the digest column, and has parsed the `event_start_date`s and `event_end_date`s. 

That's the basics of the table prep stage done. Now we can move on to actually building the FDM Dataset.

## FDMDataset

You'll probably note that, thus far, the FDMTable tool doesn't seem to have done anything all that dramatic. It's just copied a couple of tables into our FDM dataset, made sure the person_id is in good order and added a couple of dates. But, it's important that these boxes are ticked off properly before we try to build the rest of the FDM dataset - the `person` and `obesrvation_period` tables, and removing any problematic entries in our source data. 

The job of building the FDM dataset is given to the, aptly named, `FDMDataset` class. It works in a very similar way to the `FDMTable` class - you initialise it with some simple details, and then run a `.build()` method to have it work it's magic. Unlike the `FDMTable` however, the `FDMDatatset` can work said magic without the need for any user input. All that's required is:

1. A dataset for your FDM
2. Source tables that have already been build using the `FDMTable` class/tool
3. No other tables that arent FDM source tables in the dataset (or tables that the `FDMDataset` class has built itself - but more on that later)

But, if any of these requirements aren't in order, the `FDMDataset` build process will get upset and tell you about it.

The dataset you've built for this tutorial *should* tick all those boxes - provided you haven't deviated off the path this notebook has been walking. If so, you're ready to build your dataset. First, initialise your `FDMDataset` instance by running the following cell:

In [6]:
test_dataset = FDMDataset(
    dataset_id=DATASET_ID
)

and then, as with the `FDMTable`, you just run the `.build()` method, and it takes care of the rest:

In [7]:
test_dataset.build()

		 ##### BUILDING FDM DATASET CY_SAM_TEST #####
________________________________________________________________________________

1. Checking dataset for source tables:

    * test_table_1 contains:  - person_id - fdm_event_start_date 
	-> Table ready
    * test_table_2 contains:  - person_id - fdm_event_start_date  * fdm_event_end_date
	-> Table ready

2. Building person table

    * Person table built with 186 entries

3. Separating out problem entries from source tables

    test_table_1:
	* 45 problem entries identified and removed to test_table_1_problems
	* 55 entries remain in test_table_1
    test_table_2:
	* 37 problem entries identified and removed to test_table_2_problems
	* 63 entries remain in test_table_2

4. Rebuilding person table

    * Person table built with 118 entries

5. Building observation_period table

    * observation_period table built with 118 entries

________________________________________________________________________________

	 ##### BUILD PROCESS FO

And there you have it - your FDM. Magic.

If you head over to your GCP SQL workspace, you should see some new tables that form you FDM: a `person` table, an `observation_period` table and 2 "problem" tables that correspond with each of the source tables. 

The problem tables contain the entries that have been removed for one of several possible issues or errors. If you take a look at the contents of one of these problem tables, you'll find the logic behind them pretty self explanatory - both tables contain a "problem" column that contains a description of the reason they were removed from the source data, for example:

    "event_start_date is after death_datetime (+42 days)"
    
    "event_start_date is before person birth_datetime - Note: Within pre-natal period"
    
Note: the "pre-natal period" message signifies that the event starts within the 9 (or so) months or so of the mother's pregnancy - worth paying attention to for certain datasets e.g. maternity care/social care and so on - there are ways to have the `.build()` process include these entries, discussed further down.

That about does it for the basics. There are a few extra helpers and functions that might be interesting once you're comfortable with the pipeline and the python environment, detailed below:

## BigQuery Cell Magics

The first tool isn't anything the FDM pipeline can take credit for. Packaged in the python bigquery library is a "cell magic", that allows you to run pure SQL queries directly from a Jupyter notebook cell. A cell magic is a little bit of syntax that adds some extra functionality to a notebook cell - the general syntax of a cell magic is `%%name-here`, so the bigquery cell magic is `%%bigquery`. Add that to the top of a cell, write your SQL below, and juypter/python will do the rest. Give the below a try:

In [7]:
%%bigquery
SELECT *
FROM `CY_FDM_MASTER.person`
LIMIT 10

Query complete after 0.01s: 100%|██████████| 2/2 [00:00<00:00, 575.07query/s]                         
Downloading: 100%|██████████| 10/10 [00:00<00:00, 10.29rows/s]


Unnamed: 0,person_id,gender_concept_id,year_of_birth,month_of_birth,day_of_birth,birth_datetime,death_datetime,race_concept_id,ethnicity_concept_id,location_id,provider_id,care_site_id,person_source_value,gender_source_value,gender_source_concept_id,race_source_value,race_source_concept_id,ethnicity_source_value,ethnicity_source_concept_id
0,10871865,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10871865,Male,45454912,British,0,,0
1,10877333,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10877333,Female,45454912,British,0,,0
2,10861223,45454912,2010,1,15,2010-01-15,NaT,0,0,,,,10861223,Female,45454912,British,0,,0
3,10855850,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10855850,Female,45454912,British,0,,0
4,10874629,45454912,2010,1,15,2010-01-15,NaT,0,0,,,,10874629,Male,45454912,British,0,,0
5,10855432,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10855432,Female,45454912,British,0,,0
6,10861024,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10861024,Male,45454912,British,0,,0
7,10874854,45454912,2010,1,15,2010-01-15,NaT,0,0,,,,10874854,Female,45454912,British,0,,0
8,10869527,45454912,2016,2,15,2016-02-15,NaT,0,0,,,,10869527,Female,45454912,British,0,,0
9,10856693,45454912,2010,2,15,2010-02-15,NaT,0,0,,,,10856693,Female,45454912,British,0,,0


Easy. 

For those familiar with the pandas library, you can store the results of your query as a `DataFrame` by naming it immediately after the `%%bigquery` magic. So the following cell runs the same query as above, and stores the result in `eg_df`:

In [8]:
%%bigquery eg_df
SELECT *
FROM `CY_FDM_MASTER.person`
LIMIT 10

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 492.35query/s] 
Downloading: 100%|██████████| 10/10 [00:00<00:00, 10.20rows/s]


In [9]:
eg_df

Unnamed: 0,person_id,gender_concept_id,year_of_birth,month_of_birth,day_of_birth,birth_datetime,death_datetime,race_concept_id,ethnicity_concept_id,location_id,provider_id,care_site_id,person_source_value,gender_source_value,gender_source_concept_id,race_source_value,race_source_concept_id,ethnicity_source_value,ethnicity_source_concept_id
0,10871865,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10871865,Male,45454912,British,0,,0
1,10877333,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10877333,Female,45454912,British,0,,0
2,10861223,45454912,2010,1,15,2010-01-15,NaT,0,0,,,,10861223,Female,45454912,British,0,,0
3,10855850,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10855850,Female,45454912,British,0,,0
4,10874629,45454912,2010,1,15,2010-01-15,NaT,0,0,,,,10874629,Male,45454912,British,0,,0
5,10855432,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10855432,Female,45454912,British,0,,0
6,10861024,45454912,2016,1,15,2016-01-15,NaT,0,0,,,,10861024,Male,45454912,British,0,,0
7,10874854,45454912,2010,1,15,2010-01-15,NaT,0,0,,,,10874854,Female,45454912,British,0,,0
8,10869527,45454912,2016,2,15,2016-02-15,NaT,0,0,,,,10869527,Female,45454912,British,0,,0
9,10856693,45454912,2010,2,15,2010-02-15,NaT,0,0,,,,10856693,Female,45454912,British,0,,0


If so inclined, you can run and document your SQL pipelines in a notebook by using the above cell magics, and then documenting your work in markdown text cells (like this). 

Now on to the extra bits of the FDM pipeline. 

## FDMTable Helpers

### copy_table_to_dataset

You may have noticed the first stage of the table `.build()` process copying the source table into the FDM dataset. This doesn't happen automatically and, when you initialise a new `FDMTable`, you'll need to add a copy to the new FDM dataset before you can use any of the below helper functions. It's quickly done by running:

### add_column

### drop_column

### rename_columns

### head

### quick_build

## FDMDataset Helpers

### create_dataset

## Other Helpers

not attached to the table/dataset objects

### check_dataset_exists / check_table_exists

### clear_dataset

## Example Workflow

Some examples of using helper functions more fluidly

In [51]:
test_table_3.copy_table_to_dataset()

The above didn't actually do anything, as the test_table_1 already exists in 

In [11]:
blah = CLIENT.get_table("CY_FDM_MASTER.person")

In [18]:
for field in blah.schema:
    print(field.name)
    print(field.field_type)

person_id
INTEGER
gender_concept_id
INTEGER
year_of_birth
INTEGER
month_of_birth
INTEGER
day_of_birth
INTEGER
birth_datetime
DATETIME
death_datetime
DATETIME
race_concept_id
INTEGER
ethnicity_concept_id
INTEGER
location_id
INTEGER
provider_id
INTEGER
care_site_id
INTEGER
person_source_value
STRING
gender_source_value
STRING
gender_source_concept_id
INTEGER
race_source_value
STRING
race_source_concept_id
INTEGER
ethnicity_source_value
STRING
ethnicity_source_concept_id
INTEGER


In [33]:
%%bigquery blah
SELECT ARRAY_AGG(DISTINCT gender_source_concept_id) AS result
FROM `yhcr-prd-phm-bia-core.CY_FDM_MASTER.person`

Query complete after 0.00s: 100%|██████████| 3/3 [00:00<00:00, 1132.88query/s]                        
Downloading: 100%|██████████| 1/1 [00:00<00:00,  1.13rows/s]


In [31]:
%%bigquery blah
SELECT COUNT(DISTINCT race_concept_id) AS n
FROM `yhcr-prd-phm-bia-core.CY_FDM_MASTER.person`

Query complete after 0.01s: 100%|██████████| 3/3 [00:00<00:00, 707.74query/s]                         
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.07s/rows]


In [32]:
blah

Unnamed: 0,n
0,33
