if (!("devtools" %in% rownames(installed.packages()))) install.packages("devtools", repos = "https://cloud.r-project.org")
library(devtools)
if (!("dbautojoinr" %in% rownames(installed.packages()))) install_github("N1h1l1sT/dbautojoinr", upgrade = FALSE)
library(dbautojoinr)
There's a working demo playing out a use-case, complete with the SQL Database and R code used to auto-join the SQL Database. You can either skip to the written demo below, or click this link to watch the demo on YouTube, or continue reading below for the how-to-use general tutorial
- On R 3.5.x the list of tbl() (usually main_joint_tables) seems to produce an error showing some SQL Code and does not return the SQL Tables as Dataframes. However, if you go through the tables using
explain()
ordbplyr_to_sql()
you'll see the SQL code actually works and returns the SQL Table (on SSMS for instance, or usingtbl(db$con, sql(main_joint_tables$DIM_Employee %>% dbplyr_to_sql()))
.) I'm unsure as to why this happens, but it's not an issue on R versions below 3.5.x
Before one can use the automatic join functions, initialisation has to occur so that:
- A connection to the SQL Database is established
- The "db_fields" Dataframe is returned so that the user can select which columns they want to include. The whole process revolves around the db_fields DF, which can be configured by the user in a user-friendly way using mouse clicks to select SQL fields.
library(dbautojoinr)
#If the file exists, it's read and the db_fields object is created by it.
#If it doesn't, then it's created with every field included, with potentially IDs and FKs excluded, depending on what you select
db_fields_path <- paste0(getwd(), "/db_fields.csv")
db_fields <- initialise_return_db_fields(csv_path = db_fields_path,
Driver = "{SQL Server};",
Database = "DB_Name",
Server = "123.456.78.9",
UID = NULL,
PWD = NULL,
Trusted_Connection = TRUE,
Port = 1433,
ForceCreate_csv = FALSE, #If TRUE then even if the db_fields exist, it will be deleted and overwriten by a newly created default db_fields
ExcludeIdentities = FALSE,
ExcludeForeignKeys = TRUE,
Update_DBEnv_DBFields = TRUE, #If TRUE and creating db_fields file, you can have an internal main db_fields accessible via db$db_fields usually acting as the main db_fields. Default is FALSE as having local db_fields variables is the default behaviour.
ExcludeAuditingFields = FALSE, #If TRUE and creating db_fields file, any SQL Columns ending with "_OrigEntryOn", "_OrigEntryBy", "_EntryOn", "_EntryBy", "_CompName", "_Remote" or "_Username" will have INCLUDE == FALSE by default
ExcludeSYSDIAGRAMS = TRUE, #If TRUE and creating db_fields file, any SQL Columns on the table "sysdiagrams" will have INCLUDE == FALSE by default
RegexToSelectTables = "^(DIM_|FACT_|TBL_)"#, #A regex that will get tables which you want to be deselected by default. As it is it matches all SQL Tables whose name begins with DIM_, FACT_, or TBL_
#Table1$T1_ForeignKey1 == Table2$T2_ID, #Please notice that Foreign Keys are always
#Table2$T2_ForeignKey1 == Table3$T3_ID, #on the left hand side, whilst IDs are always
#Table4$T4_ForeignKey1 == Table2$T2_ID, #on the right hand side.
#Table4$T4_ForeignKey2 == Table1$T1_ID
#If you haven't set the SQL Relationships on the Database, you can impose them here by
#uncommenting the lines above and replacing the text with the actual tables and columns
If the db_fields .csv file doesn't exist and it's created or if you set ForceCreate_csv = TRUE
, then you may also initialise the db_fields with IDs and FKs selected or excluded using ExcludeIdentities = FALSE
and ExcludeForeignKeys = TRUE
.
db_fields <- edit_db_fields(db_fields, Update_DBEnv_DBFields = TRUE) #What the user selected is now saved on the db_fields variable (AND on db$db_fields because Update_DBEnv_DBFields == TRUE).
write_db_fields_csv(db_fields, db_fields_path) #For any run on the current session, the user preferences are assumed, but we need to save the file for future runs.
Depending on what you want to achieve, there are different levels of joining that you might want to do.
- Level 1 (main_joint_tables): All tables with foreign keys are joined with all the tables they have a relationship with (unless said tables have all their fields as INCLUDE == FALSE by the user)
- Level 2 (joint_table_Without_extended_joins): Join the tables of main_joint_tables into 1 complete table. This will bring the row level of the 1-Joint-Table to the row level of the 1st table of the LHS declared on db_forced_rel
- Level 3 (extended_main_joint_tables): Some of the Main joint tables might need to be joined by other Main joint tables because a table may hold information that has different meaning depending on which table it's joint with (see explanation in more detail below at the comments on 'Getting the extended_main_joint_tables' section)
- Level 4 (joint_table_With_extended_joins) Join the tables of extended_main_joint_tables into 1 complete table. This will bring the row level of the 1-Joint-Table to the row level of the 1st table of the LHS declared on db_forced_rel
main_joint_tables <-
create_main_joint_tables(db_fields = db_fields,
db_forced_rel = NULL, #We don't want to Force any relationships to create a 1-JointTable, so db_forced_rel is NULL
con = db$con,
DeselectKeysIfIncludeFalse = TRUE
)
New Arguments:
- db_fields: A DF with columns: "Include, KeyType, Table, Column, Type, RelationshipWithTable, RelationshipWithColumn, Transformation, Comment" about the User Selected fields and Relationships
- db_forced_rel: A Named String Vector. The vector names MUST point to the main table to be used for the 1-Joint-Table as its LHS
- con: A dbConnect {DBI} connection object to a SQL Database
- DeselectKeysIfIncludeFalse: A Boolean. Must be FALSE if we need to continue to 1-Joint-Table, otherwise needed Identity and Foreign keys might be missing
- Verbose: A Boolean. Verbose = TRUE will output the consecutive joins as they happen
- get_sql_query: A Boolean. get_sql_query = TRUE will create/edit the db$sql_main_joint_tables that output the SQL Code for the tables
From hereinafter we need to have configured the db_forced_rel variable with the forced relationships that we want to impose in order to join the Main tables into 1 table
#Assumptions: Database is in Canonical Form, No two columns have the same name (Usual good practice in Databases)
db_forced_rel <-
c( #The LHS of the Relationships MUST be Columns from the main table to be used for the 1-Joint-Table
Hours_SiteID = "Site_ID",
Hours_EmployeeID = "Employee_ID"
)
joint_table_Without_extended_joins <-
create_joint_table(db_fields = db_fields,
db_forced_rel = db_forced_rel)
From hereinafter we need to have configured the db_ColumnsOldNamesToNewNames variable with the renaming schema so that when the same table is joined with different Main tables, the column names change to reflect the different meaning
#DIM_Site will be joined with DIM_Employee, but also with FACT_Hours.
#An employee will work on a certain site each day, which might be different from day to day,
#but the original site he is assigned to will always remain the same - his Main Site.
#DIM_Site holds the Site information, so when it's joined with DIM_Employee, its meaning is the employees Main Site
#However, when it's joined with FACT_Hours, its meaning is the site in which the employee has worked on at that particular day.
#If we are to create a 1-Joint-Table, then the SQL columns cannot have the same name. So we're renaming the columns that
#that come from the DIM_Site table and joined with DIM_Employee into MainSite_[SomeName] instead of Site_[SomeName]
#On the final table (1-Joint-Table) MainSite_ID column will refer to the Site that the employee is assigned to, and Site_ID will refer to the one which he worked that particular day
db_ColumnsOldNamesToNewNames <-
list(
DIM_Employee = c(
c("Site_", "MainSite_")
)
)
extended_main_joint_tables <-
create_extended_main_joint_tables(db_fields = db_fields,
db_forced_rel = db_forced_rel,
db_ColumnsOldNamesToNewNames = db_ColumnsOldNamesToNewNames
)
New Arguments:
- db_ColumnsOldNamesToNewNames: A named List. Names correspond to the Table names, and the vectors inside will be used to renamed SQL Columns starting with
db_ColumnsOldNamesToNewNames[i][j]
todb_ColumnsOldNamesToNewNames[i][j+1]
with j going from 1 to length ofdb_ColumnsOldNamesToNewNames[i]
by 2
joint_table_With_extended_joins <-
create_extended_joint_table(db_fields = db_fields,
db_forced_rel = db_forced_rel,
db_ColumnsOldNamesToNewNames = db_ColumnsOldNamesToNewNames
)
Copy the SQL code under the "SQL folder" of the package into SSMS and execute the Query. First execute Database Creation.sql and then execute Data Population.sql. Now you have a "dbautojoinr" SQL Database on your SQL Server with 4 tables (DIM_Employee, DIM_Region, DIM_Site, and FACT_Hours) and with data populated on those tables. The SQL Relationships also already exist on your SQL Database, so you won't need to explicitly impose them on the Initialisation code.
if (!("devtools" %in% rownames(installed.packages()))) install.packages("devtools", repos = "https://cloud.r-project.org")
library(devtools)
if (!("dbautojoinr" %in% rownames(installed.packages()))) install_github("N1h1l1sT/dbautojoinr", upgrade = FALSE)
library(dbautojoinr)
db_fields_path <- paste0(getwd(), "/db_fields.csv")
db_fields <- initialise_return_db_fields(csv_path = db_fields_path,
ForceCreate_csv = FALSE,
ExcludeIdentities = FALSE,
ExcludeForeignKeys = TRUE,
Driver = "{SQL Server};",
Database = "dbautojoinr",
Server = "Put your own Server IP/Name here",
UID = NULL,
PWD = NULL,
Trusted_Connection = TRUE,
Port = 1433
)
show_ER_diagramme(db$dm_f) #Shows the SQL Database ER Diagramme
db_fields <- edit_db_fields(db_fields)
write_db_fields_csv(db_fields, db_fields_path)
- db_forced_rel is only needed if we don't just want the Main Tables joined, but we want to end up with just 1 table with everything else joined.
Otherwise this parameter can be null
db_forced_rel <- NULL
.
db_forced_rel <-
c(
Hours_SiteID = "Site_ID",
Hours_EmployeeID = "Employee_ID"
)
- db_ColumnsOldNamesToNewNames is only needed if a certain table is to be joined to more than 1 table, as it is the case with DIM_Site which will be joined with DIM_Employee & FACT_Hours.
db_ColumnsOldNamesToNewNames <-
list(
DIM_Employee = c(
c("Site_", "MainSite_")
)
)
main_joint_tables <-
create_main_joint_tables(db_fields,
db_forced_rel,
db$con,
DeselectKeysIfIncludeFalse = TRUE, #No need to make any other joins, so let's only get what the User selected
Verbose = TRUE,
get_sql_query = FALSE
)
extended_main_joint_tables <-
create_main_joint_tables(db_fields,
db_forced_rel,
db$con,
DeselectKeysIfIncludeFalse = FALSE,
Verbose = TRUE,
get_sql_query = FALSE
) %>%
zinternal_CreateExtendedMainJointTables(db_fields,
db_forced_rel,
db_ColumnsOldNamesToNewNames,
db$con,
DeselectKeysIfIncludeFalse = TRUE,
Verbose = TRUE,
get_sql_query = FALSE
)
You've probably noticed that instead of using create_extended_main_joint_tables
to get the result, we're now using 2 different functions that each performs 1 step.
The 1st one (create_main_joint_tables
) will retrieve the Main Tables, whilst the 2nd (zinternal_CreateExtendedMainJointTables
) does the extended joins (in our case, it joins main_joint_tables[[DIM_Employee]].[MainSite_ID]
with [DIM_Site].[Site_ID]
)
Now, you might want to proceed with this long way if you want to also make custom transformations to some table before the next joining level occurs.
Be careful with DeselectKeysIfIncludeFalse which must always be FALSE prior to the last level and always TRUE at the last one.
joint_table_Without_extended_joins <-
create_main_joint_tables(db_fields,
db_forced_rel,
db$con,
DeselectKeysIfIncludeFalse = FALSE,
Verbose = TRUE,
get_sql_query = FALSE
) %>%
zinternal_CreateOneJointTable(db_fields,
db_forced_rel,
db$con,
Verbose = TRUE,
get_sql_query = FALSE
)
joint_table_With_extended_joins <-
create_main_joint_tables(db_fields,
db_forced_rel,
db$con,
DeselectKeysIfIncludeFalse = FALSE,
Verbose = TRUE,
get_sql_query = FALSE
) %>%
zinternal_CreateExtendedMainJointTables(db_fields,
db_forced_rel,
db_ColumnsOldNamesToNewNames,
db$con,
DeselectKeysIfIncludeFalse = FALSE,
Verbose = TRUE,
get_sql_query = FALSE
) %>%
zinternal_CreateOneJointTable(db_fields,
db_forced_rel,
db$con,
Verbose = TRUE,
get_sql_query = FALSE
)
sapply(names(main_joint_tables), function(x) NCOL(main_joint_tables[[x]]))
# Number of Columns on the tables of main_joint_tables
# DIM_Site DIM_Employee FACT_Hours
# 2 4 6
sapply(names(extended_main_joint_tables), function(x) NCOL(extended_main_joint_tables[[x]]))
# Number of Columns on the tables of extended_main_joint_tables
# DIM_Site DIM_Employee FACT_Hours
# 2 5 6
print(c(joint_table = NCOL(joint_table_Without_extended_joins), joint_table_extended = NCOL(joint_table_With_extended_joins)))
# Number of Columns on the 2 Fully Joint Tables, withand without extended joins
# joint_table joint_table_extended
# 7 9