#### Install Packages

##### R

As previously mentioned, R packages are installed within a session. You may be asked to select a "mirror" the first time you install a package. Just use the cloud mirror. Any packages that are installed can be loaded with the library command. Note that the quotes are no longer needed after a package has been installed.

In [3]:
# Install packages in R using install.packages()
# install.packages("jsonlite")

# Load packages with the "library" command
library(jsonlite)
library(tidyverse)
library(odbc)
library(DBI)

Make note of the "conflicts" section, we will address that shortly. 

The tidyverse is a bit of a special case: the tidyverse is a set of packages designed to modernize base R, both for 
performance and to be more intuitive. So when you load the tidyverse package, you're actually loading a universe of packages (hence the name).

After a package is loaded, any functions from that package can be called. R is structured a little differently than Python here. In the Python example above, we had to reference the package pandas (pd) to call the data.frame command. In R, loading the library is sufficient:

In [None]:
# Use mutate from one of the packages in the tidyverse
mutate(data.frame(nums = 1:10), new = cumsum(nums))

If you DID want to be explicit, you could use the package::fuction() notation. It can be used when multiple libraries use the same function name. By default, R will choose the function that was loaded last when 2+ functions have the same name. Many times, the package author INTENDED to overwrite a base R function, so in practice function masking is not a big problem. Python handles the same problem by forcing you to always be explicit. I must admit, I like the Python philosophy more here.

In [4]:
# As long as an R package is installed, you can access its functions
# like this:
dplyr::mutate(data.frame(nums = 1:5), new = cumsum(nums))

nums,new
<int>,<int>
1,1
2,3
3,6
4,10
5,15


#### Explore the filesystem and access system commands

##### R

In R, the syntax is remarkably similar:

In [9]:
# Note that, if you do not assign an object, it is printed (not saved)
getwd() # Print working directory
list.files(".") # List files in working directory
list.files("..") # List files in parent of working directory

# Look for specific files
list.files(".", pattern = "*.ipynb") # Use regex to search

# Access environment variables
Sys.getenv("HOME")

#### Work with Databases

##### R

Again, the solutions are quite similar. In fact, it could be even MORE similar: I could write the connection string as a single string as we did in Python. However, common practice in R is to use the arguments provided in the dbConnect function.

In [9]:
cnxn <- DBI::dbConnect(
    drv = odbc::odbc(),
    Driver = "{ODBC Driver 18 for SQL Server}",
    server = "jondowns.database.windows.net,1433",
    database = "adventureworks",
    uid = "readAdvWorks",
    pwd = "Plznohackme!123") # Reminder: don't do this

# In addition to the SQL information schema tables, DBI has some convenience functions
dbListFields(cnxn, "Customer")

We will close by running the two same SQL queries as in the Python section. The main differnece between Python and R here is that Python lists the connection as the SECOND argument, and R lists it as the first. Radical, I know.

In [11]:
# Query database
cust <- DBI::dbGetQuery(cnxn, "SELECT * FROM SalesLt.Customer")
custAdd <- DBI::dbGetQuery(
    cnxn,
    "SELECT a.CustomerID
    , a.AddressType
    , b.AddressLine1
    , b.AddressLine2
    , b.City
    , b.StateProvince
    , b.CountryRegion
    , b.PostalCode
    , b.ModifiedDate
    FROM SalesLt.CustomerAddress AS a
    LEFT JOIN SalesLT.Address AS b ON a.AddressID = b.AddressID")

#### Initial Data Exploration

##### R

It is time to note another place where R and Python differ slightly in syntax. R tends to prefer the use of functions, while Python prefers attributes. In R, anything can be sent to a function, then the function decides whether it has the proper method(s) to handle that object. In Python, the preference is to store things as attributes of the object itself. I think this is somewhat a difference in philosophy: Python is general purpose, so it prioritizes predictability by forcing users to be explicit. Since R has a more defined focus, it places a higher priority on general purpose functions that can be used across a variety of objects. For example, most functions that work on dataframes also work on matrices.

Above, when we listed the columns, we used the DataFrame.columns attribute. In R, we use the colnames function that can be called on basically anything. It's the function's job to decide whether it can actually work on that object.

All that being said, it still seems pretty intuitive that this R code is doing the same as the Python code above:



In [15]:
# Use dim, nrow, and ncol to get rows/columns, and both
dim(cust) # Rows x columns

# Print out row and column names
colnames(cust) # Most common
rownames(cust)[1:5] # Less common

print("Data type of Customer ID:")
class(cust$CustomerID)

print("Data type of Last Name:")
class(cust$LastName)


[1] "Data type of Customer ID:"


[1] "Data type of Last Name:"


Finally, a frequency table is available with the very aptly named "table" function.

In [16]:
table(cust$SalesPerson)


  adventure-works\\david8 adventure-works\\garrett1     adventure-works\\jae0 
                       73                        78                        78 
adventure-works\\jillian0    adventure-works\\josé1   adventure-works\\linda3 
                      148                       142                        71 
adventure-works\\michael9  adventure-works\\pamela0     adventure-works\\shu0 
                       32                        74                       151 