## An R User's Guide to Python, Part I

In this series, I will be demonstrating a few common tasks in both R and Python. The goal is to demonstrate just how similar these languages can be while showing some of the basics.

### Introduction

Coding has come a long way in my field in a short amount of time. As recently as 5 years ago, most analysis was done in SAS. R was for the most advanced of analysts. Later, R began growing rapidly as the prospect of job automation took hold. Now, more Python users are coming into the fold, and the great "R vs. Python" debate has begun. Similar [click-bait hot takes](https://www.linkedin.com/posts/alex-freberg_python-is-better-than-r-activity-6752637240578600960-a36V/?trk=public_profile_like_view&originalSubdomain=ke) abound on the web. Which is to be expected, I suppose. Data scientists have a reputation for having opinions (and the close relatives '...opinions' and 'opinions...').

![](https://cs410032000ad584321.blob.core.windows.net/mydata/dsPrimaDonna.png)

#### A summary of the R vs. Python Debate

There is no shortage of posts about this topic, so I will keep it short. The Python folks argue that [Python is the most popular coding language](https://pypl.github.io/PYPL.html). The sheer size of the community means that it will always be able to do more. R is fussy and esoteric, they say.

The R side points out that R packages, especially those used in modeling, are often maintained by the top experts in their respective fields ([survival analysis](https://cran.r-project.org/web/packages/survival/index.html), [Generalized Additive Models](https://www.taylorfrancis.com/books/mono/10.1201/9781315370279/generalized-additive-models-simon-wood), and [Epidemiological Network models](https://cran.r-project.org/web/packages/EpiModel/index.html), to name a few). Additionally, the [Journal of Statistical Software](https://www.jstatsoft.org/index) is an academic publication that introduces and explains software implementations of new statistical methods, and R is the most common choice.

#### Why the Debate Misses the Point

The R vs. Python debate is largely counterproductive because it places the focus on the language instead of the analyst. The biggest risk to your data science career is not "choosing the wrong language", it is stubbornly choosing to work with only one language. Some of the oldest programming languages, COBOL and FORTRAN, are [highly desired in the financial sector](https://www.nytimes.com/2022/07/06/technology/cobol-jobs.html) precisely because they perform critical roles in the financial code base whilst being known by very few people. Speaking of FORTRAN, it is still used in both [R](https://www.r-bloggers.com/2014/04/fortran-and-r-speed-things-up/) and [Python](https://fortranwiki.org/fortran/show/Python). And let's not forget about [Julia](https://julialang.org/), a relatively new language that runs blazingly fast compared to both R and Python. Stick to one language at your own peril.

I am reminded of a story from my own life. Early in the COVID-19 pandemic, my wife and I got really into cycling. After consistent improvement early on, I hit a plateau and I wanted to upgrade my mountain bike for a $3,000 road bike. The road bike was perfectly engineered for the task: 11 pounds lighter with thinner tires to reduce friction. But, I soon realized something: if I rode my bike more and stopped drinking beers after the weekend rides, I would lose that same 11 pounds and then some. The bike would have provided a mild, one-time boost, at best.

### Common Data Science Tasks in R and Python

I will let you in on a little secret: R and Python look pretty darn similar for many data science tasks. To show you just how similar they can be, this blog post will walk through how both R and Python: 

1. Install and load special modules and packages developed by the community
2. Work with the filesystem and system commands
3. Connect to SQL databases and extract data
4. Basic data exploration: understand what kinds of data you have and how many records there are
5. Subsetting data: keep only what you need and get rid of everything else

Future posts will cover other topics, such as data aggregation and/or data visualization. For some of the database tasks, I will connect to a SQL server instance I stood up on a personal cloud account. If you would like to stand up your own version, Microsoft has a tutorial [here](https://learn.microsoft.com/en-us/sql/samples/adventureworks-install-configure?view=sql-server-ver16&tabs=ssms#deploy-to-azure-sql-database).

#### Modules and Packages: Install, Load, Use

Really, there are three notable differences here:

1. Python calls them "Modules", and R calls them "Packages"
2. R packages are installed in an R session while Python modules are installed from the shell
3. As long as an R package has been installed, it can be called using the pkg::function() syntax. Python modules must always be loaded before use.

##### Python

In Python, pip is the base package installer, but other options (like [Anaconda](https://www.anaconda.com/)) exist. In either case, the syntax in similar:

In [1]:
pip install numpy

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Modules need to be explicitly loaded
import sqlalchemy as sa
import pandas as pd # Note the "pd" ref here
import os
import re
import glob

# You can also import specific subsets of a package
from urllib import request

Python modules tend to have lots of sub-modules. For example, urllib has many modules, but in data science the "request" module is the most commonly used. In such cases, you may want to only import the specific part of the module that you need for the task at hand. 

In [3]:
# Not using "pd" for pandas causes an err
try:
    DataFrame({"mycol": range(0,9)}).head()
except:
    print("Well, that didn't work...\n\n")


print("This works, though!")
pd.DataFrame({"mycol": range(0,9)})

Well, that didn't work...


This works, though!


Unnamed: 0,mycol
0,0
1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8


#### Explore the filesystem and access system commands

An underrated skill in data science is folder searching and navigating the filesystem. For example, I frequently have lots of CSV files with identical formats for multiple days listed in the same folder. Luckily, both R and Python have simple tools for this.

##### Python

In [4]:
print(os.getcwd()) # Prints working directory path
print(os.listdir(".")) # List files in working directory
print(os.listdir("..")) # List files in parent of the working directory

print("\nUse glob module to look for files that match a certain pattern")
glob.glob("./*.ipynb") # 

/home/jondowns/Documents/blog_markdowns/RUserPyGuide
['basics.py', 'basics.r', '.ipynb_checkpoints', 'TextAndPy.ipynb', 'R.ipynb', 'dsPrimaDonna.png']
['.git', '.gitignore', 'blog_markdowns.Rproj', 'gamesToWin', 'getPRISMData', 'isPalindrome', 'makeNbaDb', 'RUserPyGuide', 'addToWebsite.sql']

Use glob module to look for files that match a certain pattern


['./TextAndPy.ipynb', './R.ipynb']

You may also use system commands (Windows, Linux, Mac, etc.). I often use environment variables to store, say, a folder I commonly access for my job:

In [5]:
os.getenv("HOME")

'/home/jondowns'

#### Work with Databases

Assuming Pandas is being used for your Python work, database connections are eerily similar between the two languages. An important note: below, I connect to the database by writing most of the connection string directly into the code. You should not do that! It exposes your server address, user name, and password to others who are reading your code. I've stripped out the most sensitive parts. 

Environment variables, forcing the user to enter credentials, and other practices are much safer ways to handle connection strings.

##### Python

In [6]:
# Create connection string, connect to server
connection_string = "mssql+pyodbc://readAdvWorks:Plznohackme!123"\
    "@jondowns.database.windows.net,1433/adventureworks?"\
        "driver=ODBC+Driver+18+for+SQL+Server"
cnxn = sa.create_engine(connection_string)

# Use the SQL information schema to check out what data are available
pd.read_sql(    
    """
    SELECT TOP 5 *
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_NAME = 'Customer' """, cnxn)

Unnamed: 0,TABLE_CATALOG,TABLE_SCHEMA,TABLE_NAME,COLUMN_NAME,ORDINAL_POSITION,COLUMN_DEFAULT,IS_NULLABLE,DATA_TYPE,CHARACTER_MAXIMUM_LENGTH,CHARACTER_OCTET_LENGTH,...,DATETIME_PRECISION,CHARACTER_SET_CATALOG,CHARACTER_SET_SCHEMA,CHARACTER_SET_NAME,COLLATION_CATALOG,COLLATION_SCHEMA,COLLATION_NAME,DOMAIN_CATALOG,DOMAIN_SCHEMA,DOMAIN_NAME
0,adventureworks,SalesLT,Customer,CustomerID,1,,NO,int,,,...,,,,,,,,,,
1,adventureworks,SalesLT,Customer,NameStyle,2,,NO,bit,,,...,,,,,,,,adventureworks,,
2,adventureworks,SalesLT,Customer,Title,3,,YES,nvarchar,8.0,16.0,...,,,,UNICODE,,,SQL_Latin1_General_CP1_CI_AS,,,
3,adventureworks,SalesLT,Customer,FirstName,4,,NO,nvarchar,50.0,100.0,...,,,,UNICODE,,,SQL_Latin1_General_CP1_CI_AS,adventureworks,,
4,adventureworks,SalesLT,Customer,MiddleName,5,,YES,nvarchar,50.0,100.0,...,,,,UNICODE,,,SQL_Latin1_General_CP1_CI_AS,adventureworks,,


In [7]:
# Use our connection to query two tables from the database
cust = pd.read_sql("SELECT * FROM SalesLt.Customer", cnxn)
custAdd = pd.read_sql(
    """
    SELECT a.CustomerID
    , a.AddressType
    , b.AddressLine1
    , b.AddressLine2
    , b.City
    , b.StateProvince
    , b.CountryRegion
    , b.PostalCode
    , b.ModifiedDate
    FROM SalesLt.CustomerAddress AS a
    LEFT JOIN SalesLT.Address AS b ON a.AddressID = b.AddressID
    """, cnxn)


#### Initial Data Exploration

Now that data have been pulled in, we may want to explore it some. Check the number of rows, number of columns, see whether a column is a string or number, etc. Let's go through some examples.

##### Python

DataFrames are two-dimensional objects: they have rows (indexes) and columns. Think spreadsheet. Below, you'll see printout of the columns in our dataframe and the list of indices for each row. The row indices are expressed as a range of values (range function). 

dtype can be used to check the type of a column. Note that anything that has strings/characters in them will be given the "object" type.


In [8]:
# Print out row and column names
print("Number of rows and columns")
print(cust.shape)

print("Columns: ")
print(cust.columns)

print("Indices:")
print(cust.index)

print("\nData type of Customer ID:")
print(cust["CustomerID"].dtype)

print("\nData type of Last Name:")
print(cust["LastName"].dtype)

Number of rows and columns
(847, 15)
Columns: 
Index(['CustomerID', 'NameStyle', 'Title', 'FirstName', 'MiddleName',
       'LastName', 'Suffix', 'CompanyName', 'SalesPerson', 'EmailAddress',
       'Phone', 'PasswordHash', 'PasswordSalt', 'rowguid', 'ModifiedDate'],
      dtype='object')
Indices:
RangeIndex(start=0, stop=847, step=1)

Data type of Customer ID:
int64

Data type of Last Name:
object


And, a frequency table can for a column can be produced using value_counts(). This is often useful for cateogrical variables (say, an 'Age Group' category with 5 levels).

In [9]:
cust["SalesPerson"].value_counts()

adventure-works\shu0        151
adventure-works\jillian0    148
adventure-works\josé1       142
adventure-works\garrett1     78
adventure-works\jae0         78
adventure-works\pamela0      74
adventure-works\david8       73
adventure-works\linda3       71
adventure-works\michael9     32
Name: SalesPerson, dtype: int64

#### Subsetting Data

Okay, so you've explored the data a bit and now you are ready to start cutting it down. And I encourage you to cut it down! A lean, mean script that only pulls the data it needs is both more understandable and will run faster. 

Note: ideally, most subsetting is done in the SQL query, not in the code. If the data are not needed, they shouldn't be pulled in the first place! We are being inefficient for the sake of demonstration.

##### Python

When subsetting columns, it is often useful to store the: order/column names in its own dictionary. Then, that dictionary can be referred to down the line as needed.

In [10]:
# Pick the columns you want to keep/reorder columns
keepCols = ["CustomerID", "FirstName", "LastName",
    "CompanyName", "SalesPerson", "Phone",
    "EmailAddress"]
cust2 = cust[keepCols]

Subsetting by row is also incredibly useful. There are two main options:

1. Subset by the actual row index (0-847 in our case). The iloc command is used.
2. Subset by logic: write a statement that evaluates to True/False, and only the rows pass are kept.

In [11]:
print("Subset to first row only")
print(cust2.iloc[0])

print("\nGet first 5 rows matching condition")
cust2[cust2["LastName"].str.startswith("D")].head()

Subset to first row only
CustomerID                                 1
FirstName                            Orlando
LastName                                 Gee
CompanyName                     A Bike Store
SalesPerson          adventure-works\pamela0
Phone                           245-555-0173
EmailAddress    orlando0@adventure-works.com
Name: 0, dtype: object

Get first 5 rows matching condition


Unnamed: 0,CustomerID,FirstName,LastName,CompanyName,SalesPerson,Phone,EmailAddress
43,66,Alexander,Deborde,Neighborhood Store,adventure-works\garrett1,394-555-0176,alexander1@adventure-works.com
47,75,Aidan,Delaney,Paint Supply,adventure-works\jillian0,358-555-0188,aidan0@adventure-works.com
50,78,Stefan,Delmarco,Preferred Bikes,adventure-works\linda3,819-555-0186,stefan0@adventure-works.com
54,84,Della,Demott Jr,Rewarding Activities Company,adventure-works\garrett1,752-555-0185,della0@adventure-works.com
58,93,Prashanth,Desai,Stationary Bikes and Stands,adventure-works\jillian0,138-555-0156,prashanth0@adventure-works.com


Both columns and rows can be subset at once using the loc attribute of the dataframe. This works like matrix notation in R: the first argument is used to reference rows, and the second is used to reference columns. Below, I look for any rows where the last name starts with "D", and I am pulling both the last name and the salesperson.

In [12]:
# Subset by both rows and columns-- use pd.DataFrame.loc()
cust2.loc[cust2["LastName"].str.startswith("D"),
          ["SalesPerson", "LastName"]].head()

Unnamed: 0,SalesPerson,LastName
43,adventure-works\garrett1,Deborde
47,adventure-works\jillian0,Delaney
50,adventure-works\linda3,Delmarco
54,adventure-works\garrett1,Demott Jr
58,adventure-works\jillian0,Desai
