# Python and SQL: Better Together

#### Python and SQL are complementary - we should focus on how best to integrate them rather than try to replace them!
**By Alex Monahan**   
**2021-08-15** (Yes, this is the one date format to rule them all)  
    [LinkedIn](https://www.linkedin.com/in/alex-monahan-64814292/)   
    [Twitter @\__alexmonahan\__](https://twitter.com/__alexmonahan__?lang=en)   
    The views I express are my own and not my employer's.  


There has been some spirited debate over SQL on Tech Twitter in the last few weeks. I am a huge fan of open discussions like this, and I consistently learn something when reading multiple viewpoints. It's my hope to contribute to a friendly and productive dialog with fellow data folks! Let's focus on a few positives from each perspective.

[Here, Jamie Brandon makes some excellent points about SQL's weaknesses.](https://scattered-thoughts.net/writing/against-sql) The points I agree with the most are related to SQL's incompressibility. The way I encounter this most frequently is that it is difficult to execute queries on a dynamic list of columns purely in SQL. I also wish that it was easier to modularize code into functions in more powerful way, although there are some ways of doing so. Imagine a repository of SQL helper functions you could import! If only it were possible. If it is, please let me know on Twitter!

[Pedram Navid responded to those points in an indirect way that I also found to be very impactful and thought provoking.](https://pedram.substack.com/p/for-sql) Pedram focused on the organizational impacts of choosing to move away from SQL. I agree with Pedram that SQL is a tremendous data democratization tool and that it is important that SQL folks and other programming language folks work as a team. He also makes the case that SQL is often good enough, and I would go a step further and say there are many cases where SQL is a very expressive way to request and manipulate data! The cases where SQL is most useful are very accessible and can really empower people where data is just a portion of their job. **SQL is the easiest to learn superpower as a data person!** 

Pedram also cites dbt as a powerful way to address some of SQL's rough edges. I would like to take that line of thinking a step further here: **How can we mix and match Python and SQL to get the best of both?**

## Why use SQL?

Before mixing and matching, why would we want to use SQL in the first place? While I agree with Jamie that it is imperfect, it has many redeeming qualities! Toss any others I've forgotten on Twitter!


1. SQL is very widely used  
    [It ranks 3rd in the Stack Overflow survey](https://insights.stackoverflow.com/survey/2020#most-popular-technologies), and [it was invented all the way back in 1979](https://docs.oracle.com/cd/B19306_01/server.102/b14200/intro001.htm#:~:text=In%201979%2C%20Relational%20Software%2C%20Inc,as%20the%20standard%20RDBMS%20language.)!  
    Excel and nearly every Business Intelligence tool provide a SQL interface. Plus Python's standard library includes SQLite, which is in the [top 5 most widely deployed pieces of software in existence!](https://www.sqlite.org/mostdeployed.html)  
    I also agree with Erik Bernhardsson's blog post [I don't want to learn your garbage query language](https://erikbern.com/2018/08/30/i-dont-want-to-learn-your-garbage-query-language.html). Let's align on SQL so we only have to learn or develop things once!
    

2. The use of SQL is expanding  
    The growth of cloud data warehouses is a huge indication of the power of SQL. [Stream processing is adding SQL support](https://www.confluent.io/product/ksql/), and the [SQL language itself continues to grow](https://modern-sql.com/) in power and flexibility.   
    
    
3. SQL is easy to get started with  
    While I don't necessarily have sources to cite here, I have led multiple SQL training courses that can take domain experts from 0 to introductory SQL in 8 hours. While I have not done the same for other languages, I feel like it would be difficult to be productive that quickly!  
    
    
4. However, it's hard to outgrow the need for SQL  
    Even the majority of data scientists use SQL "Sometimes" or more - placing it as the number 2 language in [Anaconda's State of Data Science 2021](https://www.anaconda.com/state-of-data-science-2021).  
    **This also makes SQL great for your career!**  
    
    
5. SQL's declarative nature removes the need to understand database internals  
    The deliberate separation between the user's request and the specific algorithms used by the database is an excellent abstraction layer 99%+ of the time. You can write years of productive SQL queries before learning the difference between a hash join and a sorted merge loop join! And even then, the database will usually choose correctly on your behalf.
    
    
6. Chances are good you need to use SQL to pull your data initially anyway  
    If you already need to know some SQL to access your company's valuable info, why not maximize your effectiveness with it?

## Tools to use when combining SQL and Python

#### Asterisks indicate libraries I have not used yet, but that I am excited to try!

* [DuckDB](https://duckdb.org/)
    * Think of this as SQLite for analytics! I list the many pros of DuckDB below! It is my favorite way to mix and match SQL and Pandas
* [SQLite](https://www.sqlite.org/index.html)
    * SQLite is an easy way to process larger than memory data in Python. It comes bundled in the standard library.
    * There are a few drawbacks that DuckDB addresses:
        * Slow performance for analytics (row-based instead of columnar like Pandas and DuckDB)
        * Requires data to be inserted into SQLite before executing a query on it
        * Overly flexible data types (This is debatable, but it makes it harder to interact with Pandas in my experience)
* [SQLAlchemy](https://www.sqlalchemy.org/)
    * SQLAlchemy can connect to tons of different databases. It's a huge advantage for the Python ecosystem.
    * When used in combination with pandas.read_sql, SQLAlchemy can pull from a SQL DB and load a Pandas DataFrame.
    * SQLAlchemy is traditionally known for its ORM (Object Relational Mapper) capabilities which allow you to avoid SQL. However, it also has powerful features for SQL fans like safe parameter escaping
    * There are many other tools for querying SQL DB's
        * [pyodbc](https://github.com/mkleehammer/pyodbc) - Uses ODBC drivers for querying which is flexible, but can be slow
        * [turbodbc](https://github.com/blue-yonder/turbodbc)* - Uses Apache Arrow to speed up ODBC connections
        * A variety of DB-specific native connectors (Ex: [cx_oracle](https://github.com/oracle/python-cx_Oracle), [psycopg2](https://github.com/psycopg/psycopg2)) - These are fast, but not universal
* [ipython-sql](https://github.com/catherinedevlin/ipython-sql)
    * Convert a Jupyter cell into a SQL language cell! This enables syntax highlighting for SQL in Jupyter.
    * I find this much nicer than working with SQL in multi-line strings, and you get syntax highlighting without having separate SQL files.
    * This builds on top of SQLAlchemy
    * See an example below!
* [PostgreSQL's PL/Python](https://www.postgresql.org/docs/10/plpython.html)*
    * You can write functions and procedures on Postgres using Python!
    * I think this addresses some of the concerns highlighted by Jamie Brandon, but as popular as Postgres is, it's not a standard feature of all DBs
    * This is a more DB-centric approach. I've found DB stored procedures to be more challenging to use with version control, but they certainly have value
* [Dask-sql](https://github.com/dask-contrib/dask-sql)*
    * Process SQL statements on a Dask cluster (on your local machine or 1000's of servers!)
    * The first bullet in the readme advertises easy Python and SQL interoperability, which sounds great!
    * This comes with some extra complexity over DuckDB - you'll need Java for the Apache Calcite query parser and a few lines of code to set up a Dask cluster. 
    * I'd like to benchmark this a bit to see how well it performs in my single machine use case
* [dbt](https://www.getdbt.com/)*
    * dbt is a way to build SQL pipelines and execute jinja-templated SQL 
    * The templating seems like a powerful way to avoid some of SQL's pain points like a rigid set of columns and requiring columns be specified in the SELECT and GROUP BY clauses. 
    * dbt also works well with version control systems
    * I am very excited to try it out, but I am not a cloud data warehouse user so I am slightly outside their target audience
    * There are [adapters](https://docs.getdbt.com/docs/available-adapters) for both [DuckDB](https://github.com/jwills/dbt-duckdb) and [Clickhouse](https://github.com/silentsokolov/dbt-clickhouse) (A fast, massively popular open source columnar data warehouse) so this should perform well. These are both community maintained, so I'll just need to test them out a bit!  

#### I plan to explore more of these in upcoming posts!
    

## DuckDB - One powerful way to mix Python and SQL

[DuckDB](https://duckdb.org/) is best summarized as the SQLite of analytics. In under 10ms, you can spin up your own in-process database that is 20x faster than SQLite, and [in most cases faster than Pandas](https://duckdb.org/2021/05/14/sql-on-pandas.html)! 

Besides the speed, why do I love DuckDB?
1. It works seamlessly with Pandas  
    You can query a DataFrame without needing to insert it into the DB, and you can return results as DF's as well. This is both simple and very fast since it is in the same process as Python. 
    
    
2. Setup is easy  
    Just pip install duckdb and you're all set. 
    
    
3. DuckDB supports almost all of PostgreSQL's syntax, but also smooths rough edges  
    When I first tested out DuckDB, it could handle everything I threw at it: Recursive CTE's, Window functions, arbitrary subqueries, and more. Even lateral joins are supported! Since then, it has only improved by adding Regex, statistical functions, and more!  
    As an example of smoothing rough edges, Postgres is notoriously picky about capitalization, but DuckDB is not case sensitive. While most function names come from Postgres, many equivalent function names from other DB's can also be used.  
    

4. MIT licensed  


5. Parquet, csv, and Apache Arrow structures can also be queried by DuckDB  
    Interoperability with Parquet in particular expands the ecosystem that DuckDB can interact with.
    

6. DuckDB continues to dramatically improve, and the developers are fantastic!  
    Truthfully, this should be item 0! I've had multiple (sometimes tricky) bugs be squashed in a matter of days, and several of my feature requests have been added. The entire team is excellent!
    

7. Persistence comes for free, but is optional! This allows DuckDB to work on larger-than-memory data.  
    If you are building a data pipeline, it can be super useful to see all of the intermediate steps. Plus, I believe that DuckDB databases are going to become a key multi-table data storage structure, just as SQLite is today. The developers are in the midst of adding some powerful compression to DuckDB's storage engine, so I see significant potential here.
    

8. Did I mention it's fast?  
    DuckDB is multi-threaded, so you can utilize all your CPU cores without any work on your end - no need to partition your data or anything!
    

9. DuckDB has a Relational API that is targeting Pandas compatibility  
    While I am admittedly a SQL fan, having a relational API can be very helpful to add in some of the dynamism and flexibility of Pandas. 

## An example workflow with DuckDB

In [None]:
#I use Anaconda, so Numpy, Pandas, and SQLAlchemy are already installed. Otherwise pip install those to start with
# !pip install numpy
# !pip install pandas
# !pip install sqlalchemy

!pip install duckdb==0.2.8

#This is a SQLalchemy driver for DuckDB. It powers the ipython-sql library below 
#Thank you to the core developer of duckdb_engine, Elliana May! She rapidly squashed a bug so that it works with ipython-sql!
!pip install duckdb_engine==0.1.8rc3 

#This allows for the %%sql magic in Jupyter to do SQL syntax highlighting and execution
!pip install ipython-sql

In [1]:
import duckdb
import pandas as pd
import sqlalchemy

In [2]:
%load_ext sql
%config SqlMagic.autopandas=True

In [None]:
import inspect
from IPython import get_ipython
ip = get_ipython()
inspect.getmembers(ip.extension_manager)

In [None]:
for e in ip.extension_manager.loaded:
    print(e)
    print(type(e))

In [2]:
import pprint
pprint.pprint(locals())

{'In': ['',
        "get_ipython().run_line_magic('load_ext', 'sql')\n"
        "get_ipython().run_line_magic('config', 'SqlMagic.autopandas=True')",
        'import pprint\npprint.pprint(locals())'],
 'NamespaceMagics': <class 'IPython.core.magics.namespace.NamespaceMagics'>,
 'Out': {},
 '_': '',
 '_Jupyter': <ipykernel.zmqshell.ZMQInteractiveShell object at 0x000002480CC92208>,
 '__': '',
 '___': '',
 '__builtin__': <module 'builtins' (built-in)>,
 '__builtins__': <module 'builtins' (built-in)>,
 '__doc__': 'Automatically created module for IPython interactive environment',
 '__loader__': None,
 '__name__': '__main__',
 '__package__': None,
 '__spec__': None,
 '_dh': ['C:\\Users\\Alex\\Documents\\Python '
         'Scripts\\Alex-Monahan.github.io\\post_creation_tools'],
 '_getshapeof': <function _getshapeof at 0x000002480CD2F8B8>,
 '_getsizeof': <function _getsizeof at 0x000002480CD2F168>,
 '_i': '%load_ext sql\n%config SqlMagic.autopandas=True',
 '_i1': '%load_ext sql\n%config SqlM

In [4]:
import inspect
inspect.getmembers(_nms)

[('__class__', IPython.core.magics.namespace.NamespaceMagics),
 ('__delattr__',
  <method-wrapper '__delattr__' of NamespaceMagics object at 0x000002480C0AF0C8>),
 ('__dict__',
  {'_trait_values': {'config': {}, 'parent': None},
   '_trait_notifiers': {'config': {'change': [<traitlets.traitlets.ObserveHandler at 0x2480933a2c8>]},
    traitlets.All: {'change': []}},
   '_trait_validators': {},
   '_cross_validation_lock': False,
   'shell': <ipykernel.zmqshell.ZMQInteractiveShell at 0x2480cc92208>,
   'options_table': {},
   'magics': {'line': {'pinfo': <bound method NamespaceMagics.pinfo of <IPython.core.magics.namespace.NamespaceMagics object at 0x000002480C0AF0C8>>,
     'pinfo2': <bound method NamespaceMagics.pinfo2 of <IPython.core.magics.namespace.NamespaceMagics object at 0x000002480C0AF0C8>>,
     'pdef': <bound method NamespaceMagics.pdef of <IPython.core.magics.namespace.NamespaceMagics object at 0x000002480C0AF0C8>>,
     'pdoc': <bound method NamespaceMagics.pdoc of <IPython

In [None]:
%sql duckdb:///another_test_db.db

In [None]:
%sql select '42' as test_column

In [None]:
%%sql 
    select '42' as test_column
    union all
    select 'woot' as test_column

In [None]:
result = _
result

### Let's play with a moderately sized dataset: a 1.6 GB csv
This file comes from this [DuckDB intro article by Ewe Korn](https://uwekorn.com/2019/10/19/taking-duckdb-for-a-spin.html) and can be downloaded [here.](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

DuckDB has its own csv reader, but let's use pandas.read_csv to show how DuckDB can work in existing Pandas workflows

In [None]:
taxi_df = pd.read_csv('')

Now that we have our data in Pandas, let's set up the ipython-sql extension for Jupyter to use an in memory DuckDB database

In [14]:
%load_ext sql

In [15]:
%config SqlMagic.autopandas=True

In [None]:
%sql duckdb:///:memory:

In [None]:
print(type(result))
result.info()

In [None]:
%%sql
select * from result

In [None]:
test_df2 = pd.DataFrame.from_dict({"i":[1, 2, 3, 4], "j":["one", "two", "three", "four"]})
test_df2

In [None]:
test_df3 = pd.DataFrame.from_dict({"i":[1, 2, 3, 4], "j":["one", "two", "three", "four"]})
try:
    conn2 = duckdb.connect(database='my_example_duckdb3.db', read_only=False)
    query_output = conn2.execute('select * from test_df3').fetchdf()
finally:
    conn2.close()
query_output

In [None]:
%sql --connection_arguments "{\"enable_external_access\":\"true\"}" duckdb:///my_example_duckdb.db select 42 as test_column

In [None]:
%sql --connection_arguments "{\"enable_external_access\":true}" duckdb:///my_example_duckdb.db select * from test_df2

In [None]:
%%sql
select * from test_df2

In [None]:
%sql -l

In [None]:
connections = _
connections['duckdb:///my_example_duckdb.db']

In [3]:
#Globals do not appear to help
global test_df2
test_df2 = pd.DataFrame.from_dict({"i":[1, 2, 3, 4], "j":["one", "two", "three", "four"]})
engine = sqlalchemy.create_engine('duckdb:///a_test_duckdb.db')
df = pd.read_sql('select * from test_df2', engine) 
df

RuntimeError: Catalog Error: Table with name test_df2 does not exist!
Did you mean "sqlite_schema"?
LINE 1: select * from test_df2
                      ^

In [8]:
engine = sqlalchemy.create_engine('duckdb:///another_test_duckdb.db')

In [11]:
%sql [{'new_engine': engine}]

Environment variable $DATABASE_URL not set, and no connect string given.
Connection info needed in SQLAlchemy format, example:
               postgresql://username:password@hostname/dbname
               or an existing connection: dict_keys([])


In [9]:
#New register feature of the driver
test_df3 = pd.DataFrame.from_dict({"i":[1, 2, 3, 4], "j":["one", "two", "three", "four"]})
engine = sqlalchemy.create_engine("duckdb:///dataframe_view_test.db")
engine.execute("register", ("test_df3",test_df3))
df = pd.read_sql('select * from test_df3', engine) 
df


Unnamed: 0,i,j
0,1,one
1,2,two
2,3,three
3,4,four


In [10]:
engine.execute('create table t_test_df3 as select * from test_df3')
df = pd.read_sql('select * from t_test_df3', engine) 
df

Unnamed: 0,i,j
0,1,one
1,2,two
2,3,three
3,4,four


In [11]:
engine.execute('create view v_test_df3 as select * from test_df3')
df = pd.read_sql('select * from v_test_df3', engine) 
df

Unnamed: 0,i,j
0,1,one
1,2,two
2,3,three
3,4,four


In [12]:
engine.dispose()

In [16]:
%sql duckdb:///dataframe_view_test.db

In [17]:
%%sql
    select
        *
    from v_test_df3

 * duckdb:///dataframe_view_test.db


RuntimeError: Catalog Error: Table with name test_df3 does not exist!
Did you mean "t_test_df3"?