# SQL in Python - Connecting to and retrieving data from PostgreSQL
We want to bring our tools together and use python and SQL in one place. Up to now, you have connected to a SQL database by using a SQL client such as DBeaver, let's change that and get python connected and running queries. 
 
In most organisations, data is stored in SQL databases, so in order to work with this locally you run SQL queries to get the data. You could also then use python to manipulate this data further and perhaps even send the transformed data back to the database.  

Loading data from a source, cleaning and augmenting it before saving it can be considered a data pipeline. Which is a tool that allows you to run a process over and over again and process data in a repeatable, structured way. Once you have built this data pipeline it could be something you want to reuse in other notebooks or projects. This is done by saving your scripts in your own module to save functions to be imported into other notebooks.  So we see the power of being able to work with SQL directly inside python scripts and save those scripts for use in further projects.

**This notebook will teach you how to connect to a database and query its data in Python directly, as well as how to store your functions that enable this as a modules.**

## How to work with this notebook
This notebook works a little different to the ones we have seen before. As the topic is quite complex we want to make it straightforward to figure out what to do without leaving you to much on your own.
Therefore we will give you hints about what code to write. You will not need to create any new code blocks. In the code blocks that are there there will be 3 different cases:  
1. The code is complete and will run.  
2. The code block contains only a comment that tells you what code to write. We try to make it pretty clear what you need to type.  
3. There is code in the code block with a blank space ```_____``` in the code that needs to be replaced with something to make the code work.


For example:

In [None]:
# case 1: Working code
print('this code works')

In [None]:
# case 2: Write code to print "this code works"
print('this code works')

In [None]:
#SOLUTION case 2: Write code to print "this code works"  
print('this code works')

In [None]:
# case 3: fill in the blank to make the code print the text
print('this code works')

In [None]:
#SOLUTION case 3: fill in the blank to make the code print the text  
print('this code works')

## Creating a connection to a PostgreSQL database with Python
Now that we know how the notebook works, lets get started with making the connection to the SQL server.  
The go-to package in Python for connecting to a SQL database is called <ins>SQL Alchemy</ins>. Check out their official documentation here: https://www.sqlalchemy.org/.  
Complete the code below to import the package.

In [1]:
import sqlalchemy

In order to create a connection to our PostgreSQL database we need the following information:
- host = the address of the machine the database is hosted on
- port = the virtual gate number through which communication will be allowed
- database = the name of the database
- user = the name of the user
- password = the password of the user

This information was provided to you prior to the "Introduction to Databases" lecture and which you used to create the connection in dbeaver.  
The function from the sqlalchemy package to create a connection to the database is called ```create_engine()```. The create_engine() function expects the parameters listed above as input in order to connect to the database. So the next step is to load those values from a file. 
  
In this repository ```/notebooks``` folder, you can find a file called env.md.  This is a template that you can use to save the data in the format we need.
We ask you to add your credentials to the template and rename it to ```.env```. This is a hidden file that will keep your password safe from being uploaded to github. The connection parameters in this file can be read by python and will be used as key-value-pairs and saved as environment variables.  
Recap: the file is really just a dot ```.``` and the letters ```env```. This will be visable in your terminal when you run ```ls -la``` in the notebooks folder. 
The empty ```''``` in the file should be filled with your details. This can be done using your text editor (VS Code).

Also in this folder, you find a ```sql_functions.py```.  This is a place to store functions that you will be able to reuse in your notebook without having to write out the code each time.
This is a new concept that we have actually been using all along, called libraries, for example ```pandas```.
We want to make this our script that helps us importing the connection credentials and other useful things to our sql notebooks.  

What's important is, that the ```.env``` file is listed in ```.gitignore``` which prevents it from being accidentally pushed to the remote repository. Now we don't have to worry about the credentials becoming exposed online (we prepared the ```.gitignore``` accordingly for you).  

The whole idea is that you have your credentials just stored in the ```.env``` file and not in your notebook because your notebook can be shared with your colleagues without giving away your secrets.

So, let's make the parameters host, port, database, user and password accessible from the ```.env``` via the ```sql_functions.py``` file. To do so, run the code block below. This might seem a little complex right now as we introduce two extra steps to get the credentials but this will give us room to do more exciting stuff in the future, and it should become more clear as we use it more...

In [2]:
# Import the get_sql_config function from sql_functions script to make the parameters accessible: host, port, database, user, password
# if you make changes to the file and want to reimport it, you need to restart the kernel and rerun everything
from sql_functions import get_sql_config

In [3]:
# call the function we imported and save the results to a variable
sql_config = get_sql_config()

In [4]:
type(sql_config)

dict

In [5]:
sql_config.keys()

dict_keys(['host', 'port', 'database', 'user', 'password'])

We now made the values of the environment variables accessible. Let's print the user variable to check if it worked.

In [6]:
# Print user variable
sql_config['user']

'sabrinapaulus'

Next, we're going to pass the imported variables that hold the credential information to the before mentioned create_engine() function and create a connection object called engine.  
Complete the code below and assign the imported variables to the right parameter of the create_engine() function.

In [7]:
# Create connection object engine
engine = sqlalchemy.create_engine('postgresql://user:pass@host/database',
                        connect_args=sql_config # use dictionary with config details
                        )


Next, let's have a look at the engine variable to see what we're working with.

In [8]:
# Print the connection object 'engine', and the type of the object
print(engine)
type(engine)

Engine(postgresql://user:***@host/database)


sqlalchemy.engine.base.Engine

The engine variable is a connection object that can create a database session. This means that currently we can use it to connect and then run queries.  
This connection will stay open until we manually close which can cause issues so instead we put the connection inside a ```with``` statement that closes the connection when we are done.

## Retrieving data from the database

Before we can use our connection to get data, we have to begin our connection session. We can then execute code in this session.
A session is created with the ```begin()``` function inside the ```with``` statement. The session will end at the end of the ```with```.
We use the ```execute``` command to run our query. 
But first we need to set our schema so we query the right tables

In [9]:
# enter the schema name for your course
schema = 'hh_analytics_24_1'

In [10]:
# Specify the query and pass it to the execute function
sql_query = f'select count(*) from {schema}.flights;'

with engine.begin() as conn: 
    results = conn.execute(sql_query)
    print(results)



<sqlalchemy.engine.cursor.LegacyCursorResult object at 0x11b01a370>


If it didn't give an error then it worked! The output should look something like this  
```
<sqlalchemy.engine.cursor.LegacyCursorResult object at 0x1057ed5e0>
```

This is not what we expect, because the results are inside an object. We need to run one more command to extract them, namely the ```fetchall()``` method.

In [12]:
# call the fetchall method on the results object to print the results of the query
sql_query = f'select count(*) from {schema}.flights;'

with engine.begin() as conn: 
    results = conn.execute(sql_query)
    print(results.fetchall())



[(361428,)]


Now we have our results! A count of the number of rows in our flights table. Great!
Lets try making a function that we can reuse to get more data

In [16]:
# Create a function called 'run_query' using the above with statement that takes a query in the form of a string and
# returns the output of the query

def run_query(sql_query):
    with engine.begin() as conn:
        results = conn.execute(sql_query)
    return results.fetchall()

In [17]:
# execute the query to get the first 5 rows from the flights table
sql_query_1 = f'select * from {schema}.flights limit 7'
run_query(sql_query_1)

[(datetime.datetime(2021, 1, 3, 0, 0), 727.0, 730, -3.0, 924.0, 939, -15.0, '9E', 'N607LR', 4628, 'CVG', 'BOS', 97.0, 117.0, 752, 0, 0),
 (datetime.datetime(2021, 1, 4, 0, 0), 737.0, 730, 7.0, 938.0, 939, -1.0, '9E', 'N602LR', 4628, 'CVG', 'BOS', 103.0, 121.0, 752, 0, 0),
 (datetime.datetime(2021, 1, 7, 0, 0), 1710.0, 1715, -5.0, 1911.0, 1912, -1.0, '9E', 'N295PQ', 4628, 'CVG', 'BOS', 104.0, 121.0, 752, 0, 0),
 (datetime.datetime(2021, 1, 8, 0, 0), 1711.0, 1715, -4.0, 1926.0, 1912, 14.0, '9E', 'N324PQ', 4628, 'CVG', 'BOS', 106.0, 135.0, 752, 0, 0),
 (datetime.datetime(2021, 1, 10, 0, 0), 1709.0, 1715, -6.0, 1900.0, 1912, -12.0, '9E', 'N297PQ', 4628, 'CVG', 'BOS', 94.0, 111.0, 752, 0, 0),
 (datetime.datetime(2021, 1, 11, 0, 0), 1713.0, 1715, -2.0, 1859.0, 1912, -13.0, '9E', 'N902XJ', 4628, 'CVG', 'BOS', 94.0, 106.0, 752, 0, 0),
 (datetime.datetime(2021, 1, 14, 0, 0), 1725.0, 1715, 10.0, 1921.0, 1912, 9.0, '9E', 'N337PQ', 4628, 'CVG', 'BOS', 100.0, 116.0, 752, 0, 0)]

There we go! Finally we have the output of our SQL query as a list.  

Perfect, let's summarise the steps we have performed above:
1. (Install and) Import the sqlalchemy package
2. Create a database connection object with the create_engine() method
3. Create a connection to the database connection using a *with* statement
4. Use the execute() method of the engine to execute a SQL query
5. Use the fetchall() method of the results object to retrieve the output of the SQL query
6. Put this into a simple function to allow us to reuse it

In total we needed about 6 steps to connect to the database and retrieve data.  
Even though we probably won't have to go through all the steps over and over again when querying data, this is still somewhat of a tedious process. On top of that, there is another inconvenience when it comes to the output. The format of the data we retrieve from the database is a list, this is useful when we want specific values. We will in a further step explore how to get results of sql queries into dataframes.

Let's fix these problems and do the following:
1. Write an expanded custom function that performs all of the steps above
2. Change the code so the SQL output is in a dataframe

## Using a custom function for data retrieval
Instead of having to write multiple lines of code whenever we want to query data from Python, we're going to make our lives easier by writing a custom function that will execute all of the necessary steps automatically. For this, we're going to define a custom function called ```get_data()``` below. This function should only expect one argument: query. The function should be able to take any query as a string, create a connection to the database, execute the query, output the data and close the connection.

Complete the code below so that the get_data() function creates the engine and returns the output of any SQL query we pass it.

In [18]:
from sql_functions import get_sql_config

In [19]:
# Write the get data function
def get_data(sql_query):
   ''' Connect to the PostgreSQL database server, run query and return data'''
    # get the connection configuration dictionary using the get_sql_config function
   sql_config = get_sql_config()

    # create a connection engine to the PostgreSQL server
   engine = sqlalchemy.create_engine('postgresql://user:pass@host/database',
                        connect_args=sql_config
                        )

    # open a conn session using 'with', execute the query, and return the results
   with engine.begin() as conn:
      results = conn.execute(sql_query)
      return results.fetchall()

In [20]:
schema = 'hh_analytics_24_1'

Now it's time to check if your function works. Use the get_data() function below to return the top five rows of the flights table.

In [21]:
# Print top 5 rows from flights
get_data(f'select * from {schema}.flights limit 2')

[(datetime.datetime(2021, 1, 3, 0, 0), 727.0, 730, -3.0, 924.0, 939, -15.0, '9E', 'N607LR', 4628, 'CVG', 'BOS', 97.0, 117.0, 752, 0, 0),
 (datetime.datetime(2021, 1, 4, 0, 0), 737.0, 730, 7.0, 938.0, 939, -1.0, '9E', 'N602LR', 4628, 'CVG', 'BOS', 103.0, 121.0, 752, 0, 0)]

It works, awesome! Now whenever we want to connect to our database and retrieve data we can simply use the get_data() function, how convenient! Although, we can't call ourselves Python hackers yet, because we would also like to have our data outputted to a dataframe.

## Using pandas methods for data retrieval
We know already that the output format of the ```fetchall()``` function is a list, which is inconvenient to work with. Luckily there is a function that lets us read a SQL query directly into a dataframe. It's called ```read_sql_query()``` and can be found in the pandas package.
The function expects a SQL query as the first argument and a connection (object) to a database as the second argument.
Complete the code below and return the output in a dataframe:

In [22]:
# Import pandas package
import pandas as pd

# Print top 5 rows from flights table using pandas method
pd.read_sql_query(sql=f'select * from {schema}.flights limit 7', con=engine)

Unnamed: 0,flight_date,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,airline,tail_number,flight_number,origin,dest,air_time,actual_elapsed_time,distance,cancelled,diverted
0,2021-01-03,727.0,730,-3.0,924.0,939,-15.0,9E,N607LR,4628,CVG,BOS,97.0,117.0,752,0,0
1,2021-01-04,737.0,730,7.0,938.0,939,-1.0,9E,N602LR,4628,CVG,BOS,103.0,121.0,752,0,0
2,2021-01-07,1710.0,1715,-5.0,1911.0,1912,-1.0,9E,N295PQ,4628,CVG,BOS,104.0,121.0,752,0,0
3,2021-01-08,1711.0,1715,-4.0,1926.0,1912,14.0,9E,N324PQ,4628,CVG,BOS,106.0,135.0,752,0,0
4,2021-01-10,1709.0,1715,-6.0,1900.0,1912,-12.0,9E,N297PQ,4628,CVG,BOS,94.0,111.0,752,0,0
5,2021-01-11,1713.0,1715,-2.0,1859.0,1912,-13.0,9E,N902XJ,4628,CVG,BOS,94.0,106.0,752,0,0
6,2021-01-14,1725.0,1715,10.0,1921.0,1912,9.0,9E,N337PQ,4628,CVG,BOS,100.0,116.0,752,0,0


This output looks like a dataframe. Was it really that easy? Let's check if the output really is a dataframe.  
Complete the code below and check if the output is of type dataframe.

In [23]:
# Print the type of the read_sql_query() output
type(pd.read_sql_query(sql=f'select * from {schema}.flights limit 7', con=engine))

pandas.core.frame.DataFrame

It really is a dataframe! Wow, that was easy and what's even better: we don't need a 'with' statement and we don't need to run the execute() and fetchall() functions anymore, although we still need the engine.  
The only thing left is to make a new function that works like our get data function to create the engine and run the query and return the dataframe in one step.  
Lets build a new custom function that takes an sql query and outputs the dataframe.  
You can do this by copying the content of the code block where we define the get_data() function and paste it into the code block below. Then, adjust the code by replacing the redundant steps with our new read_sql_query() function.

In [24]:
# define a new function get_dataframe() based on format of get_data() but using read_sql_query()
def get_dataframe(sql_query):
    sql_config = get_sql_config()
    engine = sqlalchemy.create_engine('postgresql://user:pass@host/database',
                        connect_args=sql_config
                        )
    return pd.read_sql_query(sql=sql_query, con=engine)

Let's make sure the function works by returning the first 5 rows of the flights table below.

In [25]:
# Display a dataframe containing the top 5 rows from flights
get_dataframe(f'select * from {schema}.flights limit 7')

Unnamed: 0,flight_date,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,airline,tail_number,flight_number,origin,dest,air_time,actual_elapsed_time,distance,cancelled,diverted
0,2021-01-03,727.0,730,-3.0,924.0,939,-15.0,9E,N607LR,4628,CVG,BOS,97.0,117.0,752,0,0
1,2021-01-04,737.0,730,7.0,938.0,939,-1.0,9E,N602LR,4628,CVG,BOS,103.0,121.0,752,0,0
2,2021-01-07,1710.0,1715,-5.0,1911.0,1912,-1.0,9E,N295PQ,4628,CVG,BOS,104.0,121.0,752,0,0
3,2021-01-08,1711.0,1715,-4.0,1926.0,1912,14.0,9E,N324PQ,4628,CVG,BOS,106.0,135.0,752,0,0
4,2021-01-10,1709.0,1715,-6.0,1900.0,1912,-12.0,9E,N297PQ,4628,CVG,BOS,94.0,111.0,752,0,0
5,2021-01-11,1713.0,1715,-2.0,1859.0,1912,-13.0,9E,N902XJ,4628,CVG,BOS,94.0,106.0,752,0,0
6,2021-01-14,1725.0,1715,10.0,1921.0,1912,9.0,9E,N337PQ,4628,CVG,BOS,100.0,116.0,752,0,0


Now that we have the data in a dataframe, it's easy to apply all the different data exploration and cleaning techniques we have learned already. We don't have to do that right now, but it will become very useful in the future!  
Congratulations, from now on you can call yourself a Python hacker!  

But wait..., what if I told you there is a way of making your functions even more powerful?  

## Using a custom Python module for data retrieval
We will now look into how to turn these helpful functions into reusable tools that you can use in additional notebooks without having to type out the code each time.
First, lets think about at what we have done so far. In this notebook we created functions ```get_data()``` and ```get_dataframe()``` that create an engine and run a query to return the results. We defined the functions once and then were able to call those functions as we liked with different sql queries as inputs to get the results we wanted. But we have the limitation that these functions only live in this notebook. What if we have multiple jupyter notebooks in our repository and we want to use this in other notebooks as well? How could we do that?  

The naive approach is to simply copy paste your code over and over, which should always be avoided. One reason is this could create a lot of maintenance work if the connection details or other changes needed to be made to the function. You would have to find and change each function definition in each file, which not only increases the likelihood of errors but could become a tedious unnecessarily time consuming task.  

The more elegant approach is to create your own module (or package). You have already lots of experience importing modules like pandas or matplotlib which give you access to new functions that you need for your programs. And this is what we want to build, a module that contains functions that we can use in our scripts. The only difference is instead of using conda to install these packages, you write them yourself in a python script file. We will now create a python module and use it to see how this works.

Let's get started!

Please perform the following three steps:
1. Open the ```sql_functions.py``` in VS Code (you can use the split screen tool for this)
2. Write the code to import all Python packages used in the ```get_data()``` and ```get_dataframe()``` functions
3. Copy paste the code you wrote in this notebook that define the get_data() and get_dataframe() functions into the ```sql_functions.py``` file
4. save the sql_function.py file



In the next cell block we will work through the following steps to import ```sql_functions.py```:  
1. We will, in this notebook, write the code to import the python file into this jupyter notebook, giving it an alias  
2. We will then execute the get_data() function to retrieve data, proving it has worked  

**Important: To make sure we're using the function from our sql_functions.py module and not the one we defined earlier in this jupyter notebook we need to give the imported package an alias and use it to reference the get_data() function**

In [27]:
schema = 'hh_analytics_24_1'

In [28]:
# Import the sql_functions module with an alias sf
import sql_functions as sf

# Print top 5 rows from flights using the get_data function from the sql_functions module
df = sf.get_dataframe(f'SELECT * FROM {schema}.flights LIMIT 7')

In [29]:
df.head()

Unnamed: 0,flight_date,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,airline,tail_number,flight_number,origin,dest,air_time,actual_elapsed_time,distance,cancelled,diverted
0,2021-01-03,727.0,730,-3.0,924.0,939,-15.0,9E,N607LR,4628,CVG,BOS,97.0,117.0,752,0,0
1,2021-01-04,737.0,730,7.0,938.0,939,-1.0,9E,N602LR,4628,CVG,BOS,103.0,121.0,752,0,0
2,2021-01-07,1710.0,1715,-5.0,1911.0,1912,-1.0,9E,N295PQ,4628,CVG,BOS,104.0,121.0,752,0,0
3,2021-01-08,1711.0,1715,-4.0,1926.0,1912,14.0,9E,N324PQ,4628,CVG,BOS,106.0,135.0,752,0,0
4,2021-01-10,1709.0,1715,-6.0,1900.0,1912,-12.0,9E,N297PQ,4628,CVG,BOS,94.0,111.0,752,0,0


Worked like a charm!  
Great, let's summarise: Whenever we want to create and use a custom function in any other jupyter notebook in our repository, we  

1. create a new Python module (which is a .py file),
2. Import the necessary packages and write the function definitions into this script,
3. Import it into whatever jupyter notebook you want using the standard import syntax

One important note is that the module script file needs to be in the same directory as you are working (or use some other tricks that you can research).
Another important note to remember is that python caches the module files on import. So if you makes changes to your .py file, you need to ```restart``` your interpreter!!!  

If you want to go deeper into what modules are and why we did what we did, check out the official [documentation](https://docs.python.org/3/tutorial/modules.html) about modules or reach out to us. It's important you understand this concepts since we will be working with python modules in future jupyter notebooks.

Congratulations for making it through this notebook, you deserve to call yourself a badass Python hacker!

concept  |  description
---|---|
`sqlalchemy`      | high-level python library for managing all kinds of relational databases
`.env`      |   hidden file to store your connection details and secret information like passwords
`dotenv_values(".env")` | function that loads the connection variables from the .env file
`sql_functions.py` | python file that contains functions that can be imported into your notebook
`create_engine()`      |   creates an `engine` that manages a connection to a DB
`with engine.begin() as conn` | opens a database connection to read or write data
`conn.execute(sql)` | submit arbitrary SQL statements to a DB
`pd.read_sql_query(sql, engine)` | runs a query and returns data as a DataFrame
`get_data(sql)` | a function we wrote that combines getting the connection details, creating the engine and runs a query