# SQL in Python - Connecting to and retrieving data from PostgreSQL
We want to bring our tools together and use python and SQL in one place. Up to now, you have connected to a SQL database by using a SQL client such as DBeaver.  
DBeaver allows you to connect to databases to run SQL queries against the database, create new tables and populate them with data as well as retrieving the data. This is only one tool that allows you to connect. There are many others including python.  
Often you have data on your local machine that we have manipulated in python, you have performed data cleaning procedures to bring your data into shape. This process can be considered a data pipeline.  
The final step in the workflow is in many cases saving the output of your data pipeline to a database for future use by yourself or others.
Fortunately, we can connect to the database from Python directly, eliminating the need for a separate SQL client. 
  
**This notebook will teach you how to connect to a database and query its data in Python directly, as well as how to automate the process using custom functions and modules.**

## How to work with this notebook
This notebook works a little different to the ones we have seen before. As the topic is quite complex we want to make it straightforward to figure out what to do without leaving you to much on your own.
Therefor we will give you hints about what code to write. You will not need to create any new code blocks. In the code blocks that are there there will be 3 different situations:  
1. The code is complete and will run.  
2. The code block contains only a comment that tells you what code to write. It will be pretty obvious what you need to type.  
3. There is code in the code block with a blank space ```_____``` in the code that needs to be replaced with something to make the code work


For example:

In [None]:
# case 1: Working code
print('this code works')

In [None]:
# case 2: Write code to print "this code works"

In [None]:
#case 2: Write code to print "this code works" SOLUTION 
print('this code works')

In [None]:
# case 3: fill in the blank to make the code print the text
____('this code works')

In [None]:
# case 3: fill in the blank to make the code print the text SOLUTION 
print('this code works')

## Creating a connection to a PostgreSQL database with Python
The go-to package in Python for connecting to a SQL database is called <ins>SQL Alchemy</ins>. Check out their official documentation here: https://www.sqlalchemy.org/.  
Complete the code below to import the package.

In [None]:
# Import sqlalchemy package

In [4]:
# SOLUTION - delete me for final version
import sqlalchemy 

In order to create a connection to our PostgreSQL database we need the following information:
- host = the address of the machine the database is hosted on
- port = the virtual gate number through which communication will be allowed
- database = the name of the database
- user = the name of the user
- password = the password of the user

This information was provided to you prior to the "Introduction to Databases" lecture and which you used to create the connection in dbeaver.  
The function from the sqlalchemy package to create a connection to the database is called ```create_engine()```. The create_engine() function expects the parameters listed above as input in order to connect to the database. So the next step is to load those values from a file. 
  
In this repository ```/notebooks``` folder, you can find a file called env.md.  This is a template that you can use to save the data in the format we need.
We ask you to add your credentials to the template and save it as a file called ```.env```. This is a hidden file that will keep you password safe from being uploaded to github. The connection parameters in this file can be read by python and will be used as key-value-pairs and saved as environment variables.  
Recap: the file is really just a dot ```.``` and the letters ```env```. This will be visable in your terminal when you run ```ls -la``` in the notebooks folder. 
The empty ```''``` in the file should be filled with your details. This can be done using your text editor (VS Code).

Also in this folder, you find a ```sql_functions.py```.  This is a place to store functions that you will be able to reuse in your notebook without having to write out the code each time.
This is a new concept that we have actually be using all along, called libraries, for example ```pandas```.
We want to make this our script that helps us importing the connection credentials and other useful things to our sql notebooks.  

What's important is, that the ```.env``` file is listed in ```.gitignore``` which prevents it from being accidentally pushed to the remote repository. Now we don't have to worry about the credentials becoming exposed online (we prepared the ```.gitignore``` accordingly for you).  

The whole idea is that you have your credentials just stored in the ```.env``` file and not in your notebook because your notebook can be shared with your colleagues without giving away your secrets.

So, let's make the parameters host, port, database, user and password accessible from the ```.env``` via the ```sql_functions.py``` file. To do so, run the code block below. This might seem a little complex right now as we introduce two extra steps to get the credentials but this will give us room to do more exciting stuff in the future, and it should become more clear as we use it more...

In [5]:
# Import the get_sql_config function from sql_functions script to make the parameters accessible: host, port, database, user, password
# if you make changes to the file and want to reimport it, you need to restart the kernel and rerun everything
from sql_functions import get_sql_config

In [None]:
# call the function we imported and save the results to a variable
sql_config = ______()

In [6]:
#SOLUTION
sql_config = get_sql_config()

We now made the values of the environment variables accessible. Let's print the user variable to check if it worked.

In [None]:
# Print user variable
sql_config['____']

In [20]:
# SOLUTION
sql_config['user']

'andrewemmett'

In [26]:
sql_config['port'] = 5432

Next, we're going to pass the imported variables that hold the credential information to the before mentioned create_engine() function and create a connection object called engine.  
Complete the code below and assign the imported variables to the right parameter of the create_engine() function.

In [None]:
# Create connection object engine
engine = sqlalchemy.create_engine('postgresql://user:pass@host/database',
                        connect_args=_____ # use dictionary with config details
                        )


In [27]:
# SOLUTION Create connection object engine
engine = sqlalchemy.create_engine('postgresql://user:pass@host/database',
                        connect_args=sql_config # dictionary with config details
                        )


Next, let's have a look at the engine variable to see what we're working with.

In [None]:
# Print the connection object 'engine', and the type of the object


In [22]:
# SOLUTION
print(engine, type(engine))

Engine(postgresql://user:***@host/database) <class 'sqlalchemy.engine.base.Engine'>


The engine variable is a connection object that can create a database session. This means that currently we can use it to connect and then run queries.  
This connection will stay open until we manually close which can cause issues so instead we put the connection inside a ```with``` statement that closes the connection when we are done.

## Retrieving data from the database

Before we can use our connection to get data, we have to begin our connection session. We can then exeucte code in this session
A session is created with the ```begin()``` function inside the ```with``` statement. The session will end at the end of the ```with```.
We use the ```execute``` command to run our query. 
But first we need to set our schema so we query the right tables

In [None]:
# enter the schema name for your course
schema = '_______'

In [11]:
# SOLUTION enter the schema name for your course
schema = 'cgn_analytics_22_1'

In [11]:
# Specify the query and pass it to the execute function
sql_query = f'select count(*) from {schema}.flights;'

with engine.begin() as conn: 
    results = conn.execute(_____)
    print(results)



NameError: name '_____' is not defined

In [23]:
# SOLUTION Specify the query and pass it to the execute function
sql_query = f'select count(*) from {schema}.flights;'

with engine.begin() as conn: 
    results = conn.execute(sql_query)
    print(results)



<sqlalchemy.engine.cursor.LegacyCursorResult object at 0x10556bfd0>


If it didn't give an error then it worked! The output should look something like this  
```
<sqlalchemy.engine.cursor.LegacyCursorResult object at 0x1057ed5e0>
```

This is not what we expect, because the results are inside an object. We need to run one more command to extract them, namely the ```fetchall()``` method.

In [None]:
# call the fetchall method on the results object to print the results of the query
sql_query = f'select count(*) from {schema}.flights;'

with engine.begin() as conn: 
    results = conn.execute(sql_query)
    print(results._____())



In [28]:
# SOLUTION call the fetchall method on the results object to print the results of the query
sql_query = f'select count(*) from {schema}.flights;'

with engine.begin() as conn: 
    results = conn.execute(sql_query)
    print(results.fetchall())



[(361428,)]


Now we have our results! A count of the number of rows in our flights table. Great!
Lets try making a function that we can reuse to get more data

In [None]:
# Create a function call 'run_query' using the above with statement that takes a query in the form of a string and
# returns the output of the query


In [8]:
# SOLUTION
# Create a function call 'run_query' using the above with statement that takes any query in the form of a string and
# returns the output of the query
def run_query(sql_query):
    with engine.begin() as conn: 
        results = conn.execute(sql_query)
        return results.fetchall()

In [None]:
# execute the query to get teh first 5 runs from the flights table
sql_query = f'__________________________'
run_query(_______)

In [9]:
# SOLUTION execute the query to get teh first 5 runs from the flights table
sql_query = f'select * from {schema}.flights limit 5;'
run_query(sql_query)

[(datetime.datetime(2021, 1, 9, 0, 0), 714.0, 720, -6.0, 858.0, 911, -13.0, 'OO', 'N170SY', 3303, 'PDX', 'SJC', 89.0, 104.0, 569, 0, 0),
 (datetime.datetime(2021, 1, 9, 0, 0), 1034.0, 1040, -6.0, 1317.0, 1327, -10.0, 'OO', 'N198SY', 3306, 'PDX', 'SLC', 88.0, 103.0, 630, 0, 0),
 (datetime.datetime(2021, 1, 9, 0, 0), 855.0, 900, -5.0, 1016.0, 1027, -11.0, 'OO', 'N174SY', 3307, 'SAN', 'SJC', 67.0, 81.0, 417, 0, 0),
 (datetime.datetime(2021, 1, 9, 0, 0), 2111.0, 2115, -4.0, 2326.0, 2334, -8.0, 'OO', 'N182SY', 3308, 'SEA', 'BOI', 60.0, 75.0, 399, 0, 0),
 (datetime.datetime(2021, 1, 9, 0, 0), 1823.0, 1830, -7.0, 2008.0, 1955, 13.0, 'OO', 'N197SY', 3309, 'BOI', 'LAX', 99.0, 165.0, 674, 0, 0)]

In [None]:
run_query(SELECT name 
  FROM sqlite_master
 where type = 'table')

There we go! Finally we have the output of our SQL query as a list.  

Perfect, let's summarise the steps we have performed above:
1. (Install and) Import the sqlalchemy package
2. Create a database connection object with the create_engine() method
3. Create a connection to the database connection using a *with* statement
4. Use the execute() method of the engine to execute a SQL query
5. Use the fetchall() method of the results object to retrieve the output of the SQL query
6. Put this into a simple function to allow us to reuse it

In total we needed about 6 steps to connect to the database and retrieve data.  
Even though we probably won't have to go through all the steps over and over again when querying data, this is still somewhat of a tedious process. On top of that, there is another inconvenience when it comes to the output. The format of the data we retrieve from the database is a list, this is useful when we want specific values. We will in a further step explore how to get results of sql queries into dataframes.

Let's fix these problems and do the following:
1. Write an expanded custom function that performs all of the steps above
2. Change the code so the SQL output is in a dataframe

## Automating data retrieval with a custom function: get_data()
Instead of having to write multiple lines of code whenever we want to query data from Python, we're going to make our lives easier by writing a custom function that will execute all of the necessary steps automatically. For this, we're going to define a custom function called <ins>get_data()</ins> below. This function should only expect one argument: query. The function should be able to take any query as a string, create a connection to the database, execute the query, output the data and close the connection.

Complete the code below so that the get_data() function creates the engine and returns the output of any SQL query we pass it.

In [None]:
# Write the get data function
def get_data(_____):
   ''' Connect to the PostgreSQL database server, run query and return data'''
    # get the connection configuration dictionary using the get_sql_config function
    
    # create a connection engine to the PostgreSQL server
    
    # open a conn session using 'with', execute the query, and return the results
    

In [None]:
# SOLUTION
def get_data(sql_query):
    '''Connect to the PostgreSQL database server, run query and return data'''

    # get the connection configuration using the get_sql_config function
    sql_config = get_sql_config()
    # create a connection engine to the PostgreSQL server
    engine = sqlalchemy.create_engine('postgresql://user:pass@host/database',
                    connect_args=sql_config # dictionary with config details
                    )
    
    # open a conn session using with, run the query, and return the results
    with engine.begin() as conn: 
        results = conn.execute(sql_query)
        return results.fetchall()

Now it's time to check if your function works. Use the get_data() function below to return the top five rows of the flights table.

In [None]:
# Print top 5 rows from flights
get_data('____________')

It works, awesome! Now whenever we want to connect to our database and retrieve data we can simply use the get_data() function, how convenient! Although, we can't call ourselves Python hackers yet, because we would also like to have our data outputted to a dataframe.

## Using pandas methods to extract data from sql database into a dataframe
We know already, that the output format of the fetchall() function is a list, which is inconvenient to work with. Luckily there is a function that lets us read a SQL query directly into a dataframe. It's called <ins>read_sql_query()</ins> and can be found in the pandas package.
The function expects a SQL query as the first argument and a connection (object) to a database as the second argument.
Complete the code below and return the output in a dataframe.

In [None]:
# Import pandas package
import pandas as pd

# Print top 5 rows from flights table using ()
__.read_sql_query(sql='_________________', con=____)

In [None]:
# SOLUTION
# Import pandas package
import pandas as pd

# Print top 5 rows from flights table using ()
pd.read_sql_query(sql=sql_query, con=engine)

This output looks like a dataframe. Was it really that easy? Let's check if the output really is a dataframe.  
Complete the code below and check if the output is of type dataframe.

In [None]:
# Print type of read_sql_query() output


In [None]:
# SOLUTION # Print type of read_sql_query() output
print(type(pd.read_sql_query(sql=sql_query, con=engine)))

It really is a dataframe! Wow, that was easy and what's even better: we don't need a with statement and we don't need to run the execute() and fetchall() functions anymore, although we still need the engine.  
The only thing left is to make a new function that works like our get data function to create the engine and run the query and return the dataframe in one step.
Copy the content of the code block where we define the get_data() function and paste it into the code block below. Then, adjust the code by replacing the redundant steps with our new read_sql_query() function.

In [None]:
# define a new function get_dataframe() based on format of get_data() but using read_sql_query()
def get_dataframe(sql_query):
    ''' 
    Connect to the PostgreSQL database server, 
    run query and return data as a pandas dataframe
    '''
    

In [None]:
# SOLUTION define a new function get_dataframe() based on get_data() and using read_sql_query()
def get_dataframe(sql_query):
    ''' 
    Connect to the PostgreSQL database server, 
    run query and return data as a pandas dataframe
    '''

    # get the connection configuration using the get_sql_config function
    sql_config = get_sql_config()
    # create a connection engine to the PostgreSQL server
    engine = sqlalchemy.create_engine('postgresql://user:pass@host/database',
                    connect_args=sql_config # dictionary with config details
                    )
    
    # open a conn session using with, run the query, and return the results
    return pd.read_sql_query(sql=sql_query, con=engine)

Let's make sure the function works by returning the first 5 rows of the flights table below.

In [None]:
# Display a dataframe containing the top 5 rows from flights


In [None]:
# SOLUTION # Display a dataframe containing the top 5 rows from flights
get_dataframe(sql_query)

Now that we have the data in a dataframe, it's easy to apply all the different data exploration and cleaning techniques we have learned already. We don't have to do that right now, but it will become very useful in the future!  
Congratulations, from now on you can call yourself a Python hacker!  

But wait..., what if I told you there is a way of making your functions even more powerful?  

## Automating data retrieval using a Python module
Before I will tell how to make your function even more powerful, let's take a step back and look at the prerequisites for using the get_data() and get_dataframe() functions and what limitations apply.  
In order to use our get_data() function anywhere in our jupyter notebook, we need to define it once with the correct connection details.  
But what if we have multiple jupyter notebooks in our repository and we want to use this in other notebooks as well? How could we do that?  
As with all things in life: <ins>it depends</ins>.  

Remember that when you start a jupyter notebook, you simultaneously start a kernel that will then load your virtual environment and execute any Python code you feed it.  
Every package you load and every function you define will be available in that jupyter notebook only.  
Now imagine you have defined the get_data() and get_dataframe() functions in only one of your jupyter notebooks, but you have multiple notebooks in your repository where you would like to use this function.
To be able to do that you would have to add the function definition to each notebook. In reality you would simply duplicate your code over and over, which should always be avoided. While this wouldn't mean the end of the world, the real problem occurs if we ever wanted to change the connection details inside the function. In that case, we would have to find and change each function definition in each file, which not only increases the likelihood of errors but could become a tedious unnecessarily time consuming task.  

Now that we know about the downsides, can you think of a another, better solution? If you don't have any idea, think about Python packages or more specifically about why and how we use them.  
The answer is, we import Python packages because they include predefined functions that we can then use in any of our Python scripts, which is exactly what we want to do with our function. You're probably thinking: "But we don't have a package and we also don't want to create a package for just function and publish it to the official Python library". That's true but fortunately, we don't have to. Instead we can use a built-in Python functionality called <ins>modules</ins>.    
Instead of giving you a detailed explanation what modules are and what they do, let's create a python module and use it instead.

Let's get started!

Please perform the following steps:
1. Open sql_functions.py and import all Python packages necessary for the get_data() and get_dataframe() functions to work
2. Insert the definition of the get_data() and get_dataframe() functions
3. Import the Python file into this jupyter notebook below the same way you normally import packages and give it the alias gd
4. Execute the get_data() function and retrieve the top 5 rows from flights

**Important: To make sure we're using the function from our sql_functions.py module and not the one we defined earlier in this jupyter notebook we need to give the imported package an alias and use it to reference the get_data() function**

In [None]:
# Import get_data() from sql_functions.py
import sql_functions as gd

# Print top 5 rows from flights
gd.________('SELECT * FROM flights LIMIT 5')

Worked like a charm!  
Great, let's summarise: Whenever we want to use a custom function in any other jupyter notebook in our repository, we  

1. create a new Python module,
2. add the necessary packages and function definitions,
3. and import it into our jupyter notebook using the standard import snytax

If you still don't understand what modules are and why we did what we did, check out the offical [documentation](https://docs.python.org/3/tutorial/modules.html) about modules or reach out to us. It's important you understand this concepts since we will be working with python modules in future jupyter notebooks.

Congratulations for making it through this notebook, you deserve to call yourself a badass Python hacker!