# Connecting to SQL with Python
Previously, you have learned how to connect to a SQL database by using a SQL client such as DBeaver.  
Apart from connecting to databases, DBeaver also allows you to run SQL queries against the database, create new tables and populate them with data as well as retrieving the data.  
Populating tables with data that you have locally on your machine usually requires you to save it in a file, like a CSV, and import it using the DBeaver UI.  
Often, before you reach the final of uploading your dataset, you will perform data cleaning procedures to bring your data into shape.  
This is usually done in R or Python. This means we would import the data into Python, clean it, export it to a CSV file, import it into DBeaver and upload the data into the database.  
This process is very tedious. Fortunately, there is way to make it more efficient by connecting to the database from Python directly, eliminating the need for a separate SQL client. 
  
**This notebook will not only teach you how to connect to a database and query its data in Python directly, but also how to automate the process using custom functions.**

## Create Database Connection
The go-to package in Python for connecting to a PostgreSQL database is called <ins>psycopg2</ins>. Check out their official website here: https://www.psycopg.org/docs/.  
Complete the code below to import the package.

In [9]:
import _________

ModuleNotFoundError: No module named '_________'

In order to create a connection to our PostgreSQL database you need the following information:
- host = the address of the machine the database is hosted on
- port = the virtual gate number through which communication will be allowed
- database = the name of the database
- user = the name of the user
- password = the password of the user

Don't worry, if we haven't already, we will provide this information to you.  
The function from the psycopg package to create a connection is called <ins>connect()</ins>.  
connect() expects the parameters listed above as input in order to connect to the database.  
Enter the parameters into the function below.

In [None]:
## Creating a connection to the SQL database
conn = psycopg2.connect(host="_____________",
                        port="_____________",
                        database="_________",
                        user="_____________",
                        password="_________")

Next, let's have a look at the conn variable to see what we're working with.

In [None]:
print(conn)

<connection object at 0x7fd5fb521190; dsn: 'user=dauser password=xxx dbname=defaultdb host=db-postgresql-fra1-70962-do-user-8861194-0.b.db.ondigitalocean.com port=25060', closed: 0>


The conn variable is a connection object or more precisely, a database session. This means that currently we have one open connection to our database.  
This connection will stay open until we manually close it. 
Before we can use our connection to retrieve data, we have to create a cursor. A cursor allows Python code to execute PostgeSQL commmands in a database session.  
A cursor has to be created with the <ins>cursor()</ins> method of our connection object conn.

In [None]:
cur = conn.cursor()

Now, in order to retrieve data from our database, we have to use our cursor and its <ins>execute()</ins> method. The execute() method takes a SQL query as parameter and executes it.  
Complete the code below to select all columns and the first 5 rows from the ny_flights table.

In [None]:
cur.execute('SELECT ___ FROM ___ LIMIT ___')
print(cur)

When printing the cursor, we still don't get any data output.  
Why? Because our cursor behaves similar to our connection. The result in stored inside the cursor object and can't be accessed by simply printing it.  
Instead we need to use the <ins>fetchall()</ins> method, which fetches all rows of a query result.  
Complete the code below and print the results of the SQL query.

In [None]:
print(cur.________)

There we go! Finally we have the output of our SQL query.  
Now, before we summarise everything we have done, let's close our database connection to free up database resources for our colleagues.

In order to close a database connection we have to use the <ins>close()</ins> method of our connection ebject.
Complete the code below and terminate the database connection.

In [None]:
conn.________

Perfect, let's summarise the steps we have performed above:
1. (Install and) Import the psycopg2 package
2. Create a database connection object with the connect() method
3. Create a cursor for the database connection with the cursor() method
4. Use the execute() method of the cursor to execute a SQL query
5. Use the fetchall() method of the cursor to retrieve the results of the SQL query
6. Close the database connection with the close() method

In total we needed about 6 steps to connect to the databse and retrieve data.  
Even though we probably won't have to go through all the steps over and over again when querying data, this is still somewhat of a tedious process. On top of that, there is another inconvenience when it comes to the output. The format of the data we retrieve from the database is a list. As analysts we almost always want to have our data in a dataframe, since it makes exploring and cleaning data a lot easier.

Let's fix these problems and do the following:
1. Write a custom function that performs all of the steps above
2. Change the code so the SQL output is in a dataframe

### Automating the process with a custom function: get_data()
Instead of having to write multiple lines of code whenever we want to query data from Python, we're going to make our lives easier by writing a custom function that will execute all of the necessary steps automatically. For this, we're going to define a custom function called <ins>get_data()</ins> below. This function should only expect one argument: query. The function should be able to take any query as a string, create a connection to the database, execute the query, output the data and close the connection.

Complete the code below so that the get_data() function prints the output of any SQL query we pass it.

In [None]:
def get_data(_____):
    """ Connect to the PostgreSQL database server """
    conn = None
    try:
        # connect to the PostgreSQL server
        conn = ___________________
		
        # create a cursor
        cur = __________
        
	    # execute a statement
        cur.____________

        # display the results of the query
        print(__.read_sql_query(____, ____))
       
	    # close the connection to the PostgreSQL database
        _________

    # the code below makes the function more robust, you can ignore this part
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
    finally:
        if conn is not None:
            conn.close()

Now it's time to check if your function works. Use the get_data() function below to return the first five rows of the ny_flights table.

In [None]:
print(get_data('____________'))

It works, awesome! Now whenever we want to connect to our database and retrieve data we can simply use the get_data() function, how convenient! Unforturnately, we can't call ourselves Python hackers yet, bevause we still need to solve the problem with the output format.

### Change the SQL query output format from list to dataframe
We know already, that the output format of the fetchall() function is a list, which is inconvenient to work with. Luckily there is a function that let's us read a SQL query directly into a dataframe. It's called <ins>read_sql_query()</ins> and can be found in the pandas package.
The function expects a SQL query as the first argument and a connection (object) to a database as the second argument.
Complete the code below and return the output in a dataframe.

In [None]:
import ______ __ __
print(__.read_sql_query('_________________', ____))

This output looks like a dataframe. Was it really that easy? Let's check if the output really is a dataframe.
Complete the code below and check if the output is of type dataframe.

In [None]:
print(____(______))

It really is a dataframe! Wow, that was way easier than expected and what's even better: we can don't need a cursor and also don't need to run the execute() and fetchall() functions anymore!  
The only thing left is to change our get_data() function and replace the steps that are not needed anymore with our new read_sql_query function.  
Copy the content of the code block where we define the get_data() function and paste it into the code block below. Then, adjust the code by replacing redundant steps with our new read_sql_query function.

In [None]:
### Insert get_data() source code here

Let's make sure the function works by returning the first 5 rows of the ny_flights table below.

In [None]:
print(______(_______________))

Now that we have the data in a dataframe, it's easy to apply all the different data exploration and cleaning techniques we have learned already. We don't have to do that right now, but it will become very useful in the future!  
Congratulations, from now on you can call yourself a Python hacker!  

But wait..., what if I told you there is a way of making your get_data() function even more powerful?  

## Creating a Python module
Before I will tell how to do that, let's take a step back and look at the prerequisites for using the get_data() function and what limitations apply.  
In order to use our get_data() function anywhere in our jupyter notebook, we need to define it once with the correct connection details.  
But what if we have multiple jupyter notebooks in our repository and we want to use this function across all of the other notebooks as well? How coukd we do that?  
As with all things in life: <ins>it depends</ins>.  

Remember that when you start a jupyter notebook, you simoultaneously start a kernel that will then load your virtual environment and execute any Python code you feed it.  
Every package you load and every function you define will be available across any jupyter notebook you open in your session.  
What this means for our get_data() function is that technically we only need to define the function once preferably at the beginning and we will be able to use it in any notebook we open.  
Now imagine you have defined the get_data() function in only one of your jupyter notebooks, but you have multiple files in your repository and for the majority of the time, you work on a single jupyter notebook that doesn't have the definition of the get_data() function included.  
In that particular case you have two options: Whenever you start a new session either find and open the jupyter notebook that has the get_data() function definition and run it (option 1) or add the definition to each single jupyter notebook (option 2).  
While it would not be completely unreasonable to go with one of the two options, they each have significant downsides. For option 1 if you don't know in which file and on what line the function definition is, you would've to find it first and hope that the location will not change in the future.  
If we go with option 2 we would have write the same code over and over, which should always be avoided. Additionally, if we ever wanted to change the connection details inside the function, we would have to find and change each function definition in each file.  

Now that we know about the downsides, can you think of a another, better solution? If you don't have any idea, think about Python packages or more specifically how and why we import Python packages.  
We import Python packages because they include predefined functions that we can then use in our Python script!  
This is exactly what we want to do with our function. But we don't have a package and we also don't want to create a package for just function and publish it to the offical Python library. Fortunately, we don't have to. Instead we can use a built-in Python functionality called <ins>modules</ins>  
If none of this makes any sense to you, don't worry. Going through the steps will help you understand everything.

Let's get started!

Please perform the following steps:
1. Create a new empty Python file called <ins>get_data.py</ins> in the repository (we already did that for you)
2. Insert the import commands for all Python packages necessary for the get_data() function to work
2. Insert the definition of the get_data() function below the import commands in the get_data.py file
3. Import the Python file into this jupyter notebook below
4. Execute the get_data() function below

In [1]:
# To make sure we're using the function from our get_data.py module and not the one we defined earlier we reference it by alias
import get_data as gd
gd.get_data('SELECT * from ny_flights LIMIT 5')

  flight_date  dep_time  sched_dep_time  dep_delay  arr_time  sched_arr_time  \
0  2021-01-01       655             700         -5       747             817   
1  2021-01-02       651             700         -9       847             817   
2  2021-01-03       710             700         10       832             817   
3  2021-01-04       700             700          0       816             817   
4  2021-01-07       656             700         -4       757             835   

   arr_delay airline tail_number  flight_number origin dest  air_time  \
0        -30      9E      N301PQ           4632    SYR  JFK        38   
1         30      9E      N296PQ           4632    SYR  JFK        50   
2         15      9E      N918XJ           4632    SYR  JFK        45   
3         -1      9E      N919XJ           4632    SYR  JFK        46   
4        -38      9E      N340CA           4632    SYR  JFK        45   

   distance  cancelled  diverted  
0       209          0         0  
1       20

Worked like a charm!  
Great, let's summarise: Whenever we want to use a custom function in any of our new or existing jupyter notebooks in our repository, we  

1. create a new Python file
2. add the functions definition and necessary packages
3. import it into our jupyter notebook using the import function in combination with the name of the file that includes the function's definition

We will use this technique for the remaining jupyter notebooks in this repository.

Congratulations for making it through this notebook, you deserve to call yourself a badass Python hacker!