# Connecting To PostGres with Python

In this notebook, I want to showcase how the psycopg2 library can help us execute SQL-queries directly in a python terminal, IDE or notebook. 

The 'SuperMarketTransactions' csv-file has been uploaded to my localhost on PostGres. It includes data over fictional transactions in a pan-american sueprmarket, and we will use SQL to retreive insights on the behaviour of these customers


In [30]:
import psycopg2
import pandas as pd

Every PostGres database has the following specs that are needed to connect to it

In [12]:
DB_NAME = "TestData"
DB_USER = "postgres"
DB_HOST = "localhost"
DB_PORT = "5432"

psycopg2 creates a "cursor" with which we can "fetch" data from the DB

In [197]:
try:
    conn = psycopg2.connect(database = DB_NAME, user = DB_USER,
                            host = DB_HOST, port = DB_PORT)
    print('connected')
except:
    print('not connected')

connected


Now that we've established a connection, lets write a function to retreive the data based an a Query (here called Q). Then, we fetch the raw SQL and turn it into a dataframe with it's accompanying column names  

In [201]:
def SQL_Query(Q):
    
    try:
        conn = psycopg2.connect(database = DB_NAME, user = DB_USER,
                            host = DB_HOST, port = DB_PORT)
        print('connected')
    except:
        print('not connected')
    
    cur = conn.cursor()
    cur.execute(Q)
    rows = cur.fetchall()
    Data = pd.DataFrame(rows)
    colname= [desc[0] for desc in cur.description]
    Data.columns = colname
    return(Data)
    curr.close()
    conn.close()

Let's test it out! We start simply and call the average number of children of the customers by country the operate in

In [202]:
QUERY = '''SELECT country,round(avg(children),2) as avg_children
           FROM sales.super 
           GROUP BY country 
           LIMIT 5 '''

In [203]:
SQL_Query(QUERY)

connected


Unnamed: 0,country,avg_children
0,Canada,2.51
1,Mexico,2.54
2,USA,2.57


It works! We see that even the alias was included. Let's try something more complicated, like incorporating a window function. In the Query below, we look at total revenue per city, ranked from highest to lowest, but also the ranking within country (Hidalgo bering the top performer in Mexico but only 7th overall.

In [204]:
QUERY1 = '''SELECT country, city, round(sum(revenue),2) as tot_rev,
                   RANK() OVER(ORDER BY round(sum(revenue),2)DESC),
                   RANK() OVER(PARTITION BY Country ORDER BY round(sum(revenue),2) DESC)
            FROM sales.super
            GROUP BY city, Country
            ORDER BY 3 DESC
            LIMIT 10'''

In [205]:
SQL_Query(QUERY1)

connected


Unnamed: 0,country,city,tot_rev,rank,rank.1
0,USA,Salem,4647.42,1,1
1,USA,Tacoma,4428.9,2,2
2,USA,Portland,3118.7,3,3
3,USA,Bremerton,3060.5,4,4
4,USA,Seattle,2972.34,5,5
5,USA,Los Angeles,2727.3,6,6
6,Mexico,Hidalgo,2427.7,7,1
7,USA,San Diego,2275.54,8,7
8,Mexico,Merida,2239.84,9,2
9,USA,Spokane,2193.92,10,8


What drives profitability in these stores? We can have a look at the customerbase to get a better understanding


In [247]:
Query2 =''' WITH sub AS (
                 SELECT *
                 FROM sales.super
                 WHERE city IN ('Salem','Tacoma','Portland','Bremerton','Seattle','Los Angeles'))
            
            SELECT city,
                   round(avg(numerical_income),2) as avg_income,
                   round(avg(children),2) as avg_children,
                   round(avg(units_sold),2) as avg_basket_size,
                   round(avg(revenue),2) as avg_spend,
                   round((avg(CASE WHEN homeowner ='Y' THEN 1
                        ELSE 0 END) * 100),2) as home_ownership_percentage
            FROM sub
            GROUP BY city
            ORDER BY avg_spend DESC
            '''

In [248]:
SQL_Query(Query2)

connected


Unnamed: 0,city,avg_income,avg_children,avg_basket_size,avg_spend,home_ownership_percentage
0,Salem,58967.74,2.5,4.1,14.99,60.65
1,Portland,56037.74,2.57,4.13,14.71,61.32
2,Tacoma,55375.0,2.86,4.2,13.84,60.0
3,Seattle,62142.86,2.65,4.15,13.27,57.14
4,Los Angeles,55809.52,2.63,4.21,12.99,66.67
5,Bremerton,56500.0,2.04,3.99,12.75,69.17


Income over 55K/Year, >2.5 children and a high ration of home owners are traits for these high-revenue stores