# From Pandas to PostgreSQL: Simple Insert Example

*By Naysan Saran, November 2019.*

## 1 - Introduction

In this tutorial we will go through all the steps required to 

- turn a csv into a pandas dataframe
- create the corresponding PostgreSQL database and table
- insert the pandas dataframe in the PostgreSQL table

The data for this tutorial is freely available from the World Bank website https://data.worldbank.org/indicator/en.atm.co2e.pc. The version stored in the data/ directory of this repo is a simplified version of that zip file.

## 2 - From csv file to pandas dataframe

In [1]:
import pandas as pd

csv_file = "../data/global_CO2_emissions.csv"
df = pd.read_csv(csv_file)
df.head(n=3)

Unnamed: 0,Country Name,Indicator Name,1960,1961,1962,1963,1964,1965,1966,1967,...,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014
0,Aruba,CO2 emissions (metric tons per capita),,,,,,,,,...,27.200708,26.947726,27.895023,26.229553,25.915322,24.670529,24.507516,13.157722,8.353561,8.410064
1,Afghanistan,CO2 emissions (metric tons per capita),0.046057,0.053589,0.073721,0.074161,0.086174,0.101285,0.107399,0.12341,...,0.051744,0.062428,0.083893,0.151721,0.238399,0.289988,0.406424,0.345149,0.310341,0.293946
2,Angola,CO2 emissions (metric tons per capita),0.100835,0.082204,0.210531,0.202737,0.21356,0.205891,0.268941,0.172102,...,0.985736,1.105019,1.203134,1.185,1.234425,1.244092,1.252681,1.330219,1.253776,1.290307


## 3 - PostgreSQL database, table and user setup

First we create the database. I'm assuming you already have PostgreSQL installed on your system. Otherwise you can refer to this link first https://www.postgresql.org/download/.

Creating the database - Ubuntu command line instructions 

For the sake of simplicity, we are going to create one table only to store everything.

Lastly, let's create a user and give them access to our new table.

Last permission to grant, so 'myuser' can autoincrement the 'id' primary key without having to specify it

## 4 - Uploading the dataframe in PostgreSQL

Alright back to Python. Here are all the functions we will need. For a complete, functioning code, please refer to the src/ subdirectory.

In [2]:
import psycopg2
import numpy as np

def connect(params_dic):
    """ Connect to the PostgreSQL database server """
    conn = None
    try:
        # connect to the PostgreSQL server
        print('Connecting to the PostgreSQL database...')
        conn = psycopg2.connect(**params_dic)

    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
        sys.exit(1) 
    return conn


def single_insert(conn, insert_req):
    """ Execute a single INSERT request """
    cursor = conn.cursor()
    try:
        cursor.execute(insert_req)
        conn.commit()
    except (Exception, psycopg2.DatabaseError) as error:
        print("Error: %s" % error)
        conn.rollback()
        cursor.close()
        return 1
    cursor.close()
    

Now let's specity the connection parameters. The database, username and password will be the same that we created in part 3.

In [3]:
param_dic = {
    "host"      : "localhost",
    "database"  : "worldbankdata",
    "user"      : "myuser",
    "password"  : "Passw0rd"
}

Testing the database connection

In [4]:
conn = connect(param_dic)

Connecting to the PostgreSQL database...


Just before we insert rows in our table, we can cleanup our Pandas dataframe a little bit.

In [5]:
# Drop the 'Indicator Name' column as we don't need it
df.drop('Indicator Name', axis=1, inplace=True)

In [6]:
# Replace NaN values with 'NULL'
years = [x for x in df.columns if x != 'Country Name']
for year in years:
    df[year] = df[year].apply(lambda x: 'NULL' if np.isnan(x) else x)

In [7]:
# Also drop any special character in the country name
df['Country Name'] = df['Country Name'].astype(str).replace('[^a-zA-Z0-9 ]', '', regex=True)
df.head()

Unnamed: 0,Country Name,1960,1961,1962,1963,1964,1965,1966,1967,1968,...,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014
0,Aruba,,,,,,,,,,...,27.2007,26.9477,27.895,26.2296,25.9153,24.6705,24.5075,13.1577,8.35356,8.41006
1,Afghanistan,0.0460567,0.0535888,0.0737208,0.0741607,0.0861736,0.101285,0.107399,0.12341,0.115142,...,0.051744,0.0624275,0.0838928,0.151721,0.238399,0.289988,0.406424,0.345149,0.310341,0.293946
2,Angola,0.100835,0.0822038,0.210531,0.202737,0.21356,0.205891,0.268941,0.172102,0.289718,...,0.985736,1.10502,1.20313,1.185,1.23443,1.24409,1.25268,1.33022,1.25378,1.29031
3,Albania,1.25819,1.37419,1.43996,1.18168,1.11174,1.1661,1.33306,1.36375,1.51955,...,1.4125,1.30258,1.32233,1.48431,1.4956,1.57857,1.80371,1.69291,1.74921,1.97876
4,Andorra,,,,,,,,,,...,7.29987,6.74605,6.51939,6.42781,6.12158,6.12259,5.86741,5.91688,5.90178,5.83291


In [11]:
# For each country, upload the yearly C02 emissions
for i in df.index:
    country_name = df['Country Name'][i]
    
    # Loop through each year
    for year in years:
        co2 = df[year][i]
        # Build the insert query
        query = """
        INSERT into emissions(country_name, year, co2) values('%s',%s,%s);
        """ % (country_name, year, co2)
        # Insert into the database
        single_insert(conn, query)
    
print("All rows were sucessfully inserted in the emissions table")

All rows were sucecssfully inserted in the emissions table


Done!

In [13]:
conn.close()

## 5 - (Optional) Back to the database to see what was inserted