# Before we go...

If you have already ran the code, you may need to delete the file "biggerDB.db" to avoid errors when running it again. The cell below test if this is the case and delete it automatically.


In [1]:
import os

# Looking for biggerDB.db and removing it when the file exists:
if(os.path.isfile("biggerDB.db")):
    os.remove("biggerDB.db")
    print("Removed biggerDB.db")

Removed biggerDB.db


# Example - Creating a Database using Faker and converting it to a Pandas DataFrame

We will now create a bigger database using the Faker module and see how pandas module can receive a SQL cursor and generate a DataFrame automatically from it.

Users may need to run the respective pip install commands:

pip install pandas

pip install Faker


## Generating Fake Data

We will use Faker to generate fake data to fill our table. This process may take some time if the number of rows to be generated was set to great values. 

To make it faster and at the same time get a reasonable volume of data, we will generate 50.000 data rows to our database.


In [2]:
from faker import Faker
import random

# The keys are the faker codes of each country:
dataQuantity = {
    "it_IT": 10_000, # Italy
    "en_US":  8_000, # USA
    "pt_BR": 15_000, # Brazil
    "es_AR": 10_000, # Argentina
    "fr_FR":  7_000  # France
}
TOTAL_DATA_SIZE = sum(dataQuantity.values())
MIN_AGE = 18
MAX_AGE = 70

# Generating the data using the values defined above:
data = []
id_base_val = 0

for key in dataQuantity.keys():
    fake = Faker(key)
    data.extend(
        [[i, fake.name(), random.randint(MIN_AGE, MAX_AGE), fake.address().replace("\n", " - ")]
         for i in range(id_base_val, dataQuantity[key]+id_base_val)]
    )
    id_base_val += dataQuantity[key]

# The data was generated ordered by country. Let's shuffle it:
random.shuffle(data)

# Reseting the indexes after shuffling the rows:
for i in range(TOTAL_DATA_SIZE):
    data[i][0] = i


## Viewing the data in a pandas DataFrame

Before adding it to the SQL database, let's see how our data looks in a DataFrame.

In [3]:
import pandas as pd

df = pd.DataFrame(data, columns=["ID", "Name", "Age", "Address"])
df.set_index("ID", inplace=True)

df.head(10)

Unnamed: 0_level_0,Name,Age,Address
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Catalina Ambar Soria Paz,38,"Diag. 7 N° 121 - Neuquén 8300, Neuquén"
1,Lauretta Garzoni,47,"Contrada Serraglio, 58 Appartamento 24 - 50035..."
2,Ana Paula Rodriguez Godoy,20,"Calle 3 N° 6699 Torre 9 Dto. 8 - Mendoza 5500,..."
3,Jacob Gilbert MD,25,"76487 Campos Lake - Martinborough, WA 82158"
4,Antoinette Cousin,18,"33, avenue Richard Adam - 97383 Leroy"
5,Priscila Joaquin Paez,68,"Diagonal J.M. de Rosas N° 329 - Córdoba 5000, ..."
6,Erica Kelley,48,"PSC 0195, Box 2913 - APO AP 55690"
7,Abril Guillermina Molina,54,"Calle San Luis N° 67 - Paraná 3100, Entre Ríos"
8,Christopher Hogan,55,"331 Cameron Groves - Garciaton, MI 48035"
9,Rebeca da Mata,29,"Morro Nina Castro, 66 - Lorena - 91948-743 da ..."


## Creating the new database

We're now going to create a new database and table to include our fake data, and execute the inclusion of the data using a single INSERT statement with the sqlite3 module.


In [4]:
import sqlite3

conn = sqlite3.connect('biggerDB.db')
try:
    conn.execute('''
            CREATE TABLE USERS
             (ID         INT PRIMARY KEY   NOT NULL,
             NAME        TEXT              NOT NULL,
             AGE         INT               NOT NULL,
             ADDRESS     CHAR(50));
        ''')
    print("Table created successfully!")
except:
    print("Table not created!")

# As pointed out above, we do not need to insert the data into the database line-by-line.
# So, to make the insertion of the new values, we will first create a long INSERT statement,
# using a Python string and then execute it using our SQLite connection:
try:
    valuesString = ["({}, \"{}\", {}, \"{}\")".format(row[0], row[1], row[2], row[3]) for row in data]
    valuesString = ",".join(valuesString)

    insertString = "INSERT INTO USERS (ID,NAME,AGE, ADDRESS)\n VALUES " + valuesString

    conn.execute(insertString)
    conn.commit()
    print ("Records created successfully")
except:
    print("Data not included in the table!")

conn.close()

Table created successfully!
Records created successfully


Pandas DataFrame can receive a cursor from SQLite as argument. One may notice this is exactly the same DataFrame we build using the original data, as present in the two cells below.

In [5]:
conn = sqlite3.connect('biggerDB.db')
cursor = conn.execute("SELECT * FROM USERS")

new_df = pd.DataFrame(cursor, columns=["ID", "Name", "Age", "Address"])
new_df.set_index("ID", inplace=True)

new_df.head(10)

Unnamed: 0_level_0,Name,Age,Address
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Catalina Ambar Soria Paz,38,"Diag. 7 N° 121 - Neuquén 8300, Neuquén"
1,Lauretta Garzoni,47,"Contrada Serraglio, 58 Appartamento 24 - 50035..."
2,Ana Paula Rodriguez Godoy,20,"Calle 3 N° 6699 Torre 9 Dto. 8 - Mendoza 5500,..."
3,Jacob Gilbert MD,25,"76487 Campos Lake - Martinborough, WA 82158"
4,Antoinette Cousin,18,"33, avenue Richard Adam - 97383 Leroy"
5,Priscila Joaquin Paez,68,"Diagonal J.M. de Rosas N° 329 - Córdoba 5000, ..."
6,Erica Kelley,48,"PSC 0195, Box 2913 - APO AP 55690"
7,Abril Guillermina Molina,54,"Calle San Luis N° 67 - Paraná 3100, Entre Ríos"
8,Christopher Hogan,55,"331 Cameron Groves - Garciaton, MI 48035"
9,Rebeca da Mata,29,"Morro Nina Castro, 66 - Lorena - 91948-743 da ..."


In [6]:
df.head(10)

Unnamed: 0_level_0,Name,Age,Address
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Catalina Ambar Soria Paz,38,"Diag. 7 N° 121 - Neuquén 8300, Neuquén"
1,Lauretta Garzoni,47,"Contrada Serraglio, 58 Appartamento 24 - 50035..."
2,Ana Paula Rodriguez Godoy,20,"Calle 3 N° 6699 Torre 9 Dto. 8 - Mendoza 5500,..."
3,Jacob Gilbert MD,25,"76487 Campos Lake - Martinborough, WA 82158"
4,Antoinette Cousin,18,"33, avenue Richard Adam - 97383 Leroy"
5,Priscila Joaquin Paez,68,"Diagonal J.M. de Rosas N° 329 - Córdoba 5000, ..."
6,Erica Kelley,48,"PSC 0195, Box 2913 - APO AP 55690"
7,Abril Guillermina Molina,54,"Calle San Luis N° 67 - Paraná 3100, Entre Ríos"
8,Christopher Hogan,55,"331 Cameron Groves - Garciaton, MI 48035"
9,Rebeca da Mata,29,"Morro Nina Castro, 66 - Lorena - 91948-743 da ..."
