[Reference](https://medium.com/@avi_chawla/the-downsides-of-pandasql-that-no-one-talks-about-9b63c664bef4)

# Step 1: Install PandaSQL


In [1]:
!pip install pandasql

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pandasql
  Downloading pandasql-0.7.3.tar.gz (26 kB)
Building wheels for collected packages: pandasql
  Building wheel for pandasql (setup.py) ... [?25l[?25hdone
  Created wheel for pandasql: filename=pandasql-0.7.3-py3-none-any.whl size=26784 sha256=36126acadefa3ccd2466a8edb4ef341c467b47cec5713594e9ed90bf90df11b2
  Stored in directory: /root/.cache/pip/wheels/5c/4b/ec/41f4e116c8053c3654e2c2a47c62b4fca34cc67ef7b55deb7f
Successfully built pandasql
Installing collected packages: pandasql
Successfully installed pandasql-0.7.3


# Step 2: Import Requirements


In [2]:
import pandas as pd
import numpy as np
from pandasql import sqldf
import random

# Step 3: Create a Dummy Dataset


In [3]:
city_list = ["New York", "Manchester", "California", "Munich", "Bombay", 
             "Sydeny", "London", "Moscow", "Dubai", "Tokyo"]

job_list = ["Software Development Engineer", "Research Engineer", 
            "Test Engineer", "Software Development Engineer-II", 
            "Python Developer", "Back End Developer", 
            "Front End Developer", "Data Scientist", 
            "IOS Developer", "Android Developer"]

cmp_list = ["Amazon", "Google", "Infosys", "Mastercard", "Microsoft", 
            "Uber", "IBM", "Apple", "Wipro", "Cognizant"]

data = []
for i in range(10000):
  
    company = random.choice(cmp_list)
    job = random.choice(job_list)
    city = random.choice(city_list)
    salary = int(round(np.random.rand(), 3)*10**6)
    employment = random.choices(["Full Time", "Intern"], weights=(80, 20))[0]
    rating = round((np.random.rand()*5), 1)
    
    data.append([company, job, city, salary, employment, rating])
    
data = pd.DataFrame(data, columns=["Company_Name", "Employee_Job_Title",
                                   "Employee_Work_Location",  "Employee_Salary", 
                                   "Employment_Status", "Employee_Rating"])

# Step 4: Create the PandaSQL Environment


In [4]:
pysqldf = lambda q: sqldf(q, globals())

In [5]:
type(globals())

dict

In [6]:
globals()["data"].head(2)

Unnamed: 0,Company_Name,Employee_Job_Title,Employee_Work_Location,Employee_Salary,Employment_Status,Employee_Rating
0,Uber,Back End Developer,Moscow,3000,Intern,1.8
1,Apple,Python Developer,Dubai,551000,Full Time,2.8


# Step 5: Run SQL Queries on Pandas DataFrame


In [7]:
query = "select count(*) from data"
pysqldf(query)

Unnamed: 0,count(*)
0,10000


In [8]:
query = """select * from data 
           where Employee_Rating >4.9 
           limit 5;"""
pysqldf(query)

Unnamed: 0,Company_Name,Employee_Job_Title,Employee_Work_Location,Employee_Salary,Employment_Status,Employee_Rating
0,Amazon,IOS Developer,California,217000,Full Time,5.0
1,Google,IOS Developer,New York,523000,Intern,5.0
2,Cognizant,Software Development Engineer,Moscow,851000,Full Time,5.0
3,Wipro,Back End Developer,Sydeny,230000,Full Time,5.0
4,Mastercard,Android Developer,New York,225000,Intern,5.0


In [9]:
query = """select Company_Name, count(*) 
        from data
        group by Company_Name;
        """
pysqldf(query)

Unnamed: 0,Company_Name,count(*)
0,Amazon,993
1,Apple,979
2,Cognizant,1028
3,Google,1018
4,IBM,982
5,Infosys,974
6,Mastercard,977
7,Microsoft,1020
8,Uber,994
9,Wipro,1035


In [10]:
create = data.head()

pysqldf("select * from create where Employee_Rating >4.9;")

PandaSQLException: ignored

In [11]:
%timeit pysqldf("select count(*) from data;")

%timeit data.shape[0]

10 loops, best of 5: 112 ms per loop
The slowest run took 16.23 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 5: 868 ns per loop


In [12]:
%timeit pysqldf("select Company_Name, count(*) from data group by Company_Name;")

%timeit data.groupby("Company_Name").size()

10 loops, best of 5: 119 ms per loop
The slowest run took 6.14 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 5: 1.03 ms per loop


In [13]:
%timeit pysqldf("select * from data where Employee_Rating >4.9;")

%timeit data[data.Employee_Rating>4.9]

10 loops, best of 5: 116 ms per loop
1000 loops, best of 5: 360 µs per loop


In [14]:
%timeit pysqldf("select * from data as a join data as b on (a.Company_Name = b.Company_Name and a.Employee_Job_Title = b.Employee_Job_Title);")

%timeit pd.merge(data, data, on = ["Company_Name", "Employee_Job_Title"])

1 loop, best of 5: 5.87 s per loop
1 loop, best of 5: 207 ms per loop
