# Why do we need SQL?

In real world data is not stored in a single file (csv,txt..,). It is always stored in a database. SQL is used to retrieve data from databases.

# What is SQL

<ul>
    <li> SQL stands for Structured Query Language </li>
    <li> Used for storing, manipulating and retrieving data in RDBMS (what is RDBMS?)</li>
    <li> Even though SQL is a standard language the syntax varies a bit for each database (MySQL, Oracle,SQlite..,)</li>
</ul>

### Scope of this SQL course

In this course we will only be looking at the data retrieving part of SQL. Data storing and manipulation is usually taken care by the data engineers.

### Creation of SQLITE database

Execute all the cells of "Database & Table Creation" notebook which you can find it in the SQL section of the course. This will create a SQLITE databse on your local machine.

### Importing Libraries

In [58]:
import sqlite3 as sl
import pandas as pd

### Creating connection to DB

In [59]:
con = sl.connect('sql_invoicing_1.db')

In [60]:
# List of tables in the database

cursorObj = con.cursor()

cursorObj.execute('SELECT name from sqlite_master where type= "table"')

print(cursorObj.fetchall())


[('payment_methods',), ('sqlite_sequence',), ('clients',), ('invoices',), ('payments',), ('products',), ('shippers',), ('customers',), ('order_statuses',), ('orders',), ('order_items',), ('order_item_notes',), ('offices',), ('employees',)]


 #### Retrieve all rows and columns of a table

In [61]:
Query = """
        select *
        from invoices;
        """
df = pd.read_sql_query(Query,con)
df.head()

Unnamed: 0,invoice_id,number,client_id,invoice_total,payment_total,invoice_date,due_date,payment_date
0,1,91-953-3396,2,101.79,0.0,2019-03-09,2019-03-29,
1,2,03-898-6735,5,175.32,8.18,2019-06-11,2019-07-01,2019-06-30
2,3,20-228-0335,5,147.99,0.0,2019-07-31,2019-08-20,
3,4,56-934-0748,3,152.21,0.0,2019-03-08,2019-03-28,
4,5,87-052-3121,5,169.36,0.0,2019-07-18,2019-08-07,


In [62]:
type(df)

pandas.core.frame.DataFrame

##### Class Task: Retrieve all columns and rows from any table

#### Selecting specific columns from a table

In [63]:
Query = """
       select number,invoice_total,invoice_date,due_date,payment_date
       from invoices ;
        """
df = pd.read_sql_query(Query,con)
df.head()

Unnamed: 0,number,invoice_total,invoice_date,due_date,payment_date
0,91-953-3396,101.79,2019-03-09,2019-03-29,
1,03-898-6735,175.32,2019-06-11,2019-07-01,2019-06-30
2,20-228-0335,147.99,2019-07-31,2019-08-20,
3,56-934-0748,152.21,2019-03-08,2019-03-28,
4,87-052-3121,169.36,2019-07-18,2019-08-07,


##### Class Task: Query a table by using specific column names

#### LIMIT can be used to output limited number of rows to understand the data

In [64]:
Query = """
       select number,invoice_total,invoice_date,due_date,payment_date
       from invoices 
       limit 5;
        """
df = pd.read_sql_query(Query,con)
df

Unnamed: 0,number,invoice_total,invoice_date,due_date,payment_date
0,91-953-3396,101.79,2019-03-09,2019-03-29,
1,03-898-6735,175.32,2019-06-11,2019-07-01,2019-06-30
2,20-228-0335,147.99,2019-07-31,2019-08-20,
3,56-934-0748,152.21,2019-03-08,2019-03-28,
4,87-052-3121,169.36,2019-07-18,2019-08-07,


#### ORDER BY orders the result by based on a column

In [65]:
Query = """
       select number,invoice_total,invoice_date,due_date,payment_date
       from invoices 
       order by invoice_total desc;
        """
df = pd.read_sql_query(Query,con)
df.head()

Unnamed: 0,number,invoice_total,invoice_date,due_date,payment_date
0,78-145-1093,189.12,2019-05-20,2019-06-09,
1,52-269-9803,180.17,2019-05-23,2019-06-12,2019-06-08
2,03-898-6735,175.32,2019-06-11,2019-07-01,2019-06-30
3,77-593-0081,172.17,2019-07-09,2019-07-29,
4,87-052-3121,169.36,2019-07-18,2019-08-07,


#### By default ORDER BY orders the data in ascending order

In [66]:
Query = """
       select number,invoice_total,invoice_date,due_date,payment_date
       from invoices 
       order by invoice_total;
        """
df = pd.read_sql_query(Query,con)
df.head()

Unnamed: 0,number,invoice_total,invoice_date,due_date,payment_date
0,91-953-3396,101.79,2019-03-09,2019-03-29,
1,20-848-0181,126.15,2019-01-07,2019-01-27,2019-01-27
2,33-615-4694,126.38,2019-07-30,2019-08-19,2019-08-15
3,68-093-9863,133.87,2019-09-04,2019-09-24,
4,83-559-4105,134.47,2019-11-23,2019-12-13,


#### WHERE clause is used to extract data based on some condition

In [67]:
# Query the table to extract data of one particular client
Query = """
       select *
       from invoices 
       where client_id = 5;
        """
df = pd.read_sql_query(Query,con)
df

Unnamed: 0,invoice_id,number,client_id,invoice_total,payment_total,invoice_date,due_date,payment_date
0,2,03-898-6735,5,175.32,8.18,2019-06-11,2019-07-01,2019-06-30
1,3,20-228-0335,5,147.99,0.0,2019-07-31,2019-08-20,
2,5,87-052-3121,5,169.36,0.0,2019-07-18,2019-08-07,
3,9,77-593-0081,5,172.17,0.0,2019-07-09,2019-07-29,
4,13,41-666-1035,5,135.01,87.44,2019-06-25,2019-07-15,2019-07-13
5,18,52-269-9803,5,180.17,42.77,2019-05-23,2019-06-12,2019-06-08


##### Class Task: Retrieve all invoice observations where the payment is not done

#### <,>,<=,>=,<> are other comparison operators which can be used with WHERE clause

In [68]:
Query = """
       select *
       from invoices 
       where invoice_total  >= 150;
        """
df = pd.read_sql_query(Query,con)
df.head()

Unnamed: 0,invoice_id,number,client_id,invoice_total,payment_total,invoice_date,due_date,payment_date
0,2,03-898-6735,5,175.32,8.18,2019-06-11,2019-07-01,2019-06-30
1,4,56-934-0748,3,152.21,0.0,2019-03-08,2019-03-28,
2,5,87-052-3121,5,169.36,0.0,2019-07-18,2019-08-07,
3,6,75-587-6626,1,157.78,74.55,2019-01-29,2019-02-18,2019-02-20
4,8,78-145-1093,1,189.12,0.0,2019-05-20,2019-06-09,


In [69]:
Query = """
       select *
       from invoices 
       where client_id  <> 5;
        """
df = pd.read_sql_query(Query,con)
df.head()

Unnamed: 0,invoice_id,number,client_id,invoice_total,payment_total,invoice_date,due_date,payment_date
0,1,91-953-3396,2,101.79,0.0,2019-03-09,2019-03-29,
1,4,56-934-0748,3,152.21,0.0,2019-03-08,2019-03-28,
2,6,75-587-6626,1,157.78,74.55,2019-01-29,2019-02-18,2019-02-20
3,7,68-093-9863,3,133.87,0.0,2019-09-04,2019-09-24,
4,8,78-145-1093,1,189.12,0.0,2019-05-20,2019-06-09,


#### Logical operators (AND,OR,NOT,BETWEEN,IN)

In [70]:
# Using AND operator
Query = """
       select *
       from invoices 
       where invoice_total  >= 150 and payment_total = 0;
        """
df = pd.read_sql_query(Query,con)
df

Unnamed: 0,invoice_id,number,client_id,invoice_total,payment_total,invoice_date,due_date,payment_date
0,4,56-934-0748,3,152.21,0,2019-03-08,2019-03-28,
1,5,87-052-3121,5,169.36,0,2019-07-18,2019-08-07,
2,8,78-145-1093,1,189.12,0,2019-05-20,2019-06-09,
3,9,77-593-0081,5,172.17,0,2019-07-09,2019-07-29,
4,10,48-266-1517,1,159.5,0,2019-06-30,2019-07-20,
5,16,10-451-8824,1,162.02,0,2019-03-30,2019-04-19,


In [71]:
# Using BETWEEN operator
 # Let's have a look at employees table
Query = """
       select *
       from employees;
        """
df = pd.read_sql_query(Query,con)
df

Unnamed: 0,employee_id,first_name,last_name,job_title,salary,reports_to,office_id
0,37270,Yovonnda,Magrannell,Executive Secretary,63996,,10
1,33391,D'arcy,Nortunen,Account Executive,62871,37270.0,1
2,37851,Sayer,Matterson,Statistician III,98926,37270.0,1
3,40448,Mindy,Crissil,Staff Scientist,94860,37270.0,1
4,56274,Keriann,Alloisi,VP Marketing,110150,37270.0,1
5,63196,Alaster,Scutchin,Assistant Professor,32179,37270.0,2
6,67009,North,de Clerc,VP Product Management,114257,37270.0,2
7,67370,Elladine,Rising,Social Worker,96767,37270.0,2
8,68249,Nisse,Voysey,Financial Advisor,52832,37270.0,2
9,72540,Guthrey,Iacopetti,Office Assistant I,117690,37270.0,3


In [72]:
Query = """
       select *
       from employees
       where salary between 70000 and 100000;
        """
df = pd.read_sql_query(Query,con)
df

Unnamed: 0,employee_id,first_name,last_name,job_title,salary,reports_to,office_id
0,37851,Sayer,Matterson,Statistician III,98926,37270,1
1,40448,Mindy,Crissil,Staff Scientist,94860,37270,1
2,67370,Elladine,Rising,Social Worker,96767,37270,2
3,72913,Kass,Hefferan,Computer Systems Analyst IV,96401,37270,3
4,80529,Lynde,Aronson,Junior Executive,77182,37270,4
5,84791,Hazel,Tarbert,General Manager,93760,37270,4
6,95213,Cole,Kesterton,Pharmacist,86119,37270,4
7,98374,Estrellita,Daleman,Staff Accountant IV,70187,37270,5
8,115357,Ivy,Fearey,Structural Engineer,92710,37270,5


In [73]:
# Using NOT operator
Query = """
       select *
       from employees
       where salary not between 70000 and 100000
       order by salary;
        """
df = pd.read_sql_query(Query,con)
df

Unnamed: 0,employee_id,first_name,last_name,job_title,salary,reports_to,office_id
0,63196,Alaster,Scutchin,Assistant Professor,32179,37270.0,2
1,96513,Theresa,Binney,Food Chemist,47354,37270.0,5
2,68249,Nisse,Voysey,Financial Advisor,52832,37270.0,2
3,75900,Virge,Goodrum,Information Systems Manager,54578,37270.0,3
4,33391,D'arcy,Nortunen,Account Executive,62871,37270.0,1
5,37270,Yovonnda,Magrannell,Executive Secretary,63996,,10
6,80679,Mildrid,Sokale,Geologist II,67987,37270.0,4
7,56274,Keriann,Alloisi,VP Marketing,110150,37270.0,1
8,67009,North,de Clerc,VP Product Management,114257,37270.0,2
9,72540,Guthrey,Iacopetti,Office Assistant I,117690,37270.0,3


In [74]:
# Using IN Operator
Query = """
       select *
       from employees
       where office_id in (2,3);
        """
df = pd.read_sql_query(Query,con)
df

Unnamed: 0,employee_id,first_name,last_name,job_title,salary,reports_to,office_id
0,63196,Alaster,Scutchin,Assistant Professor,32179,37270,2
1,67009,North,de Clerc,VP Product Management,114257,37270,2
2,67370,Elladine,Rising,Social Worker,96767,37270,2
3,68249,Nisse,Voysey,Financial Advisor,52832,37270,2
4,72540,Guthrey,Iacopetti,Office Assistant I,117690,37270,3
5,72913,Kass,Hefferan,Computer Systems Analyst IV,96401,37270,3
6,75900,Virge,Goodrum,Information Systems Manager,54578,37270,3
7,76196,Mirilla,Janowski,Cost Accountant,119241,37270,3


#### LIKE & WILDCARDS

In [75]:
# '%' denotes any number of characters
Query = """
       select *
       from employees
       where first_name like 'm%';
        """
df = pd.read_sql_query(Query,con)
df

Unnamed: 0,employee_id,first_name,last_name,job_title,salary,reports_to,office_id
0,40448,Mindy,Crissil,Staff Scientist,94860,37270,1
1,76196,Mirilla,Janowski,Cost Accountant,119241,37270,3
2,80679,Mildrid,Sokale,Geologist II,67987,37270,4


##### Class Task: Output all the observations where the job title ends with 'manager' or 'management'

In [76]:
# '_' denotes one character
Query = """
       select *
       from employees
       where first_name like '_i%';
        """
df = pd.read_sql_query(Query,con)
df



Unnamed: 0,employee_id,first_name,last_name,job_title,salary,reports_to,office_id
0,40448,Mindy,Crissil,Staff Scientist,94860,37270,1
1,68249,Nisse,Voysey,Financial Advisor,52832,37270,2
2,75900,Virge,Goodrum,Information Systems Manager,54578,37270,3
3,76196,Mirilla,Janowski,Cost Accountant,119241,37270,3
4,80679,Mildrid,Sokale,Geologist II,67987,37270,4


#### Working with DATES

In [77]:
Query = """
       select *,strftime('%m',invoice_date) as month_number
    
       from invoices;
        """
df = pd.read_sql_query(Query,con)
df.head()

Unnamed: 0,invoice_id,number,client_id,invoice_total,payment_total,invoice_date,due_date,payment_date,month_number
0,1,91-953-3396,2,101.79,0.0,2019-03-09,2019-03-29,,3
1,2,03-898-6735,5,175.32,8.18,2019-06-11,2019-07-01,2019-06-30,6
2,3,20-228-0335,5,147.99,0.0,2019-07-31,2019-08-20,,7
3,4,56-934-0748,3,152.21,0.0,2019-03-08,2019-03-28,,3
4,5,87-052-3121,5,169.36,0.0,2019-07-18,2019-08-07,,7


##### Class Task: Output all the cases where the payment month is July

Here you can find more about strftime function<br> https://www.sqlite.org/lang_datefunc.html

In [82]:
# Difference between date
Query = """
       select *,strftime('%J',payment_date)-strftime('%J',invoice_date) as Date_Diff
    
       from invoices;
        """
df = pd.read_sql_query(Query,con)
df

Unnamed: 0,invoice_id,number,client_id,invoice_total,payment_total,invoice_date,due_date,payment_date,Date_Diff
0,1,91-953-3396,2,101.79,0.0,2019-03-09,2019-03-29,,
1,2,03-898-6735,5,175.32,8.18,2019-06-11,2019-07-01,2019-06-30,19.0
2,3,20-228-0335,5,147.99,0.0,2019-07-31,2019-08-20,,
3,4,56-934-0748,3,152.21,0.0,2019-03-08,2019-03-28,,
4,5,87-052-3121,5,169.36,0.0,2019-07-18,2019-08-07,,
5,6,75-587-6626,1,157.78,74.55,2019-01-29,2019-02-18,2019-02-20,22.0
6,7,68-093-9863,3,133.87,0.0,2019-09-04,2019-09-24,,
7,8,78-145-1093,1,189.12,0.0,2019-05-20,2019-06-09,,
8,9,77-593-0081,5,172.17,0.0,2019-07-09,2019-07-29,,
9,10,48-266-1517,1,159.5,0.0,2019-06-30,2019-07-20,,


#### Handling NULL values

 <i>Extract sample data

In [None]:
Query = """
           select *
           from invoices
        """
df = pd.read_sql_query(Query,con)
df

<i>Extracting rows with payment_date null

In [None]:
Query = """
          select client_id,invoice_id,invoice_total,invoice_date,payment_date
from invoices
where payment_date is null;
        """
df = pd.read_sql_query(Query,con)
df

##### Class Task : Extracting rows where the payment has been made

<i> Replacing NULL with "0000-00-00" using "ifnull"

In [None]:
Query = """
          select ifnull(payment_date,'0000-00-00') as payment_v1
          from invoices;
        """
df = pd.read_sql_query(Query,con)
df

##### Class Task: Extract "invoices" table with a new column "Date_Diff" which denotes number of days since invoice and payment. For cases where payment is not done consider current date as payment date