In [1]:
import sqlite3
import pandas as pd

### SQL

SQL stands for *structured query language* and is the defacto language to know for working with *relational databases*.

In this and the next notebook we will barely skim the surface of all that there is to know about relational databases and SQL.

In this course, we will use Python magic commands to run SQL in our course Jupyter Notebooks.  To do this we need to make sure that we have the necessary software downloaded to our machines.  To get the SQL Magic commands to work on my machine, I had to run the following code.  You can do this in a Jupyter Notebook code cell.

%pip install jupysql duckdb-engine --quiet

Once you have this installed, we then start with the following command, which will allow us to use SQL Magic commands.

In [3]:
%load_ext sql

[33mThere's a new jupysql version available (0.10.5), you're running 0.10.3. To upgrade: pip install jupysql --upgrade[0m


Now, we create a connection object that connects to the built-in sqlite3 database engine.  Once we instantiate a connection object, we create a cursor.  I believe (though I don't know for sure) that cursor's get their name since they act much the same as a cursor in a word processing application in that they keep track of where you are in the database. 

In [4]:
connection = sqlite3.connect("HR.db")
cursor = connection.cursor()

Now, we need to connect to magic SQL using sqlite3.

In [5]:
%sql sqlite:///HR.db

In [5]:
%config SqlMagic.displaylimit = 20

#### SQL Magic Commands

To run SQL commands in a Jupyter Notebook, we need to do one of two things:
+ Start each line with %sql
or
+ Start a code block with %%sql

Starting a code cell with %%sql, tells Python that we will be using SQL for the entire cell.

#### Creating A Table

We now create a table in our HR database.

##### Example 1

In [6]:
%%sql

create table JOBS (
    ID INT,
    TITLE VARCHAR(50),
    MIN_SALARY INT,
    MAX_SALARY INT
);

insert into JOBS (ID, TITLE, MIN_SALARY, MAX_SALARY)
VALUES
(100, 'Sr. Architect', 60000, 100000),
(200, 'Sr. Software Developer', 60000, 80000),
(300, 'Sr. Designer', 60000, 80000),
(400, 'Sr. Architect', 60000, 100000),
(500, 'Sr. Software Developer', 60000, 80000),
(600, 'Sr. Designer', 60000, 80000),
(650, 'Sr. Architect', 60000, 100000),
(660, 'Sr. Software Developer', 60000, 80000),
(234, 'Sr. Designer', 60000, 80000),
(220, 'Sr. Architect', 60000, 100000)

$\Box$

#### Select Statement

##### Example 2

The * means all.  Using the * is equivalent to writing out all column names.

In [6]:
%sql select * from jobs;

ID,TITLE,MIN_SALARY,MAX_SALARY
100,Sr. Architect,60000,100000
200,Sr. Software Developer,60000,80000
300,Data Engineer,60000,500000
400,Sr. Architect,60000,100000
500,Data Scientist,60000,250000
600,Data Scientist,60000,250000
650,Sr. Architect,60000,100000
660,Sr. Software Developer,60000,80000
234,Data Analyst,60000,150000
220,Sr. Architect,60000,100000


When we write code to produce results from a relational database, the code is called a *query*.

$\Box$

##### Example 3

In [7]:
%sql select TITLE, MIN_SALARY from jobs;

TITLE,MIN_SALARY
Sr. Architect,60000
Sr. Software Developer,60000
Data Engineer,60000
Sr. Architect,60000
Data Scientist,60000
Data Scientist,60000
Sr. Architect,60000
Sr. Software Developer,60000
Data Analyst,60000
Sr. Architect,60000


$\Box$

#### The Where Clause

##### Example 4

In [6]:
%sql select ID, TITLE from jobs where MAX_SALARY > 90000;

ID,TITLE
100,Sr. Architect
300,Data Engineer
400,Sr. Architect
500,Data Scientist
600,Data Scientist
650,Sr. Architect
234,Data Analyst
220,Sr. Architect


$\Box$

Ok, our dataset is a little boring.  Lets change some rows using the UPDATE statement.

##### Example 5

In [16]:
%sql update jobs set title = 'Data Scientist', max_salary = '250000' where id = 500;

In [18]:
%sql update jobs set title = 'Data Scientist', max_salary = '250000' where id = 600;

In [19]:
%sql update jobs set title = 'Data Analyst', max_salary = '150000' where id = 234;

In [20]:
%sql update jobs set title = 'Data Engineer', max_salary = '500000' where id = 300;

In [21]:
%sql select * from jobs;

ID,TITLE,MIN_SALARY,MAX_SALARY
100,Sr. Architect,60000,100000
200,Sr. Software Developer,60000,80000
300,Data Engineer,60000,500000
400,Sr. Architect,60000,100000
500,Data Scientist,60000,250000
600,Data Scientist,60000,250000
650,Sr. Architect,60000,100000
660,Sr. Software Developer,60000,80000
234,Data Analyst,60000,150000
220,Sr. Architect,60000,100000


$\Box$

#### The Count Statement

##### Example 6

In [9]:
%sql select count(*) as 'num_rows' from jobs;

num_rows
10


$\Box$

#### String Patterns

We can use the LIKE keyword and the % sign as a wildcard sign to help look for patterns in a string.

##### Example 7

In [10]:
%sql select * from jobs;

ID,TITLE,MIN_SALARY,MAX_SALARY
100,Sr. Architect,60000,100000
200,Sr. Software Developer,60000,80000
300,Data Engineer,60000,500000
400,Sr. Architect,60000,100000
500,Data Scientist,60000,250000
600,Data Scientist,60000,250000
650,Sr. Architect,60000,100000
660,Sr. Software Developer,60000,80000
234,Data Analyst,60000,150000
220,Sr. Architect,60000,100000


Suppose that we want to select all titles that contain the word Architect.

In [11]:
%sql select title, max_salary from jobs where title like '%Architect';

TITLE,MAX_SALARY
Sr. Architect,100000
Sr. Architect,100000
Sr. Architect,100000
Sr. Architect,100000


Maybe we are interested in jobs that have to do with data.

In [12]:
%sql select title, max_salary from jobs where title like 'data%';

TITLE,MAX_SALARY
Data Engineer,500000
Data Scientist,250000
Data Scientist,250000
Data Analyst,150000


$\Box$

#### Order By

We can write a query that returns a table ordered by a given criteria.

##### Example 8

In [13]:
%sql select * from jobs;

ID,TITLE,MIN_SALARY,MAX_SALARY
100,Sr. Architect,60000,100000
200,Sr. Software Developer,60000,80000
300,Data Engineer,60000,500000
400,Sr. Architect,60000,100000
500,Data Scientist,60000,250000
600,Data Scientist,60000,250000
650,Sr. Architect,60000,100000
660,Sr. Software Developer,60000,80000
234,Data Analyst,60000,150000
220,Sr. Architect,60000,100000


In [14]:
%sql select * from jobs order by ID;

ID,TITLE,MIN_SALARY,MAX_SALARY
100,Sr. Architect,60000,100000
200,Sr. Software Developer,60000,80000
220,Sr. Architect,60000,100000
234,Data Analyst,60000,150000
300,Data Engineer,60000,500000
400,Sr. Architect,60000,100000
500,Data Scientist,60000,250000
600,Data Scientist,60000,250000
650,Sr. Architect,60000,100000
660,Sr. Software Developer,60000,80000


By default, ORDER BY orders in ascending order.  Using the DESC keyword, we can order in descending order.

In [15]:
%sql select * from jobs order by ID desc;

ID,TITLE,MIN_SALARY,MAX_SALARY
660,Sr. Software Developer,60000,80000
650,Sr. Architect,60000,100000
600,Data Scientist,60000,250000
500,Data Scientist,60000,250000
400,Sr. Architect,60000,100000
300,Data Engineer,60000,500000
234,Data Analyst,60000,150000
220,Sr. Architect,60000,100000
200,Sr. Software Developer,60000,80000
100,Sr. Architect,60000,100000


$\Box$

#### Group By

SQL has a group by operation similar to Pandas.

##### Example 9

In [10]:
%sql select * from jobs;

ID,TITLE,MIN_SALARY,MAX_SALARY
100,Sr. Architect,60000,100000
200,Sr. Software Developer,60000,80000
300,Data Engineer,60000,500000
400,Sr. Architect,60000,100000
500,Data Scientist,60000,250000
600,Data Scientist,60000,250000
650,Sr. Architect,60000,100000
660,Sr. Software Developer,60000,80000
234,Data Analyst,60000,150000
220,Sr. Architect,60000,100000


In [16]:
%sql select TITLE, count(*) from jobs group by TITLE;

TITLE,count(*)
Data Analyst,1
Data Engineer,1
Data Scientist,2
Sr. Architect,4
Sr. Software Developer,2


In the table above, the count($*$) column could use another name.  We can fix this using an *alias*.

In [17]:
%sql select TITLE, count(*) as 'num_employees' from jobs group by TITLE;

TITLE,num_employees
Data Analyst,1
Data Engineer,1
Data Scientist,2
Sr. Architect,4
Sr. Software Developer,2


There are many other functions we can use besides count.  These include AVG, SUM, MIN, MAX, ROUND, LENGTH, UCASE, and DISTINCT.

In [19]:
%sql select TITLE, avg(MAX_SALARY) as 'average_salary' from jobs group by TITLE;

TITLE,average_salary
Data Analyst,150000.0
Data Engineer,500000.0
Data Scientist,250000.0
Sr. Architect,100000.0
Sr. Software Developer,80000.0


$\Box$

#### Implicit Joins

Now that we have been introduced to SQL and relational databases, we begin to delve into a much deeper topic: Joins.  In this notebook, we will only look at implicit joins.  In the next notebook, we will look at explicit joins.

A *join* is a way to combine information from two (or more) tables into one table.  Thus, we need another table in our database.  The code below accomplishes this task. 

In [21]:
employees = pd.read_csv('Employees.csv')
employees.columns = ['emp_ID', 'first_name', 'last_name', 'SSN', 'birthday', 'sex', 'address', 'job_ID', 'salary', 'manager_ID', 'dep_ID']

In [24]:
employees.to_sql('employees', connection, if_exists='replace', index=False,method="multi")

9

In [20]:
%sql select * from employees;

emp_ID,first_name,last_name,SSN,birthday,sex,address,job_ID,salary,manager_ID,dep_ID
E1002,Alice,James,123457,1972-07-31,F,"980 Berry ln, Elgin,IL",200,80000,30002,5
E1003,Steve,Wells,123458,1980-10-08,M,"291 Springs, Gary,IL",300,50000,30002,5
E1004,Santosh,Kumar,123459,1985-07-20,M,"511 Aurora Av, Aurora,IL",400,60000,30004,5
E1005,Ahmed,Hussain,123410,1981-04-01,M,"216 Oak Tree, Geneva,IL",500,70000,30001,2
E1006,Nancy,Allen,123411,1978-06-02,F,"111 Green Pl, Elgin,IL",600,90000,30001,2
E1007,Mary,Thomas,123412,1975-05-05,F,"100 Rose Pl, Gary,IL",650,65000,30003,7
E1008,Bharath,Gupta,123413,1985-06-05,M,"145 Berry Ln, Naperville,IL",660,65000,30003,7
E1009,Andrea,Jones,123414,1990-09-07,F,"120 Fall Creek, Gary,IL",234,70000,30003,7
E1010,Ann,Jacob,123415,1982-03-30,F,"111 Britany Springs,Elgin,IL",220,70000,30004,5


In [21]:
%sql select * from jobs;

ID,TITLE,MIN_SALARY,MAX_SALARY
100,Sr. Architect,60000,100000
200,Sr. Software Developer,60000,80000
300,Data Engineer,60000,500000
400,Sr. Architect,60000,100000
500,Data Scientist,60000,250000
600,Data Scientist,60000,250000
650,Sr. Architect,60000,100000
660,Sr. Software Developer,60000,80000
234,Data Analyst,60000,150000
220,Sr. Architect,60000,100000


#### Implicit Cartesian Join

The *Cartesian Product* of the sets $A = \{1,2\}$ and $B = \{3,4\}$ is the set $\{(1,3),(1,4),(2,3),(2,4)\}$.  Note that this product contains all possible first coordinates paired up with all possible second coordinates.

In the same way, a *Cartesian Join* pairs up all rows in one table with all rows in the another table.

##### Example 10

In [22]:
%sql select first_name, last_name, title from employees, jobs;

first_name,last_name,TITLE
Alice,James,Sr. Architect
Alice,James,Sr. Software Developer
Alice,James,Data Engineer
Alice,James,Sr. Architect
Alice,James,Data Scientist
Alice,James,Data Scientist
Alice,James,Sr. Architect
Alice,James,Sr. Software Developer
Alice,James,Data Analyst
Alice,James,Sr. Architect


This is just the first 20 rows.  Since employees has 9 rows and jobs has 10 rows, the Cartesian Join of emplyees and jobs has 90 rows as the code below proves.

In [23]:
%sql select count(*) from employees, jobs;

count(*)
90


The interesting and somewhat odd thing about the code

select first_name, last_name, title from employees, jobs;

is that we do not specify what tables first_name, last_name, and title come from.  SQL just figures this out for us.  Luckily, both the employees and jobs tables don't have any columns named the same.  I'm not sure what SQL would do in that case.

The code below produces the same results, but is a little more specific about where the columns reside.

In [24]:
%sql select E.first_name, E.last_name, J.title from employees E, jobs J;

first_name,last_name,TITLE
Alice,James,Sr. Architect
Alice,James,Sr. Software Developer
Alice,James,Data Engineer
Alice,James,Sr. Architect
Alice,James,Data Scientist
Alice,James,Data Scientist
Alice,James,Sr. Architect
Alice,James,Sr. Software Developer
Alice,James,Data Analyst
Alice,James,Sr. Architect


$\Box$

#### Implicit Inner Join

The reason that our Cartesian Join in Example 10 contains all possible ways to concatenate rows from employees and rows from jobs is that we did not specify how these two tables are connected.  Looking at both tables, we see that the job_ID column in employees has the same information as the ID column in jobs.

In an *inner join* we specify the link, a *key*, that connects the info in one table to another. 

##### Example 11

In [12]:
%sql select E.first_name, E.last_name, J.title from employees E, jobs J where E.job_ID = J.ID;

first_name,last_name,TITLE
Alice,James,Sr. Software Developer
Steve,Wells,Data Engineer
Santosh,Kumar,Sr. Architect
Ahmed,Hussain,Data Scientist
Nancy,Allen,Data Scientist
Mary,Thomas,Sr. Architect
Bharath,Gupta,Sr. Software Developer
Andrea,Jones,Data Analyst
Ann,Jacob,Sr. Architect


In [13]:
%sql select count(*) from employees E, jobs J where E.job_ID = J.ID;

count(*)
9


$\Box$

##### Exercise 1

A new table is loaded into the HR database below.

In [5]:
departments = pd.read_csv('Departments.csv')
departments.columns = ['dep_ID', 'dep_name', 'manager_ID', 'location_ID']

In [6]:
departments.to_sql('departments', connection, if_exists='replace', index=False,method="multi")

2

In [11]:
%sql select * from departments;

dep_ID,dep_name,manager_ID,location_ID
5,Software Group,30002,L0002
7,Design Team,30003,L0003
2,Architect Group,300001,L0001


Use an implicit inner join to display the first name, last name, and department that all employees work in.