Skip to content

FeynmanDNA/IntroToSQL_Socratica

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

from Socratica 2019 SQL tutorials

Introduction

SQL generally ends with semi-colons.

In tables, there are columns with names ending in ID, these are numbers that uniquely identify each row of data. Such values are called "primary keys".

Generally you will use software to insert data.

The SELECT statement gives you the ability to choose data from multiple tables in a process called "JOINs".

A related solution would be to create a virtual table called a "View". This virtual table or table-like object collects matching data from multiple tables and make it easily accessible as if you are working with a single table.

Speed: when you execute a SELECT statement with a WHERE clause, the database may have to look at every single record in the table to find matching data. An Index can be created to ensure your queries are fast and efficient.

Transactions provide you the ability to make several changes and ensure that your data is safe if there is a problem part-way along the process.

So main features: Joins, Views, Indexes, Transactions

postgreSQL for linux

We need a relational database, because we use SQL to talk to such systems.

Transferrable Skills:

  • SQL
  • Schema Design
  • Database Optimization

Postgres mascot is elephant, because "elephants never forget"

install postgreSQL client and pgadmin4 for debian

IMPORTANT: set up postgres user and password for pgadmin4

sudo su postgres
postgres@debian:psql
postgres=# \password postgres (password = xxx)

using pgadmin4

in pgadmin4, create new server: name = localhost, connection/hostname = localhost

Postgres GUI tool is pgadmin

pgAdmin sidebar shows "server group", a server group is a container for organizing your servers.

whenever you make changes in the pgadmin, look at the SQL tab.

To see how many tables there are, open the "schemas" folder. By default database has a public schema.

SELECT

databases can have many many tables

SELECT can retrieve data from one table or multiple tables with a technique called JOINS

the earthquake csv is from US Geological Science Survey, it contains:

  • Event Type: Earthquake
  • Magnitude Range: 5.5+
  • Date Range: 1969 - 2018

This table contains 10 columns: earthquake_id, occurred_on, latitude, longitude, depth, magnitude, calculation_method, network_id, place, and cause

earthquake_id is the primary key of the table, primary key is a value that uniquely identifies each row in the table

how to import the 'earthquake' table:

  1. in pgAdmin: Tools - Query Tool

  2. input this in the Query Editor:

    CREATE TABLE public.earthquake
    (
        earthquake_id integer NOT NULL,
        occurred_on timestamp without time zone,
        latitude numeric,
        longitude numeric,
        depth numeric,
        magnitude numeric,
        calculation_method character varying,
        network_id character varying,
        place character varying,
        cause character varying,
        CONSTRAINT earthquake_pkey PRIMARY KEY (earthquake_id)
    )
    WITH (
        OIDS = FALSE
    )
  3. now execute (F5)

  4. right-click the 'earthquake' table

  5. use Import (make sure the 'Header' switch is set to 'Yes')

 "/usr/bin/psql" --command " "\\copy public.earthquake (earthquake_id, occurred_on, latitude, longitude, depth, magnitude, calculation_method, network_id, place, cause) FROM '/home/svd/Documents/Learn/SQL/earthquake.csv' CSV HEADER QUOTE '\"' ESCAPE '''';""

Q: How to see the whole table? *

SELECT * FROM earthquake;

this has two parts, Select xxx lists the columns you want data for, the FROM xxx specifies which tables to select data from.

Q: How many rows are there in this table? COUNT

there are functions you can use in your queries~

SELECT COUNT(*) FROM earthquake;

for this query, we select a COUNT rather than the data itself.

Notice that the name of the columnin the rowset is "count", this is the name of the function, not the name of any column in our table

speed:

  • SELECT *: 279 msec
  • SELECT COUNT(*): 104 msec

Q: specific select with WHERE

SELECT magnitude, place, occurred_on FROM earthquake;

to see the specific columns (the column order does not matter in the query)

In addition to the SELECT and FROM keywords, there is a third valuable part of queries: WHERE

  • SELECT: columns
  • FROM: table
  • WHERE: rows condition AND condition OR condition
  • ORDER BY: asc/desc
  • LIMIT: restrict returned rows

Q: all earthquakes that occurred on or after jan 1 2000:

SELECT *
FROM earthquake
WHERE occurred_on >= '2000-01-01';

Q: What was the largest earthquake in 2010?

SELECT *
FROM earthquake
WHERE occurred_on >= '2010-01-01' AND occurred_on <= '2010-12-31'
ORDER BY magnitude DESC
LIMIT 1;

To restrict the number of rows returned, use LIMIT keyword

Q: how to see the columns of a database?

we only need to select 1 row to see all the columns SELECT * FROM earthquake LIMIT 1;

Q: SQL functions:

some other common SQL functions are COUNT, MIN, MAX, AVG, SUM

SELECT MIN(occurred_on), MAX(occurred_on) FROM earthquake;

Q: What magnitude range is covered by the table?

SELECT MIN(magnitude), MAX(magnitude) FROM earthquake;

Q: what are the causes for earthquake? DISTINCT

many earthquakes have the same cause, so a lot of duplicates

SELECT DISTINCT will make sure we do not see duplicate rows

SELECT DISTINCT cause FROM earthquake;

Q: how many earthquakes are caused by nuclear explosion?

SELECT COUNT(*) FROM earthquake WHERE cause='nuclear explosion';

NOTE: SQL ' and " are different! you should use SINGLE QUOTE: '

Q: Find the most recent earthquake caused by a nuclear explosion?

SELECT magnitude, place, occurred_on
FROM earthquake
WHERE cause = 'nuclear explosion'
ORDER BY occurred_on DESC
LIMIT 5;

Q: How can we count the number of aftershocks?

Idea; find quakes with "Honshu" and "Japan" in the 'place' text and occurred within a week of the March 11th quake

SELECT * FROM earthquake WHERE place LIKE '%Honshu%Japan%'
AND occurred_on BETWEEN '2011-03-11' AND '2011-3-18';

'%x' is a string pattern, the % symbol matches zero or more characters

BETWEEN...AND is another operator

INSERT

The sample database will consist of several tables:

  • chitter_user table
    • columns: user_id, username, encrypted_password, email, and date_joined
    • user_id will be the auto-generated primary key
  • post table
    • columns: post_id, user_id, post_text, and posted_on
    • post_id will be the auto-generated primary key
  • follower table
    • columns: user_id and follower_id

for chitter_user table:

CREATE TABLE public.chitter_user
(
    username text,
    user_id serial NOT NULL,
    encrypted_password text,
    email text,
    date_joined timestamp without time zone,
    PRIMARY KEY (user_id)
)
WITH (
    OIDS = FALSE
);

ALTER TABLE public.chitter_user
    OWNER to postgres;

for post table:

CREATE TABLE public.post
(
    post_id serial NOT NULL,
    user_id integer,
    post_text text,
    posted_on timestamp without time zone DEFAULT current_timestamp,
    PRIMARY KEY (post_id),
    CONSTRAINT user_id_constraint FOREIGN KEY (user_id)
        REFERENCES public.chitter_user (user_id) MATCH SIMPLE
        ON UPDATE CASCADE
        ON DELETE CASCADE
)
WITH (
    OIDS = FALSE
);

ALTER TABLE public.post
    OWNER to postgres;

for follower table:

CREATE TABLE public.follower
(
    user_id integer NOT NULL,
    follower_id integer NOT NULL,
    PRIMARY KEY (user_id, follower_id),
    CONSTRAINT user_id_constraint FOREIGN KEY (user_id)
        REFERENCES public.chitter_user (user_id) MATCH SIMPLE
        ON UPDATE CASCADE
        ON DELETE CASCADE,
    CONSTRAINT follower_id_constraint FOREIGN KEY (follower_id)
        REFERENCES public.chitter_user (user_id) MATCH SIMPLE
        ON UPDATE CASCADE
        ON DELETE CASCADE
)
WITH (
    OIDS = FALSE
);

ALTER TABLE public.follower
    OWNER to postgres;

the listed columns for which you have data and the VALUES must be in the same order, the order of the data must match the order of the columns

user DEFAULT keyword as the user_id, this is because our database will generate this value for us

INSERT INTO chitter_user
	(user_id, username, encrypted_password, email, date_joined)
VALUES
	(DEFAULT, 'firstuser', 'd3dasdfi3', '1@1.com', '2019-02-21');

now try add user without specifying all fields, only insert for username and encrypted password

INSERT INTO chitter_user
    (username, encrypted_password)
VALUES
    ('seconduser', '878732ddui');

the user_id is generated for us, but other columns are NULL because we did not provide the values

insert into post table, using comma , to insert multiple entries

INSERT INTO post
    (user_id, post_text)
VALUES
    (1, 'hi, first post!'),
    (1, 'second post!!');

speed: measure speed of INSERT queries TEST: insert 10,000 rows in two different ways, using Python with a Postgres database

in order to talk to the database, we will use the popular 'psyco PG2' module import psycopg2

NOTE: in order for psycopg2 to work, you need to change pg_hba.conf file (/etc/postgresql/9.1/main/pg_hba.conf*).

This line:

local   all             postgres                                peer

Should be:

local   all             postgres                                md5

so the login by password can be done via psycopg2

then restart postgresql sudo service postgresql restart

single query method took: 1.3434667587280273 seconds

big query method took: 0.11228609085083008 seconds

BUT NOTE: when inserting large volumes of data, the way you word your queries can make a big impact on how quickly they will run.

UPDATE statement

a sample secret_user database, with the following schema: columns: user_id, first_name, last_name, code_name, country, organization, salary, and knows_kung_fu

to update, use UPDATE tablename SET clause, and specify the ROWS to make this change. Use "WHERE" clause to specify rows to update

UPDATE secret_user
SET first_name = 'James'
WHERE user_id = 1;

like insert, you can make changes to multiple columns with one query using comma

UPDATE secret_user
SET code_name = 'Neo 2.0', salary = 115000
WHERE first_name = 'Jack' AND last_name = 'Ryan;
UPDATE table_name
SET column1 = value1, column2 = value2, column3 = value3, ...
WHERE conditions for rows;

if you do not include a WHERE clause, the change will apply to EVERY ROW in the table!!!

New operator: IN

UPDATE secret_user
SET knows_kung_fu = TRUE
WHERE user_id IN (5,7,8);

You can do calculations in the query!

UPDATE secret_user
SET salary = 1.1*salary;

Calculate the sum of the salaries:

SELECT SUM(salary)
FROM secret_user;

DELETE statements

the sample song table has 8 columns: song_id (primary key), title, artist, album, year_released, duration, tempo, loudness

-- Table: public.song

-- DROP TABLE public.song;

CREATE TABLE public.song
(
    song_id integer NOT NULL,
    title character varying COLLATE pg_catalog."default",
    artist character varying COLLATE pg_catalog."default",
    album character varying COLLATE pg_catalog."default",
    year_released integer,
    duration numeric,
    tempo numeric,
    loudness numeric,
    CONSTRAINT song_pkey PRIMARY KEY (song_id)
)
WITH (
    OIDS = FALSE
)
TABLESPACE pg_default;

ALTER TABLE public.song
    OWNER to postgres;

See the range of songs first:

SELECT MIN(year_released), MAX(year_released) FROM song;

The Min(year_released) is 0! Let's see the distinct values for year released and order by year

SELECT DISTINCT year_released
FROM song
ORDER BY year_released ASC;

How many songs have an unknown release year, thus 0 in the column? 5320 of them!

SELECT COUNT(*)
FROM song
WHERE year_released = 0;

We should remove these songs from our table.

DELETE FROM song
WHERE year_released= 0;

IMPORTANT NOTE: a DELETE query is used delete ROWS! NOT COLUMNS!, the WHERE clause determines which rows will be deleted

If you do not specify WHERE, it will DELETE ALL ROWS in your table.

Q: Has tempo changed over time?

To answer this, we will select the year released and the average tempo from the song table. we will GROUP the rows by the year released and average tempo on a year-by-year basis.

SELECT year_released, AVG(tempo)
FROM song
GROUP BY year_released
ORDER BY year_released;

This query aggregate the data by year_released, and then return a single row for each batch of rows.

ROUND will round the numeric value to integer

Comment in SQL

single line comment: --

multi-line comment /* .... */

JOINs

five tables for this demo:

  • martian: all people living on Mars
    • columns: martian_id, first_name, last_name, base_id, super_id
    • super_id is the ID of the Martian they report to
  • base: all habitats on Mars
    • columns: base_id, base_name, founded
    • linked with martian table because each Martian has a base_id idenfitying where they live and work
  • visitor: tracking all current visitors to Mars
  • inventory: supplies available at each base
  • supply: items available at Central Martian Storage Distribution Center

Q: How do you generate a report listing the full names of all Martians and the name of their home base?

The Martian name is the martian table, while the base name is the base table. The solution is to join these two tables together by the base_id

BUT WHEN a martian's base_id is NULL, do we include him in the JOIN, or do we leave him out? The answer depends on the TYPE OF JOIN you do. For now, we will leave him out.

Having joined these two tables, you can select data from these rows like an ordinary select query.

SELECT *
FROM martian
INNER JOIN base
ON martian.base_id = base.base_id;

Here martian is the "Left Table" and base is the "Right Table"

Specify HOW to connect two rows from these tables with an ON clause. In the ON clause you have to specify both table name and column name.

Q: How do you tell SQL whether or not to include rows from a table that do not have a match in the other?

We have two tables, and we have two choices per table: include rows without a match or exclude them -> so 4 options in total. SQL has a JOIN for each of the four options!

SELECT column1, column2, ...
FROM leftTable
____ JOIN rightTable
ON leftTable.xxx = rightTable.xxx
WHERE condition(s)
ORDER BY value;

Four types of JOINs:

  1. INNER: only return connected, matching rows
  2. LEFT: returns all connected rows, and unconnected rows from left table (nulls for ALL the columns in the right table)
  3. RIGHT: reverse of LEFT, null values will be used for ALL columns in the left table if no match
  4. FULL: or FULL OUTER == LEFT + RIGHT, nulls will fill the gap

NOTE: not every database supports the four joins, for example, MySQL and SQLite do not support FULL JOINs

To avoid ambiguity, you specify the table AND column name in the SELECT clause: martian.martian_id, base.base_id

ALIAS

create an "alias" for table name, use the AS keyword

SELECT m.martian_id, b.base_id, b.base_name
FROM martian AS m
INNER JOIN base AS b
ON m.base_id = b.base_id

you can give column names aliases too

SELECT v.first_name AS visitor_fn, v.last_name AS visitor_ln,
    m.first_name AS martian_fn, m.last_name AS martian_ln
FROM visitor AS v
LEFT JOIN martian AS m
ON v.host_id = m.host_id;

JOIN table with itself (Self Join)

left table is martian, and the right table is their boss (super id) whom they report to

You want a report of EVERY martian, but not every martian has a superior, so we can use LEFT JOIN:

SELECT m.first_name AS fn, m.last_name AS ln,
       s.first_name AS super_fn, s.last_name AS super_ln
FROM martian AS m
LEFT JOIN martian AS s
ON m.super_id = s.martian_id
ORDER BY m.martian_id;

Inventory report

the two tables containing the data we need:

  • inventory: supplies available at each base
  • supply: items available at Central Martian Storage Distribution Center

we want name of the supplies and what items they do NOT have in storage, with sub-query

SELECT *
FROM (SELECT * FROM inventory WHERE base_id = 1) AS i
RIGHT JOIN supply AS s
ON i.supply_id = s.supply_id
ORDER BY s.supply_id;

Views

you can create virtual tables to avoid super-long queries

How to create a view:

  1. write a query returning the data to appear in the view
  2. above write CREATE VIEW and give your view a name AS the query
CREATE VIEW new_view_name AS
SELECT column1, column2, ...
FROM table_name;

View benefit: security and permission to a view instead of whole table, simplicity,

Union

To have the list of ALL people, both martian and visitor on Mars:

One way to combine two queries is the UNION command

You only need to be sure to line up columns of the same type in your two SELECT queries

CREATE VIEW people_on_mars AS
SELECT CONCAT('m', martian_id) AS id, first_name, last_name, 'Martian' AS status
FROM martian_public
    UNION
SELECT CONCAT('v', visitor_id) AS id, first_name, last_name, 'Visitor' AS status
FROM visitor;

Cross join and Coalesce

Task: create a VIEW called "base_storage" showing the quantity of all supply items in stock at each base

For all possible combinations of Base and supply item, with a CROSS JOIN: Performs cross product between two tables

Cross join connects each row from the first table with each row in the second table. This gives us all possible base/supply combinations. 5 bases x 10 supply items = 50 possible combinations

CREATE VIEW base_storage AS
SELECT b.base_id, s.supply_id, s.name,
    COALESCE(
        (SELECT quantity FROM inventory
         WHERE base_id = b.base_id AND supply_id = s.supply_id),
    0) AS quantity
FROM base AS b
CROSS JOIN supply AS s;

coalesce(a1, a2, a3 ...) returns the first non-null argument

Indexes in SQL

index = how an index works in a book

When you create an index, the database will generate a method to rapidly find data based on one or more columns

eg1. create index on first_name column of person table

CREATE INDEX person_first_name_idx
ON person (first_name);

It took over four minutes to build the index for the first-name column

The database will do its work not in the order we wrote the queries!

Query Optimizer, the database will consider all possible ways to execute your query, then chooses the optimal path!

Each query possibility is a query plan

A database can have more than one index, and an index can be made for more than one colum.

CREATE INDEX person_first_name_last_name_idex
ON person (first_name, last_name);

This index took over 7 minutes to build

indexes are not free!

  • Just as indexes in a book require paper, indexes in a database require storage
  • Furthermore, when you add data to a database, it needs to create new records AND update ALL RELEVANT the indexes

About

SQL playlist from YouTube Socratica: http://bit.ly/Socratica_SQL

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages