# SQL Tutorial

This Notebook uses the *Cellphone Recommendations* example from Kaggle.com


https://www.kaggle.com/datasets/meirnizri/cellphones-recommendations

About Dataset - from the above website (Accessed 16/11/2022):

*This dataset contains three files:*

*The cellphone data.csv contains data on the most popular cell phones in the US in 2022. The data for each cell phone consists of the most notable features such as performance rating (AnTuTu), memory size, camera's resolution, battery size, screen size, release date, etc. The price of each cell phone collected from Amazon and Best-Buy (in Aug 22). Overall, in our dataset there are 34 cell phones with 13 features.*

*The user's data and their ratings are in cellphones users.csv and cellphones ratings.csv. To elicit the ratings, we conducted a survey on Mechanical Turk. Each participant was presented with 10 random cell phones, and she was asked to indicate how likely she is to purchase each of the cell phones at the given price, on a scale from 1 (very unlikely) to 10 (very likely). We also asked each participant to add personal information: age, gender, and occupation.*

*This dataset can be used for building a recommendation system model that relies mainly on the features of the items.*

Enable access to the PostgreSQL database engine via SQL cell magic.

In [None]:
%load_ext sql

Use the sql_init.ipynb file provided by Tutorial 08.2 to login (User tm351)

In [None]:
# Make the connection - this file is available from the Notebooks 08 folder

%run sql_init.ipynb

If this has run properly it will have setup a connection string (DB_CONNECTION_STRING), which can be used to create a connection to the database (DB_CONNECTION).

See the Notebooks in Part 08 for further examples.

In [None]:
DB_CONNECTION_STRING

In [None]:
import sqlalchemy 
DB_CONNECTION=sqlalchemy.create_engine(DB_CONNECTION_STRING)
DB_CONNECTION

In [None]:
%sql DB_CONNECTION

Drop the tables if they previously existed - only run this if they exist:

In [None]:
%%sql
/* need to drop cellphone_ratings first, since it is related to the other two tables */
DROP TABLE IF EXISTS cellphone_ratings;
DROP TABLE IF EXISTS cellphone_data;
DROP TABLE IF EXISTS cellphone_users;


The cellphones dataset is stored in three CSV files. 

*Notebook 08.3 Adding column constraints* to tables shows how we can load these into a Panda DataFrame, then convert to SQL Table. Reading CSV files into a Pandas dataframe is first shown in *Notebook 02.2.1 Data file formats - CSV*

First import pandas:

In [None]:
import pandas as pd

Next import the CSV files into Panda dataframes:

In [None]:
# Import the cellphones_data.csv file into a DataFrame and change cellphones to cellphone
cellphone_data_df=pd.read_csv('./data/cellphones data.csv',
                       parse_dates=['release date'])

#Look at the first few rows of the resulting DataFrame
cellphone_data_df.head()

In [None]:
# change spaces to underscore (_) in column names, makes life easier when querying the tables

cellphone_data_df = cellphone_data_df.rename(columns={'operating system': 'operating_system', 'internal memory': 'internal_memory',
                                                       'main camera' : 'main_camera', 'selfie camera' : 'selfie_camera',
                                                       'battery size' : 'battery_size','screen size' : 'screen_size',
                                                       'release date' : 'release_date'})

cellphone_data_df.head()

In [None]:
# Import the cellphones ratings.csv file into a DataFrame
cellphone_ratings_df=pd.read_csv('./data/cellphones ratings.csv')

#Look at the first few rows of the resulting DataFrame
cellphone_ratings_df.head()

In [None]:
# Import the cellphones users.csv file into a DataFrame
cellphone_users_df=pd.read_csv('./data/cellphones users.csv')

#Look at the first few rows of the resulting DataFrame
cellphone_users_df.head()

Now convert to tables

Postgresql allows the data to be imported via a Panda’s DataFrame

See Notebook 08.2 for some examples


In [None]:
cellphone_data_df.to_sql('cellphone_data',
                  DB_CONNECTION,
                  if_exists='replace',
                  index=False
                  )

In [None]:
cellphone_ratings_df.to_sql('cellphone_ratings',
                  DB_CONNECTION,
                  if_exists='replace',
                  index=False
                  )

In [None]:
cellphone_users_df.to_sql('cellphone_users',
                  DB_CONNECTION,
                  if_exists='replace',
                  index=False
                  )

Check tables have been created ok.

In [None]:
%%sql
SELECT * FROM cellphone_users;

In [None]:
%%sql
SELECT * FROM cellphone_ratings;

In [None]:
%%sql
SELECT * FROM cellphone_data;

When there are a lot of rows LIMIT can be used to restrict how many rows are returned - similar to the .head() function

In [None]:
%%sql

SELECT *
FROM cellphone_data
LIMIT 5;

Can use OFFSET to skip n records before applying the LIMIT

In [None]:
%%sql

SELECT *
FROM cellphone_data
LIMIT 5 OFFSET 5;

LIMIT does not take negative arguments, for example, to get last 5 records, but you can sort the records in ascending/descending order to achieve the same effect:

In [None]:
%%sql
SELECT * FROM cellphone_ratings
ORDER by rating, cellphone_id
LIMIT 10;

In [None]:
%%sql
/*  
    Note if we want both columns in descending order, you need to use DESC twice, otherwise it will default to ascending
*/
SELECT * FROM cellphone_ratings
ORDER by rating DESC, cellphone_id DESC
LIMIT 10;

You can use the data dictionary tables to check that the tables have been created:

In [None]:
%%sql
/* check if tables created */

SELECT *
FROM information_schema.tables
WHERE table_type = 'BASE TABLE' and table_schema <> 'pg_catalog'
and table_name LIKE 'cellphone%';

SQL *INSERT* command can be also used to add some data.

We can check the structure of the tables first

In [None]:
%reload_ext schemadisplay_magic

In [None]:
%schema --connection_string $DB_CONNECTION_STRING -t cellphone_data

In [None]:
%schema --connection_string $DB_CONNECTION_STRING -t cellphone_ratings

In [None]:
%schema --connection_string $DB_CONNECTION_STRING -t cellphone_users

Before making any changes to the data, lets add some constraints, such as a PRIMARY KEY for each table:


In [None]:
%%sql

ALTER TABLE cellphone_data
ADD CONSTRAINT cellphone_data_pk
    PRIMARY KEY(cellphone_id);

In [None]:
%%sql

ALTER TABLE cellphone_users
ADD CONSTRAINT cellphone_users_pk
    PRIMARY KEY(user_id);

cellphones_ratings needs a composite key, since a user may have rated more than one cellphone.

In [None]:
%%sql

ALTER TABLE cellphone_ratings
ADD CONSTRAINT cellphone_ratings_pk
    PRIMARY KEY(user_id, cellphone_id);

Plus add some foreign keys to cellphone_ratings and cellphone_users

In [None]:
%%sql

ALTER TABLE cellphone_ratings
ADD CONSTRAINT cellphone_users_fk
    FOREIGN KEY(user_id) REFERENCES cellphone_users(user_id);

In [None]:
%%sql

ALTER TABLE cellphone_ratings
ADD CONSTRAINT cellphone_data_fk
    FOREIGN KEY(cellphone_id) REFERENCES cellphone_data(cellphone_id);

In [None]:
%schema --connection_string $DB_CONNECTION_STRING -t cellphone_data

In [None]:
%%sql
/* add a user and their rating for a phone */
INSERT INTO cellphone_users VALUES (300, 61, 'Female', 'OU Associate Lecturer');
INSERT INTO cellphone_ratings VALUES (300, 8, 9);
/* add a user who has not made a rating and a phone that has not been rated yet */
INSERT INTO cellphone_users VALUES (350, 40, 'Male', 'Contract Administrator');
INSERT INTO cellphone_data VALUES (40, 'OPPO', 'A79', 'Android', 128, 8, 5.76, 50, 2, 5000, 6.72, 218, 154.99, '2023-10-28');
COMMIT;

Phone spec taken from: https://specs-tech.com/en/oppo-a79/ and price Amazon UK 14/11/23

Test that the primary key and foreign keys work

The following three inserts should generate an integrity error - can you see why?


In [None]:
%%sql
INSERT INTO cellphone_users VALUES (300, 21, 'Male', 'IT Consultant');

In [None]:
%%sql
INSERT INTO cellphone_ratings VALUES (400, 9, 9);

In [None]:
%%sql
INSERT INTO cellphone_ratings VALUES (300, 50, 9);

Some queries
=======

Simple selects


In [None]:
%%sql
SELECT * FROM cellphone_data;

In [None]:
%%sql
/* restrict columns */

SELECT cellphone_id, brand, model FROM cellphone_data;

In [None]:
%%sql
/* restrict rows */

SELECT * FROM cellphone_data WHERE brand = 'Samsung';

In [None]:
%%sql
/* combination */

SELECT cellphone_id, brand, model FROM cellphone_data
WHERE brand = 'Apple';

Joins
===

In [None]:
%%sql
/* traditional way to join tables, using table aliases */

SELECT cr.cellphone_id, brand, model, rating 
FROM cellphone_data cd, cellphone_ratings cr
WHERE cd.cellphone_id = cr.cellphone_id;

Hint: do check the number of rows returned from any join, you should never get more rows than there are in any of the tables used. If you do, check that the tables are joined correctly.

In [None]:
%%sql
/* traditional way to join, restricting to Samsung only */

SELECT cr.cellphone_id, brand, model, rating 
FROM cellphone_data cd, cellphone_ratings cr
WHERE cd.cellphone_id = cr.cellphone_id
AND brand = 'Samsung';


In [None]:
%%sql
/* ANSI join */

SELECT cr.cellphone_id, brand, model, rating 
FROM cellphone_data cd JOIN cellphone_ratings cr 
ON cd.cellphone_id = cr.cellphone_id
AND brand = 'Samsung';


In [None]:
%%sql
/* ANSI join */

SELECT cu.user_id, age, occupation, cr.user_id, cellphone_id, rating 
FROM cellphone_users cu JOIN cellphone_ratings cr 
ON cu.user_id = cr.user_id;

In [None]:
%%sql
SELECT COUNT(*) AS data_count FROM cellphone_data;

In [None]:
%%sql
SELECT COUNT(*) AS ratings_count FROM cellphone_ratings;

In [None]:
%%sql
SELECT COUNT(*) AS ratings_count FROM cellphone_users;

In [None]:
%%sql
/*
    outer join to include cellphones with no ratings
*/

SELECT cellphone_data.cellphone_id, brand, model, cellphone_ratings.cellphone_id, rating 
FROM cellphone_data LEFT OUTER JOIN cellphone_ratings 
ON cellphone_data.cellphone_id = cellphone_ratings.cellphone_id;

In [None]:
%%sql
/* outer join to include cellphones with no ratings only */

SELECT cellphone_data.cellphone_id, brand, model, cellphone_ratings.cellphone_id, rating 
FROM cellphone_data LEFT OUTER JOIN cellphone_ratings 
ON cellphone_data.cellphone_id = cellphone_ratings.cellphone_id
AND cellphone_ratings.cellphone_id IS NULL;


Our newly added OPPO phone is not the only one without a rating

In [None]:
%%sql
/*  outer join
    to include users who have not rated any phones
    our added user is at the end
*/

SELECT cu.user_id, age, gender, rating
FROM cellphone_ratings cr RIGHT OUTER JOIN cellphone_users cu 
ON cr.user_id = cu.user_id ;

In [None]:
%%sql
/*  outer join
    to only show users who have not rated any phones
*/

SELECT cu.user_id, age, gender, rating
FROM cellphone_ratings cr RIGHT OUTER JOIN cellphone_users cu ON cr.user_id = cu.user_id 
AND cu.user_id IS NULL
ORDER BY user_id;

The other type of OUTER JOIN is a *FULL OUTER JOIN* which would be useful if we had some rows in the cellphone_ratings table that did not match either the ratings or users tables, but the foreign key constraints would prevent this.

In [None]:
%%sql
/*  full outer join
    to only show users who have not rated any phones, or cellphone ratings without a user. 
    In this case the foreign key constraint will mean the results are similar to the RIGHT OUTER JOIN above, without 
    the test for nulls
*/

SELECT cu.user_id, age, gender, rating
FROM cellphone_ratings cr FULL OUTER JOIN cellphone_users cu ON cr.user_id = cu.user_id 
ORDER BY user_id;

In [None]:
%%sql
/*
    cartesian product
    look what happens if you forget to join the tables.
*/

SELECT *
FROM cellphone_ratings, cellphone_users;


Always check how many rows are returned. You should not get back more rows than there are in the tables!

In [None]:
%%sql
/* statistics */

SELECT cellphone_id, count(*) as rating_count
FROM cellphone_ratings
GROUP BY cellphone_id
ORDER BY cellphone_id;

Would the following query be correct:

In [None]:
%%sql
SELECT cellphone_id, count(*) as rating_count
FROM cellphone_ratings
WHERE count(*) > 35
GROUP BY cellphone_id
ORDER BY cellphone_id;

No, the above query should not work, because the WHERE clause is referencing a GROUP function. If you want to restrict the rows used in the group function, use HAVING instead:

In [None]:
%%sql
/* Having is like a WHERE clause on the group function: */

SELECT cellphone_id, count(*) as rating_count
FROM cellphone_ratings
GROUP BY cellphone_id
HAVING count(*) > 25
ORDER BY cellphone_id;


In [None]:
%%sql
/* Can have a WHERE and HAVING clause together */

SELECT cellphone_id, count(*) as rating_count
FROM cellphone_ratings
WHERE rating > 6
GROUP BY cellphone_id
HAVING count(*) > 25
ORDER BY cellphone_id;

Subqueries:

In [None]:
%%sql
/* Who is the eldest user? */

SELECT user_id, age, gender FROM cellphone_users WHERE age = 
    (SELECT MAX(age) FROM cellphone_users);

In [None]:
%%sql
/*
    Which users work in IT.
    Note, you need to use IN rather than equals (=) for comparison, 
    since more than one user_id will be returned.
*/

SELECT user_id, age, gender
FROM cellphone_users 
WHERE user_id IN 
    (SELECT user_id FROM cellphone_users WHERE occupation = 'IT');

In [None]:
%%sql
/* 
    Which phone has the highest rating:
*/

SELECT cellphone_data.cellphone_id, brand, model, rating 
FROM cellphone_data JOIN cellphone_ratings 
ON cellphone_data.cellphone_id = cellphone_ratings.cellphone_id
WHERE rating = (SELECT MAX(rating) FROM cellphone_ratings);

Perhaps an anomaly, since the other ratings seem to be 1-10

In [None]:
%%sql
/*
    Column subquery:
*/

SELECT cellphone_id, brand, model, 
    (SELECT MAX(price) as Max_Price FROM cellphone_data), 
    (SELECT MAX(price) FROM cellphone_data)-price as Difference 
FROM cellphone_data;


Correlated subquery

This will find the employees whose salary is 10% more than the average salaries for employees of the same gender.

This requires comparing a person’s salary with an aggregate (average of all salaries):


In [None]:
%%sql
/*
    correlated subquery
    which phones cost more than average price
    of phones of the same brand
*/

SELECT brand, model, price FROM cellphone_data cd1 
WHERE price > 
    (SELECT AVG(price) AS Avg_price 
     FROM cellphone_data cd2 
     WHERE cd1.brand = cd2.brand);


# Exercises from the tutorial

In [None]:
%%sql
/* retrieve all rows from cellphone_users */

SELECT * FROM cellphone_users;

In [None]:
%%sql
/*  retrieve cellphones released after the 1st day of 2022 */

SELECT brand, model, release_date 
FROM cellphone_data
WHERE release_date > '2022-01-01';

In [None]:
%%sql
/*  retrieve brand and price of all Samsung and OnePlus phones over £600 */
SELECT brand, price 
FROM cellphone_data
WHERE (brand = 'Samsung' OR brand = 'OnePlus') 
AND price >= 600;


In [None]:
%%sql
/*  retrieve brand and price of all Samsungs or OnePlus phones over £600
    subtle difference from previous query */

SELECT brand, price 
FROM cellphone_data
WHERE brand = 'Samsung' OR (brand = 'OnePlus' 
AND price >= 600);
                                                                   

Misc Exercises
=====
Why do these generate errors:

In [None]:
%%sql
SELECT brand, model, rating 
FROM cellphone_data, cellphone_ratings 
WHERE cellphone_id = cellphone_id;


In [None]:
%%sql
SELECT cellphone_id, brand, model, rating 
FROM cellphone_data cd, cellphone_ratings cr
WHERE cd.cellphone_id = cr.cellphone_id;


Both of the above do not work because the cellphone_id appears in both of the tables appearing in the FROM clause, so you need to tell the database which one you want.

In [None]:
%%sql
SELECT brand, AVG(price) FROM cellphone_data GROUP BY price;


You can mix fields and group functions, but you need to *GROUP BY* the field that is not a group function, brand in this case.

Further exercises
=====

In [None]:
%%sql
SELECT brand, ROUND(AVG(price),2) AS average_price FROM cellphone_data 
GROUP BY brand
ORDER BY average_price DESC;

In [None]:
%%sql
SELECT AVG(price) FROM cellphone_data GROUP BY brand;


Subqueries:

In [None]:
%%sql
/* latest or earliest release? */
SELECT * FROM cellphone_data WHERE release_date =
(SELECT MIN(release_date) FROM cellphone_data);


In [None]:
%%sql
SELECT * FROM cellphone_data WHERE release_date =
(SELECT MAX(release_date) FROM cellphone_data);
