# JOINs

### 1) INNER JOIN
- Definition: Combines rows from two tables where there is a match in the specified columns.

- Key Point: Only returns rows that have matching values in both tables.
- Use Case: Fetch data that exists in both tables (e.g., employees and their departments).

Example:

In [None]:
SELECT e.employee_name, d.department_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id;

-- Result: Employees with valid department assignments.

-----------------

### 2. FULL OUTER JOIN
- Definition: Combines the results of both LEFT JOIN and RIGHT JOIN. Returns all rows from both tables, with NULLs where there is no match.

- Key Point: Useful for identifying unmatched data in both tables.
- Use Case: Find all employees and projects, including those without matches.
Example:

In [None]:
SELECT e.employee_name, p.project_name
FROM employees e
FULL OUTER JOIN projects p ON e.employee_id = p.employee_id;

-- Result: All employees and their projects, including those without assignments.

------------

### 3. LEFT JOIN (or LEFT OUTER JOIN)
- Definition: Returns all rows from the left table and the matching rows from the right table. If no match is found, NULLs are returned for the right table.

- Key Point: Useful for finding unmatched data in the right table.
- Use Case: Find employees who are not assigned to any project.

Example:



In [None]:
SELECT e.employee_name, p.project_name
FROM employees e
LEFT JOIN projects p ON e.employee_id = p.employee_id;

-- Result: All employees with their assigned projects, including those without any project.

-------------------

### 4. RIGHT JOIN (or RIGHT OUTER JOIN)
- Definition: Returns all rows from the right table and the matching rows from the left table. If no match is found, NULLs are returned for the left table.

- Key Point: Similar to LEFT JOIN but focuses on the right table.
- Use Case: Find projects that are not assigned to any employee.
Example:

In [None]:
SELECT p.project_name, e.employee_name
FROM projects p
RIGHT JOIN employees e ON p.employee_id = e.employee_id;

-- Result: All projects with their assigned employees, including those without any employee.

----------

### 5. CROSS JOIN
- Definition: Produces a Cartesian product of two tables (every row in the first table is combined with every row in the second table).

- Key Point: Use sparingly, as it can generate a large number of rows.
- Use Case: Generate all possible combinations of employees and projects.
Example:

In [None]:
SELECT e.employee_name, p.project_name
FROM employees e
CROSS JOIN projects p;

-- Result: Every employee paired with every project, regardless of assignment.

---------

### 6. SELF JOIN
- Definition: A table is joined with itself.

- Key Point: Useful for hierarchical or relational data within the same table.
- Use Case: Find employees who report to the same manager.
Example:



In [None]:
SELECT e1.employee_name AS Employee, e2.employee_name AS Manager
FROM employees e1
INNER JOIN employees e2 ON e1.manager_id = e2.employee_id;

-- Result: List of employees with their respective managers.

----------

### 7. NATURAL JOIN
- Definition: Automatically joins tables based on columns with the same name and compatible data types.

- Key Point: Avoid if column names are ambiguous or inconsistent.
- Use Case: Simplify queries when column names are standardized.
Example:

In [None]:
SELECT *
FROM employees
NATURAL JOIN departments;

-- Result: Employees with their department information, based on common column names.

-------

### 8. USING Clause
- Definition: Simplifies joins by specifying the column(s) to join on, assuming the column names are the same in both tables.

- Key Point: Cleaner syntax than ON for common column names.
- Use Case: Join tables with shared column names.
Example:

In [None]:
SELECT e.employee_name, d.department_name
FROM employees e
INNER JOIN departments d USING (department_id);

-- Result: Employees with their department information, using the common column name.

-------

### 9. Advanced Join Use Cases
A. Joining More Than Two Tables
- Combine multiple tables in a single query. Example:

In [None]:
SELECT e.employee_name, d.department_name, p.project_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
LEFT JOIN projects p ON e.employee_id = p.employee_id;

-- result: Employees with their department and project information, including those without projects.

B. Filtering with Joins
- Add `WHERE` or `HAVING` clauses to filter results. 

Example:

In [None]:
SELECT e.employee_name, d.department_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
WHERE d.department_name = 'Engineering';

-- result: Employees in the Engineering department.

C. Aggregation with Joins
- Combine joins with aggregate functions like `COUNT`, `SUM`, etc. 

Example:

In [None]:
SELECT d.department_name, COUNT(e.employee_id) AS employee_count
FROM departments d
LEFT JOIN employees e ON d.department_id = e.department_id
GROUP BY d.department_name;

-- result: Count of employees in each department, including departments with no employees.

Important Notes:

- Performance: Joins can be resource-intensive. Use `indexes` on join columns to improve performance.

- NULL Handling: Be cautious with NULL values, especially in OUTER JOINs.
- Aliasing: Use table aliases (e, d, etc.) for readability and to avoid ambiguity.
- Avoid Cartesian Products: Unless intentional, ensure join conditions are specified to avoid unintended CROSS JOINs.

In [None]:
EXPLAIN ANALYZE
SELECT e.employee_name, d.department_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
WHERE d.department_name = 'Engineering';

-- result: Execution plan for the query, showing how the database will execute it.

## Indexes on JOINS

To demonstrate the practical impact of using indexes on joins, I'll provide a step-by-step example with performance comparisons. 

This will include the problem, solution, and performance analysis using PostgreSQL's `EXPLAIN ANALYZE`.

### Scenario: Joining Employees and Departments
- Problem : 
    - You have two tables:
        - employees (1 million rows)
        - departments (100 rows)

You want to find the department name for each employee using an INNER JOIN on the department_id column. 

Without an index, the query is slow because PostgreSQL performs a sequential scan on both tables.


#### Step 1: Create Tables and Populate Data

In [None]:
-- Create the departments table
CREATE TABLE departments (
    department_id SERIAL PRIMARY KEY,
    department_name VARCHAR(50) NOT NULL
);

-- Create the employees table
CREATE TABLE employees (
    employee_id SERIAL PRIMARY KEY,
    employee_name VARCHAR(50) NOT NULL,
    department_id INT
);

-- Insert data into departments
INSERT INTO departments (department_name)
SELECT 'Department ' || g
FROM generate_series(1, 100) g;

-- Insert data into employees
INSERT INTO employees (employee_name, department_id)
SELECT 'Employee ' || g, (g % 100) + 1
FROM generate_series(1, 1000000) g;

#### Step 2: Run the Query Without Index

In [None]:
-- Query to join employees and departments
EXPLAIN ANALYZE
SELECT e.employee_name, d.department_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id;

Query Plan (Without Index):

- Execution Time: ~2000 ms
- Reason: PostgreSQL performs a sequential scan on both tables, which is inefficient for large datasets.

#### Step 3: Add Index on Join Column

In [None]:
-- Create an index on the department_id column in employees
CREATE INDEX idx_employees_department_id ON employees(department_id);

#### Step 4: Run the Query With Index

In [None]:
-- Query to join employees and departments
EXPLAIN ANALYZE
SELECT e.employee_name, d.department_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id;

Query Plan (With Index):

- Execution Time: ~500 ms
- Reason: PostgreSQL uses the index scan on employees.department_id, significantly reducing the number of rows scanned.

#### Step 5: Compare Performance

#### Step 6: Practical Insights
- 1) When to Use Indexes:
        - Use indexes on columns frequently used in JOIN conditions.

        - Indexes are especially useful for large tables.

- 2) When Not to Use Indexes:
        - For small tables, sequential scans may be faster than using an index.

        - Avoid creating too many indexes, as they can slow down INSERT, UPDATE, and DELETE operations.

- 3) How to Verify Index Usage:
        - Use EXPLAIN ANALYZE to check if PostgreSQL is using the index (Index Scan in the query plan).

- 4) Composite Indexes:
        - If a query involves multiple columns in the JOIN condition, create a composite index:

In [None]:
CREATE INDEX idx_composite ON employees(department_id, employee_name);

 ####  Creating an index on the `department_id` column improves query performance by optimizing how PostgreSQL retrieves data during JOIN operations. Here's a detailed explanation of what happens internally:

- Without Index: When a query involves a JOIN condition on 
    `department_id`, PostgreSQL performs a sequential scan on both tables. 
    This means it reads every row in the `employees` table and matches it with every row in the `departments` table. For large datasets, this is inefficient and time-consuming.

- With Index: When an index is created on `department_id`, PostgreSQL uses the index to quickly locate rows in the `employees table` that match the `department_id` in the `departments table`. Instead of scanning the entire table, it performs an index scan, which is much faster.

- How Index Works: The index is essentially a data structure (e.g., B-tree) that stores the values of the `department_id` column in a sorted order. When a query is executed, PostgreSQL uses the index to perform a binary search to find matching rows, reducing the number of rows it needs to scan.
- Query Plan Changes: After creating the index, if you run `EXPLAIN ANALYZE` on the query, you'll notice that PostgreSQL uses an 'Index Scan' instead of a 'Sequential Scan'. This indicates that the database is leveraging the index to optimize the query.
- Performance Improvement: The execution time is significantly reduced because the index allows PostgreSQL to skip unnecessary rows and directly access the relevant data. For example, in your case, the execution time dropped from ~2000 ms (sequential scan) to ~500 ms (index scan).
- Additional Benefits: Indexes also improve performance for other operations involving the `department_id` column, such as filtering (`WHERE department_id = X`) or sorting (`ORDER BY department_id`). Use the Explain Analyze command to verify the query plan and observe the performance improvement.

---------------------

## AGGREGATE FUNCTIONS

Aggregate functions are used to perform calculations on a set of rows and return a single value. They are commonly used with GROUP BY and HAVING clauses.

In [None]:
SELECT COUNT(*) AS total_employees FROM employees;

-- Counts the number of rows.

SELECT COUNT(DISTINCT department_id) AS total_departments FROM employees;

-- Counts the number of unique department IDs.

In [None]:
SELECT SUM(salary) AS total_salary FROM employees;

-- calculates the total salary of all employees.

SELECT AVG(salary) AS average_salary FROM employees;

-- calculates the average salary of all employees.

SELECT MIN(salary) AS min_salary FROM employees;

-- finds the minimum salary among all employees.

SELECT MAX(salary) AS max_salary FROM employees;

-- finds the maximum salary among all employees.


In [None]:

-- STRING_AGG(COLUMN,DELIMETER) -> Concatenates strings from multiple rows into a single string, with a delimiter.

SELECT STRING_AGG(employee_name, ', ') AS employee_list FROM employees;
-- Concatenates employee names into a single string, separated by commas.



-- ARRAY_AGG(COLUMN) -> Aggregates values into an array.

SELECT ARRAY_AGG(employee_name) AS employee_array FROM employees;
-- Creates an array of employee names.


-- JSON_AGG(COLUMN) -> Aggregates values into a JSON array.

SELECT JSON_AGG(employee_name) AS employee_json FROM employees;
-- Creates a JSON array of employee names.


-- JSONB_AGG(COLUMN) -> Aggregates values into a JSONB array.

SELECT JSONB_AGG(employee_name) AS employee_jsonb FROM employees;
-- Creates a JSONB array of employee names.


-- XMLAGG(COLUMN) -> Aggregates values into an XML document.

SELECT XMLAGG(employee_name) AS employee_xml FROM employees;
-- Creates an XML document of employee names.



-- COALESCE(COLUMN, DEFAULT) -> Returns the first non-null value.

SELECT COALESCE(department_id, 0) AS department_id FROM employees;
-- Returns the department_id or 0 if null.



-- NULLIF(COLUMN1, COLUMN2) -> Returns null if the two columns are equal.

SELECT NULLIF(department_id, 0) AS department_id FROM employees;
-- Returns null if department_id is 0, otherwise returns department_id.

#### Using Aggregate Functions with GROUP BY
- Aggregate functions are often used with GROUP BY to group rows based on one or more columns.

In [None]:
SELECT department, SUM(salary) AS total_salary
FROM employees
GROUP BY department;
-- result: Total salary for each department.


SELECT department, COUNT(employee_id) AS employee_count
FROM employees
GROUP BY department;

-- result: Count of employees in each department.


SELECT department, AVG(salary) AS average_salary
FROM employees
GROUP BY department;

-- result: Average salary for each department.



SELECT department, job_title, COUNT(employee_id) AS employee_count
FROM employees
GROUP BY department, job_title;
-- result: Count of employees grouped by department and job title.

#### Using Aggregate Functions with HAVING
- The HAVING clause filters groups based on aggregate values.

In [None]:
SELECT department, SUM(salary) AS total_salary
FROM employees
GROUP BY department
HAVING SUM(salary) > 100000;

-- result: Departments with total salary greater than 100000.

Advanced Examples

In [None]:
SELECT department, COUNT(*) AS employee_count
FROM employees
GROUP BY department;
-- result: Count of employees in each department.


SELECT employee_name, salary
FROM employees
ORDER BY salary DESC
LIMIT 3;
-- result: Top 3 highest-paid employees.



SELECT department, STRING_AGG(employee_name, ', ') AS employee_names
FROM employees
GROUP BY department;
-- result: List of employee names grouped by department.
-- concatenates employee names into a single string for each department.


### SUB QUERIES

##### 1) Single-Row Sub-Query
- Definition: Returns a single row with a single column as the result.

- Use Case: Used when the outer query requires a single value (e.g., comparison with a specific value).

In [None]:
-- Find employees who earn more than the average salary
SELECT employee_name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

-- Use Case: Identify employees earning more than the average salary.

In [None]:
-- Optimized Query

WITH avg_salary AS (
    SELECT AVG(salary) AS avg_salary
    FROM employees
)
SELECT employee_name, salary
FROM employees, avg_salary
WHERE salary > avg_salary.avg_salary;

-- The with clause (common table expression) calculates the average salary once, avoiding recalculation it for each row in outer query.

##### 2. Multi-Row Sub-Query
- Definition: Returns multiple rows as the result.

- Use Case: Used when the outer query needs to compare a column with multiple values (e.g., using IN, ANY, or ALL).

In [None]:
-- Find employees who work in departments located in 'New York'
SELECT employee_name
FROM employees
WHERE department_id IN (
    SELECT department_id
    FROM departments
    WHERE location = 'New York'
);

-- Use Case: Fetch employees working in specific locations.

In [None]:
-- Optimized Query

SELECT e.employee_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
WHERE d.location = 'New York';

-- The optimized query uses a join instead of a subquery, which can be more efficient in some cases.

-- better than using cte 

##### 3. Correlated Sub-Query
- Definition: A sub-query that references columns from the outer query. It is executed repeatedly for each row in the outer query.

- Use Case: Used for row-by-row comparisons or filtering based on related data.


In [None]:
-- Find employees whose salary is greater than the average salary of their department
SELECT employee_name, salary
FROM employees e
WHERE salary > (
    SELECT AVG(salary)
    FROM employees
    WHERE department_id = e.department_id
);

-- Use Case: Identify top earners within each department.

In [None]:
-- Optimized Query

WITH department_avg_salary AS (
    SELECT department_id, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department_id
)
SELECT e.employee_name, e.salary
FROM employees e
INNER JOIN department_avg_salary das ON e.department_id = das.department_id
WHERE e.salary > das.avg_salary;

-- The optimized query calculates the average salary for each department once and joins it with the employees table, improving performance.


-- Used a WITH clause to precompute the average salary for each department, reducing the need for repeated sub-query execution for each row in the outer query.


Key Points:
- Single-Row Sub-Query: Use operators like =, <, >.

- Multi-Row Sub-Query: Use operators like IN, ANY, ALL.

- Correlated Sub-Query: Executes for each row in the outer query, making it slower for large datasets.

Performance Tip:
- Optimize sub-queries by using indexes on columns involved in the sub-query conditions.