In [None]:
# A) SQL Basics & Advanced:
'''
1. What are window functions in SQL? Provide examples of how they are used.
   - rank() , dense_rank(), row_number(), precent_rank() ?
2. Explain the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN.
3. How would you optimize a slow SQL query?
4. Discuss the differences between OLAP and OLTP databases.
5. Find out duplicates in a table?
7. where vs having
 # A WHERE clause is used is filter records from a result.  The filter occurs before any groupings are made.
 # A HAVING clause is used to filter values from a group.
'''

### **Data Integrity and Constraints**
'''
Q) What are the different types of constraints in SQL, and how do they ensure data integrity?
A)
  - **PRIMARY KEY : Ensures each record is unique.
  - **FOREIGN KEY : Ensures referential integrity between tables.
  - **UNIQUE      : Ensures all values in a column are unique.
  - **CHECK       : Ensures that all values in a column meet specific conditions.
  - **NOT NULL    : Ensures a column cannot have NULL values.

'''

In [None]:
##1) Complex Joins
## You have two tables: Customers(customer_id, customer_name) and Orders(order_id,customer_id) . 
## Write a SQL query to find all customers who have placed more than 5 orders. Ensure that your query is optimized for large datasets.
'''
SELECT 
    C.customer_id,
    C.customer_name
FROM 
    Customers C
JOIN 
    (SELECT 
         O.customer_id,
         COUNT(O.order_id) AS order_count
     FROM 
         Orders O
     GROUP BY 
         O.customer_id
     HAVING 
         COUNT(O.order_id) > 5
    ) OC
ON 
    C.customer_id = OC.customer_id;

'''
# Optimization Considerations:
'''
1. Indexes: Ensure that there is an index on the customer_id column in the Orders table. This will help the database engine to 
 quickly group and count the orders for each customer.

2. Partitioning: If the Orders table is very large, consider partitioning the table by customer_id or by date (if applicable). 
This can speed up the query by reducing the amount of data the database needs to scan.

3. Query Execution Plan: Always check the query execution plan to ensure that the query is using indexes efficiently and that there are 
no full table scans on large tables.

'''

###Optimization Considerations:
'''
1. Indexes:
    Ensure that there are indexes on the customer_id columns in both the Orders and Customers tables. This will speed up the join operation.
    If the Orders table is very large, consider creating a composite index on (customer_id, order_id).

2. Query Execution Plan:
    * Analyze the query execution plan to ensure that the database is using the indexes effectively and not performing full table scans.
    * If the query is still slow, consider using database-specific optimizations, such as materialized views, 
    to store the result of the subquery OC for reuse.

3. Partitioning:
    * If the Orders table is partitioned by customer_id or order_date, ensure that the partitioning strategy is optimized for the query's
      access patterns.
'''

# 
'''
SELECT 
   c.customer_id,
   c.customer_name,
   count(distinct o.order_id) as total_orders
 FROM customers c 
 JOIN Orders o 
   c.customer_id = o.customer_id
GROUP BY customer_id having count(distinct order_id) > 5

'''

In [None]:
#2) Data Aggregation
#Q) Given a sales(date, product_id,amount) , write a SQL query to calculate the 7-day rolling average of sales for each product.
'''
Q) To calculate the 7-day rolling average of sales for each product, you can use the `WINDOW` function with a range-based window specification. Here’s how you can write the SQL query to achieve this:

SELECT
    date,
    product_id,
    amount,
    AVG(Amount) OVER (
        PARTITION BY product_id
        ORDER BY Date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_avg_7_days
FROM
    Sales
ORDER BY
    product_id,
    date;
```

### Explanation:

1. **`SELECT Date, ProductID, Amount`**: This selects the `Date`, `ProductID`, and `Amount` columns from the `Sales` table.
2. **`AVG(Amount) OVER (...) AS rolling_avg_7_days`**: The `AVG()` function calculates the average of the `Amount` values over a specified window of rows. The result is aliased as `rolling_avg_7_days`.
3. **`PARTITION BY ProductID`**: This clause partitions the data by `ProductID`, so that the rolling average is calculated separately for each product.

4. **`ORDER BY Date`**: This orders the data within each partition by `Date`, which is necessary for calculating the rolling average.

5. **`ROWS BETWEEN 6 PRECEDING AND CURRENT ROW`**: This specifies the window frame for the rolling average calculation. It includes the current row and the 6 preceding rows, resulting in a 7-day window (including the current day).

6. **`FROM Sales`**: This specifies the `Sales` table as the source of the data.

7. **`ORDER BY ProductID, Date`**: This ensures the final result is ordered by `ProductID` and `Date`, making it easier to interpret the rolling averages over time.

### Sample Data:

| Date       | ProductID | Amount | rolling_avg_7_days |
|------------|-----------|--------|--------------------|
| 2024-08-01 | 1         | 100    | 100                |
| 2024-08-02 | 1         | 150    | 125                |
| 2024-08-03 | 1         | 200    | 150                |
| 2024-08-04 | 1         | 250    | 175                |
| 2024-08-05 | 1         | 300    | 200                |
| 2024-08-06 | 1         | 350    | 225                |
| 2024-08-07 | 1         | 400    | 250                |
| 2024-08-08 | 1         | 450    | 300                |

### Optimization Considerations:

1. **Indexes**: 
   - Ensure there is an index on the `ProductID` and `Date` columns in the `Sales` table. This will help with the partitioning and ordering operations.

2. **Data Size**:
   - For very large datasets, consider partitioning the table by date ranges or using other techniques to manage large data volumes efficiently.

3. **Query Execution Plan**:
   - Check the execution plan to make sure that the query is using indexes appropriately and not performing full table scans or other costly operations.

This query will compute the 7-day rolling average of sales for each product, considering the sales data over the specified window of time.
'''

In [None]:
#2) Window Functions
#Q) Write a SQL query that ranks customers based on their total purchase amount. The ranking should reset for each region.
# Orders(customer_id,region,purchase_amount)
'''
To rank customers based on their total purchase amount, with the ranking resetting for each region, 
you can use the `RANK()` window function in SQL. This function allows you to assign a rank to each customer within 
their respective region based on their total purchase amount.

### SQL Query:
```sql
SELECT
    customer_id,
    region,
    total_purchase_amount,
    RANK() OVER (PARTITION BY region ORDER BY total_purchase_amount DESC) AS rank
FROM
(
        SELECT
            customer_id,
            region,
            SUM(purchase_amount) AS total_purchase_amount
        FROM
            Orders
        GROUP BY
            customer_id,
            region
) AS customer_purchases;

### Explanation:

1. **Inner Query (`customer_purchases`)**:
   - **`SELECT customer_id, region, SUM(purchase_amount) AS total_purchase_amount`**: This selects the `customer_id`, `region`, and calculates the total purchase amount for each customer by summing their `purchase_amount` values.
   - **`FROM Orders`**: Specifies the `Orders` table as the source of data.
   - **`GROUP BY customer_id, region`**: Groups the results by `customer_id` and `region` to calculate the total purchase amount for each customer within their respective region.

2. **Outer Query**:
   - **`RANK() OVER (PARTITION BY region ORDER BY total_purchase_amount DESC) AS rank`**: The `RANK()` function assigns a rank to each customer based on their `total_purchase_amount`. The `PARTITION BY region` clause ensures that the ranking resets for each region. The `ORDER BY total_purchase_amount DESC` clause orders the customers within each region from highest to lowest total purchase amount.

3. **Final Output**:
   - The query returns the `customer_id`, `region`, `total_purchase_amount`, and the computed rank for each customer within their region.

### Sample Data:

| customer_id | region | total_purchase_amount | rank |
|-------------|--------|-----------------------|------|
| 1           | East   | 5000                  | 1    |
| 2           | East   | 3000                  | 2    |
| 3           | East   | 2000                  | 3    |
| 4           | West   | 7000                  | 1    |
| 5           | West   | 4000                  | 2    |

### Optimization Considerations:

1. **Indexes**: 
   - Ensure that there are indexes on the `customer_id`, `region`, and `purchase_amount` columns in the `Orders` table. This can help speed up the aggregation and sorting operations.

2. **Partitioning**:
   - If the `Orders` table is very large, consider partitioning it by `region` to improve the efficiency of the query, especially if the query is run frequently.

3. **Query Execution Plan**:
   - Always check the execution plan to ensure that the query is using the appropriate indexes and is not performing unnecessary full table scans.

This query efficiently ranks customers based on their total purchase amount within each region, ensuring the ranking resets for every region as required.

'''


In [None]:
##4) Recursive Query
##Question: Write a SQL query to find the manager of each employee in a company, where the Employees table has columns EmployeeID, 
# ManagerID, and Name. Use a recursive CTE to achieve this.
'''

To find the manager of each employee using a recursive Common Table Expression (CTE), you can use the following SQL query. This query will help you traverse the hierarchy and find the direct and indirect managers for each employee.

### SQL Query:
```sql
WITH RECURSIVE EmployeeHierarchy AS (
    -- Anchor member: Select employees who are at the top of the hierarchy (or have no manager)
    SELECT
        EmployeeID,
        ManagerID,
        Name,
        CAST(Name AS VARCHAR(255)) AS ManagerPath
    FROM
        Employees
    WHERE
        ManagerID IS NULL  -- Assuming employees with no manager are at the top of the hierarchy

    UNION ALL

    -- Recursive member: Join with the Employees table to find the direct managers
    SELECT
        e.EmployeeID,
        e.ManagerID,
        e.Name,
        CAST(h.ManagerPath || ' -> ' || e.Name AS VARCHAR(255)) AS ManagerPath
    FROM
        Employees e
    INNER JOIN
        EmployeeHierarchy h
    ON
        e.ManagerID = h.EmployeeID
)

-- Final selection
SELECT
    e.EmployeeID,
    e.Name AS EmployeeName,
    m.Name AS ManagerName
FROM
    Employees e
LEFT JOIN
    EmployeeHierarchy eh
ON
    e.EmployeeID = eh.EmployeeID
LEFT JOIN
    Employees m
ON
    eh.ManagerID = m.EmployeeID
ORDER BY
    e.EmployeeID;
```

### Explanation:

1. **CTE Definition**:
   - **`WITH RECURSIVE EmployeeHierarchy AS (...)`**: Defines a recursive CTE named `EmployeeHierarchy`.

2. **Anchor Member**:
   - **`SELECT EmployeeID, ManagerID, Name, CAST(Name AS VARCHAR(255)) AS ManagerPath`**: Selects the employees at the top of the hierarchy. These employees have no manager (`ManagerID IS NULL`).
   - **`CAST(Name AS VARCHAR(255)) AS ManagerPath`**: Initializes the `ManagerPath` with the employee's name. This column will store the path of managers for each employee.

3. **Recursive Member**:
   - **`SELECT e.EmployeeID, e.ManagerID, e.Name, CAST(h.ManagerPath || ' -> ' || e.Name AS VARCHAR(255)) AS ManagerPath`**: Recursively joins the `Employees` table to the `EmployeeHierarchy` CTE to find direct reports and build the `ManagerPath`.
   - **`INNER JOIN Employees e ON e.ManagerID = h.EmployeeID`**: Joins the current level of employees with their managers from the previous level.

4. **Final Selection**:
   - **`SELECT e.EmployeeID, e.Name AS EmployeeName, m.Name AS ManagerName`**: Selects the final result showing each employee and their manager's name.
   - **`LEFT JOIN EmployeeHierarchy eh ON e.EmployeeID = eh.EmployeeID`**: Joins the employees with the hierarchy CTE to get the manager information.
   - **`LEFT JOIN Employees m ON eh.ManagerID = m.EmployeeID`**: Joins again with the `Employees` table to get the manager's name.

5. **Ordering**:
   - **`ORDER BY e.EmployeeID`**: Orders the final result by `EmployeeID` for readability.

### Note:

- The `CAST` function for `ManagerPath` might vary depending on your SQL dialect. Adjust the data type and length accordingly.
- If your SQL dialect does not support the `RECURSIVE` keyword, you may need to use a different approach or specific features of your database system. 

This query will give you a list of each employee along with their direct manager. If you need a more comprehensive hierarchy that includes indirect managers (all levels), you would need to further expand the `ManagerPath` and handle the hierarchy traversal accordingly.

'''

In [1]:
#5) Data Cleaning
 #You have a table with customer records, but some rows contain duplicate customer information. 
 #Write a SQL query to delete duplicate records, keeping only the first instance of each customer based on their email address.
'''
To delete duplicate customer records while keeping only the first instance based on their email address,
 you can use a common approach involving the `ROW_NUMBER()` window function to identify and remove duplicates. Here's how you can do it:

### SQL Query:

WITH RankedCustomers AS (
    SELECT
        CustomerID,
        Email,
        ROW_NUMBER() OVER (PARTITION BY Email ORDER BY CustomerID) AS rn
    FROM
        Customers
)
DELETE FROM Customers
WHERE CustomerID IN (
    SELECT CustomerID
    FROM RankedCustomers
    WHERE rn > 1
);

### Explanation:

1. **CTE Definition (`RankedCustomers`)**:
   - **`ROW_NUMBER() OVER (PARTITION BY Email ORDER BY CustomerID) AS rn`**: This assigns a unique rank to each row within each `Email` partition. The ranking is ordered by `CustomerID`, which ensures that the first instance of each email has `rn = 1`.
   - **`PARTITION BY Email`**: This clause groups rows by `Email` so that `ROW_NUMBER()` is computed within each email group.
   - **`ORDER BY CustomerID`**: This orders the rows within each email group by `CustomerID`, ensuring that the lowest `CustomerID` gets rank `1`.

2. **Delete Operation**:
   - **`DELETE FROM Customers WHERE CustomerID IN (...)`**: This deletes rows from the `Customers` table where the `CustomerID` is in the set of IDs that are identified as duplicates (i.e., those with `rn > 1`).

3. **Subquery**:
   - **`SELECT CustomerID FROM RankedCustomers WHERE rn > 1`**: This selects the `CustomerID` values of rows that are considered duplicates, i.e., those with a rank greater than 1.

### Additional Considerations:

- **Backup**: Before performing a delete operation, make sure to back up your data or run the query in a transaction to ensure you can rollback if needed.

- **Indexes**: Ensure that there are indexes on the `Email` and `CustomerID` columns to improve the performance of the query, especially if the table is large.

- **Testing**: It's a good practice to test the `SELECT` part of the query first to verify the rows that would be deleted before executing the `DELETE` statement.

### Testing Query:
To see which rows would be deleted before actually removing them, you can run:

```sql
WITH RankedCustomers AS (
    SELECT
        CustomerID,
        Email,
        ROW_NUMBER() OVER (PARTITION BY Email ORDER BY CustomerID) AS rn
    FROM
        Customers
)
SELECT *
FROM RankedCustomers
WHERE rn > 1;
```

This will show you the duplicate records that have a rank greater than 1, allowing you to verify which records are considered duplicates.

'''

"\nTo delete duplicate customer records while keeping only the first instance based on their email address,\n you can use a common approach involving the `ROW_NUMBER()` window function to identify and remove duplicates. Here's how you can do it:\n\n### SQL Query:\n\nWITH RankedCustomers AS (\n    SELECT\n        CustomerID,\n        Email,\n        ROW_NUMBER() OVER (PARTITION BY Email ORDER BY CustomerID) AS rn\n    FROM\n        Customers\n)\nDELETE FROM Customers\nWHERE CustomerID IN (\n    SELECT CustomerID\n    FROM RankedCustomers\n    WHERE rn > 1\n);\n\n### Explanation:\n\n1. **CTE Definition (`RankedCustomers`)**:\n   - **`ROW_NUMBER() OVER (PARTITION BY Email ORDER BY CustomerID) AS rn`**: This assigns a unique rank to each row within each `Email` partition. The ranking is ordered by `CustomerID`, which ensures that the first instance of each email has `rn = 1`.\n   - **`PARTITION BY Email`**: This clause groups rows by `Email` so that `ROW_NUMBER()` is computed within each e

In [None]:
# Q) Write a SQL query to pivot a table, turning rows into columns.
'''
To pivot a table in SQL Server, you can use the `PIVOT` operator. This operator allows you to convert rows into columns, typically used for summarizing data.

### Example Scenario:
Suppose you have a table named `Sales` with the following structure:

| SalesPerson | Product  | Amount |
|-------------|----------|--------|
| Alice       | ProductA | 100    |
| Alice       | ProductB | 150    |
| Bob         | ProductA | 120    |
| Bob         | ProductC | 130    |

You want to pivot this table to show the total sales amount by each salesperson for each product.

### Pivot Query:

```sql
SELECT SalesPerson, 
       [ProductA], 
       [ProductB], 
       [ProductC]
FROM 
(
    SELECT SalesPerson, Product, Amount
    FROM Sales
) AS SourceTable
PIVOT
(
    SUM(Amount)
    FOR Product IN ([ProductA], [ProductB], [ProductC])
) AS PivotTable;
```

### Explanation:
- The inner query (`SourceTable`) selects the `SalesPerson`, `Product`, and `Amount` columns.
- The `PIVOT` operator aggregates the `Amount` values by `SalesPerson` for each `Product`.
- The `FOR Product IN ([ProductA], [ProductB], [ProductC])` clause specifies the values from the `Product` column that should become new columns.

### Result:

| SalesPerson | ProductA | ProductB | ProductC |
|-------------|----------|----------|----------|
| Alice       | 100      | 150      | NULL     |
| Bob         | 120      | NULL     | 130      |

This query transforms the rows into columns based on the `Product` values.
'''