

---

## 1. What is SQL’s role in an AI/ML pipeline?

**Answer:**
SQL is primarily used in the **data preparation and validation stages** of an AI/ML pipeline. It is used to:

* Extract raw data from databases
* Perform aggregations, joins, and filtering for **feature creation**
* Validate data quality (nulls, duplicates, outliers)
* Generate labels and training datasets
* Monitor data drift and model outputs post-deployment

SQL is preferred here because operations like filtering, grouping, and joining are **more efficient at the database level** than in-memory processing.

---

## 2. Why do AI/ML interviews emphasize SQL more than advanced ML theory?

**Answer:**
Because **poor data preparation leads to poor models**, regardless of algorithm choice.
Interviewers use SQL to test:

* Data reasoning
* Edge case handling
* Understanding of data relationships
* Ability to avoid data leakage
* Analytical thinking

In real-world ML systems, **70–80% of work happens before modeling**, and SQL dominates that phase.

---

## 3. Explain NULL and why it is dangerous in analytics.

**Answer:**
NULL represents **unknown or missing data**, not zero or empty.

It is dangerous because:

* Arithmetic with NULL results in NULL
* Comparisons with NULL return UNKNOWN
* Aggregate functions may ignore NULLs, leading to biased results
* Joins can silently drop records if NULL keys exist

In ML, mishandling NULLs leads to **biased features and incorrect labels**, which directly affects model performance.

---

## 4. Difference between WHERE and HAVING (deep explanation)?

**Answer:**

* **WHERE** filters individual rows **before aggregation**
* **HAVING** filters groups **after aggregation**

Execution order:

1. FROM
2. WHERE
3. GROUP BY
4. HAVING
5. SELECT
6. ORDER BY

This is why aggregate functions cannot be used in WHERE.
In analytics, HAVING is commonly used to remove statistically insignificant groups.

---

## 5. Why must all non-aggregated columns be in GROUP BY?

**Answer:**
SQL requires determinism.
If a column is neither aggregated nor grouped, SQL cannot determine **which value** to display from the group.

FAANG databases enforce this rule strictly to avoid **ambiguous analytical results**.

---

## 6. How do JOINs impact row counts and why is this critical for ML?

**Answer:**
JOINs can:

* Increase row count (many-to-many)
* Reduce rows (INNER JOIN)
* Introduce NULLs (LEFT JOIN)

In ML, incorrect joins can:

* Duplicate samples
* Skew distributions
* Leak future information
* Inflate feature importance

This is one of the **most common silent ML bugs** in production systems.

---

## 7. INNER JOIN vs LEFT JOIN – when would LEFT JOIN be preferred?

**Answer:**
LEFT JOIN is preferred when:

* You want to **preserve all records** from the primary table
* Missing data is meaningful (e.g., users without transactions)

In ML, LEFT JOIN is often used to avoid **dropping negative or inactive samples**, which are crucial for unbiased learning.

---

## 8. What is a correlated subquery and why is it expensive?

**Answer:**
A correlated subquery executes **once per row** of the outer query.

It is expensive because:

* It prevents query optimization
* It increases computational complexity
* It scales poorly on large datasets

FAANG interviewers expect candidates to **replace correlated subqueries with JOINs or window functions** when possible.

---

## 9. Window functions vs GROUP BY – core conceptual difference?

**Answer:**

* **GROUP BY** reduces rows
* **Window functions** retain row granularity

Window functions allow analytics **without collapsing data**, making them ideal for:

* Time-series features
* Rankings
* Rolling metrics
* Behavioral analytics

They are essential for **feature engineering in ML**.

---

## 10. Why can’t window functions be used in WHERE?

**Answer:**
Window functions are evaluated **after WHERE**.

Since WHERE executes earlier, the window results do not exist yet.
Filtering window results must be done using:

* Subqueries
* CTEs
* QUALIFY (in some databases)

This tests understanding of **query execution order**.

---

## 11. Difference between ROW_NUMBER, RANK, and DENSE_RANK?

**Answer:**

* **ROW_NUMBER** → unique numbering, no ties
* **RANK** → skips ranks on ties
* **DENSE_RANK** → no gaps in ranking

Used in ML to:

* Select top-K events
* Remove duplicates deterministically
* Rank features or entities

---

## 12. What is data leakage and how can SQL cause it?

**Answer:**
Data leakage occurs when future information is used to create training features.

SQL causes leakage when:

* Joins use future timestamps
* Aggregates include future data
* Labels are computed before splitting datasets

FAANG companies heavily penalize candidates who ignore this.

---

## 13. Why is SELECT * discouraged in production analytics?

**Answer:**
Because it:

* Increases I/O
* Breaks schema-dependent pipelines
* Makes queries fragile to column changes
* Loads unnecessary data into ML pipelines

Explicit column selection is a **best practice in scalable ML systems**.

---

## 14. What is an index and when does it hurt performance?

**Answer:**
An index improves read performance but:

* Slows INSERT/UPDATE/DELETE
* Consumes memory
* Can be ignored by the optimizer

Indexing low-cardinality columns (e.g., gender) usually hurts performance.

---

## 15. OLTP vs OLAP – why does it matter for ML?

**Answer:**

* **OLTP** → transactional, normalized, frequent writes
* **OLAP** → analytical, denormalized, read-heavy

ML workloads rely on **OLAP systems** because:

* They support large scans
* They optimize aggregations
* They enable fast feature extraction

---

## 16. Why are data warehouses denormalized?

**Answer:**
Denormalization reduces joins, improves read performance, and simplifies analytical queries.

In ML pipelines, faster feature extraction is more valuable than strict normalization.

---

## 17. How does SQL help detect data drift?

**Answer:**
Using:

* Distribution comparisons
* Aggregates over time
* Percentile changes
* NULL rate monitoring

SQL is used for **continuous validation of model inputs**.

---

## 18. Why is SQL still relevant despite Spark and Pandas?

**Answer:**
Because:

* Databases scale better
* SQL is declarative
* Optimizers handle execution
* Data often already resides in warehouses

FAANG ML systems use SQL + Python, not one or the other.

---

## 19. Explain ACID and why it matters less in analytics but more in labeling.

**Answer:**
ACID ensures correctness in transactions.

For analytics:

* Eventual consistency is acceptable

For labeling:

* Incorrect labels corrupt training data
* Atomicity and consistency are critical

---

## 20. What does a strong SQL answer look like in FAANG interviews?

**Answer:**
A strong answer:

* Explains *why*, not just *what*
* Mentions performance implications
* Connects SQL decisions to ML outcomes
* Acknowledges edge cases

---

## FINAL ADVICE (IMPORTANT)

For FAANG-level AI/ML roles:

* SQL is judged as **data intelligence**
* Syntax matters less than **reasoning**
* Window functions + joins = deal breakers

---


