In [None]:
### 1. Explain the differences between Hive and relational databases (RDBMS).
'''
- Schema on Read vs. Write: In Hive, the schema is applied when reading data, whereas in RDBMS, the schema is enforced when writing data.

- Performance: Hive is not optimized for low-latency queries and is primarily used for batch processing on large datasets, 
whereas RDBMS are designed for high-speed transactions.

- SQL Compatibility: Hive supports HiveQL (similar to SQL), but lacks some advanced SQL features such as ACID transactions 
(though ACID support has improved in newer versions).

- Storage: Hive stores data in HDFS or S3 (cloud), while RDBMS use block storage systems.
'''

In [None]:
### 2. What are the different file formats supported by Hive?
#Hive supports multiple file formats for reading and writing data:
'''
   - TextFile: Default format, uncompressed and row-based.
   - SequenceFile: Binary key-value pairs, row-based.
   - ORC (Optimized Row Columnar): Columnar format, optimized for reading and space efficiency.
   - Parquet: Columnar format, supports efficient compression and encoding.
   - RCFile (Record Columnar File): Combines columnar and row-based storage.
   - Avro: Row-based, used for schema evolution and serialization.
'''

In [None]:
### 3. How can you optimize performance in Hive?
'''   
1. Partitioning: Split large datasets into smaller logical divisions based on columns to reduce scan size.

2. Bucketing: Group data in a partition into manageable "buckets" to allow efficient querying.

3. Compression: Use efficient file formats like ORC or Parquet with compression techniques such as Snappy, LZO, or Zlib.

4. Tez Execution Engine: Use Tez instead of MapReduce for faster query execution.

5. Caching Metadata: Hive Metastore caching can improve performance by reducing the load on the metastore database.

6. Query Optimization: Enable Hive query optimizers such as Cost-Based Optimization (CBO).
'''

In [None]:
### 4. Explain how partitioning works in Hive.
# Partitioning divides a table into smaller partitions based on column values. When you query a partitioned table, Hive only scans 
# relevant partitions instead of the whole table, reducing the amount of data read.

'''
CREATE TABLE sales (id INT, amount DECIMAL, date STRING)
  PARTITIONED BY (year STRING, month STRING);
  
 When querying:
   
SELECT * FROM sales WHERE year = '2023' AND month = '09';

Hive will only read data from the partition where `year=2023` and `month=09`.
'''

In [None]:
### 5. How does bucketing differ from partitioning in Hive?
#Partitioning: Divides the data based on distinct column values, each partition containing one set of data.
#Bucketing: Divides data within a partition into manageable "buckets" based on a hash of a column. This enables joins and 
# grouping operations to be more efficient.

'''
CREATE TABLE sales_bucketed (id INT, amount DECIMAL)
   CLUSTERED BY (id) INTO 10 BUCKETS;

This creates 10 buckets based on the hash of the `id` column. Bucketing is used to optimize JOINs.
'''

In [None]:
### 6. What is the use of `distribute by`, `sort by`, and `cluster by` in Hive?
#- `DISTRIBUTE BY`: Ensures that rows with the same column values go to the same reducer, useful for controlling how data is 
# distributed across reducers.
SELECT * FROM table DISTRIBUTE BY col;

#   - `SORT BY`: Orders data within each reducer. It does not guarantee global ordering across all reducers.
SELECT * FROM table SORT BY col;

#   - `CLUSTER BY`: Combines the functions of `DISTRIBUTE BY` and `SORT BY`. It distributes rows to reducers and sorts them within 
# each reducer.
SELECT * FROM table CLUSTER BY col;


In [None]:
### 7. Explain the concept of ACID transactions in Hive. How do they work?
#Hive introduced ACID (Atomicity, Consistency, Isolation, Durability) transactions in version 0.14. These allow insert, 
# update, and delete operations on tables.
'''
   - Insert: `INSERT INTO` appends to a table.
   - Update/Delete: Changes individual records.
   - Atomicity: Transactions either complete fully or roll back on failure.
   - Isolation: Multiple transactions can occur simultaneously, without interference.
'''   

#   To enable ACID transactions:
'''
   SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
   SET hive.support.concurrency=true;

'''


In [None]:
### 8. What are dynamic partitions in Hive, and when would you use them?
#   Dynamic partitioning allows you to create partitions automatically based on the data in the table, as opposed to statically defining 
# partitions beforehand.

'''
INSERT INTO TABLE sales PARTITION (year, month)
SELECT id, amount, year(date), month(date) FROM transactions;

'''
#  Hive automatically creates partitions for each distinct year and month in the `transactions` table

In [None]:
### 9. How do you handle schema evolution in Hive?
'''
Schema evolution in Hive can be handled using formats like Avro or ORC, which allow you to add or remove columns from tables without 
breaking compatibility.
   
# Example of evolving an ORC schema:
ALTER TABLE table_name ADD COLUMNS (new_col STRING);

#Hive ignores missing columns and assigns NULL values to new columns.
'''

In [None]:
### 10. What are some common challenges in using Hive, and how can they be mitigated?
'''
- Slow Query Performance: Can be mitigated by optimizing data formats (ORC/Parquet), 
   * partitioning, 
   * bucketing, and 
   * enabling vectorization.

- Handling Small Files: Too many small files slow down MapReduce jobs. 
     * Use compaction or combine small files into larger ones.

- Metadata Bottleneck: Hive Metastore can become a bottleneck. 
    * Caching or using a faster database for the metastore can help.

'''