# Hands-On Exercise: Implementing a Data Warehouse in Apache Hive Using Retail Big Data

This exercise will guide students through setting up a simple Hive data warehouse on the previously created Hadoop cluster. The use case will focus on retail big data, and students will implement a dimensional model (star schema) in Hive, explore OLAP concepts, and practice Hive partitioning, bucketing, and external table creation.

## Step 1: Designing and Implementing a Hive Simple Data Warehouse

**What is a Data Warehouse?**
A data warehouse is a system used for reporting and data analysis, integrating data from multiple sources to provide a consolidated view for business intelligence.

**Retail Use Case:**
In this exercise, we will work with a sample retail dataset that contains the following tables:

- Customers: Information about customers.
- Products: Product catalog.
- Sales: Sales transaction records.
- Stores: Details about store locations.

### Task 1: Create a Hive Database for the Data Warehouse

1. Launch the Hive shell:

In [None]:
$ hive

2. Create a Database: To organize the data, create a database named `retail_dw`:

In [None]:
CREATE DATABASE retail_dw;
USE retail_dw;

## Step 2: Hive Data Warehouse Architectures and Design

### Task 2: Create Hive Tables
Design and implement the basic structure for your Hive data warehouse using tables for the retail use case. We'll follow the star schema design pattern for efficient querying and reporting.

Create the Fact and Dimension Tables:

1. Customers Dimension Table:

In [None]:
CREATE TABLE customers (
  customer_id INT,
  customer_name STRING,
  customer_email STRING,
  customer_phone STRING,
  customer_address STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
;

2. Products Dimension Table:

In [None]:
CREATE TABLE products (
  product_id INT,
  product_name STRING,
  category STRING,
  price DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
;

3. Stores Dimension Table:

In [None]:
CREATE TABLE stores (
  store_id INT,
  store_name STRING,
  store_city STRING,
  store_state STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
;

4. Sales Fact Table:

In [None]:
CREATE TABLE sales (
  transaction_id INT,
  transaction_date STRING,
  customer_id INT,
  product_id INT,
  store_id INT,
  quantity INT,
  total_amount DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
;

5. Load data into tables:


In [None]:
LOAD DATA LOCAL INPATH './Documents/datatech_labs/datatech_lab_de_course_public/week3/data_warehousing/hands_on_data/customers_data.csv' INTO TABLE customers;
LOAD DATA LOCAL INPATH './Documents/datatech_labs/datatech_lab_de_course_public/week3/data_warehousing/hands_on_data/products_data.csv' INTO TABLE products;
LOAD DATA LOCAL INPATH './Documents/datatech_labs/datatech_lab_de_course_public/week3/data_warehousing/hands_on_data/stores_data.csv' INTO TABLE stores;
LOAD DATA LOCAL INPATH './Documents/datatech_labs/datatech_lab_de_course_public/week3/data_warehousing/hands_on_data/sales_data.csv' INTO TABLE sales;

## Step 3: Hive Dimensional Modeling and Star Schema

In dimensional modeling, we use fact and dimension tables. The fact table (sales) stores quantitative data for analysis, while the dimension tables (customers, products, stores) provide context to the facts.

### Task 3: Implement the Star Schema

In this star schema:
- Fact Table: `sales`

- Dimension Tables: `customers`, `products`, `stores`

After creating the tables in step 2, your Hive data warehouse follows a star schema with `sales` at the center of the schema, surrounded by the dimensions (customers, products, stores).

## Step 4: Hive OLAP Concepts and Cube Operations

### Task 4: Run OLAP Queries Using GROUP BY and CUBE

1. Simple OLAP Query: Query to calculate total sales by product category:

In [None]:
SELECT p.category, SUM(s.total_amount)
FROM sales s
JOIN products p ON s.product_id = p.product_id
GROUP BY p.category
;

2. Cube Operation: Use CUBE to get aggregated results across multiple dimensions:

In [None]:
SELECT p.category, st.store_state, SUM(s.total_amount)
FROM sales s
JOIN products p ON s.product_id = p.product_id
JOIN stores st ON s.store_id = st.store_id
GROUP BY CUBE(p.category, st.store_state)
;

## Step 5: Hive Partitioning and Bucketing

**Partitioning:**
Partitioning in Hive allows us to split tables into smaller pieces based on certain columns, improving query performance.

### Task 5: Implement Partitioning on the `Sales` Table

1. Create a Partitioned Sales Table: Partition the sales data by `store_state`:

In [None]:
CREATE TABLE sales_partitioned (
  transaction_id INT,
  transaction_date STRING,
  customer_id INT,
  product_id INT,
  store_id INT,
  quantity INT,
  total_amount DOUBLE
)
PARTITIONED BY (store_state STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
;

2. Add Data to Partitions: Load data into partitions based on the state:

In [None]:
LOAD DATA LOCAL INPATH './Documents/datatech_labs/datatech_lab_de_course_public/week3/data_warehousing/hands_on_data/sales_data.csv'
INTO TABLE sales_partitioned
PARTITION (store_state = 'CA');

SELECT * FROM sales_partitioned;

**Bucketing:**
Bucketing further divides each partition into "buckets" for better parallelism.

### Task 6: Implement Bucketing on the Customers Table

1. Create a Bucketed Customers Table: Bucket the customers table by customer_id:

In [None]:
CREATE TABLE customers_bucketed (
  customer_id INT,
  customer_name STRING,
  customer_email STRING,
  customer_phone STRING,
  customer_address STRING
)
CLUSTERED BY (customer_id) INTO 4 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
;

2. Load Data into the Bucketed Table:

In [None]:
LOAD DATA LOCAL INPATH './Documents/datatech_labs/datatech_lab_de_course_public/week3/data_warehousing/hands_on_data/customers_data.csv'
INTO TABLE customers_bucketed;

SELECT * FROM customers_bucketed;

## Step 6: How Hive Data is Stored in HDFS

Hive tables are stored as directories in HDFS, with each table represented as a directory and each partition (if partitioned) represented as subdirectories.

### Task 7: Explore HDFS Storage for Hive Tables

1. Check the HDFS location of a table: You can find the storage location of Hive tables by running the following:

In [None]:
DESCRIBE FORMATTED sales_partitioned;

2. Explore the HDFS Directory: Use HDFS commands to explore the directory structure:

In [None]:
$ hdfs dfs -ls /user/hive/warehouse/retail_dw.db/sales_partitioned/

## Step 7: Hive Metastore Query Sample

The Hive metastore stores metadata about tables. You can query the metastore using the `SHOW` and `DESCRIBE` commands.

### Task 8: Query the Hive Metastore

1. View All Tables in the Database:

In [None]:
SHOW TABLES IN retail_dw;

2. Describe a Table’s Schema: Get detailed information about the `sales` table:

In [None]:
DESCRIBE FORMATTED sales;

## Step 8: Load Data from Local Path or HDFS Path

### Task 9: Load Data into Hive Tables

1. Loading Data from a Local Path: Load local data into the `products` table:

In [None]:
LOAD DATA LOCAL INPATH './Documents/datatech_labs/datatech_lab_de_course_public/week3/data_warehousing/hands_on_data/products_data.csv'
INTO TABLE products;

2. Loading Data from HDFS: Upload the data file to HDFS first:

In [None]:
$ hdfs dfs -put ~/Documents/datatech_labs/datatech_lab_de_course_public/week3/data_warehousing/hands_on_data/* /user/datatech-labs/hive-data/

3. Then, load the HDFS data into the table:

In [None]:
LOAD DATA INPATH '/user/datatech-labs/hive-data/products_data.csv'
INTO TABLE products;

## Step 9: Create Hive External Tables

External tables in Hive allow you to manage data outside the Hive warehouse directory.

### Task 10: Create an External Table

1. Create an External Table for Stores Data:

In [None]:
CREATE EXTERNAL TABLE external_stores (
  store_id INT,
  store_name STRING,
  store_city STRING,
  store_state STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/datatech-labs/hive-data/stores'
;

2. Read Data through the External Table: Ensure the data exists at the specified HDFS location.

In [None]:
SELECT * FROM external_stores;

## Step 10: Specify Storing Format in Table Creation

Hive supports various file formats like Text, ORC, and Parquet. Using efficient formats like ORC or Parquet can improve performance.

### Task 11: Create Tables with Different Storage Formats

1. Create a Table Using `Parquet` Format: In this task, we will create a Hive table for the `sales` data and specify that it should be stored in Parquet format, which is columnar and optimized for big data analytics.

In [None]:
CREATE TABLE sales_parquet (
  transaction_id INT,
  transaction_date STRING,
  customer_id INT,
  product_id INT,
  store_id INT,
  quantity INT,
  total_amount DOUBLE
)
STORED AS PARQUET
;

2. **Load Data into the Parquet Table**: You can load data into the Parquet table just like any other Hive table. If you already have data in the `sales` table, you can insert it into the `sales_parquet` table using a simple INSERT statement:

In [None]:
INSERT INTO TABLE sales_parquet
SELECT * FROM sales
;

3. Verify the Data in HDFS: You can check the storage of the Parquet table in HDFS using the following command:

In [None]:
hdfs dfs -ls /user/hive/warehouse/retail_dw.db/sales_parquet/

----------------------------------------------------------------------------