## Create Database Statement

A database in Hive is a namespace or a collection of tables.

Syntax - `CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>`

In [0]:
CREATE DATABASE IF NOT EXISTS dataenggdatascfreelance1247_db;

## Drop Database Statement

Drops all the tables and deletes the database


Syntax- `DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];`



### Drop database without table or Empty Database:

In [0]:
DROP DATABASE IF EXISTS database_name;

DROP SCHEMA database_name;

### Drop database with tables:

- The following query drops the database using CASCADE. It means dropping respective tables before dropping the database.

- By default, the mode is RESTRICT which blocks the deletion of database if it holds tables.

In [0]:
DROP DATABASE database_name CASCADE;

## Create Table Statement

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name\
 (col_name data_type [column_constraint_specification] [COMMENT col_comment],\
  col_name data_type [column_constraint_specification] [COMMENT col_comment],.. constraint_specification)\
COMMENT table_comment\
PARTITIONED BY (col_name data_type COMMENT col_comment, ...)\
CLUSTERED BY (col_name, col_name, ...)\
SORTED BY (col_name [ASC|DESC], ...) INTO num_buckets BUCKETS\
SKEWED BY (col_name, col_name, ...)\
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)\
STORED AS DIRECTORIES\
ROW FORMAT row_format\
STORED AS file_format\
LOCATION hdfs_path

In [0]:
USE dataenggdatascfreelance1247_db;

In [0]:
CREATE TABLE IF NOT EXISTS employee 
(
eid int COMMENT 'Employee ID', 
name string COMMENT 'Employee Name',
salary float COMMENT 'Employee Salary', 
designation String COMMENT 'Employee Designation')
COMMENT 'Employee details'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

**file_format:**
  - SEQUENCEFILE
  - TEXTFILE    -- (Default, depending on hive.default.fileformat configuration)
  - RCFILE      -- (Note: Available in Hive 0.6.0 and later)
  - ORC         -- (Note: Available in Hive 0.11.0 and later)
  - PARQUET     -- (Note: Available in Hive 0.13.0 and later)
  - AVRO        -- (Note: Available in Hive 0.14.0 and later)
  - JSONFILE    -- (Note: Available in Hive 4.0.0 and later)

**column_constraint_specification:**
  - PRIMARY KEY
  - UNIQUE
  - NOT NULL
  - DEFAULT [default_value]
  - CHECK  [check_expression] 
  - ENABLE|DISABLE 
  
**constraint_specification:**
  - PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY
  - PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY
  - CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE 
  - CONSTRAINT constraint_name UNIQUE (col_name, ...) DISABLE NOVALIDATE RELY/NORELY
  - CONSTRAINT constraint_name CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY  

## Load Data Statement

In Hive, we can insert data using the LOAD DATA statement.

While inserting data into Hive, it is better to use LOAD DATA to store bulk records. There are two ways to load data: one is **from local file system** and second is **from Hadoop file system**.

The syntax for load data is as follows:\
`LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename 
[PARTITION (partcol1=val1, partcol2=val2 ...)]`

- LOCAL is identifier to specify the local path. It is optional.
- OVERWRITE is optional to overwrite the data in the table.
- PARTITION is optional.

In [0]:
LOAD DATA LOCAL INPATH '/home/dataenggdatascfreelance1247/hive_data/sample.txt' INTO TABLE employee;

## Alter Table

ALTER TABLE name RENAME TO new_name\
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])\
ALTER TABLE name DROP [COLUMN] column_name\
ALTER TABLE name CHANGE column_name new_name new_type\
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])

Rename table employee to table emp 

In [0]:
ALTER TABLE employee RENAME TO emp;

Alter table employee to change column 'name'(string) to column 'ename'(string)

In [0]:
DESCRIBE employee;

ALTER TABLE employee CHANGE name ename String;

DESCRIBE employee;

Alter Table employee to change column salary(float) to salary(double)

In [0]:
DESCRIBE employee;

ALTER TABLE employee CHANGE salary salary Double;

DESCRIBE employee;

Add column named dept to the employee table

In [0]:
DESCRIBE employee;

ALTER TABLE employee ADD COLUMNS (dept STRING COMMENT 'Department name');

DESCRIBE employee;

Alter table employee to deletes all the columns from the employee table and replaces it with emp and name columns:

In [0]:
DESCRIBE employee;

ALTER TABLE employee REPLACE COLUMNS (empid Int, empname String);

DESCRIBE employee;

##Drop Table 

The syntax is as follows:

`DROP TABLE [IF EXISTS] table_name;`

In [0]:
DROP TABLE IF EXISTS employee;

## Hive - Partitioning

### What are the Hive Partitions
Partitioning is a way of dividing a table into related parts based on the values of particular columns like date, city, and department. Each table in the hive can have one or more partition keys to identify a particular partition. Using partition it is easy to do queries on slices of the data.

### Why is Partitioning Important
- We know that the huge amount of data which is in the range of petabytes is getting stored in HDFS. So due to this, it becomes very difficult for Hadoop users to query this huge amount of data. 
- Hive was introduced to lower down this burden of data querying. Apache Hive converts the SQL queries into MapReduce jobs and then submits it to the Hadoop cluster. When we submit a SQL query, Hive read the entire data-set. So, it becomes inefficient to run MapReduce jobs over a large table. 
- This is resolved by creating partitions in tables. Apache Hive makes this job of implementing partitions very easy by creating partitions by its automatic partition scheme at the time of table creation.
- In Partitioning method, all the table data is divided into multiple partitions. Each partition corresponds to a specific value(s) of partition column(s). It is kept as a sub-record inside the table’s record present in the HDFS. Therefore on querying a particular table, appropriate partition of the table is queried which contains the query value. Thus this decreases the I/O time required by the query. Hence increases the performance speed.

### Partitioning Example
A table named Tab1 contains employee data such as id, name, dept, and yoj (i.e. year of joining). Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time by searching only inside the partition for 2012.\

The following example shows how to partition a file and its data:

The following file contains employeedata table.

/tab1/employeedata/file1

id, name, dept, yoj\
1, gopal, TP, 2012\
2, kiran, HR, 2012\
3, kaleel,SC, 2013\
4, Prasanth, SC, 2013

The above data is partitioned into two files using year.

/tab1/employeedata/2012/file2

1, gopal, TP, 2012\
2, kiran, HR, 2012


/tab1/employeedata/2013/file3

3, kaleel,SC, 2013\
4, Prasanth, SC, 2013



### How to Create Partitions in Hive
To create data partitioning in Hive following command is used-

`CREATE TABLE table_name (column1 data_type, column2 data_type) PARTITIONED BY (partition1 data_type, partition2 data_type,….);`

In [0]:
USE dataenggdatascfreelance1247_db;

CREATE TABLE partitioned_table (id INT, name STRING, dept STRING, yoj INT) 
PARTITIONED BY (year STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';

LOAD DATA LOCAL INPATH 'hive_data/file_partition1.txt' OVERWRITE INTO TABLE partitioned_table PARTITION (year='2012');
LOAD DATA LOCAL INPATH 'hive_data/file_partition2.txt' OVERWRITE INTO TABLE partitioned_table PARTITION (year='2013');

### Types of Hive Partitioning

There are two types of Partitioning in Apache Hive-

- Static Partitioning
- Dynamic Partitioning

**Hive Static Partitioning**
- Insert input data files individually into a partition table is Static Partition.
- Usually when loading files (big files) into Hive tables static partitions are preferred.
- Static Partition saves your time in loading data compared to dynamic partition.
- You “statically” add a partition in the table and move the file into the partition of the table.
- We can alter the partition in the static partition.
- You can get the partition column value from the filename, day of date etc without reading the whole big file.
- ***If you want to use the Static partition in the hive you should set property set hive.mapred.mode = strict This property set by default in hive-site.xml***
- Static partition is in Strict Mode.
- You should use where clause to use limit in the static partition.
- You can perform Static partition on Hive Manage table or external table.

**Hive Dynamic Partitioning**
- Single insert to partition table is known as a dynamic partition.
- Usually, dynamic partition loads the data from the non-partitioned table.
- Dynamic Partition takes more time in loading data compared to static partition.
- When you have large data stored in a table then the Dynamic partition is suitable.
- If you want to partition a number of columns but you don’t know how many columns then also dynamic partition is suitable.
- Dynamic partition there is no required where clause to use limit.
- we can’t perform alter on the Dynamic partition.
- You can perform dynamic partition on hive external table and managed table.
- ***If you want to use the Dynamic partition in the hive then the mode is in non-strict mode.***
- Here are Hive dynamic partition properties you should allow