# Setting Up a Basic Hadoop Environment for ETL

In this notebook, we'll guide you through setting up a basic Hadoop environment for ETL operations using HDFS, YARN, MapReduce, and Apache Pig. We'll execute commands directly from this notebook using cell magics.

## Section 1: Prerequisites

### 1.1 Operating System

This setup can be performed on Windows, macOS, or Unix-based operating systems. Since Hadoop is designed to run on Unix-like systems, Windows users will need to set up a Unix-like environment using tools like Windows Subsystem for Linux (WSL) or Cygwin.

**For Windows Users**:

**Option 1: Windows Subsystem for Linux (WSL)**

WSL allows you to run a Linux distribution directly on Windows. You can install Ubuntu from the Microsoft Store.

-   Enable WSL:

Open PowerShell as Administrator and run:

`Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux`

Restart your computer when prompted.

- Install Ubuntu:

Download and install Ubuntu from the [Microsoft Store](https://apps.microsoft.com/detail/9nblggh4msv6?rtc=1&hl=en-us&gl=US).

- Launch Ubuntu and Update Packages:



In [None]:
%%bash
sudo apt-get update -V
sudo apt-get upgrade -y

**Option 2: Cygwin**

Cygwin provides a Unix-like environment on Windows.

- Download and install Cygwin from www.cygwin.com.
- During installation, select the packages required (e.g., OpenSSH, OpenJDK).
- Proceed with the instructions in the Cygwin terminal.

**Note**: Using WSL is recommended for simplicity and compatibility.

**For MacOS Users:**

Most Unix commands are compatible with macOS. You may need to install some packages using Homebrew.

- Install Homebrew (if not already installed)

In [None]:
%%bash
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

- Update Homebrew

In [None]:
%%bash
brew update

### 1.2 Install Java Development Kit

Hadoop requires Java to run. We'll install JDK 8.

**Check if Java is Already Installed**

Let's verify if Java is already installed on your system.

In [None]:
%%bash
java -version

**Install JDK 8**
If Java is not installed or you need to install JDK 8, follow the instructions below:

**For Windows Users (Using WSL or Cygwin):**

- Install OpenJDK 8:

In [None]:
%%bash
sudo apt-get install -y openjdk-8-jdk

- Set Java Environment Variables:

In [None]:
%%bash
echo "export JAVA_HOME=$(readlink -f /usr/bin/java | sed 's:/bin/java::')" >> ~/.bashrc
source ~/.bashrc

**For macOS Users:**

- Install OpenJDK 8 Using Homebrew:

In [None]:
%%bash
brew tap homebrew/cask-versions
brew install --cask adoptopenjdk8

- Set Java Environment Variables:

In [None]:
%%bash
echo "export JAVA_HOME=$(/usr/libexec/java_home -v1.8)" >> ~/.bash_profile
source ~/.bash_profile

**Verify Java Installation**

After installation, confirm that Java is correctly installed:

In [None]:
%%bash
java -version

You should see output similar to:

`java version "1.8.0_xxx"`

`Java(TM) SE Runtime Environment (build 1.8.0_xxx)`

`Java HotSpot(TM) 64-Bit Server VM (build xx.x-bxx, mixed mode)`


### 1.3 Configure SSH

Hadoop uses SSH for communication between nodes, even in a single-node setup. We'll set up passwordless SSH access to `localhost`.

First, ensure that SSH is installed:

**For Windows Users (Using WSL or Cygwin):**

In [None]:
%%bash
sudo apt-get install -y openssh-server openssh-client

**For macOS Users:**

SSH is typically pre-installed on macOS. Verify with:

In [None]:
%%bash
ssh -V

**Generate SSH Keys**

Generate a new SSH key pair without a passphrase:

In [None]:
%%bash
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa

**Configure Authorized Keys**

Add your public key to the list of authorized keys:

In [None]:
%%bash
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys

**Start SSH Service (If Necessary)**

For Windows Users (Using WSL or Cygwin):

In [None]:
%%bash
sudo service ssh start

**Test SSH Connection to localhost**

Verify that you can SSH to localhost without a password:

In [None]:
%%bash
ssh -o StrictHostKeyChecking=no localhost echo "SSH connection established."

If successful, you should see:

`SSH connection established.`

## Section 2: Install Hadoop

In this section, we'll download and install Hadoop on your system. We'll also set up the necessary environment variables. The instructions are tailored for Windows (using WSL or Cygwin), macOS, and Unix-based systems.

### 2.1 Download Hadoop

We'll download the latest stable release of Hadoop from the Apache website.

**Determine the Latest Stable Version**

Please verify the latest version from the Apache Hadoop Releases page.

In [None]:
%%bash
# Replace with the latest version number
HADOOP_VERSION=3.3.1

**Download Hadoop**

For All Users:

In [None]:
%%bash
# Download Hadoop
wget https://downloads.apache.org/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz -P ~/

Note: If wget is not installed, you can install it using:

- For Ubuntu/WSL/Cygwin:

In [None]:
%%bash
sudo apt-get install -y wget

- For macOS:

In [None]:
%%bash
brew install wget

### 2.2 Extract Hadoop

Unpack the downloaded tarball and move it to a convenient location.

In [None]:
%%bash
# Extract Hadoop
tar -xzvf ~/hadoop-$HADOOP_VERSION.tar.gz -C ~/

# Move Hadoop to /usr/local/hadoop (may require sudo)
sudo mv ~/hadoop-$HADOOP_VERSION /usr/local/hadoop

**Note for Windows Users**: If you encounter permission issues, you might need to adjust the permissions or run the commands with appropriate privileges.

### 2.3 Set Environment Variables

We'll set up the environment variables required for Hadoop to run properly.

**For All Users:**

Edit your shell profile file to include Hadoop and Java environment variables.

**Determine Your Shell Profile File**

- For Bash shell: ~/.bashrc or ~/.bash_profile
- For Zsh shell (common on macOS): ~/.zshrc

For this guide, we'll assume you're using ~/.bashrc.

**Edit the Shell Profile**

Append the following lines to your shell profile file.

In [None]:
%%bash
# Define shell profile file
SHELL_PROFILE=~/.bashrc

# Append Hadoop environment variables
echo "
# Hadoop Environment Variables
export HADOOP_HOME=/usr/local/hadoop
export PATH=\$PATH:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin
" >> \$SHELL_PROFILE

**Set JAVA_HOME Environment Variable**

We need to set JAVA_HOME so that Hadoop knows where Java is installed.

**For Ubuntu/WSL/Cygwin Users:**

In [None]:
%%bash
# Append JAVA_HOME to shell profile
echo "
export JAVA_HOME=\$(readlink -f /usr/bin/java | sed 's:/bin/java::')
" >> \$SHELL_PROFILE

**For macOS Users:**

In [None]:
%%bash
# Append JAVA_HOME to shell profile
echo "
export JAVA_HOME=\$(/usr/libexec/java_home -v1.8)
" >> \$SHELL_PROFILE

**Apply the Changes**

Reload your shell profile to apply the environment variable changes.

In [None]:
%%bash
source ~/.bashrc

**Note for macOS Users**: If you edited ~/.bash_profile or ~/.zshrc, use that file in the source command.

### 2.4 Verify Hadoop Installation

In [None]:
%%bash
hadoop version

You should see output similar to:

`Hadoop 3.3.1`

`Source code repository https://github.com/apache/hadoop -r ...`

`Compiled by ... on ...`

`Compiled with protoc 3.x.x`

`From source with checksum ...`

`This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.3.1.jar`


If you see this output, Hadoop is installed successfully.

## Section 3: Configure HDFS

### 3.1 Edit `core-site.xml`

The `core-site.xml` file contains configuration settings for Hadoop core, such as the default filesystem.

**Locate the `core-site.xml` File**

The file is located in the Hadoop configuration directory:

In [None]:
$HADOOP_HOME/etc/hadoop/core-site.xml

**Create a Backup (Optional)**

It's a good practice to create a backup before modifying configuration files.

In [None]:
%%bash
cp $HADOOP_HOME/etc/hadoop/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml.backup

**Edit `core-site.xml`**

We'll add configuration settings to specify the default filesystem.

In [None]:
%%bash
cat <<EOL > $HADOOP_HOME/etc/hadoop/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
   <property>
       <name>fs.defaultFS</name>
       <value>hdfs://localhost:9000</value>
   </property>
</configuration>
EOL

### 3.2 Edit `hdfs-site.xml`

The hdfs-site.xml file contains settings specific to HDFS, such as replication factor.

**Create a backup (optional)**

In [None]:
%%bash
cp $HADOOP_HOME/etc/hadoop/hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml.backup

**Edit `hdfs-site.xml`**

We'll set the replication factor to 1 since we're setting up a single-node cluster.

In [None]:
%%bash
cat <<EOL > $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
   <property>
       <name>dfs.replication</name>
       <value>1</value>
   </property>
</configuration>
EOL

### 3.3 Format the NameNode

Before starting HDFS, we need to format the NameNode.

In [None]:
%%bash
hdfs namenode -format

You should see output indicating that the NameNode has been formatted successfully.

## Section 4: Configure YARN and MapReduce

In this section, we'll configure YARN (Yet Another Resource Negotiator) and MapReduce. YARN is Hadoop's cluster resource management system, and MapReduce is the programming model for data processing. We'll edit the necessary configuration files to enable these components.

### 4.1 Edit `yarn-site.xml`

The `yarn-site.xml` file contains configuration settings for YARN.

**Locate the `yarn-site.xml` File**

The file is located in the Hadoop configuration directory:

In [None]:
$HADOOP_HOME/etc/hadoop/yarn-site.xml

**Create a Backup (Optional)**

It's a good practice to create a backup before modifying configuration files.

In [None]:
%%bash
cp $HADOOP_HOME/etc/hadoop/yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml.backup

**Edit `yarn-site.xml`**

We'll add configuration settings to specify the YARN NodeManager auxiliary services and enable the MapReduce shuffle service.

In [None]:
%%bash
cat <<EOL > $HADOOP_HOME/etc/hadoop/yarn-site.xml
<?xml version="1.0"?>
<configuration>
    <!-- Site specific YARN configuration properties -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>
EOL

### 4.2 Edit `mapred-site.xml`

The `mapred-site.xml` file contains configuration settings for MapReduce.

**Create the `mapred-site.xml` File**

If the file does not exist, create it by copying the template.

In [None]:
%%bash
cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml

**Create a Backup (Optional)**

In [None]:
%%bash
cp $HADOOP_HOME/etc/hadoop/mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml.backup

**Edit mapred-site.xml**

We'll configure MapReduce to run on YARN.

In [None]:
%%bash
cat <<EOL > $HADOOP_HOME/etc/hadoop/mapred-site.xml
<?xml version="1.0"?>
<configuration>
    <!-- Site specific MapReduce configuration properties -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
EOL

### 4.3 Verify Configuration Files

You can verify the contents of the configuration files to ensure they have been set correctly.

**View `yarn-site.xml`**

In [None]:
%%bash
cat $HADOOP_HOME/etc/hadoop/yarn-site.xml

**View `mapred-site.xml`**

In [None]:
%%bash
cat $HADOOP_HOME/etc/hadoop/mapred-site.xml

**Explanation:**

- `yarn-site.xml`: We specified the yarn.nodemanager.aux-services property and set its value to mapreduce_shuffle. This enables the shuffle service, which is necessary for MapReduce jobs to run properly on YARN.
- `mapred-site.xml`: We set the mapreduce.framework.name property to yarn, indicating that MapReduce should use YARN as its resource manager.

By configuring these files, we've set up Hadoop to use YARN for resource management and prepared the system to run MapReduce jobs.

## Section 5: Start Hadoop Services

In this section, we'll start the Hadoop services, including HDFS (Hadoop Distributed File System) and YARN (Yet Another Resource Negotiator). We'll also verify that the services are running correctly by accessing their web interfaces.

### 5.1 Starrt HDFS (NameNode and DataNode)

We need to start the HDFS daemons: the NameNode and DataNode.

In [None]:
%%bash
start-dfs.sh

**Note**: If you encounter a "command not found" error, ensure that $HADOOP_HOME/bin and $HADOOP_HOME/sbin are in your PATH. Alternatively, use the full path to the script:

In [None]:
%%bash
$HADOOP_HOME/sbin/start-dfs.sh

**Expected Output**

You should see output indicating that the NameNode and DataNode are starting:

In [None]:
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-yourusername-namenode-yourhostname.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-yourusername-datanode-yourhostname.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-yourusername-secondarynamenode-yourhostname.out

### 5.2 Start YARN (ResourceManager and NodeManager)

Next, we'll start the YARN daemons: the ResourceManager and NodeManager.

In [None]:
%%bash
start-yarn.sh

Alternatively, use the full path:

In [None]:
%%bash
$HADOOP_HOME/sbin/start-yarn.sh

**Expected Output**

You should see output indicating that the ResourceManager and NodeManager are starting:

In [None]:
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-yourusername-resourcemanager-yourhostname.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-yourusername-nodemanager-yourhostname.out

### 5.3 Verify Running Services

We can verify that the Hadoop services are running by checking the web interfaces provided by HDFS and YARN.

**5.3.1 HDFS NameNode Web Interface**

- URL: http://localhost:9870

Open a web browser and navigate to http://localhost:9870. You should see the HDFS NameNode status page.

**5.3.2 YARN ResourceManager Web Interface**
- URL: http://localhost:8088
Open a web browser and navigate to http://localhost:8088. You should see the YARN ResourceManager status page.

**Note**: If you cannot access these pages, ensure that your firewall settings allow connections to these ports.

### 5.4 Verify Hadoop Processes

You can check if the Hadoop daemons are running by listing Java processes.

In [None]:
%%bash
jps

**Expected Output**

You should see output similar to:

`ResourceManager`

`NameNode`

`DataNode`

`NodeManager`

`SecondaryNameNode`

`Jps`


### 5.5 Troubleshooting

**Common Issues**

- Environment Variables Not Set: Ensure that `JAVA_HOME` and `HADOOP_HOME` are set correctly and that `$HADOOP_HOME/bin` and `$HADOOP_HOME/sbin` are in your `PATH`.
- SSH Issues: If you encounter SSH connection issues, ensure that passwordless SSH is set up correctly (refer back to Section 1.3).
- Permission Denied Errors: Make sure you have the necessary permissions to start Hadoop services. Avoid running Hadoop as the root user; instead, use a regular user account.

**Log Files**

Check the Hadoop log files for detailed error messages:

- HDFS Logs: Located in `$HADOOP_HOME/logs/`, files like `hadoop-yourusername-namenode-yourhostname.out`.
- YARN Logs: Located in `$HADOOP_HOME/logs/`, files like `yarn-yourusername-resourcemanager-yourhostname.out`.

**Explanation**:

- Starting HDFS Services: The `start-dfs.sh` script starts the HDFS daemons, which include the NameNode, DataNode, and SecondaryNameNode.
- Starting YARN Services: The `start-yarn.sh` script starts the YARN daemons, including the ResourceManager and NodeManager.
- Web Interfaces: Hadoop provides web interfaces to monitor the cluster's status. The NameNode UI shows the status of HDFS, and the ResourceManager UI shows the status of YARN and running applications.
- Process Verification: Using the `jps` command, we can list all Java processes to confirm that the necessary Hadoop daemons are running.

## Section 6: Run a Sample MapReduce Job

In this section, we'll run a sample MapReduce job to ensure that our Hadoop setup is functioning correctly. We'll use the built-in WordCount example that comes with Hadoop. This example will count the frequency of words in a set of input files.

### 6.1 Prepare the HDFS Directory Structure

We need to create directories in HDFS to store our input files and output results.

#### 6.1.1 Create a User Directory in HDFS

Replace yourusername with your actual username if necessary. Alternatively, you can use the whoami command to dynamically get your username.

In [None]:
%%bash
hdfs dfs -mkdir -p /user/$(whoami)

Verify the Directory Creation

In [None]:
%%bash
hdfs dfs -ls /user

You should see your username listed in the output.

### 6.2 Copy Sample Files to HDFS

We'll use the Hadoop configuration files as sample input data for the WordCount example.

In [None]:
%%bash
hdfs dfs -put $HADOOP_HOME/etc/hadoop/*.xml /user/$(whoami)/

Verify Files in HDFS

In [None]:
%%bash
hdfs dfs -ls /user/$(whoami)/

You should see a list of .xml files that were copied to your HDFS directory.

### 6.3 Run the WordCount MapReduce Job

We'll execute the WordCount example using the sample files we just uploaded.

#### 6.3.1 Identify the Hadoop MapReduce Examples JAR

First, find the exact version of the Hadoop MapReduce examples JAR file.

In [None]:
%%bash
ls $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar

This command will output the full path to the examples JAR, such as:

In [None]:
/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar

#### 6.3.2 Execute the WordCount Job

Run the WordCount job using the identified JAR file.

In [None]:
%%bash
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /user/$(whoami) /user/$(whoami)/output

**Explanation**:

- Input Path: `/user/$(whoami)` (the directory containing your input files)
- Output Path: `/user/$(whoami)/output` (the directory where the results will be stored)

**Expected Output**

The terminal will display logs showing the progress of the MapReduce job. Look for lines indicating that the job has completed successfully.

### 6.4 View the Output Results

After the job completes, we can check the output stored in HDFS.

#### 6.4.1 List the Output Directory

In [None]:
%%bash
hdfs dfs -ls /user/$(whoami)/output

You should see files like part-r-00000, which contain the word count results.

#### 6.4.2 Display the Results

In [None]:
%%bash
hdfs dfs -cat /user/$(whoami)/output/part-r-00000

This command will output the word counts. You should see something like:

In [None]:
Configuration    2
Filesystem       3
Hadoop           5
MapReduce        1
...

### 6.5 Verify Job Status on YARN ResourceManager UI

You can also verify that the job ran successfully by checking the YARN web interface.

- URL: http://localhost:8088

Navigate to this URL in your web browser. Under the "**Finished Applications**" section, you should see your WordCount job listed.

### 6.6 Clean Up the Output Directory (Optional)

If you want to run the job again, you need to remove the existing output directory; otherwise, Hadoop will throw an error because the output directory already exists.

In [None]:
%%bash
hdfs dfs -rm -r /user/$(whoami)/output

**Explanation**:

- **Creating Directories**: We created necessary directories in HDFS to organize our data.

- **Uploading Data**: We uploaded sample files to HDFS to serve as input for our MapReduce job.

- **Running the Job**: We executed the WordCount example, which processed the input files and produced word counts.

- **Viewing Results**: We retrieved the output from HDFS and displayed it.

- **Web Interface**: The YARN ResourceManager UI provides a graphical interface to monitor and manage jobs.

## Section 7: Install Apache Hive

In this section, we'll install Apache Hive, a data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Hive allows you to write SQL-like queries to process and analyze data stored in HDFS.

### Section 7.1 Install Apache Hive

We'll download the latest stable version of Apache Hive from the Apache website.

**Determine the Latest Stable Version**

Please check the Apache Hive Releases page to confirm the latest version.

In [None]:
%%bash
# Replace with the latest Hive version number if necessary
HIVE_VERSION=3.1.2

**Download Apache Hive**

In [None]:
%%bash
# Download Apache Hive
wget https://downloads.apache.org/hive/hive-$HIVE_VERSION/apache-hive-$HIVE_VERSION-bin.tar.gz -P ~/

**Note**: If wget is not installed, you can install it using:

- For Ubuntu/WSL/Cygwin:

In [None]:
%%bash
sudo apt-get install -y wget

- For macOS:

In [None]:
%%bash
brew install wget

### Section 7.2 Extract Apache Hive

Unpack the downloaded tarball and move it to a suitable location.

In [None]:
%%bash
# Extract Hive
tar -xzvf ~/apache-hive-$HIVE_VERSION-bin.tar.gz -C ~/

# Move Hive to /usr/local/hive (may require sudo)
sudo mv ~/apache-hive-$HIVE_VERSION-bin /usr/local/hive

**Note for Windows Users**: If you encounter permission issues, you might need to adjust permissions or run the commands with administrative privileges.

### 7.3 Set Environment Variables

We need to set up environment variables so that the system recognizes Hive commands.

**Edit Your Shell Profile**

Append the following lines to your shell profile file.

In [None]:
%%bash
# Define shell profile file
SHELL_PROFILE=~/.bashrc  # Use ~/.bash_profile or ~/.zshrc if appropriate

# Append Hive environment variables
echo "
# Hive Environment Variables
export HIVE_HOME=/usr/local/hive
export PATH=\$PATH:\$HIVE_HOME/bin
" >> \$SHELL_PROFILE

**Apply the Changes**

Reload your shell profile to apply the environment variable changes.

In [None]:
%%bash
source ~/.bashrc

**Note for macOS Users**: If you edited ~/.bash_profile or ~/.zshrc, replace ~/.bashrc with the appropriate file in the source command.

### 7.4 Configure Hive

Hive requires a metastore to store metadata about the tables and partitions. We'll use the embedded Derby database for simplicity.

#### 7.4.1 Create Hive Configuration Directory

In [None]:
%%bash
mkdir -p $HIVE_HOME/conf

#### 7.4.2 Copy Template Configuration File

In [None]:
%%bash
cp $HIVE_HOME/conf/hive-default.xml.template $HIVE_HOME/conf/hive-site.xml

#### 7.4.3 Edit `hive-site.xml`

We'll configure Hive to use the embedded Derby database for the metastore.

In [None]:
%%bash
cat <<EOL > $HIVE_HOME/conf/hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<configuration>
    <!-- Hive Metastore Database Settings -->
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:derby:;databaseName=metastore_db;create=true</value>
        <description>JDBC connect string for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>org.apache.derby.jdbc.EmbeddedDriver</value>
        <description>Driver class name for the JDBC metastore</description>
    </property>
    <!-- Hive Metastore Warehouse Directory -->
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/$(whoami)/hive/warehouse</value>
        <description>Location of default database for the warehouse</description>
    </property>
</configuration>
EOL

#### 7.4.4 Create Warehouse Directory in HDFS

In [None]:
%%bash
hdfs dfs -mkdir -p /user/$(whoami)/hive/warehouse
hdfs dfs -chmod g+w /user/$(whoami)/hive/warehouse

### 7.5 Verify Apache Hive Installation

Let's confirm that Hive is installed and configured correctly.

In [None]:
%%bash
hive --version

You should see output similar to:

In [None]:
Hive 3.1.2
Subversion git://... (r...)
Compiled by ... on ...

## Section 8: Run an ETL Process Using Apache Hive

In this section, we'll perform an ETL (Extract, Transform, Load) operation using Apache Hive. We'll create a Hive table, load data into it, perform transformations using SQL queries, and store the results.

### 8.1 Prepare Sample Data

We'll use the same sample data as before.

#### 8.1.1 Create a Sample Data File

If you haven't already, create the student_data.txt file.

In [None]:
%%bash
cat <<EOL > ~/student_data.txt
1,John,Doe,85
2,Jane,Smith,92
3,Bob,Johnson,76
4,Alice,Williams,89
5,Tom,Brown,95
EOL

#### 8.1.2 Upload the Data File to HDFS

We'll copy the student_data.txt file from the local filesystem to HDFS.

In [None]:
%%bash
hdfs dfs -mkdir -p /user/$(whoami)/hive_data
hdfs dfs -put ~/student_data.txt /user/$(whoami)/hive_data/

Verify the file in HDFS

In [None]:
%%bash
hdfs dfs -ls /user/$(whoami)/hive_data/

You should see the student_data.txt file listed in the output.

### 8.2 Start Hive CLI

We'll use the Hive command-line interface to interact with Hive.

In [None]:
%%bash
hive

This will start the Hive shell. From here on, commands prefixed with hive> indicate that they should be run inside the Hive shell.

### 8.3 Create a Hive Table

We'll create an external table that maps to the data file we uploaded.

#### 8.3.1 Create External Table

In [None]:
hive> CREATE EXTERNAL TABLE student_data(
    id INT,
    first_name STRING,
    last_name STRING,
    score INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/$(whoami)/hive_data';

**Explanation**:

- CREATE EXTERNAL TABLE: Creates a table without moving the data; the data remains in the specified location.
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ',': Specifies that fields are separated by commas.
- STORED AS TEXTFILE: Indicates the data is stored in plain text files.
- LOCATION: Specifies the directory in HDFS where the data is located.

#### 8.3.2 Verify Table Creation

In [None]:
hive> SHOW TABLES;

You should see student_data listed.

### 8.4 Run SQL Queries to Transform Data

We'll perform transformations using SQL queries.

#### 8.4.1 Select All Data

In [None]:
hive> SELECT * FROM student_data;

This should display all records in the table.

#### 8.4.2 Filter Records with Score Greater than 80

In [None]:
hive> CREATE TABLE high_scores AS
SELECT * FROM student_data WHERE score > 80;

This query creates a new table high_scores containing students with scores greater than 80.

#### 8.4.3 Calculate Average Score

In [None]:
hive> SELECT AVG(score) AS average_score FROM student_data;

This query calculates the average score of all students.

### 8.5 View the Results

#### 8.5.1 Select from `high_scores` Table

In [None]:
hive> SELECT * FROM high_scores;

This should display the records of students with high scores.

### 8.6 Export the Results to HDFS

If you wish to export the high_scores table data to an HDFS directory, you can use the following command:

In [None]:
hive> INSERT OVERWRITE DIRECTORY '/user/$(whoami)/hive_output/high_scores'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
SELECT * FROM high_scores;

**Explanation**:

- INSERT OVERWRITE DIRECTORY: Writes the results of the query to the specified HDFS directory.
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ',': Specifies the output format.

### 8.7 Exit Hive Shell

Type the following command to leave the Hive shell:

In [None]:
hive> EXIT;

### 8.8 Verify the Output in HDFS

#### 8.8.1 List the Output Directory

In [None]:
%%bash
hdfs dfs -ls /user/$(whoami)/hive_output/high_scores/

You should see files containing the results.

#### 8.8.2 Display the Output

In [None]:
%%bash
hdfs dfs -cat /user/$(whoami)/hive_output/high_scores/000000_0

**Expected Output:**

In [None]:
2,Jane,Smith,92
4,Alice,Williams,89
5,Tom,Brown,95

### 8.9 Clean Up (Optional)

If you wish to remove the tables and output directories:

#### 8.9.1 Drop Tables in Hive

Start the Hive shell:

In [None]:
%%bash
hive

Then run:

In [None]:
hive> DROP TABLE student_data;
hive> DROP TABLE high_scores;
hive> EXIT;

#### 8.9.2 Remove Output Directories in HDFS

In [None]:
%%bash
hdfs dfs -rm -r /user/$(whoami)/hive_output/

**Explanation**:

- Data Preparation: We used the sample dataset representing student records and uploaded it to HDFS.
- Creating Tables: We created Hive tables mapped to our data in HDFS.
- SQL Queries: We used SQL queries to filter data and perform calculations.
- Data Export: We exported query results to HDFS directories.
- Hive Shell: Hive provides a command-line interface for executing queries similar to SQL.

**Key Concepts**:

- Apache Hive: A data warehouse system for Hadoop that facilitates querying and managing large datasets using SQL.
- ETL Process: Extracting data from a source, transforming it according to business logic, and loading it into a destination.
- HiveQL: Hive's query language, which is similar to SQL and allows for complex data manipulations.

## Section 9: Cleanup and Shutdown

In this final section, we'll cover how to properly shut down the Hadoop services and clean up any temporary files or directories created during this exercise. Regular cleanup helps maintain system performance and frees up resources.

### 9.1 Stop Hadoop Services

To gracefully shut down the Hadoop services, we'll stop both HDFS and YARN daemons.

#### 9.1.1 Stop YARN Services

In [None]:
%%bash
stop-yarn.sh

Alternatively, if the command is not found:

In [None]:
%%bash
$HADOOP_HOME/sbin/stop-yarn.sh

Expected Output

You should see messages indicating that the ResourceManager and NodeManager are stopping:

In [None]:
stopping resourcemanager
localhost: stopping nodemanager

#### 9.1.2 Stop HDFS Services

In [None]:
%%bash
stop-dfs.sh

Alternatively:

In [None]:
%%bash
$HADOOP_HOME/sbin/stop-dfs.sh

**Expected Output**

You should see messages indicating that the NameNode and DataNode are stopping:

In [None]:
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode

### 9.2 Verify that Services Have Stopped

You can check that the Hadoop daemons are no longer running by listing Java processes.

In [None]:
%%bash
jps

**Expected Output**

You should only see the Jps process listed:

In [None]:
Jps

### 9.3 Clean Up Temp Files (Optional)

During Hadoop's operation, temporary files and directories are created. You may choose to clean these up to free disk space.

#### 9.3.1 Remove HDFS Data Directories

If you want to completely remove all data stored in HDFS (including the NameNode and DataNode data), you can delete the Hadoop data directories. By default, these are located in /tmp/hadoop-$(whoami).

**Warning**: This will delete all data in HDFS.

In [None]:
%%bash
rm -rf /tmp/hadoop-$(whoami)*

#### 9.3.2 Remove HDFS Files and Directories

If you only want to remove the files and directories created in HDFS during this exercise:

In [None]:
%%bash
hdfs dfs -rm -r /user/$(whoami)/*

### 9.4 Remove Sample Data Files (Optional)

If you wish to remove the sample data files from your local filesystem:

In [None]:
%%bash
rm ~/student_data.txt
rm ~/etl.pig

### 9.5 Reset Environment Variables (Optional)

If you want to remove the environment variables set during this setup:

#### 9.5.1 Edit Your Shell Profile

Open your shell profile file (e.g., ~/.bashrc, ~/.bash_profile, or ~/.zshrc) in a text editor and remove the lines related to HADOOP_HOME, PIG_HOME, and their additions to the PATH.

Alternatively, you can use the following commands to remove them automatically:

In [None]:
%%bash
sed -i '/# Hadoop Environment Variables/,+2d' ~/.bashrc
sed -i '/# Pig Environment Variables/,+2d' ~/.bashrc

**Note**: The sed command may differ between systems, and the above command works for GNU sed. On macOS, you might need to use sed -i '' instead.

### 9.6 Restart Your Shell

After making changes to your shell profile, restart your terminal or reload the shell configuration:

In [None]:
%%bash
source ~/.bashrc

### 9.7 Additional Cleanup (Optional)

If you installed any software specifically for this exercise and wish to remove it:

#### 9.7.1 Remove Hadoop and Pig Directories

In [None]:
%%bash
sudo rm -rf /usr/local/hadoop
sudo rm -rf /usr/local/pig

#### 9.7.2 Remove Downloaded Tarballs

In [None]:
%%bash
rm ~/hadoop-*.tar.gz
rm ~/pig-*.tar.gz


**Explanation**:

- **Stopping Services**: It's important to properly stop Hadoop services to ensure all processes terminate gracefully.
- **Cleaning Up**: Removing temporary files and directories helps maintain system health and frees up disk space.
- **Environment Variables**: Resetting environment variables cleans up your shell environment if you no longer need Hadoop and Pig commands accessible globally.
- **Data Preservation**: Be cautious when deleting data directories to avoid accidental loss of important data.

## Conclusion

Congratulations! You've completed setting up a basic Hadoop environment and performed ETL operations using Apache Pig. You've learned how to:

- Install and configure Hadoop components: HDFS, YARN, and MapReduce.
- Start and stop Hadoop services.
- Run a sample MapReduce job.
- Install and use Apache Pig for data processing.
- Clean up your environment after use.

This exercise has provided hands-on experience with Hadoop's ecosystem, and you're now better equipped to explore more advanced features and large-scale data processing tasks.

Next Steps:

- Explore More Pig Scripts: Try modifying the Pig script to perform different transformations or analyses on your data.
- Learn Hive: Consider setting up Apache Hive for SQL-like querying of large datasets on Hadoop.
- Set Up a Multi-Node Cluster: If you're interested in scaling up, explore setting up a multi-node Hadoop cluster to handle larger workloads.