# Hands-On Exercise: Setting Up a Hadoop Cluster on Your Local Machine

**Objective**: By the end of this hands-on exercise, students will have set up a single-node Hadoop cluster on an Ubuntu machine, understand the basics of the Hadoop ecosystem (HDFS, MapReduce, YARN), and run a simple MapReduce job and an Apache Spark application.

## Step 1: Introduction to Big Data and Distributed Systems

**Description**: Big Data refers to large and complex datasets that traditional data processing software cannot handle efficiently. Distributed computing involves dividing these large data sets into smaller chunks and processing them on multiple machines simultaneously.

Key Concepts:
- Volume: The size of the data.

- Velocity: The speed at which data is generated.

- Variety: Different types of data (structured, unstructured).

- Veracity: Uncertainty of data.

Distributed Systems:
- A cluster of machines (nodes) works together to process data faster.

- Hadoop is one such system built for distributed processing of large data.


## Step 2: Setting Up a Hadoop Cluster on Ubuntu

**Pre-requisites**:
- Ubuntu (Virtual Machine or Native Installation)

- Java installed (`sudo apt install openjdk-8-jdk`)

- SSH setup (`sudo apt install openssh-server`)

### Task 1: Install Hadoop
1. Download Hadoop: Download the latest Hadoop binaries from the official Apache Hadoop site:

In [None]:
$ wget https://downloads.apache.org/hadoop/common/hadoop-3.3.5/hadoop-3.3.5.tar.gz

2. Extract Hadoop:

In [None]:
$ tar -xvf hadoop-3.3.5.tar.gz

3. Set Environment Variables: Add the following to your `~/.bashrc` file:

In [None]:
export HADOOP_HOME=/path/to/hadoop-3.3.5
export PATH=$PATH:$HADOOP_HOME/bin

4. Configure Hadoop:
- Update `core-site.xml`:

In [None]:
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

- Update `hdfs-site.xml`:

In [None]:
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

- Configure `mapred-site.xml`:

In [None]:
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

- Configure `yarn-site.xml`:

In [None]:
<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
</configuration>

5. Format the NameNode: Run the following command to format the NameNode:

In [None]:
$ hdfs namenode -format

6. Start Hadoop Services: Start the NameNode and DataNode:

In [None]:
$ start-dfs.sh
$ start-yarn.sh

You can verify the Hadoop services are running by visiting:

- HDFS Web UI: `http://localhost:9870`

- YARN Resource Manager: `http://localhost:8088`


## Step 3: Hadoop Ecosystem: HDFS, MapReduce, YARN
### Task 2: Introduction to HDFS (Hadoop Distributed File System)

- HDFS: The distributed file system for storing large datasets across multiple nodes.

1. Create a Directory in HDFS:

In [None]:
$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/student

2. Copy Files to HDFS: Upload a sample dataset to HDFS:

In [None]:
$ hdfs dfs -put /path/to/local/file.txt /user/student/file.txt

3. List Files in HDFS: Check the files in HDFS:

In [None]:
$ hdfs dfs -ls /user/student/

### Task 3: MapReduce Overview. Implement a Simple MapReduce Job
- MapReduce: A programming model used for processing large data sets with a distributed algorithm on a cluster.

1. Create a text file (word count example):

In [None]:
$ echo -e "Hello\nWorld\nHello\nHadoop" > words.txt
$ hdfs dfs -put words.txt /user/student/words.txt

2. Write a Python MapReduce Job: Create a Python file named `wordcount.py`:

In [None]:
from mrjob.job import MRJob

class MRWordCount(MRJob):
    def mapper(self, _, line):
        for word in line.split():
            yield (word.lower(), 1)

    def reducer(self, word, counts):
        yield (word, sum(counts))

if __name__ == "__main__":
    MRWordCount.run()


3. Run the Job: Execute the MapReduce job on the dataset in HDFS:

In [None]:
$ python3 wordcount.py -r hadoop hdfs:///user/student/words.txt

------------------------------------------------------------