# Hands-On Exercise: Setting Up a Hadoop Cluster on Your Local Machine

**Objective**: By the end of this hands-on exercise, students will have set up a single-node Hadoop cluster on an Ubuntu machine, understand the basics of the Hadoop ecosystem (HDFS, MapReduce, YARN), and run a simple MapReduce job and an Apache Spark application.

## Step 1: Introduction to Big Data and Distributed Systems

**Description**: Big Data refers to large and complex datasets that traditional data processing software cannot handle efficiently. Distributed computing involves dividing these large data sets into smaller chunks and processing them on multiple machines simultaneously.

Key Concepts:
- Volume: The size of the data.

- Velocity: The speed at which data is generated.

- Variety: Different types of data (structured, unstructured).

- Veracity: Uncertainty of data.

Distributed Systems:
- A cluster of machines (nodes) works together to process data faster.

- Hadoop is one such system built for distributed processing of large data.


## Step 2: Setting Up a Hadoop Cluster on Ubuntu

**Pre-requisites**:
- Ubuntu (Virtual Machine or Native Installation)

- Java installed (`sudo apt install openjdk-8-jdk`)

- SSH setup (`sudo apt install openssh-server`)


**Installation**:
- Run the `hadoop-install-script.sh`

## Step 3: Hadoop Ecosystem: HDFS, MapReduce, YARN
### Task 1: Introduction to HDFS (Hadoop Distributed File System)

- HDFS: The distributed file system for storing large datasets across multiple nodes.

1. Create a Directory in HDFS:

In [None]:
$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/datatech-labs

2. List Files in HDFS: Check the files in HDFS:

In [None]:
$ hdfs dfs -ls /

### Task 2: MapReduce Overview. Implement a Simple MapReduce Job
- MapReduce: A programming model used for processing large data sets with a distributed algorithm on a cluster.

1. Create a text file (word count example):

In [None]:
$ echo -e "Hello\nWorld\nHello\nHadoop" > words.txt
$ hdfs dfs -put words.txt /user/datatech-labs/words.txt

2. Write a Python MapReduce Job: Look at the Python file named `wordcount_mapRed.py`:

3. Run the Job: Execute the MapReduce job on the dataset in HDFS:

In [None]:
$ python3 wordcount_mapRed.py -r hadoop hdfs:///user/datatech-labs/words.txt

------------------------------------------------------------