# Assignment

## Q1. Explain the core components of the Hadoop ecosystem and their respective roles in processing and storing big data. Provide a brief overview of HDFS, MapReduce, and YARN.
The Hadoop ecosystem is a suite of tools designed to handle large-scale data processing and storage. The core components are:

HDFS (Hadoop Distributed File System): A distributed file system that stores data across multiple machines in large blocks (default 128 MB), ensuring high availability and fault tolerance through replication.

MapReduce: A programming model for processing large datasets in parallel by dividing tasks into Map and Reduce functions. It processes data in a distributed manner across multiple nodes.

YARN (Yet Another Resource Negotiator): A cluster management framework that allocates resources and schedules jobs across the nodes. YARN provides a more flexible and efficient resource management layer compared to the older Hadoop 1.x architecture.

## Q2. Discuss the Hadoop Distributed File System (HDFS) in detail. Explain how it stores and manages data in a distributed environment. Describe the key concepts of HDFS, such as NameNode, DataNode, and blocks, and how they contribute to data reliability and fault tolerance.
HDFS is the primary storage system in Hadoop, designed to store very large datasets. It splits data into blocks (default 128 MB) and distributes them across different DataNodes in the cluster.

NameNode: The master node that manages metadata (file names, permissions, locations of blocks). It does not store actual data but coordinates access to the files.

DataNodes: Worker nodes that store the actual data blocks. DataNodes periodically report back to the NameNode with the status of their stored blocks.

Blocks: Files in HDFS are broken into fixed-size blocks (typically 128 MB or 256 MB) that are replicated across multiple DataNodes for fault tolerance.

HDFS achieves fault tolerance through replication (default is 3 copies of each block), ensuring data availability even if a node fails.

## Q3. Write a step-by-step explanation of how the MapReduce framework works. Use a real-world example to illustrate the Map and Reduce phases. Discuss the advantages and limitations of MapReduce for processing large datasets.
MapReduce processes data in two main phases:

Map Phase:

The input dataset is split into key-value pairs.
Each mapper processes these pairs in parallel to produce intermediate key-value pairs.
Example: In a word count problem, the input text is split, and the mapper outputs pairs like (word, 1).

Shuffle and Sort:

Intermediate key-value pairs are grouped by key, shuffled, and sorted before being passed to the reducers.
Reduce Phase:

The reducer processes each group of intermediate values to generate the final output.
Example: In the word count, the reducer sums up the occurrences of each word.

Advantages:

Scalability: Can handle petabytes of data.
Fault tolerance: Automatic recovery from hardware failures.
Limitations:

Latency: High overhead in managing multiple phases.
No iterative processing: Not efficient for machine learning algorithms, which require multiple passes over the data.
## Q4. Explore the role of YARN in Hadoop. Explain how it manages cluster resources and schedules applications. Compare YARN with the earlier Hadoop 1.x architecture and highlight the benefits of YARN.
YARN is the resource management layer in Hadoop that separates resource management from job scheduling. YARN manages resources in two components:

ResourceManager (RM): Allocates resources to different applications.
NodeManager (NM): Monitors resources and execution on individual nodes.
Hadoop 1.x had a fixed MapReduce job tracker, which managed both resource allocation and task scheduling, leading to scalability issues.

Benefits of YARN:

Better resource utilization: Allows multiple applications to run in parallel on the same cluster.
Scalability: Can scale to thousands of nodes.
Supports diverse workloads: Not limited to MapReduce (can run Spark, Tez, etc.).
## Q5. Provide an overview of some popular components within the Hadoop ecosystem, such as HBase, Hive, Pig, and Spark. Describe the use cases and differences between these components. Choose one component and explain how it can be integrated into a Hadoop ecosystem for specific data processing tasks.
HBase: A NoSQL database that provides real-time read/write access to large datasets stored in HDFS.

Hive: A data warehouse infrastructure that allows querying and managing large datasets using SQL-like syntax (HiveQL). Best suited for batch processing.

Pig: A high-level scripting language for processing large datasets. It provides data flow execution via Pig Latin scripts.

Spark: A fast, in-memory data processing engine that supports real-time stream processing, machine learning, and iterative algorithms.

Hive Example: Hive can be integrated into a Hadoop ecosystem to perform SQL-based queries on structured data stored in HDFS. This allows analysts familiar with SQL to work with large datasets using a declarative approach.

## Q6. Explain the key differences between Apache Spark and Hadoop MapReduce. How does Spark overcome some of the limitations of MapReduce for big data processing tasks?
Key differences between Apache Spark and Hadoop MapReduce:

In-memory processing: Spark processes data in-memory, leading to faster computations compared to the disk-based MapReduce.
Ease of use: Spark provides APIs for Java, Scala, Python, and R, while MapReduce involves writing complex Java code.
Real-time processing: Spark supports real-time stream processing (e.g., Spark Streaming), whereas MapReduce is batch-oriented.
Iterative algorithms: Spark is efficient for iterative algorithms used in machine learning, while MapReduce requires multiple passes over the data.
Spark overcomes MapReduce’s limitations by reducing disk I/O and providing faster performance for iterative and real-time processing.

## Q7. Write a Spark application in Scala or Python that reads a text file, counts the occurrences of each word, and returns the top 10 most frequent words. Explain the key components and steps involved in this application.

from pyspark import SparkContext

# Initialize Spark context
sc = SparkContext("local", "WordCountApp")

# Read text file
text_file = sc.textFile("path/to/textfile.txt")

# Word count logic
word_counts = text_file.flatMap(lambda line: line.split()) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b)

# Get top 10 most frequent words
top_10_words = word_counts.takeOrdered(10, key=lambda x: -x[1])

# Print the results
for word, count in top_10_words:
    print(f"{word}: {count}")


Key components:

flatMap: Splits each line into words.
map: Maps each word to a pair (word, 1).
reduceByKey: Aggregates word counts.
takeOrdered: Retrieves the top 10 frequent words.
## Q8. Using Spark RDDs (Resilient Distributed Datasets), perform the following tasks on a dataset of your choice:
a. Filter: rdd.filter(lambda row: row['value'] > 100) b. Map: rdd.map(lambda row: (row['column1'], row['column2'] * 2)) c. Reduce: rdd.reduce(lambda a, b: a + b)

## Q9. Create a Spark DataFrame in Python or Scala by loading a dataset (e.g., CSV or JSON) and perform the following operations:

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()

# Load CSV dataset
df = spark.read.csv("path/to/dataset.csv", header=True, inferSchema=True)

# a. Select specific columns
df.select("column1", "column2").show()

# b. Filter rows
df.filter(df["column1"] > 100).show()

# c. Group by and aggregate
df.groupBy("column3").agg({"column2": "sum"}).show()

# d. Join two DataFrames
df1.join(df2, df1["id"] == df2["id"]).show()



## Q11. Explain the fundamental concepts of Apache Kafka. What is it, and what problems does it aim to solve in the context of big data and real-time data processing?
Apache Kafka is a distributed event streaming platform that handles high-throughput real-time data feeds. Kafka solves the problem of ingesting, processing, and storing large streams of data by providing:

Publish/subscribe messaging model for real-time processing.
Decoupling of producers and consumers to scale applications.
**Fault tolerance