# Hadoop Overview
Apache Hadoop is an open-source framework for storing and processing large datasets in a distributed computing environment.



## Key Components of Hadoop
#### HDFS (Hadoop Distributed File System)

#### Distributed storage system for big data.
Data is split into blocks and stored across multiple nodes.
MapReduce

#### Processing framework for parallel computation.
#### Map: Processes and filters data.
#### Reduce: Aggregates results.
YARN (Yet Another Resource Negotiator)

Resource manager for handling distributed tasks.
Hadoop Common

Libraries and utilities supporting other modules.


In [None]:
start-dfs.sh
start-yarn.sh


# How Hadoop Works
Input Data → Split into blocks (e.g., 128 MB each).

Distributed Storage → Blocks stored across multiple nodes in HDFS.

Parallel Processing → MapReduce processes each block simultaneously.

Fault Tolerance → Replicates data blocks for reliability.

In [None]:
# List Files in HDFS:

hdfs dfs -ls /

# Copy Local File to HDFS:

hdfs dfs -put file.txt /data/

# View File Content in HDFS:

hdfs dfs -cat /data/file.txt

# Run Word Count Example (MapReduce):

hadoop jar hadoop-mapreduce-examples.jar wordcount /input /output

# Hadoop Use Cases

Data Warehousing: Storing massive datasets for analytics.

Log Processing: Analyzing logs from servers and applications.

Recommendation Systems: Building product recommendations.

Sentiment Analysis: Processing social media data.

# Apache Spark

### Core Components of Spark
#### Spark Core

Basic functionalities: task scheduling, memory management, fault recovery.
Provides the RDD (Resilient Distributed Dataset) API for distributed data processing.

Spark SQL

Allows querying structured data using SQL.
Supports DataFrames for data manipulation.

Spark Streaming

Processes real-time data streams from sources like Kafka and Flume.

MLlib (Machine Learning Library)

Built-in tools for machine learning and statistical analysis.

GraphX

For graph processing and analytics.

In [None]:
%pip install pyspark


In [None]:
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName('SparkExample').getOrCreate()


In [None]:
# Load data
data = spark.read.csv('data.csv', header=True, inferSchema=True)

# View data
data.show()

# Select columns
data.select('column1', 'column2').show()


# Working with RDDs
Create RDDs

In [None]:
data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)]
rdd = spark.sparkContext.parallelize(data)

# View data
print(rdd.collect())


# Transformations and Actions

# Map Transformation
rdd_map = rdd.map(lambda x: (x[0], x[1] + 10))

# Filter Transformation
rdd_filter = rdd.filter(lambda x: x[1] > 30)

# Actions
print(rdd_map.collect())
print(rdd_filter.collect())


# Spark SQL with DataFrames
### Create DataFrame

In [None]:
from pyspark.sql import Row

data = [Row(name="Alice", age=34), Row(name="Bob", age=45)]
df = spark.createDataFrame(data)

# Show Data
df.show()


In [None]:
df.createOrReplaceTempView("people")

result = spark.sql("SELECT * FROM people WHERE age > 35")
result.show()

# Machine Learning with MLlib
##### Example: Linear Regression

In [None]:

from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

# Prepare Data
data = [(1, 2.0), (2, 4.0), (3, 6.0)]
df = spark.createDataFrame(data, ["feature", "label"])

# Assemble features
assembler = VectorAssembler(inputCols=["feature"], outputCol="features")
output = assembler.transform(df)

# Train Model
lr = LinearRegression(featuresCol="features", labelCol="label")
model = lr.fit(output)

# Predict
predictions = model.transform(output)
predictions.show()

# Use Cases for Spark
#### ETL (Extract, Transform, Load):

Process and clean large datasets efficiently.

#### Real-time Analytics:

Stream processing for IoT data or logs.

#### Machine Learning Pipelines:

Train scalable ML models with large datasets.

#### Graph Processing:

Social network analysis or recommendation engines.