
# Processing a Large CSV Dataset Using MapReduce

This guide demonstrates how to:
1. Load a CSV dataset into HDFS
2. Run a MapReduce job to extract specific fields
3. Aggregate data using a Reducer

The example uses **Hadoop Streaming with Python**, which is simple and widely supported.

---

## Prerequisites

- Hadoop installed and configured
- HDFS running
- Python 3 installed on all nodes
- Environment variables set:
  - `HADOOP_HOME`
  - `PATH` includes `$HADOOP_HOME/bin`

Verify Hadoop:
```bash
hadoop version
````

---

## Example Dataset

Assume a CSV file named `sales.csv`:


### Goal

* **Extract**: `region` and `quantity`
* **Aggregate**: Total quantity sold per region

---

## Step 1: Load CSV into HDFS

Create a directory in HDFS:

```bash
hdfs dfs -mkdir -p /data/input
```

Upload the CSV file:

```bash
hdfs dfs -put sales.csv /data/input/
```

Verify upload:

```bash
hdfs dfs -ls /data/input
```

---

## Step 2: Create the Mapper

The mapper:

* Skips the header row
* Extracts `region` and `quantity`
* Outputs key-value pairs: `<region, quantity>`

Create `mapper.py`:

```python
#!/usr/bin/env python3
import sys

for line in sys.stdin:
    line = line.strip()
    if line.startswith("order_id"):
        continue  # skip header

    fields = line.split(",")
    region = fields[1]
    quantity = int(fields[3])

    print(f"{region}\t{quantity}")
```

Make it executable:

```bash
chmod +x mapper.py
```

---

## Step 3: Create the Reducer

The reducer:

* Sums quantities per region

Create `reducer.py`:

```python
#!/usr/bin/env python3
import sys

current_region = None
total_quantity = 0

for line in sys.stdin:
    region, quantity = line.strip().split("\t")
    quantity = int(quantity)

    if current_region == region:
        total_quantity += quantity
    else:
        if current_region:
            print(f"{current_region}\t{total_quantity}")
        current_region = region
        total_quantity = quantity

if current_region:
    print(f"{current_region}\t{total_quantity}")
```

Make it executable:

```bash
chmod +x reducer.py
```

---

## Step 4: Run the MapReduce Job

Remove output directory if it exists:

```bash
hdfs dfs -rm -r /data/output
```

Run Hadoop Streaming:

```bash
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \
  -input /data/input/sales.csv \
  -output /data/output \
  -mapper mapper.py \
  -reducer reducer.py \
  -file mapper.py \
  -file reducer.py
```

---

## Step 5: View the Results

Display the output:

```bash
hdfs dfs -cat /data/output/part-00000
```

### Sample Output

```text
EU      2
US      5
```

---
