# Lesson 12: MapReduce Practical Projects

Welcome to the practical part of our MapReduce module. In the previous lesson, you learned how to use our `FakeMapReduce` framework to solve specific, well-defined problems. Today, it's your turn to design and implement MapReduce solutions from scratch.

This lesson contains a series of problems that you can solve individually or in groups. For each problem, you will need to:
1.  Create a new data file for the input.
2.  Create a new Python solver file (e.g., `log_analyzer.py`).
3.  Write the `mapper` and `reducer` functions to solve the problem.
4.  Use the `FakeMapReduce` framework to run your job.

## Quick Reminder: The `FakeMapReduce` Workflow

* **`mapper(line)`**: Your function must process a single line of input and `yield` one or more `(key, value)` pairs.
* **`reducer(accumulator, next_value)`**: Your function takes two values associated with the same key and must return a single, combined value. Our framework applies this function iteratively.
* **Framework**: You will import and use the `DataLoader` and `Job` classes from `FakeMapReduce.py` to run your solution.

---

## Project 1: Log File Analyzer

### Description
You are given a server log file where each line represents a log entry. Each entry has a log level (`INFO`, `WARN`, `ERROR`), a service name, and a message. Your task is to count how many `ERROR` and `WARN` messages were produced by each service.

### Data Format (`logs.txt`)
Each line follows the format: `LEVEL - [Service Name] Message`

**Example Data:**
```
INFO - [AuthService] User 'alice' logged in successfully
WARN - [DatabaseService] Connection pool is reaching its limit
ERROR - [PaymentService] Credit card transaction failed for user 'bob'
INFO - [AuthService] User 'bob' updated profile
ERROR - [DatabaseService] Query timed out
ERROR - [AuthService] Failed login attempt for user 'eve'
WARN - [DatabaseService] High latency detected on replica 2
```

### Task
Write a MapReduce job that outputs the total count of `WARN` and `ERROR` logs for each service.

**Expected Output:**
```
('AuthService', 'ERROR'): 1
('DatabaseService', 'WARN'): 2
('DatabaseService', 'ERROR'): 1
('PaymentService', 'ERROR'): 1
```

### Hints

* **Mapper**:
    * Parse each line to extract the log level and the service name.
    * If the level is `ERROR` or `WARN`, `yield` a key-value pair.
    * What would be a good key to group the data correctly? A tuple `(service, level)` would be perfect.
    * The value you emit should be `1`, just like in the word count example.

* **Reducer**:
    * Your reducer will receive two counts at a time for a given key.
    * The logic should be identical to the word count reducer: simply sum the two values.

---

## Project 2: Average Score Calculator

### Description
You have a file containing student scores for different subjects. Your task is to calculate the average score for each subject.

### Data Format (`scores.txt`)
Each line follows the format: `StudentID,Subject,Score`

**Example Data:**
```
101,Math,85
102,Physics,92
101,Physics,88
103,Math,95
102,Math,76
103,Physics,90
```

### Task
Write a MapReduce job that outputs the average score for each subject.

**Expected Output:**
```
Math: 85.33
Physics: 90.0
```

### Hints

This is more complex because you can't directly average averages. To calculate an average, you need the **total sum** and the **total count**. Our pairwise reducer can't compute this in one step, so we need a clever trick.

* **Mapper**:
    * Parse the line to get the `Subject` and `Score`.
    * The key should be the `Subject`.
    * For the value, you need to emit both the score and a count of `1`. A tuple is perfect for this. `yield (subject, (score, 1))`

* **Reducer (`reducer(accumulator, next_value)`)**:
    * The `accumulator` will be a tuple like `(total_score_so_far, count_so_far)`.
    * The `next_value` will also be a tuple `(score, 1)`.
    * Your reducer must combine them by adding the scores and adding the counts: `return (accumulator[0] + next_value[0], accumulator[1] + next_value[1])`.

* **Final Processing**:
    * The final output from your MapReduce job for 'Math' will be `('Math', (256, 3))`. 
    * You will need to write a small formatting function *after* the job completes to loop through the results and calculate the final average (`total_score / count`).

---

## Project 3: Building an Inverted Index

### Description
An inverted index is a core component of a search engine. It maps content, such as words, to its locations in a document or a set of documents. Your task is to build an inverted index for a small collection of documents, where each line in the input file is considered a separate document.

### Data Format (`documents.txt`)
Each line is a document. The line number can be used as the document ID.

**Example Data:**
```
Line 1: the quick brown fox
Line 2: the lazy brown dog
Line 3: the quick fox jumps
```

### Task
Write a MapReduce job that creates an inverted index. The output should be a word followed by a sorted list of unique document IDs (line numbers) where that word appears.

**Expected Output:**
```
brown: [1, 2]
dog: [2]
fox: [1, 3]
jumps: [3]
lazy: [2]
quick: [1, 3]
the: [1, 2, 3]
```

### Hints

* **Data Input**: Your `DataLoader` needs to be slightly modified, or you need to keep track of the line number. A simple way is to use `enumerate` when reading the file.
    ```python
    # In your main solver script
    with open(input_file, 'r') as f:
        lines = f.readlines()
    # Now you can pass (line_number, line_content) to the mapper
    ```
    This means your `mapper` will need to accept two arguments: `mapper(line_number, line_content)`.

* **Mapper `mapper(line_number, line_content)`**:
    * For each word in `line_content`, `yield (word, [line_number])`.
    * Note that you are yielding a list containing a single line number.

* **Reducer `reducer(accumulator, next_value)`**:
    * Both `accumulator` and `next_value` will be lists of line numbers.
    * Your reducer should simply combine the two lists: `return accumulator + next_value`.

* **Final Processing**:
    * The final output for a word might be a list with duplicates, like `('the', [1, 2, 1, 3])`.
    * Write a final formatting function that loops through the results, converts the list to a `set` to get unique values, and then converts it back to a sorted `list` for the final output.