<a href="https://colab.research.google.com/github/Prosper1325/BarCoktail/blob/main/MAPREDUCER_amouzou.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h2>MapReduce Mini-Project: Analyzing Amazon Movie Reviews</h2>

<p>
In this exercise, you will work as a data engineer for a streaming platform.
Your goal is to perform several analytics tasks on a free and publicly
available dataset of Amazon Movie Reviews using MapReduce in Hadoop.
</p>

<p>
You will complete four tasks:
</p>

<ol>
  <li><b>Count total number of reviews per movie</b></li>
  <li><b>Compute average rating per movie</b></li>
  <li><b>Extract frequent keywords from reviews</b></li>
  <li><b>Join average ratings with top keywords</b></li>
</ol>

<p>
For each task, you will write a MapReduce program (Python Streaming or Java)
and run it using Hadoop in local mode. Your final outputs will help the
company understand which movies are popular, how viewers rate them, and what
keywords often appear in the reviews.
</p>

In [None]:
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

!wget -q https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
!tar -xzf hadoop-3.3.6.tar.gz

0% [Working]            Hit:1 https://cli.github.com/packages stable InRelease
0% [Connecting to archive.ubuntu.com (185.125.190.82)] [Connecting to security.                                                                               Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:6 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ Packages [83.6 kB]
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [2,201 kB]
Get:8 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:9 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:10 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Get:11 http://security.ubu

In [None]:
import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["HADOOP_HOME"] = "/content/hadoop-3.3.6"
os.environ["PATH"] += f":{os.environ['HADOOP_HOME']}/bin:{os.environ['HADOOP_HOME']}/sbin"

In [None]:
%%bash
cat > /content/hadoop-3.3.6/etc/hadoop/core-site.xml << EOF
<configuration>
 <property>
   <name>fs.defaultFS</name>
   <value>file:///</value>
 </property>
</configuration>
EOF

<h2>About the Dataset</h2>

<p>
We will use the <b>Amazon Movies &amp; TV 5-core dataset</b>, which is publicly
available and contains movie reviews from Amazon. Each entry in the dataset
is stored as a JSON object with fields such as:
</p>

<ul>
  <li><code>reviewerID</code> – the ID of the reviewer</li>
  <li><code>asin</code> – unique movie identifier</li>
  <li><code>reviewText</code> – full written review</li>
  <li><code>overall</code> – the star rating (1 to 5)</li>
  <li><code>vote</code> – how many users found the review helpful</li>
  <li><code>category</code> – always “Movies &amp; TV” in this dataset</li>
</ul>

<p>
You will download the dataset and inspect a few records to understand its
structure before starting the tasks.
</p>

In [None]:
import gzip
import json
import os
import sys # Import sys for printing warnings to stderr

# -------------------------------------------------------------------
# 1) Download the SMALL Movies & TV dataset (correct version)
# -------------------------------------------------------------------
print("Downloading SMALL Movies & TV 5-core dataset...")

URL = "https://jmcauley.ucsd.edu/data/amazon_v2/categoryFilesSmall/Movies_and_TV_5.json.gz"
FILE_GZ = "Movies_and_TV_small.json.gz"

!wget --no-check-certificate -O {FILE_GZ} {URL}

if os.path.getsize(FILE_GZ) == 0:
    raise ValueError("Downloaded file is empty!")

print("Download complete.\n")

# -------------------------------------------------------------------
# 2) Load JSON data (each line is a JSON object)
# -------------------------------------------------------------------
print("Loading JSON data from JSON Lines format...")

data = []
with gzip.open(FILE_GZ, "rt", encoding="utf-8") as f:
    for line_num, line in enumerate(f, 1):
        line = line.strip()
        if line: # Only process non-empty lines
            try:
                data.append(json.loads(line))
            except json.JSONDecodeError as e:
                print(f"Warning: Could not decode JSON on line {line_num}: {line}. Error: {e}", file=sys.stderr)
                # Continue to the next line to be robust against malformed lines
                continue

print(f"Total records loaded: {len(data)}") # Should be ~3.4 million records
print()

# -------------------------------------------------------------------
# 3) Convert to JSON-LINES format for MapReduce (if not already done)
#    This step ensures 'movies.json' is a clean JSON-Lines file.
# -------------------------------------------------------------------
print("Converting to JSON-lines format (outputting to movies.json with 900,000 records)...")

# Limit to 900,000 records to have less runing times on colab (on a real cluster, remove this line)
limited_data = data[:900000]

with open("movies.json", "w", encoding="utf-8") as out:
    for entry in limited_data:
        out.write(json.dumps(entry) + "\n")

print(f"Conversion complete. Saved as movies.json with {len(limited_data)} records\n")

# -------------------------------------------------------------------
# 4) Preview
# -------------------------------------------------------------------
print("Sample entries:\n")

with open("movies.json", "r", encoding="utf-8") as f:
    for i in range(3):
        line = f.readline()
        if not line: # Check for end of file
            print("Not enough lines in movies.json to display 3 samples.")
            break
        print(json.loads(line))

Downloading SMALL Movies & TV 5-core dataset...
--2025-12-10 08:18:36--  https://jmcauley.ucsd.edu/data/amazon_v2/categoryFilesSmall/Movies_and_TV_5.json.gz
Resolving jmcauley.ucsd.edu (jmcauley.ucsd.edu)... 137.110.160.73
Connecting to jmcauley.ucsd.edu (jmcauley.ucsd.edu)|137.110.160.73|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 791322468 (755M) [application/x-gzip]
Saving to: ‘Movies_and_TV_small.json.gz’


2025-12-10 08:20:22 (7.17 MB/s) - ‘Movies_and_TV_small.json.gz’ saved [791322468/791322468]

Download complete.

Loading JSON data from JSON Lines format...
Total records loaded: 3410019

Converting to JSON-lines format (outputting to movies.json with 900,000 records)...
Conversion complete. Saved as movies.json with 900000 records

Sample entries:

{'overall': 5.0, 'verified': True, 'reviewTime': '11 9, 2012', 'reviewerID': 'A2M1CU2IRZG0K9', 'asin': '0005089549', 'style': {'Format:': ' VHS Tape'}, 

<h2>Task 1 — Count Total Number of Reviews per Movie</h2>

<p>
Your first task is to count how many reviews each movie has received. You will
write a MapReduce program where:
</p>

<ul>
  <li>The <b>mapper</b> reads each JSON record, extracts the <code>asin</code>
      field, and emits <code>(asin, 1)</code>.</li>
  <li>The <b>reducer</b> sums the counts for each movie and outputs
      <code>(asin, total_reviews)</code>.</li>
</ul>

<p>
This task is conceptually similar to a word count, but applied to movie IDs.
Complete the mapper and reducer code in the following cell.
</p>

<h2>Task 2 — Compute Average Rating per Movie</h2>

<p>
In this task, you will compute the <b>average rating</b> for each movie.
</p>

<p>The mapper should:</p>
<ul>
  <li>Extract <code>asin</code> and <code>overall</code> (rating)</li>
  <li>Emit <code>(asin, rating)</code></li>
</ul>

<p>The reducer should:</p>
<ul>
  <li>Sum all ratings for each movie</li>
  <li>Count how many ratings were received</li>
  <li>Compute and output the average rating</li>
</ul>

<p>
Use a MapReduce job to generate a list of movies with their average ratings.
</p>

In [None]:
#  Mapper
%%writefile mapper.py
import sys, json

for line in sys.stdin:
    line = line.strip()
    if not line:
        continue
    try:
        record = json.loads(line)
        asin = record.get("asin", None)
        if asin:
            print(f"{asin}\t1")
    except json.JSONDecodeError:
        continue



Writing mapper.py


In [None]:
# Reducer
%%writefile reducer.py
#!/usr/bin/env python3
import sys

current_asin = None
rating_sum = 0.0
rating_count = 0

for line in sys.stdin:
    asin, rating = line.strip().split("\t")
    rating = float(rating)

    if current_asin is None:
        current_asin = asin
        rating_sum = rating
        rating_count = 1

    elif asin == current_asin:
        rating_sum += rating
        rating_count += 1

    else:
        avg = rating_sum / rating_count
        print(f"{current_asin}\t{avg}")
        current_asin = asin
        rating_sum = rating
        rating_count = 1

# dernier film
if current_asin is not None:
    avg = rating_sum / rating_count
    print(f"{current_asin}\t{avg}")


Writing reducer.py


TASK 2

In [None]:
%%bash
rm -rf output
chmod +x mapper.py reducer.py
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar \
    -mapper "python3 mapper.py" \
    -reducer "python3 reducer.py" \
    -input movies.json \
    -output output \
    -file mapper.py \
    -file reducer.py

packageJobJar: [mapper.py, reducer.py] [] /tmp/streamjob4267695710050286770.jar tmpDir=null


2025-12-10 08:24:22,041 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
2025-12-10 08:24:22,579 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2025-12-10 08:24:22,662 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2025-12-10 08:24:22,662 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2025-12-10 08:24:22,684 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2025-12-10 08:24:22,892 INFO mapred.FileInputFormat: Total input files to process : 1
2025-12-10 08:24:22,928 INFO mapreduce.JobSubmitter: number of splits:24
2025-12-10 08:24:23,082 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1002289391_0001
2025-12-10 08:24:23,082 INFO mapreduce.JobSubmitter: Executing with tokens: []
2025-12-10 08:24:23,366 INFO mapred.LocalDistributedCacheManager: Localized file:/content/mapper.py as file:/tmp/hadoop-root/mapred/local/job_local1

<h2>Task 3 — Extract Frequent Keywords from Reviews</h2>

<p>
Now you will perform text analysis on the <code>reviewText</code> field.
Your task is to extract meaningful keywords for each movie.
</p>

<p>The mapper should:</p>
<ul>
  <li>Clean and tokenize the text</li>
  <li>Remove punctuation and stopwords</li>
  <li>Emit <code>(asin:word, 1)</code> for each keyword</li>
</ul>

<p>The reducer should:</p>
<ul>
  <li>Sum the counts for each <code>(asin, word)</code> pair</li>
  <li>Output the total frequency of each keyword per movie</li>
</ul>

<p>
This task combines text preprocessing with distributed computation.
</p>

In [None]:
%%writefile mapper_tokenize.py
#!/usr/bin/env python3
import sys
import json
import re
from string import punctuation

# Import NLTK stopwords
import nltk
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')
from nltk.corpus import stopwords

STOPWORDS = set(stopwords.words('english'))

def clean_and_tokenize(text):
    text = text.lower()
    # Retirer ponctuation
    text = re.sub(r"[{}]".format(re.escape(punctuation)), " ", text)
    # Split en mots
    tokens = text.split()
    # Supprimer stopwords
    tokens = [t for t in tokens if t not in STOPWORDS]
    return tokens

for line in sys.stdin:
    line = line.strip()
    if not line:
        continue
    try:
        record = json.loads(line)
        asin = record.get("asin")
        review = record.get("reviewText", "")
        if asin and review:
            tokens = clean_and_tokenize(review)
            for token in tokens:
                print(f"{asin}:{token}\t1")
    except:
        continue


Writing mapper_tokenize.py


In [None]:
%%writefile reducer_tokenize.py
#!/usr/bin/env python3
import sys

current_key = None
current_count = 0

for line in sys.stdin:
    line = line.strip()
    if not line:
        continue
    try:
        key, count = line.split("\t")
        count = int(count)
    except ValueError:
        continue  # Ignore malformed lines

    if current_key is None:
        current_key = key
        current_count = count
    elif key == current_key:
        current_count += count
    else:
        print(f"{current_key}\t{current_count}")
        current_key = key
        current_count = count

# dernier mot
if current_key is not None:
    print(f"{current_key}\t{current_count}")


Writing reducer_tokenize.py


In [None]:
# Mapper -> sort -> Reducer
!python3 mapper_tokenize.py < movies.json | sort | python3 reducer_tokenize.py

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
6300182924:gun	1
6300182924:hard	1
6300182924:hazzard	1
6300182924:head	1
6300182924:heart	1
6300182924:helps	1
6300182924:hero	1
6300182924:high	1
6300182924:hill	1
6300182924:hired	1
6300182924:hot	1
6300182924:hunter	1
6300182924:inept	1
6300182924:injured	1
6300182924:jack	1
6300182924:jobs	1
6300182924:john	7
6300182924:justice	2
6300182924:keep	1
6300182924:kill	1
6300182924:kirk	6
6300182924:knew	1
6300182924:know	1
6300182924:kurt	2
6300182924:largely	1
6300182924:last	1
6300182924:lawyer	2
6300182924:leave	3
6300182924:lee	1
6300182924:left	1
6300182924:les	1
6300182924:lesson	1
6300182924:lightweight	1
6300182924:like	2
6300182924:line	1
6300182924:lines	1
6300182924:live	1
6300182924:looks	1
6300182924:lot	1
6300182924:lots	1
6300182924:love	4
6300182924:loves	1
6300182924:macon	3
6300182924:mainly	1
6300182924:man	5
6300182924:manages	1
6300182924:many	1
6300182924:mexico	2
630018292

In [None]:
!hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar \
    -mapper mapper_tokenize.py \
    -reducer reducer_tokenize.py \
    -input movies.json \
    -output output_tokenize \
    -file mapper_tokenize.py \
    -file reducer_tokenize.py


2025-12-10 08:47:59,998 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapper_tokenize.py, reducer_tokenize.py] [] /tmp/streamjob8468439551810667545.jar tmpDir=null
2025-12-10 08:48:00,543 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2025-12-10 08:48:00,618 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2025-12-10 08:48:00,618 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2025-12-10 08:48:00,638 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2025-12-10 08:48:00,853 INFO mapred.FileInputFormat: Total input files to process : 1
2025-12-10 08:48:00,871 INFO mapreduce.JobSubmitter: number of splits:24
2025-12-10 08:48:01,024 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local516068429_0001
2025-12-10 08:48:01,024 INFO mapreduce.JobSubmitter: Executing with tokens: []
2025-12-10 08:48:01,306 INFO mapred.Loc

5 premieres lignes de l'output_tokenize

In [None]:
!hdfs dfs -cat output_tokenize/part-00000 | head -n 5

0001526863:19	1
0001526863:3	3
0001526863:adults	1
0001526863:advertised	1
0001526863:alike	1
cat: Unable to write to output stream.


<h2>Task 4 — Join Ratings with Top Keywords</h2>

<p>
For this task, you will combine the results of Task 2 (average ratings) and
Task 3 (keyword frequencies) using a <b>reduce-side join</b>.
</p>

<p>
You will provide two inputs to your MapReduce job:
</p>

<ul>
  <li><b>Ratings file</b> with <code>(asin, average_rating)</code></li>
  <li><b>Keywords file</b> with <code>(asin, keyword, count)</code></li>
</ul>

<p>Each mapper should tag its data:</p>

<ul>
  <li><code>("R", rating)</code> for ratings</li>
  <li><code>("K", keyword:count)</code> for keywords</li>
</ul>

<p>
The reducer will receive all entries for a given movie and combine them to
produce an output containing:
</p>

<ul>
  <li>The movie identifier (<code>asin</code>)</li>
  <li>Its average rating</li>
  <li>Its most frequent keywords</li>
</ul>

In [None]:
%%writefile ratings_mapper.py
#!/usr/bin/env python3
import sys

for line in sys.stdin:
    line = line.strip()
    if not line:
        continue
    try:
        asin, avg_rating = line.split("\t")
        print(f"{asin}\tR:{avg_rating}")
    except:
        continue


Writing ratings_mapper.py


In [None]:
%%writefile keywords_mapper.py
#!/usr/bin/env python3
import sys

for line in sys.stdin:
    line = line.strip()
    if not line:
        continue
    try:
        asin_keyword, count = line.split("\t")
        asin, keyword = asin_keyword.split(":", 1)
        print(f"{asin}\tK:{keyword}:{count}")
    except:
        continue


Writing keywords_mapper.py


In [None]:
%%writefile join_reducer.py
#!/usr/bin/env python3
import sys
from collections import defaultdict

current_asin = None
avg_rating = None
keywords = defaultdict(int)

for line in sys.stdin:
    line = line.strip()
    if not line:
        continue
    try:
        asin, value = line.split("\t")
    except:
        continue

    if current_asin != asin and current_asin is not None:
        # sortie du film précédent
        # sélection des top mots-clés
        top_keywords = sorted(keywords.items(), key=lambda x: x[1], reverse=True)[:5]
        keywords_str = ",".join([k for k, c in top_keywords])
        print(f"{current_asin}\t{avg_rating}\t{keywords_str}")
        # reset
        keywords = defaultdict(int)
        avg_rating = None

    current_asin = asin

    if value.startswith("R:"):
        avg_rating = value[2:]
    elif value.startswith("K:"):
        parts = value[2:].split(":", 1)
        if len(parts) == 2:
            keyword, count = parts
            keywords[keyword] += int(count)

# dernière ligne
if current_asin is not None:
    top_keywords = sorted(keywords.items(), key=lambda x: x[1], reverse=True)[:5]
    keywords_str = ",".join([k for k, c in top_keywords])
    print(f"{current_asin}\t{avg_rating}\t{keywords_str}")


Writing join_reducer.py


In [None]:
!hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar \
    -mapper "cat ratings.txt | python3 ratings_mapper.py" \
    -mapper "cat keywords.txt | python3 keywords_mapper.py" \
    -reducer join_reducer.py \
    -input ratings.txt \
    -input keywords.txt \
    -output output_join \
    -file ratings_mapper.py \
    -file keywords_mapper.py \
    -file join_reducer.py


2025-12-10 09:02:37,474 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [ratings_mapper.py, keywords_mapper.py, join_reducer.py] [] /tmp/streamjob5551752325259798341.jar tmpDir=null
2025-12-10 09:02:38,049 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2025-12-10 09:02:38,177 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2025-12-10 09:02:38,177 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2025-12-10 09:02:38,212 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2025-12-10 09:02:38,419 INFO mapreduce.JobSubmitter: Cleaning up the staging area file:/tmp/hadoop/mapred/staging/root448488793/.staging/job_local448488793_0001
2025-12-10 09:02:38,419 ERROR streaming.StreamJob: Error Launching job : Input path does not exist: file:/content/ratings.txt
Input path does not exist: file:/content/keywords.txt
Streaming Command Failed!

# Task
Inspect the first few lines of the output from the keyword extraction job by displaying the first 5 lines of the file `output_tokenize/part-00000`.

## Inspect Results of Keyword Extraction

### Subtask:
Inspect the first few lines of the output from the keyword extraction job (located in `output_tokenize/part-00000`) to confirm that keywords and their counts have been correctly generated for each movie.


**Reasoning**:
To inspect the output of the keyword extraction job, I need to use the `hdfs dfs -cat` command to read the content of the `output_tokenize/part-00000` file and then pipe it to `head -n 5` to display only the first 5 lines.



In [None]:
import subprocess

# Construct the command to display the first 5 lines of the output file
command = "hdfs dfs -cat output_tokenize/part-00000 | head -n 5"

try:
    # Execute the command
    result = subprocess.run(command, shell=True, check=True, capture_output=True, text=True)
    print(result.stdout)
except subprocess.CalledProcessError as e:
    print(f"Error executing command: {e.cmd}")
    print(f"Stdout: {e.stdout}")
    print(f"Stderr: {e.stderr}")


0001526863:19	1
0001526863:3	3
0001526863:adults	1
0001526863:advertised	1
0001526863:alike	1



In [None]:
%%bash
rm -f ratings.txt
cat output/part* > ratings.txt


In [None]:
%%bash
rm -f keywords.txt
cat output_tokenize/part* > keywords.txt

In [None]:
%%bash
rm -rf output_join
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar \
    -mapper "python3 ratings_mapper.py < ratings.txt" \
    -mapper "python3 keywords_mapper.py < keywords.txt" \
    -reducer join_reducer.py \
    -input ratings.txt \
    -input keywords.txt \
    -output output_join \
    -file ratings_mapper.py \
    -file keywords_mapper.py \
    -file join_reducer.py

packageJobJar: [ratings_mapper.py, keywords_mapper.py, join_reducer.py] [] /tmp/streamjob4508713821108549065.jar tmpDir=null


2025-12-10 09:07:51,656 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
2025-12-10 09:07:52,208 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2025-12-10 09:07:52,296 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2025-12-10 09:07:52,296 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2025-12-10 09:07:52,315 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2025-12-10 09:07:52,529 INFO mapred.FileInputFormat: Total input files to process : 2
2025-12-10 09:07:52,549 INFO mapreduce.JobSubmitter: number of splits:11
2025-12-10 09:07:52,776 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1034547026_0001
2025-12-10 09:07:52,776 INFO mapreduce.JobSubmitter: Executing with tokens: []
2025-12-10 09:07:53,112 INFO mapred.LocalDistributedCacheManager: Localized file:/content/ratings_mapper.py as file:/tmp/hadoop-root/mapred/local/jo