# Big Data Analytics — Assignment 01
> Author : Badr TAJINI - Big Data Analytics - ESIEE 2025-2026

**Chapter 1 :** Introduction to Big Data  
**Chapter 2 :** MapReduce Algorithm Design

**Tools :** Spark or PySpark.   
**Advice:** Keep evidence and reproducibility.

## 0. Bootstrap
Use Profile A from the `BDA_Installation_Guide.md`. Log versions and key Spark configs.

In [26]:
spark.stop()

In [1]:
import sys
import platform
from pyspark.sql import SparkSession
import pyspark

spark = (
    SparkSession.builder
    .appName("BDA-AssignmentLab-01")
    .config("spark.ui.port", "4046")
    .getOrCreate()
)

spark.sparkContext.setLogLevel("WARN")

print(f"Spark version: {spark.version}")
print(f"PySpark version: {pyspark.__version__}")
print(f"Python version: {sys.version.split()[0]}")
print(f"Session timezone: {spark.conf.get('spark.sql.session.timeZone')}")


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/10/23 13:41:26 WARN Utils: Your hostname, Remi, resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/10/23 13:41:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/10/23 13:41:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Spark version: 4.0.1
PySpark version: 4.0.1
Python version: 3.10.19
Session timezone: Europe/Paris


In [2]:
print(spark.sparkContext.uiWebUrl)

http://10.255.255.254:4046


## 1. Load dataset

In [3]:
from pathlib import Path
import urllib.request

BASE_DIR = Path.cwd()
DATA_DIR = BASE_DIR / "data"
OUTPUTS_DIR = BASE_DIR / "outputs"
PROOF_DIR = BASE_DIR / "proof"
for directory in (DATA_DIR, OUTPUTS_DIR, PROOF_DIR):
    directory.mkdir(exist_ok=True)

SHAKESPEARE_URL = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
TEXT_PATH = DATA_DIR / "shakespeare.txt"
if not TEXT_PATH.exists():
    urllib.request.urlretrieve(SHAKESPEARE_URL, TEXT_PATH)

raw_rdd = spark.sparkContext.textFile(str(TEXT_PATH)).cache()
lines_df = spark.read.text(str(TEXT_PATH)).withColumnRenamed("value", "line").cache()

# Materialize caches
raw_rdd.count()
lines_df.count()

print(f"Data loaded from: {TEXT_PATH}")
lines_df.show(5, truncate=False)


                                                                                

Data loaded from: /mnt/c/Users/rerel/OneDrive/Bureau/Esiee/Esiee/E5/BDA/Lab_1/Assigment/data/shakespeare.txt
+----------------------+
|line                  |
+----------------------+
|1609                  |
|                      |
|THE SONNETS           |
|                      |
|by William Shakespeare|
+----------------------+
only showing top 5 rows


## 2. Part A — “perfect x” follower counts

In [4]:
# write some code here 
from pyspark.sql import functions as F
from contextlib import redirect_stdout
from io import StringIO
# - tokenize lowercase, split on non-letters 
tokenized = (
    lines_df
    .withColumn("token", F.explode(F.split(F.lower(F.col("line")), "[^a-z]+")))
    .filter(F.col("token") != "")
)
# - for each line, if tokens[i]=='perfect' take tokens[i+1] 
from pyspark.sql.window import Window
window = Window.orderBy(F.monotonically_increasing_id())
with_next = tokenized.withColumn("next_token", F.lead("token").over(window))
followers = (
    with_next
    .filter(F.col("token") == "perfect")
    .groupBy("next_token")
    .count()
    .filter(F.col("count") > 1)        # on garde ceux qui apparaissent plus d'une fois
    .orderBy(F.desc("count"))
)
# - discard followers with count=1
# - write outputs/perfect_followers.csv 
output_path = OUTPUTS_DIR / "perfect_followers.csv"
followers.write.mode("overwrite").option("header", True).csv(str(output_path))
print(f"✅ Résultats écrits dans {output_path}")
# - save explain('formatted') to proof/plan_perfect.txt
# 5️⃣ Sauvegarder le plan d'exécution dans proof/plan_perfect.txt
plan_path = PROOF_DIR / "plan_perfect.txt"

# La méthode Python `explain()` écrit dans la sortie standard,
# donc on la capture proprement dans un fichier :
with open(plan_path, "w") as f:
    followers.explain(mode="formatted")  # affichera dans ton notebook
    import io, contextlib
    buffer = io.StringIO()
    with contextlib.redirect_stdout(buffer):
        followers.explain(mode="formatted")
    f.write(buffer.getvalue())

print(f"🧠 Plan d'exécution sauvegardé dans {plan_path}")


25/10/23 13:41:38 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/23 13:41:39 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/23 13:41:39 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/23 13:41:39 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/23 13:41:41 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/23 13:41:41 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
          

✅ Résultats écrits dans /mnt/c/Users/rerel/OneDrive/Bureau/Esiee/Esiee/E5/BDA/Lab_1/Assigment/outputs/perfect_followers.csv
== Physical Plan ==
AdaptiveSparkPlan (17)
+- Sort (16)
   +- Filter (15)
      +- HashAggregate (14)
         +- HashAggregate (13)
            +- Project (12)
               +- Filter (11)
                  +- Window (10)
                     +- Sort (9)
                        +- Exchange (8)
                           +- Project (7)
                              +- Filter (6)
                                 +- Generate (5)
                                    +- InMemoryTableScan (1)
                                          +- InMemoryRelation (2)
                                                +- * Project (4)
                                                   +- Scan text  (3)


(1) InMemoryTableScan
Output [1]: [line#2]
Arguments: [line#2]

(2) InMemoryRelation
Arguments: [line#2], StorageLevel(disk, memory, deserialized, 1 replicas)

(3) Scan text 
Output

25/10/23 13:41:43 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/10/23 13:41:43 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


## 3. Part B — PMI with RDDs: pairs

In [6]:
# write some code here
# - parse --threshold K
# - keep first 40 tokens per line
# - compute counts for x and (x,y); then PMI=log10(P(x,y)/(P(x)P(x)))
# - filter by threshold; write outputs/pmi_pairs_sample.csv
# - save plan text to proof/plan_pmi_pairs.txt if DF used

import re, math
from pyspark.sql import functions as F

# --- Helper pour tokeniser les lignes ---
WORD_RE = re.compile(r"[a-z]+")
def tokenize(line):
    tokens = WORD_RE.findall(line.lower())
    return tokens[:40]  # garder les 40 premiers tokens

# 1️⃣ Lire et tokeniser le texte
docs = raw_rdd.map(tokenize).map(lambda ts: list(dict.fromkeys(ts)))  # tokens uniques

# 2️⃣ Paramètre seuil
K = 5  # par exemple, tu peux aussi le passer en argument si besoin

# 3️⃣ Calcul des univariés (comptage des mots uniques par ligne)
unigrams = (docs.flatMap(lambda ts: [(t, 1) for t in ts])
                 .reduceByKey(lambda a, b: a + b))
uni_counts = dict(unigrams.collect())
bc_uni = spark.sparkContext.broadcast(uni_counts)

# 4️⃣ Calcul des cooccurrences (x, y)
def pairify(tokens):
    s = set(tokens)
    return [((x, y), 1) for x in s for y in s if x != y]

pairs = docs.flatMap(pairify).reduceByKey(lambda a, b: a + b)

# 5️⃣ Nombre total de documents (pour P(x,y))
N = docs.count()

# 6️⃣ Calcul du PMI pour chaque paire
def compute_pmi(rec):
    (x, y), cxy = rec
    cx = bc_uni.value.get(x, 1)
    cy = bc_uni.value.get(y, 1)
    pmi = math.log10((cxy * N) / (cx * cy))
    return (x, y, int(cxy), float(pmi))

pmi_rdd = pairs.filter(lambda kv: kv[1] >= K).map(compute_pmi)

# 7️⃣ Conversion en DataFrame
pmi_df = spark.createDataFrame(pmi_rdd, ["x", "y", "count", "pmi"])
sample_df = pmi_df.orderBy(F.desc("pmi"), F.desc("count")).limit(50)

# 8️⃣ Écriture du CSV
out_path = OUTPUTS_DIR / "pmi_pairs_sample.csv"
sample_df.coalesce(1).write.mode("overwrite").option("header", True).csv(str(out_path))
print(f"✅ Échantillon écrit dans {out_path}")

# 9️⃣ Sauvegarde du plan dans proof/plan_pmi_pairs.txt
plan_path = PROOF_DIR / "plan_pmi_pairs.txt"
import io, contextlib

buf = io.StringIO()
with contextlib.redirect_stdout(buf):
    sample_df.explain("formatted")

with open(plan_path, "w") as f:
    f.write(buf.getvalue())

print(f"🧠 Plan sauvegardé dans {plan_path}")


                                                                                

✅ Écrit: /mnt/c/Users/rerel/OneDrive/Bureau/Esiee/Esiee/E5/BDA/Lab_1/Assigment/outputs/pmi_stripes_sample.csv
🧠 Plan sauvegardé: /mnt/c/Users/rerel/OneDrive/Bureau/Esiee/Esiee/E5/BDA/Lab_1/Assigment/proof/plan_pmi_stripes.txt


## 4. Part B — PMI with RDDs: stripes

In [8]:
# write some code here
# - build stripes x -> map[y -> count] with combiners
# - reuse univariate counts; compute PMI with log10
# - threshold K; write outputs/pmi_stripes_sample.csv
# - plan to proof/plan_pmi_stripes.txt if DF used

import re, math
from pyspark.sql import functions as F

# --- Helper pour tokeniser les lignes ---
WORD_RE = re.compile(r"[a-z]+")
def tokenize(line):
    return WORD_RE.findall(line.lower())

# 1️⃣ Construire les documents (liste de tokens uniques par ligne)
docs = raw_rdd.map(tokenize).map(lambda tokens: list(dict.fromkeys(tokens)))

# 2️⃣ Compter les lignes valides et calculer les univariés
N = docs.filter(lambda ts: len(ts) > 1).count()
unigrams = (docs.flatMap(lambda ts: [(t, 1) for t in ts])
                 .reduceByKey(lambda a, b: a + b))
uni_counts = dict(unigrams.collect())
bc_uni = spark.sparkContext.broadcast(uni_counts)

# 3️⃣ Construire les stripes x -> {y: count} avec combiners
def stripes_from_line(tokens):
    pairs = []
    s = set(tokens)
    for x in s:
        stripe = {}
        for y in s:
            if y != x:
                stripe[y] = 1
        if stripe:
            pairs.append((x, stripe))
    return pairs

def merge_maps(a, b):
    for k, v in b.items():
        a[k] = a.get(k, 0) + v
    return a

stripes = docs.flatMap(stripes_from_line).reduceByKey(merge_maps)

# 4️⃣ Calcul du PMI à partir des stripes et des comptes univariés
def to_pmi(record):
    x, stripe = record
    results = []
    cx = bc_uni.value.get(x, 1)
    for y, cxy in stripe.items():
        cy = bc_uni.value.get(y, 1)
        pmi = math.log10((cxy * N) / (cx * cy))
        results.append((x, y, cxy, pmi))
    return results

pairs = stripes.flatMap(to_pmi)

# 5️⃣ Filtrer selon un seuil K
K = 5
filtered = pairs.filter(lambda r: r[2] >= K)

# 6️⃣ Convertir en DataFrame et sauvegarder un échantillon
pmi_df = spark.createDataFrame(filtered, ["x", "y", "count", "pmi"])
sample_df = pmi_df.orderBy(F.desc("pmi")).limit(50)

out_path = OUTPUTS_DIR / "pmi_stripes_sample.csv"
sample_df.coalesce(1).write.mode("overwrite").option("header", True).csv(str(out_path))
print(f"✅ Échantillon écrit dans {out_path}")

# 7️⃣ Sauvegarder le plan logique dans proof/plan_pmi_stripes.txt
plan_path = PROOF_DIR / "plan_pmi_stripes.txt"

import io, contextlib

buf = io.StringIO()
with contextlib.redirect_stdout(buf):
    sample_df.explain("formatted")  # capture la sortie de Spark
with open(plan_path, "w") as f:
    f.write(buf.getvalue())

print(f"🧠 Plan sauvegardé dans {plan_path}")



                                                                                

✅ Échantillon écrit dans /mnt/c/Users/rerel/OneDrive/Bureau/Esiee/Esiee/E5/BDA/Lab_1/Assigment/outputs/pmi_stripes_sample.csv
🧠 Plan sauvegardé dans /mnt/c/Users/rerel/OneDrive/Bureau/Esiee/Esiee/E5/BDA/Lab_1/Assigment/proof/plan_pmi_stripes.txt


## 5. Spark UI evidence
Open http://localhost:4046 during runs and capture Files Read, Input Size, Shuffle Read/Write.

## 6. Environment and reproducibility

In [9]:
# write some code here
# - print Java version, Spark conf, OS info
# - save ENV.md: versions + key configs
import json
import subprocess

def get_java_version():
    try:
        output = subprocess.check_output(["java", "-version"], stderr=subprocess.STDOUT)
        return output.decode("utf-8").strip().splitlines()[0]
    except Exception as exc:
        return f"Unavailable ({exc})"

java_output = get_java_version()
print(f"Java: {java_output}")

print("Spark configuration (selected):")
conf_items = sorted(spark.sparkContext.getConf().getAll())
for key, value in conf_items:
    print(f" - {key} = {value}")

env_summary = {
    "python": sys.version,
    "spark": spark.version,
    "pyspark": pyspark.__version__,
    "java": java_output,
    "os": platform.platform(),
    "spark_conf": {k: v for k, v in conf_items if k.startswith("spark.")}
}

env_lines = [
    "# Environment Summary",
    "",
    f"- Python: {sys.version.split()[0]}",
    f"- Spark: {spark.version}",
    f"- PySpark: {pyspark.__version__}",
    f"- Java: {java_output}",
    f"- OS: {platform.platform()}",
    "",
    "## Spark Configuration"
]

env_lines.extend(f"- {k} = {v}" for k, v in env_summary["spark_conf"].items())

ENV_PATH = Path("ENV.md")
ENV_PATH.write_text("\n".join(env_lines) + "\n")

print(f"Environment details saved to {ENV_PATH.resolve()}")