# Big Data Analytics — Lab 0 Starter (v2)
> Author : Badr TAJINI - Big Data Analytics - ESIEE 2025-2026

Verify your PySpark setup end‑to‑end and capture evidence.

## 1. Environment bootstrap

In [1]:
from datetime import datetime
print("Run timestamp (UTC):", datetime.utcnow().isoformat())

try:
    from pyspark.sql import SparkSession
    import pyspark, sys, platform, os
    spark = (
        SparkSession.builder
        .appName("BDA-Lab0")
        .config("spark.sql.session.timeZone","UTC")
        .config("spark.sql.shuffle.partitions","8")
        .getOrCreate()
    )
    print("Spark:", spark.version)
    print("PySpark:", pyspark.__version__)
    print("Python:", sys.version.split()[0], "|", platform.platform())
    print("SPARK_HOME:", os.environ.get("SPARK_HOME", "<pip-only>"))
except Exception as e:
    print("Spark init failed:", e)
    spark = None


Run timestamp (UTC): 2025-10-23T07:40:56.816983


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/10/23 09:41:01 WARN Utils: Your hostname, Remi, resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/10/23 09:41:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/10/23 09:41:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/10/23 09:41:03 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/10/23 09:41:03 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/10/23 09:41:03 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


Spark: 4.0.1
PySpark: 4.0.1
Python: 3.10.19 | Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39
SPARK_HOME: <pip-only>


In [3]:
df.createOrReplaceTempView("perfect_followers")
df.cache()
df.count()   # ⚠️ important : déclenche le cache

4

## 2. DataFrame sanity check

In [2]:
if spark is None:
    raise SystemExit("Spark not available. Fix setup and re-run Section 1.")

data = [("a",1),("b",2),("c",3),("a",2)]
df = spark.createDataFrame(data, ["key","val"])
df.show()
df.groupBy("key").count().show()

print("\n--- formatted plan ---")
df.groupBy("key").count().explain(mode="formatted")


                                                                                

+---+---+
|key|val|
+---+---+
|  a|  1|
|  b|  2|
|  c|  3|
|  a|  2|
+---+---+



[Stage 3:>                                                        (0 + 20) / 20]

+---+-----+
|key|count|
+---+-----+
|  a|    2|
|  b|    1|
|  c|    1|
+---+-----+


--- formatted plan ---
== Physical Plan ==
AdaptiveSparkPlan (6)
+- HashAggregate (5)
   +- Exchange (4)
      +- HashAggregate (3)
         +- Project (2)
            +- Scan ExistingRDD (1)


(1) Scan ExistingRDD
Output [2]: [key#0, val#1L]
Arguments: [key#0, val#1L], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)

(2) Project
Output [1]: [key#0]
Input [2]: [key#0, val#1L]

(3) HashAggregate
Input [1]: [key#0]
Keys [1]: [key#0]
Functions [1]: [partial_count(1)]
Aggregate Attributes [1]: [count#26L]
Results [2]: [key#0, count#27L]

(4) Exchange
Input [2]: [key#0, count#27L]
Arguments: hashpartitioning(key#0, 8), ENSURE_REQUIREMENTS, [plan_id=71]

(5) HashAggregate
Input [2]: [key#0, count#27L]
Keys [1]: [key#0]
Functions [1]: [count(1)]
Aggregate Attributes [1]: [count(1)#25L]
Results [2]: [key#0, count(1)#25L AS count#22L]

(6) A

                                                                                

## 3. Spark UI metrics (screenshot)
Open http://localhost:4040 after running an action and record Files Read, Input Size, and Shuffle Read/Write.

## 4. Optional: RDD quick check (for Hadoop+Spark profile)

In [None]:
rdd = spark.sparkContext.parallelize([1,2,3,4,5])
print(rdd.map(lambda x: x*2).collect())


## 5. Save evidence

In [None]:
from io import StringIO
import sys
buf = StringIO()
old_stdout = sys.stdout
try:
    sys.stdout = buf
    spark.range(10).groupByExpr("id % 2").count().explain(mode="formatted")
finally:
    sys.stdout = old_stdout

from pathlib import Path
Path("lab0_plan.txt").write_text(buf.getvalue(), encoding="utf-8")
print("Saved lab0_plan.txt")


In [8]:
!ls

BDA_Lab0_Starter_v2.ipynb


In [11]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("BDA-A01").getOrCreate()

# 💾 Chemin absolu vers ton fichier
file_path = "tiny_shakespeare.txt"


df = spark.read.text(file_path)

df.createOrReplaceTempView("shakespeare")
df.cache()
print(f"✅ {df.count()} lignes chargées et mises en cache !")

df.show(5, truncate=False)


                                                                                

✅ 40000 lignes chargées et mises en cache !
+---------------------------------------------+
|value                                        |
+---------------------------------------------+
|First Citizen:                               |
|Before we proceed any further, hear me speak.|
|                                             |
|All:                                         |
|Speak, speak.                                |
+---------------------------------------------+
only showing top 5 rows


In [12]:
print(spark.sparkContext.uiWebUrl)


http://10.255.255.254:4043


In [13]:
spark.sql("SELECT * FROM shakespeare LIMIT 5")

DataFrame[value: string]