# Installation Tutorial &  Lab
## Dr. Aurelle TCHAGNA
# Course: Advanced AI / Big Data Lab (Landmark University, M.Tech)  
**Goal:** Install and validate **Apache Spark**, **PySpark**, **MongoDB**, **PyMongo**, **MongoDB Spark Connector**, and **MongoDB Compass**, then practice PySpark with a **WordCount** pipeline and DataFrame analytics.

This notebook contains:  
1 installation steps (Windows/Linux/macOS)  
2 verification commands  
3 PySpark lab (RDD + DataFrame)  
4 MongoDB operations (PyMongo + Spark Connector config)  
5 exercises


## 0) What you will install
### Required
- **Java (JDK 11 or 17)**: Spark runs on the JVM.
- **Apache Spark**: distributed compute engine.
- **Python 3.9+**: for PySpark scripts and notebooks.
- **PySpark**: Python API for Spark.
- **MongoDB Community Server**: NoSQL database.
- **PyMongo**: Python driver for MongoDB.
- **MongoDB Compass**: GUI to view/edit MongoDB data.

### Optional (Recommended)
- **MongoDB Spark Connector**: enables Spark to read/write MongoDB efficiently.


## 1) Install Java (JDK)
### Windows
1. Install **JDK 17** (or 11).
2. Set environment variables:
   - `JAVA_HOME=C:\Program Files\Java\jdk-17`
   - Add `%JAVA_HOME%\bin` to `PATH`

### Ubuntu/Debian
```bash
sudo apt update
sudo apt install -y openjdk-17-jdk
java -version
```

### macOS (Homebrew)
```bash
brew install openjdk@17
java -version
```

✅ **Check**
```bash
java -version
echo %JAVA_HOME%   # Windows
echo $JAVA_HOME    # Linux/macOS
```


## 2) Install Apache Spark
Download Spark (prebuilt for Hadoop) from the official Apache Spark site.

After extracting, set:
- `SPARK_HOME` to the Spark folder
- Add `$SPARK_HOME/bin` to `PATH`

**Windows example**
- `SPARK_HOME=C:\spark\spark-3.5.1-bin-hadoop3`
- Add `C:\spark\spark-3.5.1-bin-hadoop3\bin` to PATH

**Linux/macOS example**
```bash
export SPARK_HOME=$HOME/spark/spark-3.5.1-bin-hadoop3
export PATH=$SPARK_HOME/bin:$PATH
```

✅ **Check**
```bash
spark-submit --version
pyspark --version
```


## 3) Install Python packages (PySpark + PyMongo)
Create a virtual environment (recommended) then install packages.

### Windows (PowerShell)
```powershell
python -m venv .venv
.venv\Scripts\activate
pip install --upgrade pip
pip install pyspark pymongo
```

### Linux/macOS
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install pyspark pymongo
```


## 4) Install MongoDB Community Server + Compass
### MongoDB Server
Install MongoDB Community Server (Windows/macOS/Linux) from MongoDB official downloads.

✅ After installation, ensure MongoDB is running.

**Windows**
- Open **Services** → start **MongoDB Server**  
- Default URI: `mongodb://localhost:27017`

**Linux (systemd)**
```bash
sudo systemctl start mongod
sudo systemctl status mongod
```

### MongoDB Compass
Install **MongoDB Compass** and connect using:
- `mongodb://localhost:27017`


## 5) MongoDB Spark Connector (important notes)
Spark needs the MongoDB connector **JAR** when reading/writing MongoDB via Spark.

### Option A: Use Maven coordinate (easy)
In SparkSession config:
- `org.mongodb.spark:mongo-spark-connector_2.12:10.3.0` (example)

### Option B: Download connector JAR manually
Provide it to Spark using `--jars` or `spark.jars`.


## 6) Quick verification inside Jupyter
Run the next cells. They check PySpark/PyMongo and start a local Spark session.


In [1]:
import sys
print("Python:", sys.version)

try:
    import pyspark
    print("PySpark version:", pyspark.__version__)
except Exception as e:
    print("PySpark error:", e)

try:
    import pymongo
    print("PyMongo version:", pymongo.__version__)
except Exception as e:
    print("PyMongo error:", e)


Python: 3.14.2 (tags/v3.14.2:df79316, Dec  5 2025, 17:18:21) [MSC v.1944 64 bit (AMD64)]
PySpark version: 4.1.1
PyMongo version: 4.16.0


In [None]:
from pyspark.sql import SparkSession
import os

spark = (
    SparkSession.builder
    .appName("MongoSparkSession")
    .master("local[*]")

    # Local MongoDB Spark Connector JAR
    .config(
        "spark.jars",
        r"C:\spark-jars\mongo-spark-connector_2.12-10.3.0.jar"
    )

    # MongoDB connection
    .config("spark.mongodb.read.connection.uri", "mongodb://localhost:27017")
    .config("spark.mongodb.write.connection.uri", "mongodb://localhost:27017")

    # Python path
    .config("spark.pyspark.python", os.environ.get("PYSPARK_PYTHON"))

    .getOrCreate()
)

spark


: 

In [None]:
spark.version


: 

# Part A — PySpark Lab (RDD): WordCount 


In [None]:
text = [
    "Spark is fast. Spark is general-purpose.",
    "PySpark lets you use Spark with Python.",
    "Big data processing with Spark is scalable and efficient.",
    "MongoDB is a NoSQL database. PyMongo connects Python to MongoDB."
]

rdd = spark.sparkContext.parallelize(text)
rdd.take(3)


: 

In [None]:
import re

def tokenize(line: str):
    return re.findall(r"[a-z0-9]+", line.lower())

word_counts = (
    rdd.flatMap(tokenize)
       .map(lambda w: (w, 1))
       .reduceByKey(lambda a, b: a + b)
       .sortBy(lambda x: x[1], ascending=False)
)

word_counts.take(20)


: 

In [None]:
# Save results (local folder)
out_dir = "wordcount_out"

import shutil, os
if os.path.exists(out_dir):
    shutil.rmtree(out_dir)

word_counts.coalesce(1).saveAsTextFile(out_dir)
print("Saved to:", out_dir)


: 

# Part B — PySpark Lab (DataFrames): Cleaning + analytics


In [None]:
from pyspark.sql import functions as F

df_wc = word_counts.toDF(["word", "count"])
df_wc.show(10)


: 

In [None]:
# Top 10 words
df_wc.orderBy(F.desc("count")).limit(10).show()


: 

In [None]:
# Add derived features
total_words = df_wc.agg(F.sum("count").alias("total")).collect()[0]["total"]
df_features = (df_wc
               .withColumn("length", F.length("word"))
               .withColumn("freq", F.col("count") / F.lit(total_words)))

df_features.orderBy(F.desc("count")).show(10)


: 

In [None]:
# Spark SQL
df_features.createOrReplaceTempView("wc")

query = '''
SELECT word, count, length, freq
FROM wc
WHERE length >= 6
ORDER BY count DESC
LIMIT 10
'''
spark.sql(query).show()


: 

In [None]:
# Part C — MongoDB with PyMongo 
#Make sure MongoDB is running and Compass can connect to `mongodb://localhost:27017`.


: 

In [None]:
from pymongo import MongoClient

MONGO_URI = "mongodb://localhost:27017"
client = MongoClient(MONGO_URI)

db = client["landmark_ai_lab"]
col = db["wordcount"]

# reset collection
col.delete_many({})

top50 = df_features.orderBy(F.desc("count")).limit(50).toPandas()
records = top50.to_dict(orient="records")
col.insert_many(records)

print("Inserted documents:", col.count_documents({}))
records[:2]


: 

In [None]:
# Read back (PyMongo)
list(col.find({}, {"_id": 0}).sort("count", -1).limit(10))


: 

# Part D — MongoDB Spark Connector 
This requires the connector JAR (via Maven package or local file).
If it fails, students still get full marks using Part C with PyMongo.


In [None]:
# OPTIONAL: SparkSession configured for Mongo Spark Connector
# spark.stop()

# from pyspark.sql import SparkSession
# spark = (SparkSession.builder
#          .appName("Landmark-PySpark-Mongo-Connector")
#          .master("local[*]")
#          .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.12:10.3.0")
#          .config("spark.mongodb.read.connection.uri", "mongodb://localhost:27017/landmark_ai_lab.wordcount")
#          .config("spark.mongodb.write.connection.uri", "mongodb://localhost:27017/landmark_ai_lab.wordcount")
#          .getOrCreate())

# df_mongo = spark.read.format("mongodb").load()
# df_mongo.show(10)


: 

# Exercises (lab) — to be completed 
## Exercise 1 (RDD)
Using the same corpus:
1. Remove stopwords: `{"is","a","the","to","with","and"}`  
2. Recompute word counts  
3. Compare top 10 before vs after stopword removal

## Exercise 2 (DataFrames) 
Create a DataFrame with columns: `word, count, length, freq` and:
1. Compute average word length weighted by frequency  
2. Return the 10 longest words and their counts  
3. Filter words with `count >= 2` and show their share of total frequency

## Exercise 3 (MongoDB + PyMongo) 
1. Store all words (not only top 50) into MongoDB  
2. Create an index on `count` descending  
3. Query: return words with `length >= 7` sorted by count

## Exercise 4
Download any public text dataset (or scrape a few articles), and build a Spark pipeline:
- tokenization + cleaning  
- word count  
- top bigrams (pairs of consecutive words)  
- store results in MongoDB, visualize in Compass

## Exercise 5 
Use Spark ML to compute TF‑IDF:
- `pyspark.ml.feature.Tokenizer`, `HashingTF`, `IDF`


# Exercise 6
1) Build a Spark pipeline that reads **multiple text files** from a folder and produces:
- per-file word count  
- global word count  
- top 20 keywords per file

2) Extend the pipeline to store both results in MongoDB:
- Collection 1: `global_wordcount`  
- Collection 2: `per_file_wordcount`


: 