![Lancaster University](https://www.lancaster.ac.uk/media/lancaster-university/content-assets/images/fst/logos/SCC-Logo.svg)

# SCC.454: Large Scale Platforms for AI and Data Analysis
## Practice Quiz

**Duration:** 1 Hour  
**Total Marks:** 100  

---

### Instructions

1. This is a **practice quiz** to help you prepare for the actual assessment.
2. **Write your code** in the designated code cells below each question.
3. **All questions are independent** — if you cannot answer one question, move on to the next.
4. **Run your code** to verify correctness.

### API Documentation

- **NumPy:** [https://numpy.org/doc/stable/reference/](https://numpy.org/doc/stable/reference/)
- **Pandas:** [https://pandas.pydata.org/docs/reference/](https://pandas.pydata.org/docs/reference/)
- **Scikit-learn:** [https://scikit-learn.org/stable/api/](https://scikit-learn.org/stable/api/)
- **PySpark SQL Functions:** [https://spark.apache.org/docs/3.5.0/api/python/reference/pyspark.sql/functions.html](https://spark.apache.org/docs/3.5.0/api/python/reference/pyspark.sql/functions.html)
- **PySpark DataFrame:** [https://spark.apache.org/docs/3.5.0/api/python/reference/pyspark.sql/dataframe.html](https://spark.apache.org/docs/3.5.0/api/python/reference/pyspark.sql/dataframe.html)
- **PySpark ML Feature:** [https://spark.apache.org/docs/3.5.0/api/python/reference/pyspark.ml.html](https://spark.apache.org/docs/3.5.0/api/python/reference/pyspark.ml.html)

---

| Section | Topic | Marks |
|---------|-------|-------|
| **A** | Python, NumPy, Pandas & Scikit-learn | **30** |
| **B** | Apache Spark (RDDs, DataFrames, SQL) | **35** |
| **C** | Data Preprocessing & Similarity Search | **35** |
| | **Total** | **100** |


---
# Section A: Python, NumPy, Pandas & Scikit-learn (30 marks)
---


## Question 1 — NumPy Array Operations [10 marks]

Consider the following 3×4 matrix **M**:

```
      Col0  Col1  Col2  Col3
Row0    4    12     7     3
Row1    8     5    14    10
Row2    6    11     2     9
```

**API Reference:** [numpy.org/doc/stable/reference](https://numpy.org/doc/stable/reference/)

**(a)** Create the matrix `M` as a NumPy array exactly as shown above. Print its shape and data type. **[2 marks]**

**(b)** Extract and print: (i) the second row, (ii) the third column, and (iii) the element at row 1, column 2. **[2 marks]**

**(c)** Compute and print the **sum** of each row and the **mean** of each column. **[3 marks]**

**(d)** Using boolean indexing, find and print all elements in `M` that are **greater than 7**. Then, create a copy of `M` and replace all elements greater than 7 with `0`. Print the modified matrix. **[3 marks]**


In [None]:
# Q1 — Write your code here
import numpy as np

# (a) Create matrix M, print shape and dtype
nums = [4, 12, 7, 3, 8, 5, 14, 10, 6, 11, 2, 9]

M = np.array(nums).reshape(3, 4)

print(M.shape)
print(M.dtype)

# (b) Extract second row, third column, element at [1,2]
print(M[1, :])
print(M[:, 2])
print(M[1, 2])

# (c) Sum of each row, mean of each column
row_sum = M.sum(axis=1)
col_mean = M.mean(axis=0)

print("Sum of each row =", row_sum)
print("Mean of each column =", col_mean)

# (d) Elements > 7, then replace > 7 with 0
print("Elements > 7 =", M[M > 7])

M[M > 7] = 0

print("Updated Matrix:")
print(M)


(3, 4)
int64
[ 8  5 14 10]
[ 7 14  2]
14
Sum of each row = [26 37 28]
Mean of each column = [6.         9.33333333 7.66666667 7.33333333]
Elements > 7 = [12  8 14 10 11  9]
Updated Matrix:
[[4 0 7 3]
 [0 5 0 0]
 [6 0 2 0]]


## Question 2 — Pandas Data Manipulation [10 marks]

A shop has recorded the following sales data:

```
order_id  product      category     price   quantity  date
1001      Laptop       Electronics  999.99  1         2025-03-01
1002      Mouse        Electronics  29.99   3         2025-03-01
1003      Notebook     Stationery   5.99    10        2025-03-02
1004      Keyboard     Electronics  79.99   2         2025-03-02
1005      Pen Set      Stationery   12.99   5         2025-03-03
1006      Monitor      Electronics  349.99  1         2025-03-03
1007      Stapler      Stationery   8.99    NaN       2025-03-04
1008      Headphones   Electronics  149.99  2         2025-03-04
```

**API Reference:** [pandas.pydata.org/docs/reference](https://pandas.pydata.org/docs/reference/)

**(a)** Create this DataFrame in pandas exactly as shown (use `np.nan` for the missing value). Print the DataFrame and its info. **[2 marks]**

**(b)** Fill the missing `quantity` value with the **median** quantity of all products. Print the updated DataFrame. **[2 marks]**

**(c)** Add a new column called `total` computed as `price × quantity`. Then filter and display only rows where `total > 100`. **[3 marks]**

**(d)** Using `groupby`, calculate the **total revenue** (sum of `total`) and the **number of orders** per category. Sort by total revenue descending. **[3 marks]**


In [2]:
# Q2 — Write your code here
import pandas as pd
import numpy as np

# (a) Create DataFrame and print info
data = {
    "order_id": [1001,1002,1003,1004,1005,1006,1007,1008],
    "product": ["Laptop","Mouse","Notebook","Keyboard","Pen Set","Monitor","Stapler","Headphones"],
    "category": ["Electronics","Electronics","Stationery","Electronics","Stationery","Electronics","Stationery","Electronics"],
    "price": [999.99,29.99,5.99,79.99,12.99,349.99,8.99,149.99],
    "quantity": [1,3,10,2,5,1,np.nan,2],
    "date": pd.to_datetime([
        "2025-03-01","2025-03-01","2025-03-02","2025-03-02",
        "2025-03-03","2025-03-03","2025-03-04","2025-03-04"
    ])
}

df = pd.DataFrame(data)

print(df)
print(df.info())

# (b) Fill missing quantity with median
median_qty = df["quantity"].median()
df["quantity"] = df["quantity"].fillna(median_qty)

print("\nAfter filling missing quantity:")
print(df)

# (c) Add total column, filter where total > 100
df["total"] = df["price"] * df["quantity"]

filtered = df[df["total"] > 100]

print("\nRows where total > 100:")
print(filtered)

# (d) Groupby category: total revenue and order count
summary = (
    df.groupby("category")
      .agg(
          total_revenue=("total", "sum"),
          number_of_orders=("order_id", "count")
      )
      .sort_values(by="total_revenue", ascending=False)
)

print("\nRevenue summary by category:")
print(summary)


   order_id     product     category   price  quantity       date
0      1001      Laptop  Electronics  999.99       1.0 2025-03-01
1      1002       Mouse  Electronics   29.99       3.0 2025-03-01
2      1003    Notebook   Stationery    5.99      10.0 2025-03-02
3      1004    Keyboard  Electronics   79.99       2.0 2025-03-02
4      1005     Pen Set   Stationery   12.99       5.0 2025-03-03
5      1006     Monitor  Electronics  349.99       1.0 2025-03-03
6      1007     Stapler   Stationery    8.99       NaN 2025-03-04
7      1008  Headphones  Electronics  149.99       2.0 2025-03-04
<class 'pandas.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   order_id  8 non-null      int64         
 1   product   8 non-null      str           
 2   category  8 non-null      str           
 3   price     8 non-null      float64       
 4   quantity  7 non-null      float64   

## Question 3 — Scikit-learn Classification [10 marks]

You will use the Iris dataset for this question. **Run the setup cell first.**

**API Reference:** [scikit-learn.org/stable/api](https://scikit-learn.org/stable/api/)


In [3]:
# === RUN THIS CELL FIRST ===
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
df_iris['target'] = iris.target

print(f"Dataset shape: {df_iris.shape}")
print(f"Target classes: {list(iris.target_names)}")
print(f"Class distribution:\n{df_iris['target'].value_counts().sort_index()}")
df_iris.head()


Dataset shape: (150, 5)
Target classes: [np.str_('setosa'), np.str_('versicolor'), np.str_('virginica')]
Class distribution:
target
0    50
1    50
2    50
Name: count, dtype: int64


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


**(a)** Split the data into training (80%) and testing (20%) sets with `random_state=42` and stratified sampling. Print the shapes. **[2 marks]**

**(b)** Apply `StandardScaler` to the features. Fit on training data only, then transform both sets. **[2 marks]**

**(c)** Train a **K-Nearest Neighbours** classifier with `n_neighbors=3`. Print the accuracy on the test set. **[3 marks]**

**(d)** Print the **confusion matrix** and the **classification report** for the KNN model. **[3 marks]**


In [4]:
# Q3 — Write your code here
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# (a) Train-test split with stratification
X = df_iris.drop("target", axis=1)
y = df_iris["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

# (b) StandardScaler - fit on train, transform both
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# (c) Train KNN with n_neighbors=3, print accuracy
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaled, y_train)

y_pred = knn.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)

# (d) Confusion matrix and classification report
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))


X_train shape: (120, 4)
X_test shape: (30, 4)
y_train shape: (120,)
y_test shape: (30,)
Test Accuracy: 0.9333333333333333

Confusion Matrix:
[[10  0  0]
 [ 0 10  0]
 [ 0  2  8]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       0.83      1.00      0.91        10
           2       1.00      0.80      0.89        10

    accuracy                           0.93        30
   macro avg       0.94      0.93      0.93        30
weighted avg       0.94      0.93      0.93        30



---
# Section B: Apache Spark — RDDs, DataFrames & SQL (35 marks)
---

### ⚙️ Spark Setup

Run the two setup cells below before attempting the Spark questions.


In [5]:
# === SETUP CELL 1: Install PySpark and Java ===
!pip install pyspark==3.5.0 -q
!apt-get install openjdk-11-jdk-headless -qq > /dev/null 2>&1

import os
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["JAVA_HOME"] = r"C:\Program Files\Java\jdk-17"
print("PySpark and Java installed successfully!")

# Remove before submit
import os
import sys

os.environ["PYSPARK_PYTHON"] = sys.executable
os.environ["PYSPARK_DRIVER_PYTHON"] = sys.executable

print("Driver Python:", sys.executable)
print("Worker Python:", os.environ["PYSPARK_PYTHON"])



[notice] A new release of pip is available: 24.0 -> 26.0
[notice] To update, run: python.exe -m pip install --upgrade pip


PySpark and Java installed successfully!
Driver Python: d:\Lancaster University Coursework\Term 2\SSC 454 - Large scale platforms for AI and Data Analysis\Labs\venv\Scripts\python.exe
Worker Python: d:\Lancaster University Coursework\Term 2\SSC 454 - Large scale platforms for AI and Data Analysis\Labs\venv\Scripts\python.exe


The system cannot find the path specified.


In [6]:
# === SETUP CELL 2: Create SparkSession ===
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SCC454-Practice") \
    .config("spark.driver.memory", "4g") \
    .getOrCreate()

sc = spark.sparkContext
print(f"Spark version: {spark.version}")


Spark version: 3.5.0


## Question 4 — RDD Transformations and Actions [12 marks]

**Run the setup cell first**, then answer the questions below.


In [7]:
# === RUN THIS CELL FIRST ===
sentences = [
    "Apache Spark is fast",
    "Spark is used for big data",
    "Big data processing is important",
    "Spark and Hadoop are popular",
    "Data science uses Spark",
]

sentences_rdd = sc.parallelize(sentences, 2)
print(f"RDD created with {sentences_rdd.count()} sentences")


RDD created with 5 sentences


**(a)** Using `flatMap`, split each sentence into words (lowercase) and collect all words as a list. Print the total number of words. **[3 marks]**

**(b)** Using `map` and `reduceByKey`, count the occurrences of each word. Print all word counts. **[3 marks]**

**(c)** Find the **top 5 most frequent words** using `sortBy`. Print them with their counts. **[3 marks]**

**(d)** Using `filter`, find all words that contain the letter `'a'`. Print the count and the list of words. **[3 marks]**


In [8]:
# Q4 — Write your code here

# (a) Split sentences into words, count total words
words_rdd = sentences_rdd.flatMap(lambda x: x.lower().split())

total_words = words_rdd.count()
print("Total number of words:", total_words)

# (b) Word count using map and reduceByKey
word_counts = (
    words_rdd
    .map(lambda word: (word, 1))
    .reduceByKey(lambda a, b: a + b)
)

print("\nWord Counts:")
for word, count in word_counts.collect():
    print(word, count)

# (c) Top 5 most frequent words
top5 = (
    word_counts
    .sortBy(lambda x: x[1], ascending=False)
    .take(5)
)

print("\nTop 5 Most Frequent Words:")
for word, count in top5:
    print(word, count)

# (d) Words containing letter 'a'
words_with_a = words_rdd.filter(lambda word: 'a' in word)

filtered_words = words_with_a.collect()

print("\nWords containing 'a':")
print("Count:", len(filtered_words))
print("Words:", filtered_words)


Total number of words: 24

Word Counts:
apache 1
fast 1
used 1
for 1
big 2
important 1
and 1
hadoop 1
are 1
popular 1
science 1
uses 1
spark 4
is 3
data 3
processing 1

Top 5 Most Frequent Words:
spark 4
is 3
data 3
big 2
apache 1

Words containing 'a':
Count: 14
Words: ['apache', 'spark', 'fast', 'spark', 'data', 'data', 'important', 'spark', 'and', 'hadoop', 'are', 'popular', 'data', 'spark']


## Question 5 — Spark DataFrame Operations [12 marks]

Consider the following student grades data:

```
student_id  name      subject     score   semester
S001        Alice     Maths       85      Fall
S001        Alice     Physics     78      Fall
S002        Bob       Maths       92      Fall
S002        Bob       Physics     88      Fall
S003        Carol     Maths       76      Fall
S003        Carol     Physics     82      Fall
S001        Alice     Maths       88      Spring
S001        Alice     Physics     84      Spring
S002        Bob       Maths       90      Spring
S002        Bob       Physics     91      Spring
```

**Run the setup cell first.**


In [9]:
# === RUN THIS CELL FIRST ===
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

grades_data = [
    ("S001", "Alice", "Maths", 85, "Fall"),
    ("S001", "Alice", "Physics", 78, "Fall"),
    ("S002", "Bob", "Maths", 92, "Fall"),
    ("S002", "Bob", "Physics", 88, "Fall"),
    ("S003", "Carol", "Maths", 76, "Fall"),
    ("S003", "Carol", "Physics", 82, "Fall"),
    ("S001", "Alice", "Maths", 88, "Spring"),
    ("S001", "Alice", "Physics", 84, "Spring"),
    ("S002", "Bob", "Maths", 90, "Spring"),
    ("S002", "Bob", "Physics", 91, "Spring"),
]

grades_schema = StructType([
    StructField("student_id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("subject", StringType(), True),
    StructField("score", IntegerType(), True),
    StructField("semester", StringType(), True),
])

grades_df = spark.createDataFrame(grades_data, grades_schema)
print("Grades DataFrame created:")
grades_df.show()


Grades DataFrame created:
+----------+-----+-------+-----+--------+
|student_id| name|subject|score|semester|
+----------+-----+-------+-----+--------+
|      S001|Alice|  Maths|   85|    Fall|
|      S001|Alice|Physics|   78|    Fall|
|      S002|  Bob|  Maths|   92|    Fall|
|      S002|  Bob|Physics|   88|    Fall|
|      S003|Carol|  Maths|   76|    Fall|
|      S003|Carol|Physics|   82|    Fall|
|      S001|Alice|  Maths|   88|  Spring|
|      S001|Alice|Physics|   84|  Spring|
|      S002|  Bob|  Maths|   90|  Spring|
|      S002|  Bob|Physics|   91|  Spring|
+----------+-----+-------+-----+--------+



**(a)** Select only `name`, `subject`, and `score` columns. Then filter to show only rows where `score >= 85`. **[3 marks]**

**(b)** Add a new column `grade` based on score: `'A'` if score >= 90, `'B'` if score >= 80, `'C'` otherwise. Show the result. **[3 marks]**

**(c)** Using `groupBy`, calculate the **average score** per student (by `name`). Order by average score descending. **[3 marks]**

**(d)** Using `groupBy`, calculate the **average score** per subject per semester. Show the result ordered by semester then subject. **[3 marks]**


In [10]:
# Q5 — Write your code here
from pyspark.sql.functions import col, when, avg, round as spark_round

# (a) Select columns and filter score >= 85
filtered_df = (
    grades_df
    .select("name", "subject", "score")
    .filter(col("score") >= 85)
)

print("Scores >= 85:")
filtered_df.show()

# (b) Add grade column (A/B/C based on score)
graded_df = (
    grades_df
    .withColumn(
        "grade",
        when(col("score") >= 90, "A")
        .when(col("score") >= 80, "B")
        .otherwise("C")
    )
)

print("With Grade Column:")
graded_df.show()

# (c) Average score per student
avg_per_student = (
    grades_df
    .groupBy("name")
    .agg(spark_round(avg("score"), 2).alias("avg_score"))
    .orderBy(col("avg_score").desc())
)

print("Average Score per Student:")
avg_per_student.show()

# (d) Average score per subject per semester
avg_subject_semester = (
    grades_df
    .groupBy("semester", "subject")
    .agg(spark_round(avg("score"), 2).alias("avg_score"))
    .orderBy("semester", "subject")
)

print("Average Score per Subject per Semester:")
avg_subject_semester.show()


Scores >= 85:
+-----+-------+-----+
| name|subject|score|
+-----+-------+-----+
|Alice|  Maths|   85|
|  Bob|  Maths|   92|
|  Bob|Physics|   88|
|Alice|  Maths|   88|
|  Bob|  Maths|   90|
|  Bob|Physics|   91|
+-----+-------+-----+

With Grade Column:
+----------+-----+-------+-----+--------+-----+
|student_id| name|subject|score|semester|grade|
+----------+-----+-------+-----+--------+-----+
|      S001|Alice|  Maths|   85|    Fall|    B|
|      S001|Alice|Physics|   78|    Fall|    C|
|      S002|  Bob|  Maths|   92|    Fall|    A|
|      S002|  Bob|Physics|   88|    Fall|    B|
|      S003|Carol|  Maths|   76|    Fall|    C|
|      S003|Carol|Physics|   82|    Fall|    B|
|      S001|Alice|  Maths|   88|  Spring|    B|
|      S001|Alice|Physics|   84|  Spring|    B|
|      S002|  Bob|  Maths|   90|  Spring|    A|
|      S002|  Bob|Physics|   91|  Spring|    A|
+----------+-----+-------+-----+--------+-----+

Average Score per Student:
+-----+---------+
| name|avg_score|
+-----+---

## Question 6 — Spark SQL [11 marks]

Register the grades DataFrame as a temporary view and answer using **Spark SQL**.


In [11]:
# Register the DataFrame as a temp view
grades_df.createOrReplaceTempView("grades")
print("View 'grades' registered.")


View 'grades' registered.


**(a)** Write a SQL query to find all students who scored **above 85** in **Maths**. Return: `name`, `score`, `semester`. **[3 marks]**

**(b)** Write a SQL query to calculate the **average score per subject**. Return: `subject`, `avg_score` (rounded to 2 decimals). **[3 marks]**

**(c)** Write a SQL query to find the **highest score** achieved by each student across all subjects and semesters. Return: `name`, `max_score`. Order by `max_score` descending. **[5 marks]**


In [12]:
# Q6 — Write your SQL queries here

# (a) Students scoring above 85 in Maths
result_a = spark.sql("""
    SELECT name, score, semester
    FROM grades
    WHERE subject = 'Maths' AND score > 85
""")
result_a.show()


# (b) Average score per subject
result_b = spark.sql("""
    SELECT subject,
           ROUND(AVG(score), 2) AS avg_score
    FROM grades
    GROUP BY subject
""")
result_b.show()


# (c) Highest score per student
result_c = spark.sql("""
    SELECT name,
           MAX(score) AS max_score
    FROM grades
    GROUP BY name
    ORDER BY max_score DESC
""")
result_c.show()

+-----+-----+--------+
| name|score|semester|
+-----+-----+--------+
|  Bob|   92|    Fall|
|Alice|   88|  Spring|
|  Bob|   90|  Spring|
+-----+-----+--------+

+-------+---------+
|subject|avg_score|
+-------+---------+
|  Maths|     86.2|
|Physics|     84.6|
+-------+---------+

+-----+---------+
| name|max_score|
+-----+---------+
|  Bob|       92|
|Alice|       88|
|Carol|       82|
+-----+---------+



---
# Section C: Data Preprocessing & Similarity Search (35 marks)
---


## Question 7 — Text Preprocessing & Regular Expressions [12 marks]

Consider the following product data with messy text:

```
id   raw_text
1    "Product: LAPTOP-2025 | Price: $999.99 | Stock: 50"
2    "Product: mouse-2024 | Price: $29.50 | Stock: 200"
3    "Product: KEYBOARD-2025 | Price: $79.00 | Stock: 75"
4    "Product: Monitor-2023 | Price: $349.99 | Stock: 30"
5    "Product: HEADSET-2025 | Price: $149.00 | Stock: 100"
```

**Run the setup cell first.**


In [14]:
# === RUN THIS CELL FIRST ===
product_data = [
    (1, "Product: LAPTOP-2025 | Price: $999.99 | Stock: 50"),
    (2, "Product: mouse-2024 | Price: $29.50 | Stock: 200"),
    (3, "Product: KEYBOARD-2025 | Price: $79.00 | Stock: 75"),
    (4, "Product: Monitor-2023 | Price: $349.99 | Stock: 30"),
    (5, "Product: HEADSET-2025 | Price: $149.00 | Stock: 100"),
]

products_df = spark.createDataFrame(product_data, ["id", "raw_text"])
print("Products DataFrame created:")
products_df.show(truncate=False)


Products DataFrame created:
+---+---------------------------------------------------+
|id |raw_text                                           |
+---+---------------------------------------------------+
|1  |Product: LAPTOP-2025 | Price: $999.99 | Stock: 50  |
|2  |Product: mouse-2024 | Price: $29.50 | Stock: 200   |
|3  |Product: KEYBOARD-2025 | Price: $79.00 | Stock: 75 |
|4  |Product: Monitor-2023 | Price: $349.99 | Stock: 30 |
|5  |Product: HEADSET-2025 | Price: $149.00 | Stock: 100|
+---+---------------------------------------------------+



**(a)** Using `regexp_extract`, extract the **product name** (e.g., "LAPTOP-2025") into a new column called `product_name`. Show the result. **[3 marks]**

**(b)** Using `regexp_extract`, extract the **price** (the numeric value after $, e.g., "999.99") into a column called `price`. Cast it to `DoubleType`. **[3 marks]**

**(c)** Using `lower()`, convert the `product_name` to lowercase. Then use `regexp_replace` to remove the year part (e.g., "-2025") from the product name. **[3 marks]**

**(d)** Using `rlike`, filter to show only products from year **2025** (i.e., product name contains "2025"). **[3 marks]**


In [15]:
# Q7 — Write your code here
from pyspark.sql.functions import regexp_extract, regexp_replace, lower, col
from pyspark.sql.types import DoubleType

# (a) Extract product name
products_with_name = products_df.withColumn(
    "product_name",
    regexp_extract(col("raw_text"), r"Product:\s*([A-Za-z]+-\d{4})", 1)
)

print("Extracted Product Name:")
products_with_name.show(truncate=False)

# (b) Extract price and cast to Double
products_with_price = products_with_name.withColumn(
    "price",
    regexp_extract(col("raw_text"), r"Price:\s*\$(\d+\.\d+)", 1).cast(DoubleType())
)

print("Extracted Price:")
products_with_price.show(truncate=False)

# (c) Lowercase product name and remove year
cleaned_products = products_with_price.withColumn(
    "product_name",
    lower(col("product_name"))
).withColumn(
    "product_name",
    regexp_replace(col("product_name"), r"-\d{4}", "")
)

print("Cleaned Product Name:")
cleaned_products.show(truncate=False)

# (d) Filter products from 2025
products_2025 = products_df.filter(
    col("raw_text").rlike(r"2025")
)

print("Products from 2025:")
products_2025.show(truncate=False)


Extracted Product Name:
+---+---------------------------------------------------+-------------+
|id |raw_text                                           |product_name |
+---+---------------------------------------------------+-------------+
|1  |Product: LAPTOP-2025 | Price: $999.99 | Stock: 50  |LAPTOP-2025  |
|2  |Product: mouse-2024 | Price: $29.50 | Stock: 200   |mouse-2024   |
|3  |Product: KEYBOARD-2025 | Price: $79.00 | Stock: 75 |KEYBOARD-2025|
|4  |Product: Monitor-2023 | Price: $349.99 | Stock: 30 |Monitor-2023 |
|5  |Product: HEADSET-2025 | Price: $149.00 | Stock: 100|HEADSET-2025 |
+---+---------------------------------------------------+-------------+

Extracted Price:
+---+---------------------------------------------------+-------------+------+
|id |raw_text                                           |product_name |price |
+---+---------------------------------------------------+-------------+------+
|1  |Product: LAPTOP-2025 | Price: $999.99 | Stock: 50  |LAPTOP-2025  |99

## Question 8 — Shingling & Jaccard Similarity [12 marks]

Consider these three short documents:

```
Doc A: "the cat sat on the mat"
Doc B: "the cat sat on the hat"
Doc C: "the dog ran in the park"
```


**(a)** Write a Python function `word_shingles(text, n)` that returns a **set** of word n-grams. Apply it to all three documents with `n=2`. Print the shingle sets for each document. **[3 marks]**

**(b)** Write a function `jaccard_similarity(set_a, set_b)` that computes Jaccard similarity. Calculate and print the similarity between: (A, B), (A, C), and (B, C). **[3 marks]**

**(c)** Based on your results, which pair of documents is **most similar**? Which is **least similar**? **[2 marks]**

**(d)** Write a simple `MinHash` function that takes a set and `num_hashes` parameter, and returns a signature (list of minimum hash values). Use Python's built-in `hash()` function with different salts. Compare the estimated Jaccard (from signatures) with the true Jaccard for documents A and B using `num_hashes=50`. **[4 marks]**


In [16]:
# Q8 — Write your code here
import builtins

# Documents
doc_a = "the cat sat on the mat"
doc_b = "the cat sat on the hat"
doc_c = "the dog ran in the park"


# (a) Word shingles function, apply with n=2
def word_shingles(text, n):
    words = text.split()
    return set(tuple(words[i:i+n]) for i in range(len(words) - n + 1))

shingles_a = word_shingles(doc_a, 2)
shingles_b = word_shingles(doc_b, 2)
shingles_c = word_shingles(doc_c, 2)

print("Shingles A:", shingles_a)
print("Shingles B:", shingles_b)
print("Shingles C:", shingles_c)


# (b) Jaccard similarity function and compute for all pairs
def jaccard_similarity(set_a, set_b):
    return len(set_a & set_b) / len(set_a | set_b)

sim_ab = jaccard_similarity(shingles_a, shingles_b)
sim_ac = jaccard_similarity(shingles_a, shingles_c)
sim_bc = jaccard_similarity(shingles_b, shingles_c)

print("\nJaccard Similarities:")
print("A vs B:", sim_ab)
print("A vs C:", sim_ac)
print("B vs C:", sim_bc)


# (c) Most similar and least similar pairs
print("\nMost similar pair: A & B")
print("Least similar pair: A & C (or B & C)")


# (d) Simple MinHash function and comparison
def minhash_signature(shingle_set, num_hashes):
    signature = []
    for i in range(num_hashes):
        min_hash = min(hash(str(shingle) + str(i)) for shingle in shingle_set)
        signature.append(min_hash)
    return signature

def estimated_jaccard(sig1, sig2):
    matches = builtins.sum(1 for i in range(len(sig1)) if sig1[i] == sig2[i])
    return matches / len(sig1)

num_hashes = 50

sig_a = minhash_signature(shingles_a, num_hashes)
sig_b = minhash_signature(shingles_b, num_hashes)

est_sim_ab = estimated_jaccard(sig_a, sig_b)

print("\nTrue Jaccard (A vs B):", sim_ab)
print("Estimated Jaccard (A vs B) using MinHash:", est_sim_ab)


Shingles A: {('on', 'the'), ('sat', 'on'), ('the', 'cat'), ('cat', 'sat'), ('the', 'mat')}
Shingles B: {('on', 'the'), ('sat', 'on'), ('the', 'cat'), ('cat', 'sat'), ('the', 'hat')}
Shingles C: {('the', 'dog'), ('ran', 'in'), ('in', 'the'), ('the', 'park'), ('dog', 'ran')}

Jaccard Similarities:
A vs B: 0.6666666666666666
A vs C: 0.0
B vs C: 0.0

Most similar pair: A & B
Least similar pair: A & C (or B & C)

True Jaccard (A vs B): 0.6666666666666666
Estimated Jaccard (A vs B) using MinHash: 0.62


## Question 9 — LSH with Spark ML [11 marks]

Using the same three documents from Question 8, build an LSH pipeline with Spark ML.


**(a)** Create a Spark DataFrame with columns `id` and `text` for the three documents. Use `Tokenizer` to split into words, then `CountVectorizer` with `binary=True` to create feature vectors. Show the schema. **[3 marks]**

**(b)** Fit a `MinHashLSH` model with `numHashTables=3`. Transform the data and show the hash values. **[3 marks]**

**(c)** Use `approxSimilarityJoin` with threshold `0.6` to find similar document pairs. Display the results. **[3 marks]**

**(d)** Use `approxNearestNeighbors` to find the 2 nearest neighbours of document A. Print their IDs and distances. **[2 marks]**


In [17]:
# Q9 — Write your code here
from pyspark.sql import Row
from pyspark.ml.feature import Tokenizer, CountVectorizer, MinHashLSH
from pyspark.sql.functions import col

# (a) Create DataFrame, tokenize, vectorize
docs = [
    (1, "the cat sat on the mat"),
    (2, "the cat sat on the hat"),
    (3, "the dog ran in the park"),
]

df = spark.createDataFrame(docs, ["id", "text"])

# Tokenize
tokenizer = Tokenizer(inputCol="text", outputCol="words")
df_tokens = tokenizer.transform(df)

# CountVectorizer (binary=True)
cv = CountVectorizer(inputCol="words", outputCol="features", binary=True)
cv_model = cv.fit(df_tokens)
df_features = cv_model.transform(df_tokens)

print("Schema after vectorization:")
df_features.printSchema()


# (b) Fit MinHashLSH and show hashes
mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=3)
mh_model = mh.fit(df_features)

df_hashed = mh_model.transform(df_features)

print("Hashed Data:")
df_hashed.select("id", "hashes").show(truncate=False)


# (c) approxSimilarityJoin with threshold 0.6
similar_pairs = mh_model.approxSimilarityJoin(
    df_features, df_features, 0.6, distCol="JaccardDistance"
).filter(col("datasetA.id") < col("datasetB.id"))

print("Similar document pairs (threshold=0.6):")
similar_pairs.select(
    col("datasetA.id").alias("doc1"),
    col("datasetB.id").alias("doc2"),
    "JaccardDistance"
).show()


# (d) approxNearestNeighbors for document A (id=1)
doc_a = df_features.filter(col("id") == 1).select("features").first()["features"]

nearest = mh_model.approxNearestNeighbors(
    df_features, doc_a, 2, distCol="JaccardDistance"
)

print("2 Nearest Neighbours of Document A:")
nearest.select("id", "JaccardDistance").show()


Schema after vectorization:
root
 |-- id: long (nullable = true)
 |-- text: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- features: vector (nullable = true)

Hashed Data:
+---+-----------------------------------------------+
|id |hashes                                         |
+---+-----------------------------------------------+
|1  |[[1.4369549E7], [7.35564722E8], [3.14786228E8]]|
|2  |[[1.4369549E7], [7.35564722E8], [3.14786228E8]]|
|3  |[[1.4369549E7], [7.35564722E8], [3.11263053E8]]|
+---+-----------------------------------------------+

Similar document pairs (threshold=0.6):
+----+----+-------------------+
|doc1|doc2|    JaccardDistance|
+----+----+-------------------+
|   1|   2|0.33333333333333337|
+----+----+-------------------+

2 Nearest Neighbours of Document A:
+---+-------------------+
| id|    JaccardDistance|
+---+-------------------+
|  1|                0.0|
|  2|0.33333333333333337|
+---+----------

---
## Cleanup


In [18]:
# Stop Spark session
spark.stop()
print("Spark session stopped. Practice quiz complete!")


Spark session stopped. Practice quiz complete!


---
### End of Practice Quiz

**Review your answers and check:**
- [ ] All code cells execute without errors
- [ ] Outputs match what you expect
- [ ] You understand the concepts tested

---
*SCC.454: Large Scale Platforms for AI and Data Analysis — Lancaster University*
