## Data Upload

_Because it’s large, we upload it at the top of the notebook every session (Github didnt accept it because its >50MB)._

In [25]:
# Upload the books dataset manually
from google.colab import files
import io

uploaded = files.upload()

# Get the uploaded file name
csv_filename = next(iter(uploaded))
print(f"Uploaded: {csv_filename}")


Saving Books.csv to Books.csv
Uploaded: Books.csv


---
## Spark Setup & Load Dataset

### Spark Initialization

In [4]:
!pip install pyspark --quiet #install pyspark in colab (with suppression)

from pyspark.sql import SparkSession # the entry point to spark funcs


# Spark config
spark = SparkSession.builder \
    .appName("BookRecommendationCF") \
    .getOrCreate()

spark


### Load the Dataset into Spark

In [26]:
# Load the uploaded CSV into a Spark DataFrame
ratings_raw = spark.read.csv(
    csv_filename,
    header=True,
    inferSchema=True, # auto detect types
    escape="\"",
    multiLine=True # important, otherwise the loading will be corrupt
)

print("Rows:", ratings_raw.count())
ratings_raw.printSchema()
ratings_raw.show(5, truncate=False)


Rows: 499
root
 |-- _c0: integer (nullable = true)
 |-- user_id: integer (nullable = true)
 |-- location: string (nullable = true)
 |-- age: double (nullable = true)
 |-- isbn: string (nullable = true)
 |-- rating: integer (nullable = true)
 |-- book_title: string (nullable = true)
 |-- book_author: string (nullable = true)
 |-- year_of_publication: integer (nullable = true)
 |-- publisher: string (nullable = true)
 |-- img_s: string (nullable = true)
 |-- img_m: string (nullable = true)
 |-- img_l: string (nullable = true)
 |-- Summary: string (nullable = true)
 |-- Language: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- country: string (nullable = true)

+---+-------+-------------------------+-----------+---------+------+-------------------+--------------------+-------------------+-----------------------+------------------------------------------------------------+------------------------

Some notes:
- `rating`: 0 means no rating
- `isbn`: is stored as String, bec some isbn include hyphens, and if it wasn't a string, leading zeros may be removed.

### Quick Inspection

In [28]:
# Num of distinct users and books
print("Unique users:", ratings_raw.select("user_id").distinct().count())
print("Unique ISBNs:", ratings_raw.select("isbn").distinct().count())

# Detect potential issues early
ratings_raw.describe(['user_id', 'age', 'isbn', 'state', 'Category']).show()


Unique users: 228
Unique ISBNs: 20
+-------+-----------------+------------------+--------------------+---------+------------------+
|summary|          user_id|               age|                isbn|    state|          Category|
+-------+-----------------+------------------+--------------------+---------+------------------+
|  count|              249|               249|                 249|      249|               249|
|   mean|80763.51004016065|  37.0054842872289|4.9289581070445347E8|     NULL|               9.0|
| stddev|70227.23992765018|10.110548475294626|3.2815658792424786E8|     NULL|               0.0|
|    min|                2|              14.0|          074322678X|        ,|                 9|
|    max|           274634|              74.0|           887841740|wisconsin|['Social Science']|
+-------+-----------------+------------------+--------------------+---------+------------------+



Notes:
- num of distinct ids = number or records (expected)
- num of isbn = number of records (expected)
- Some isbns have non-numeric chars, they must be numeric strings for matching.

---
## Data Cleaning / Preparation

### Drop unneeded columns

some features we actually dont need, like the `_c0`, location info, `age`, `images`, `language`, `summary`.

keeping them will waste memory, slow down computations, and make the notebook unreadable.

and some data will distory the similarity calcs, like the 0 rating.

also we remove inactive users and rarely rated books bec the Pearson correlation needs overlap.

### Drop uneeded features (columns)

In [30]:
# Keep only columns needed for ratings and minimal metadata
ratings_cols = ["user_id", "isbn", "rating"]
books_cols = ["isbn", "book_title", "book_author"]

ratings_df = ratings_raw.select(ratings_cols)
books_df = ratings_raw.select(books_cols).dropDuplicates(["isbn"])

print("[a] Ratings DF shape:", ratings_df.count(), "rows")
print("[b] Books DF shape:", books_df.count(), "unique books")

[a] Ratings DF shape: 499 rows
[b] Books DF shape: 20 unique books


_You have [a] ratings spread across [b] books._

### Clean ISBNs

Remove non-numeric chars, and remove empty ones (best done by regex)

In [31]:
from pyspark.sql.functions import regexp_replace, col

print("Before ISBN cleaning:", ratings_df.count(), "ratings,", books_df.count(), "books")

# Remove non-numeric characters from ISBN
ratings_df = ratings_df.withColumn("isbn", regexp_replace(col("isbn"), "[^0-9]", ""))
books_df = books_df.withColumn("isbn", regexp_replace(col("isbn"), "[^0-9]", ""))

# Drop rows where ISBN is empty after cleaning
ratings_df = ratings_df.filter(col("isbn") != "")
books_df = books_df.filter(col("isbn") != "")

print("After ISBN cleaning:", ratings_df.count(), "ratings,", books_df.count(), "books")



Before ISBN cleaning: 499 ratings, 20 books
After ISBN cleaning: 249 ratings, 19 books
After ISBN cleaning:
Ratings rows: 249
Books rows: 19


### Remove zero or invalid ratings

In [33]:
ratings_df = ratings_df.filter((col("rating") > 0) & (col("rating") <= 10))
print("Ratings after removing zeros/invalid:", ratings_df.count())


Ratings after removing zeros/invalid: 108


### Filter active users and books (≥5 ratings)

_Active users/book are those that exist more than 4 times in the dataset._

### Clean titles and authors

In [35]:
from pyspark.sql.functions import when, trim

books_df = books_df.withColumn("book_title", trim(col("book_title")))
books_df = books_df.withColumn("book_title", when(col("book_title") == "", "Unknown Title").otherwise(col("book_title")))

books_df = books_df.withColumn("book_author", trim(col("book_author")))
books_df = books_df.withColumn("book_author", when(col("book_author") == "", "Unknown Author").otherwise(col("book_author")))

# Keep only books present in cleaned ratings
books_df = books_df.join(ratings_df.select("isbn").distinct(), on="isbn", how="inner")

print("After cleaning titles/authors:", books_df.count(), "books")



After cleaning titles/authors: 0 books


### Remove duplicate ratings per user/book

We want only **one rating per user per book**.    
We keep highest rating. (or using timestamp is better)

In [36]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

# Keep only the highest rating if duplicates exist
window = Window.partitionBy("user_id", "isbn").orderBy(col("rating").desc())

ratings_df = ratings_df.withColumn("row_num", row_number().over(window)) \
                       .filter(col("row_num") == 1) \
                       .drop("row_num")

print("After removing duplicate ratings:", ratings_df.count(), "ratings")



After removing duplicate ratings: 0 ratings


### Cleaning Result

In [38]:
print("Final ratings:", ratings_df.count())
print("Final books:", books_df.count())
print("Active users:", active_users.count())


Final ratings: 0
Final books: 0
Active users: 1
