## Data Upload

_Because itâ€™s large, we upload it at the top of the notebook every session (Github didnt accept it because its >50MB)._

In [6]:
# Upload the books dataset manually
from google.colab import files
import io

uploaded = files.upload()

# Get the uploaded file name
csv_filename = next(iter(uploaded))
print(f"Uploaded: {csv_filename}")


Saving Books.csv to Books.csv
Uploaded: Books.csv


---
## Spark Setup & Load Dataset

### Spark Initialization

In [4]:
!pip install pyspark --quiet #install pyspark in colab (with suppression)

from pyspark.sql import SparkSession # the entry point to spark funcs


# Spark config
spark = SparkSession.builder \
    .appName("BookRecommendationCF") \
    .getOrCreate()

spark


### Load the Dataset into Spark

In [12]:
# Load the uploaded CSV into a Spark DataFrame
ratings_raw = spark.read.csv(
    csv_filename,
    header=True,
    inferSchema=True, # auto detect types
    escape="\"",
    multiLine=True # important, otherwise the loading will be corrupt
)

print("Rows:", ratings_raw.count())
ratings_raw.printSchema()
ratings_raw.show(5, truncate=False)


Rows: 499
root
 |-- _c0: integer (nullable = true)
 |-- user_id: integer (nullable = true)
 |-- location: string (nullable = true)
 |-- age: double (nullable = true)
 |-- isbn: string (nullable = true)
 |-- rating: integer (nullable = true)
 |-- book_title: string (nullable = true)
 |-- book_author: string (nullable = true)
 |-- year_of_publication: integer (nullable = true)
 |-- publisher: string (nullable = true)
 |-- img_s: string (nullable = true)
 |-- img_m: string (nullable = true)
 |-- img_l: string (nullable = true)
 |-- Summary: string (nullable = true)
 |-- Language: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- country: string (nullable = true)

+---+-------+-------------------------+-----------+---------+------+-------------------+--------------------+-------------------+-----------------------+------------------------------------------------------------+------------------------

Some notes:
- `rating`: 0 means no rating
- `isbn`: is stored as String, bec some isbn include hyphens, and if it wasn't a string, leading zeros may be removed.

### Quick Inspection

In [14]:
# Num of distinct users and books
print("Unique users:", ratings_raw.select("user_id").distinct().count())
print("Unique ISBNs:", ratings_raw.select("isbn").distinct().count())

# Detect potential issues early
ratings_raw.describe(['user_id', 'age', 'isbn', 'state', 'Category']).show()


Unique users: 228
Unique ISBNs: 20
+-------+-----------------+------------------+--------------------+---------+------------------+
|summary|          user_id|               age|                isbn|    state|          Category|
+-------+-----------------+------------------+--------------------+---------+------------------+
|  count|              249|               249|                 249|      249|               249|
|   mean|80763.51004016065|  37.0054842872289|4.9289581070445347E8|     NULL|               9.0|
| stddev|70227.23992765018|10.110548475294626|3.2815658792424786E8|     NULL|               0.0|
|    min|                2|              14.0|          074322678X|        ,|                 9|
|    max|           274634|              74.0|           887841740|wisconsin|['Social Science']|
+-------+-----------------+------------------+--------------------+---------+------------------+



Notes:
- num of distinct ids = number or records (expected)
- num of isbn = number of records (expected)
- Some isbns have non-numeric chars, they must be numeric strings for matching.

---
## Data Cleaning / Preparation

### Drop unneeded columns

some features we actually dont need, like the `_c0`, location info, `age`, `images`, `language`, `summary`.

keeping them will waste memory, slow down computations, and make the notebook unreadable.

and some data will distory the similarity calcs, like the 0 rating.

also we remove inactive users and rarely rated books bec the Pearson correlation needs overlap.