# Hybrid Recommendation System

**Team Structure:**
- Member 1: Infrastructure, Data Loading, Fusion & Evaluation
- Member 2: Collaborative Filtering (ALS)
- Member 3: Content-Based Filtering (TF-IDF + LSH)

## 1. Setup

### 1.1 Imports

In [1]:
import os
import sys
import urllib.request
import zipfile
from math import log2

# Fix for Windows
os.environ["PYSPARK_PYTHON"] = sys.executable
os.environ["PYSPARK_DRIVER_PYTHON"] = sys.executable

from pyspark.ml.recommendation import ALS
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import IntegerType, FloatType, StructType, StructField

### 1.2 Download Data

In [2]:
DATA_URL = "https://files.grouplens.org/datasets/movielens/ml-1m.zip"
DATA_DIR = "data"
DATASET_DIR = os.path.join(DATA_DIR, "ml-1m")
ZIP_PATH = os.path.join(DATA_DIR, "ml-1m.zip")

In [3]:
if not os.path.exists(DATA_DIR):
    os.makedirs(DATA_DIR)

In [4]:
if not os.path.exists(DATASET_DIR):

    if not os.path.exists(ZIP_PATH):
        print("Downloading MovieLens ml-1m...")
        urllib.request.urlretrieve(DATA_URL, ZIP_PATH)

    print("Extracting...")
    with zipfile.ZipFile(ZIP_PATH, 'r') as zip_ref:
        zip_ref.extractall(DATA_DIR)

### 1.3 Spark Session

In [5]:
spark = SparkSession.builder.appName("MMDS") \
    .config("spark.driver.memory", "16g") \
    .config("spark.executor.memory", "16g") \
    .config("spark.sql.shuffle.partitions", "20") \
    .config("spark.default.parallelism", "20") \
    .config("spark.driver.maxResultSize", "8g") \
    .config("spark.sql.autoBroadcastJoinThreshold", "400m") \
    .getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/18 20:44:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### 1.4 Load Data

In [6]:
users_df = spark.read.text(os.path.join(DATASET_DIR, "users.dat")).select(
    F.split(F.col("value"), "::").getItem(0).cast(IntegerType()).alias("user_id"),
    F.split(F.col("value"), "::").getItem(1).alias("gender"),
    F.split(F.col("value"), "::").getItem(2).cast(IntegerType()).alias("age"),
    F.split(F.col("value"), "::").getItem(3).cast(IntegerType()).alias("occupation"),
    F.split(F.col("value"), "::").getItem(4).alias("zip_code")
)
users_df.count()

6040

In [7]:
items_df = spark.read.text(os.path.join(DATASET_DIR, "movies.dat")).select(
    F.split(F.col("value"), "::").getItem(0).cast(IntegerType()).alias("item_id"),
    F.split(F.col("value"), "::").getItem(1).alias("title"),
    F.split(F.col("value"), "::").getItem(2).alias("genres")
)
items_df.count()

3883

In [8]:
ratings_df = spark.read.text(os.path.join(DATASET_DIR, "ratings.dat")).select(
    F.split(F.col("value"), "::").getItem(0).cast(IntegerType()).alias("user_id"),
    F.split(F.col("value"), "::").getItem(1).cast(IntegerType()).alias("item_id"),
    F.split(F.col("value"), "::").getItem(2).cast(FloatType()).alias("rating"),
    F.split(F.col("value"), "::").getItem(3).cast(IntegerType()).alias("timestamp")
)
ratings_df.count()

1000209

In [9]:
users_df.show(5)

+-------+------+---+----------+--------+
|user_id|gender|age|occupation|zip_code|
+-------+------+---+----------+--------+
|      1|     F|  1|        10|   48067|
|      2|     M| 56|        16|   70072|
|      3|     M| 25|        15|   55117|
|      4|     M| 45|         7|   02460|
|      5|     M| 25|        20|   55455|
+-------+------+---+----------+--------+
only showing top 5 rows


In [10]:
items_df.show(5)

+-------+--------------------+--------------------+
|item_id|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Animation|Childre...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|        Comedy|Drama|
|      5|Father of the Bri...|              Comedy|
+-------+--------------------+--------------------+
only showing top 5 rows


In [11]:
ratings_df.show(5)

+-------+-------+------+---------+
|user_id|item_id|rating|timestamp|
+-------+-------+------+---------+
|      1|   1193|   5.0|978300760|
|      1|    661|   3.0|978302109|
|      1|    914|   3.0|978301968|
|      1|   3408|   4.0|978300275|
|      1|   2355|   5.0|978824291|
+-------+-------+------+---------+
only showing top 5 rows


## 2. Exploratory Data Analysis

### 2.1 Rating matrix

In [12]:
num_users = users_df.count()
num_items = items_df.count()
num_ratings = ratings_df.count()
sparsity = (1 - (num_ratings / (num_users * num_items))) * 100

print(f"Users:            {num_users:,}")
print(f"Movies:           {num_items:,}")
print(f"Ratings:          {num_ratings:,}")
print(f"Sparsity:         {sparsity:.2f}%")
print(f"Avg ratings/user: {num_ratings/num_users:.1f}")
print(f"Avg ratings/movie:{num_ratings/num_items:.1f}")

Users:            6,040
Movies:           3,883
Ratings:          1,000,209
Sparsity:         95.74%
Avg ratings/user: 165.6
Avg ratings/movie:257.6


### 2.2 Rating distribution

In [13]:
ratings_df.groupBy("rating").count().orderBy("rating").show()

+------+------+
|rating| count|
+------+------+
|   1.0| 56174|
|   2.0|107557|
|   3.0|261197|
|   4.0|348971|
|   5.0|226310|
+------+------+



                                                                                

### 2.3 Genre distribution

In [14]:
items_df.select(F.explode(F.split(F.col("genres"), "\\|")).alias("genre")) \
    .groupBy("genre").count().orderBy(F.desc("count")).show()

+-----------+-----+
|      genre|count|
+-----------+-----+
|      Drama| 1603|
|     Comedy| 1200|
|     Action|  503|
|   Thriller|  492|
|    Romance|  471|
|     Horror|  343|
|  Adventure|  283|
|     Sci-Fi|  276|
| Children's|  251|
|      Crime|  211|
|        War|  143|
|Documentary|  127|
|    Musical|  114|
|    Mystery|  106|
|  Animation|  105|
|    Western|   68|
|    Fantasy|   68|
|  Film-Noir|   44|
+-----------+-----+



### 2.4 User gender distribution

In [15]:
users_df.groupBy("gender").count().show()

+------+-----+
|gender|count|
+------+-----+
|     F| 1709|
|     M| 4331|
+------+-----+



### 2.5 User age distribution

In [16]:
users_df.groupBy("age").count().orderBy("age").show()

+---+-----+
|age|count|
+---+-----+
|  1|  222|
| 18| 1103|
| 25| 2096|
| 35| 1193|
| 45|  550|
| 50|  496|
| 56|  380|
+---+-----+



## 3. Train/Test Split

In [17]:
def chronological_split(ratings_df, train_ratio=0.8, min_train_ratings=5):
    user_time_window = Window.partitionBy("user_id").orderBy("timestamp")
    user_count_window = Window.partitionBy("user_id")

    ratings_with_rank = ratings_df.withColumn(
        "row_num", F.row_number().over(user_time_window)
    ).withColumn(
        "user_total", F.count("*").over(user_count_window)
    ).withColumn(
        "train_threshold", F.floor(F.col("user_total") * train_ratio)
    )

    ratings_valid = ratings_with_rank.filter(
        F.col("train_threshold") >= min_train_ratings
    )

    ratings_labeled = ratings_valid.withColumn(
        "split",
        F.when(F.col("row_num") <= F.col("train_threshold"), "train").otherwise("test")
    )

    original_columns = ["user_id", "item_id", "rating", "timestamp"]
    train_df = ratings_labeled.filter(F.col("split") == "train").select(original_columns)
    test_df = ratings_labeled.filter(F.col("split") == "test").select(original_columns)

    return train_df, test_df

In [18]:
train_df, test_df = chronological_split(ratings_df, train_ratio=0.8, min_train_ratings=5)
train_df = train_df.cache()
test_df = test_df.cache()

In [19]:
print(f"Train: {train_df.count():,} ratings")
print(f"Test: {test_df.count():,} ratings")
print(f"Users in train: {train_df.select('user_id').distinct().count():,}")
print(f"Users in test: {test_df.select('user_id').distinct().count():,}")

                                                                                

Train: 797,758 ratings
Test: 202,451 ratings
Users in train: 6,040
Users in test: 6,040


## 4. Evaluation Code


### 4.1 Hyperparameters

In [20]:
K = 10
RELEVANCE_THRESHOLD = 4.0

### 4.2 Ground Truth

In [21]:
ground_truth = test_df.filter(F.col("rating") >= RELEVANCE_THRESHOLD) \
    .groupBy("user_id").agg(F.collect_list("item_id").alias("relevant_items"))

### 4.3 Functions

In [22]:
def get_top_k(recs_df, score_col, k):
    window = Window.partitionBy("user_id").orderBy(F.desc(score_col))
    ranked = recs_df.withColumn("rank", F.row_number().over(window)) \
        .filter(F.col("rank") <= k)

    return ranked.groupBy("user_id").agg(
        F.array_sort(F.collect_list(F.struct("rank", "item_id"))).alias("ranked_structs")
    ).withColumn(
        "recommended_items",
        F.expr("transform(ranked_structs, x -> x.item_id)")
    ).drop("ranked_structs")

In [23]:
def precision_at_k(top_k_df, ground_truth_df, k):
    joined = top_k_df.join(ground_truth_df, "user_id")
    result = joined.withColumn("hits", F.size(F.array_intersect("recommended_items", "relevant_items"))) \
        .agg(F.avg(F.col("hits") / k)).collect()[0][0]
    return result or 0.0

In [24]:
def recall_at_k(top_k_df, ground_truth_df):
    joined = top_k_df.join(ground_truth_df, "user_id")
    result = joined.withColumn("hits", F.size(F.array_intersect("recommended_items", "relevant_items"))) \
        .withColumn("recall", F.when(F.size("relevant_items") > 0, F.col("hits") / F.size("relevant_items")).otherwise(0)) \
        .agg(F.avg("recall")).collect()[0][0]
    return result or 0.0

In [25]:
def ndcg_at_k(top_k_df, ground_truth_df, k):
    joined = top_k_df.join(ground_truth_df, "user_id")
    exploded = joined.select(
        "user_id",
        "relevant_items",
        F.posexplode("recommended_items").alias("pos", "item_id")
    ).withColumn("item_id", F.col("item_id").cast(IntegerType()))

    with_dcg = exploded \
        .withColumn("rel", F.when(F.array_contains("relevant_items", F.col("item_id")), 1.0).otherwise(0.0)) \
        .withColumn("dcg", F.col("rel") / F.log2(F.col("pos") + 2)) \
        .groupBy("user_id", "relevant_items").agg(F.sum("dcg").alias("dcg"))

    idcg_vals = [sum(1.0 / log2(i + 2) for i in range(n)) for n in range(k + 1)]
    idcg_map = F.create_map(*[x for i, v in enumerate(idcg_vals) for x in (F.lit(i), F.lit(v))])

    result = with_dcg \
        .withColumn("num_rel", F.least(F.size("relevant_items"), F.lit(k))) \
        .withColumn("idcg", idcg_map[F.col("num_rel")]) \
        .withColumn("ndcg", F.when(F.col("idcg") > 0, F.col("dcg") / F.col("idcg")).otherwise(0)) \
        .agg(F.avg("ndcg")).collect()[0][0]
    return result or 0.0

In [26]:
def evaluate(recs_df, score_col, name):
    if recs_df.count() == 0:
        print(f"{name}: No recommendations (not implemented)")
        return {"Precision@10": 0.0, "Recall@10": 0.0, "NDCG@10": 0.0}

    top_k = get_top_k(recs_df, score_col, K)
    p = precision_at_k(top_k, ground_truth, K)
    r = recall_at_k(top_k, ground_truth)
    n = ndcg_at_k(top_k, ground_truth, K)

    print(f"{name}: P@{K}={p:.4f}, R@{K}={r:.4f}, NDCG@{K}={n:.4f}")
    return {"Precision@10": p, "Recall@10": r, "NDCG@10": n}

## 5. Collaborative Filtering (ALS)

Implement using `pyspark.ml.recommendation.ALS`

In [27]:
class CollaborativeFilter:

    def __init__(self, rank=10, regParam=0.1, maxIter=10):
        self.als = ALS(
            userCol="user_id",
            itemCol="item_id",
            ratingCol="rating",
            rank=rank,
            regParam=regParam,
            maxIter=maxIter,
            coldStartStrategy="drop",
            nonnegative=True
        )
        self.model = None

    def train(self, df):
        self.model = self.als.fit(df)

    def get_recommendations(self, df, k=10):
        """Get top-K recommendations"""
        if self.model is None:
            raise ValueError("Call train() first.")

        users = df.select("user_id").distinct()
        user_recs = self.model.recommendForUserSubset(users, k)

        return user_recs.select(
            F.col("user_id"),
            F.explode("recommendations").alias("rec")
        ).select(
            F.col("user_id"),
            F.col("rec.item_id").cast(IntegerType()).alias("item_id"),
            F.col("rec.rating").alias("prediction")
        )

    def predict(self, df):
        """Predict ratings for user-item pairs in test"""
        if self.model is None:
            raise ValueError("Train should be called first")
        return self.model.transform(df)

### Bonus: Hyperparameter Tuning

In [28]:
#
import time
from itertools import product

# --- Inner split for tuning (chronological, like your main split)
train_inner, val_df = chronological_split(train_df, train_ratio=0.9, min_train_ratings=5)

# Ground truth = relevant in our validation set
val_ground_truth = val_df.filter(F.col("rating") >= RELEVANCE_THRESHOLD) \
    .groupBy("user_id").agg(F.collect_list("item_id").alias("relevant_items"))

# Cache
train_inner.cache()
val_df.cache()
_ = train_inner.count()
_ = val_df.count()

def eval_als_on_val(rank, regParam, maxIter, k_candidates=100):
    """
    Train ALS on train_inner, recommend on val_df, compute ranking metrics
    vs val_ground_truth.
    """
    cf = CollaborativeFilter(rank=rank, regParam=regParam, maxIter=maxIter)

    t0 = time.perf_counter()
    cf.train(train_inner)
    train_s = time.perf_counter() - t0

    t0 = time.perf_counter()
    recs = cf.get_recommendations(val_df, k=k_candidates).withColumnRenamed("prediction", "als_score").cache()
    _ = recs.count()  # materialize
    infer_s = time.perf_counter() - t0

    # metrics@K
    top_k = get_top_k(recs, "als_score", K)
    p = precision_at_k(top_k, val_ground_truth, K)
    r = recall_at_k(top_k, val_ground_truth)
    n = ndcg_at_k(top_k, val_ground_truth, K)

    return {
        "rank": rank, "regParam": regParam, "maxIter": maxIter,
        "P@K": p, "R@K": r, "NDCG@K": n,
        "train_s": train_s, "infer_s": infer_s,
    }

In [29]:
def find_best_als_params():
    ranks = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
    regParams = [0.01, 0.05, 0.1, 0.2]
    maxIters = [10, 20, 30]

    als_tune_results = []
    als_best = None

    for rank, reg, iters in product(ranks, regParams, maxIters):
        out = eval_als_on_val(rank, reg, iters, k_candidates=100)
        als_tune_results.append(out)

        if als_best is None or out["NDCG@K"] > als_best["NDCG@K"]:
            als_best = out

        print(
            f"ALS tune rank={rank:>2}, reg={reg:<4}, iters={iters:<2} | "
            f"NDCG@{K}={out['NDCG@K']:.4f} P@{K}={out['P@K']:.4f} R@{K}={out['R@K']:.4f} | "
            f"train={out['train_s']:.2f}s infer={out['infer_s']:.2f}s"
        )

# re-running it is lenghty. Uncomment it on your own risk
# als_best = find_best_als_params()
als_best = {"rank": 80, "regParam": 0.05, "maxIter": 30}
print("\nBEST ALS PARAMS (by NDCG@K):")
print(als_best)


BEST ALS PARAMS (by NDCG@K):
{'rank': 80, 'regParam': 0.05, 'maxIter': 30}


<pre>

ALS tune rank=10, reg=0.01, iters=10 | NDCG@10=0.0005 P@10=0.0004 R@10=0.0006 | train=3.37s infer=4.49s


ALS tune rank=10, reg=0.01, iters=20 | NDCG@10=0.0007 P@10=0.0007 R@10=0.0009 | train=2.53s infer=2.65s


ALS tune rank=10, reg=0.01, iters=30 | NDCG@10=0.0009 P@10=0.0010 R@10=0.0012 | train=3.96s infer=3.06s


ALS tune rank=10, reg=0.05, iters=10 | NDCG@10=0.0030 P@10=0.0028 R@10=0.0047 | train=2.17s infer=3.19s


ALS tune rank=10, reg=0.05, iters=20 | NDCG@10=0.0052 P@10=0.0047 R@10=0.0072 | train=2.67s infer=3.27s


ALS tune rank=10, reg=0.05, iters=30 | NDCG@10=0.0064 P@10=0.0057 R@10=0.0086 | train=3.36s infer=1.85s


ALS tune rank=10, reg=0.1 , iters=10 | NDCG@10=0.0037 P@10=0.0035 R@10=0.0062 | train=2.52s infer=2.71s


ALS tune rank=10, reg=0.1 , iters=20 | NDCG@10=0.0069 P@10=0.0063 R@10=0.0105 | train=2.58s infer=4.16s


ALS tune rank=10, reg=0.1 , iters=30 | NDCG@10=0.0078 P@10=0.0071 R@10=0.0116 | train=4.65s infer=3.38s


ALS tune rank=10, reg=0.2 , iters=10 | NDCG@10=0.0003 P@10=0.0003 R@10=0.0004 | train=1.57s infer=2.45s


ALS tune rank=10, reg=0.2 , iters=20 | NDCG@10=0.0006 P@10=0.0006 R@10=0.0011 | train=2.58s infer=2.73s


ALS tune rank=10, reg=0.2 , iters=30 | NDCG@10=0.0006 P@10=0.0007 R@10=0.0012 | train=4.23s infer=3.12s


ALS tune rank=20, reg=0.01, iters=10 | NDCG@10=0.0016 P@10=0.0017 R@10=0.0013 | train=1.97s infer=2.56s


ALS tune rank=20, reg=0.01, iters=20 | NDCG@10=0.0023 P@10=0.0023 R@10=0.0023 | train=4.28s infer=3.33s


ALS tune rank=20, reg=0.01, iters=30 | NDCG@10=0.0026 P@10=0.0027 R@10=0.0026 | train=5.39s infer=3.46s


ALS tune rank=20, reg=0.05, iters=10 | NDCG@10=0.0087 P@10=0.0069 R@10=0.0113 | train=2.35s infer=2.83s


ALS tune rank=20, reg=0.05, iters=20 | NDCG@10=0.0104 P@10=0.0083 R@10=0.0129 | train=3.65s infer=3.38s


ALS tune rank=20, reg=0.05, iters=30 | NDCG@10=0.0110 P@10=0.0086 R@10=0.0135 | train=4.86s infer=3.17s


ALS tune rank=20, reg=0.1 , iters=10 | NDCG@10=0.0068 P@10=0.0062 R@10=0.0105 | train=2.31s infer=2.67s


ALS tune rank=20, reg=0.1 , iters=20 | NDCG@10=0.0100 P@10=0.0087 R@10=0.0142 | train=2.86s infer=3.25s


ALS tune rank=20, reg=0.1 , iters=30 | NDCG@10=0.0112 P@10=0.0094 R@10=0.0151 | train=4.99s infer=3.13s


ALS tune rank=20, reg=0.2 , iters=10 | NDCG@10=0.0005 P@10=0.0005 R@10=0.0009 | train=2.32s infer=3.49s


ALS tune rank=20, reg=0.2 , iters=20 | NDCG@10=0.0006 P@10=0.0007 R@10=0.0012 | train=3.59s infer=3.01s


ALS tune rank=20, reg=0.2 , iters=30 | NDCG@10=0.0007 P@10=0.0007 R@10=0.0012 | train=4.73s infer=3.05s


ALS tune rank=30, reg=0.01, iters=10 | NDCG@10=0.0033 P@10=0.0031 R@10=0.0027 | train=2.68s infer=2.63s


ALS tune rank=30, reg=0.01, iters=20 | NDCG@10=0.0042 P@10=0.0037 R@10=0.0041 | train=4.47s infer=2.66s


ALS tune rank=30, reg=0.01, iters=30 | NDCG@10=0.0044 P@10=0.0040 R@10=0.0041 | train=6.52s infer=2.93s


ALS tune rank=30, reg=0.05, iters=10 | NDCG@10=0.0118 P@10=0.0093 R@10=0.0145 | train=2.87s infer=2.09s


ALS tune rank=30, reg=0.05, iters=20 | NDCG@10=0.0142 P@10=0.0106 R@10=0.0168 | train=3.98s infer=2.56s


ALS tune rank=30, reg=0.05, iters=30 | NDCG@10=0.0143 P@10=0.0109 R@10=0.0165 | train=6.52s infer=2.99s


ALS tune rank=30, reg=0.1 , iters=10 | NDCG@10=0.0079 P@10=0.0069 R@10=0.0121 | train=2.67s infer=3.21s


ALS tune rank=30, reg=0.1 , iters=20 | NDCG@10=0.0111 P@10=0.0094 R@10=0.0155 | train=5.59s infer=3.68s


ALS tune rank=30, reg=0.1 , iters=30 | NDCG@10=0.0122 P@10=0.0103 R@10=0.0165 | train=6.32s infer=2.25s


ALS tune rank=30, reg=0.2 , iters=10 | NDCG@10=0.0004 P@10=0.0004 R@10=0.0006 | train=1.72s infer=1.92s


ALS tune rank=30, reg=0.2 , iters=20 | NDCG@10=0.0007 P@10=0.0007 R@10=0.0012 | train=3.25s infer=2.31s


ALS tune rank=30, reg=0.2 , iters=30 | NDCG@10=0.0007 P@10=0.0007 R@10=0.0012 | train=5.43s infer=2.19s


ALS tune rank=40, reg=0.01, iters=10 | NDCG@10=0.0038 P@10=0.0037 R@10=0.0036 | train=2.41s infer=1.92s


ALS tune rank=40, reg=0.01, iters=20 | NDCG@10=0.0060 P@10=0.0054 R@10=0.0054 | train=5.72s infer=2.94s


ALS tune rank=40, reg=0.01, iters=30 | NDCG@10=0.0060 P@10=0.0053 R@10=0.0058 | train=8.32s infer=3.01s


ALS tune rank=40, reg=0.05, iters=10 | NDCG@10=0.0143 P@10=0.0109 R@10=0.0164 | train=3.48s infer=3.22s


ALS tune rank=40, reg=0.05, iters=20 | NDCG@10=0.0153 P@10=0.0115 R@10=0.0176 | train=4.94s infer=2.68s


ALS tune rank=40, reg=0.05, iters=30 | NDCG@10=0.0158 P@10=0.0118 R@10=0.0181 | train=7.69s infer=2.96s


ALS tune rank=40, reg=0.1 , iters=10 | NDCG@10=0.0087 P@10=0.0073 R@10=0.0130 | train=2.94s infer=3.14s


ALS tune rank=40, reg=0.1 , iters=20 | NDCG@10=0.0120 P@10=0.0099 R@10=0.0156 | train=6.40s infer=3.53s


ALS tune rank=40, reg=0.1 , iters=30 | NDCG@10=0.0128 P@10=0.0104 R@10=0.0165 | train=7.70s infer=3.80s


ALS tune rank=40, reg=0.2 , iters=10 | NDCG@10=0.0005 P@10=0.0004 R@10=0.0009 | train=3.05s infer=3.30s


ALS tune rank=40, reg=0.2 , iters=20 | NDCG@10=0.0007 P@10=0.0007 R@10=0.0012 | train=4.60s infer=3.00s


ALS tune rank=40, reg=0.2 , iters=30 | NDCG@10=0.0007 P@10=0.0007 R@10=0.0012 | train=7.12s infer=3.37s


ALS tune rank=50, reg=0.01, iters=10 | NDCG@10=0.0052 P@10=0.0046 R@10=0.0046 | train=4.07s infer=3.17s


ALS tune rank=50, reg=0.01, iters=20 | NDCG@10=0.0066 P@10=0.0059 R@10=0.0065 | train=6.87s infer=3.06s


ALS tune rank=50, reg=0.01, iters=30 | NDCG@10=0.0075 P@10=0.0062 R@10=0.0075 | train=10.16s infer=2.83s


ALS tune rank=50, reg=0.05, iters=10 | NDCG@10=0.0147 P@10=0.0108 R@10=0.0179 | train=3.73s infer=3.33s


ALS tune rank=50, reg=0.05, iters=20 | NDCG@10=0.0164 P@10=0.0119 R@10=0.0190 | train=6.03s infer=3.08s


ALS tune rank=50, reg=0.05, iters=30 | NDCG@10=0.0171 P@10=0.0122 R@10=0.0196 | train=9.21s infer=3.05s


ALS tune rank=50, reg=0.1 , iters=10 | NDCG@10=0.0091 P@10=0.0078 R@10=0.0134 | train=3.48s infer=2.95s


ALS tune rank=50, reg=0.1 , iters=20 | NDCG@10=0.0125 P@10=0.0102 R@10=0.0165 | train=5.85s infer=3.33s


ALS tune rank=50, reg=0.1 , iters=30 | NDCG@10=0.0137 P@10=0.0110 R@10=0.0176 | train=8.61s infer=3.15s


ALS tune rank=50, reg=0.2 , iters=10 | NDCG@10=0.0004 P@10=0.0004 R@10=0.0009 | train=3.22s infer=2.97s


ALS tune rank=50, reg=0.2 , iters=20 | NDCG@10=0.0007 P@10=0.0007 R@10=0.0012 | train=5.47s infer=2.94s


ALS tune rank=50, reg=0.2 , iters=30 | NDCG@10=0.0007 P@10=0.0007 R@10=0.0012 | train=8.05s infer=3.44s


ALS tune rank=60, reg=0.01, iters=10 | NDCG@10=0.0054 P@10=0.0048 R@10=0.0050 | train=4.87s infer=3.02s


ALS tune rank=60, reg=0.01, iters=20 | NDCG@10=0.0081 P@10=0.0067 R@10=0.0082 | train=8.53s infer=2.93s


ALS tune rank=60, reg=0.01, iters=30 | NDCG@10=0.0089 P@10=0.0070 R@10=0.0093 | train=11.79s infer=3.30s


ALS tune rank=60, reg=0.05, iters=10 | NDCG@10=0.0142 P@10=0.0105 R@10=0.0167 | train=4.28s infer=2.89s


ALS tune rank=60, reg=0.05, iters=20 | NDCG@10=0.0153 P@10=0.0114 R@10=0.0178 | train=7.22s infer=3.27s


ALS tune rank=60, reg=0.05, iters=30 | NDCG@10=0.0161 P@10=0.0119 R@10=0.0184 | train=11.17s infer=2.83s


ALS tune rank=60, reg=0.1 , iters=10 | NDCG@10=0.0093 P@10=0.0080 R@10=0.0137 | train=4.05s infer=2.75s


ALS tune rank=60, reg=0.1 , iters=20 | NDCG@10=0.0123 P@10=0.0103 R@10=0.0165 | train=6.97s infer=2.77s


ALS tune rank=60, reg=0.1 , iters=30 | NDCG@10=0.0132 P@10=0.0110 R@10=0.0176 | train=9.83s infer=3.43s


ALS tune rank=60, reg=0.2 , iters=10 | NDCG@10=0.0004 P@10=0.0004 R@10=0.0008 | train=3.82s infer=2.69s


ALS tune rank=60, reg=0.2 , iters=20 | NDCG@10=0.0006 P@10=0.0007 R@10=0.0012 | train=6.64s infer=3.66s


ALS tune rank=60, reg=0.2 , iters=30 | NDCG@10=0.0007 P@10=0.0007 R@10=0.0012 | train=9.28s infer=2.80s


ALS tune rank=70, reg=0.01, iters=10 | NDCG@10=0.0077 P@10=0.0062 R@10=0.0074 | train=5.44s infer=2.72s


ALS tune rank=70, reg=0.01, iters=20 | NDCG@10=0.0110 P@10=0.0085 R@10=0.0120 | train=10.04s infer=2.61s


ALS tune rank=70, reg=0.01, iters=30 | NDCG@10=0.0116 P@10=0.0087 R@10=0.0114 | train=14.41s infer=3.16s


ALS tune rank=70, reg=0.05, iters=10 | NDCG@10=0.0145 P@10=0.0107 R@10=0.0177 | train=4.86s infer=3.01s


ALS tune rank=70, reg=0.05, iters=20 | NDCG@10=0.0164 P@10=0.0121 R@10=0.0194 | train=8.96s infer=2.81s


ALS tune rank=70, reg=0.05, iters=30 | NDCG@10=0.0167 P@10=0.0126 R@10=0.0197 | train=12.37s infer=3.05s


ALS tune rank=70, reg=0.1 , iters=10 | NDCG@10=0.0092 P@10=0.0078 R@10=0.0139 | train=4.48s infer=3.35s


ALS tune rank=70, reg=0.1 , iters=20 | NDCG@10=0.0122 P@10=0.0100 R@10=0.0166 | train=8.17s infer=2.81s


ALS tune rank=70, reg=0.1 , iters=30 | NDCG@10=0.0134 P@10=0.0108 R@10=0.0175 | train=11.81s infer=3.21s


ALS tune rank=70, reg=0.2 , iters=10 | NDCG@10=0.0004 P@10=0.0004 R@10=0.0009 | train=4.55s infer=3.38s


ALS tune rank=70, reg=0.2 , iters=20 | NDCG@10=0.0006 P@10=0.0007 R@10=0.0012 | train=7.64s infer=2.87s


ALS tune rank=70, reg=0.2 , iters=30 | NDCG@10=0.0007 P@10=0.0007 R@10=0.0012 | train=10.78s infer=3.36s


ALS tune rank=80, reg=0.01, iters=10 | NDCG@10=0.0083 P@10=0.0065 R@10=0.0084 | train=6.59s infer=2.85s


ALS tune rank=80, reg=0.01, iters=20 | NDCG@10=0.0115 P@10=0.0088 R@10=0.0120 | train=12.08s infer=2.76s


ALS tune rank=80, reg=0.01, iters=30 | NDCG@10=0.0125 P@10=0.0092 R@10=0.0127 | train=17.78s infer=3.15s


ALS tune rank=80, reg=0.05, iters=10 | NDCG@10=0.0156 P@10=0.0114 R@10=0.0173 | train=5.73s infer=3.61s


ALS tune rank=80, reg=0.05, iters=20 | NDCG@10=0.0174 P@10=0.0127 R@10=0.0200 | train=11.86s infer=2.78s


ALS tune rank=80, reg=0.05, iters=30 | NDCG@10=0.0178 P@10=0.0129 R@10=0.0209 | train=15.38s infer=3.03s


ALS tune rank=80, reg=0.1 , iters=10 | NDCG@10=0.0095 P@10=0.0081 R@10=0.0140 | train=5.24s infer=2.92s


ALS tune rank=80, reg=0.1 , iters=20 | NDCG@10=0.0131 P@10=0.0108 R@10=0.0175 | train=9.90s infer=2.62s


ALS tune rank=80, reg=0.1 , iters=30 | NDCG@10=0.0147 P@10=0.0117 R@10=0.0186 | train=14.16s infer=2.98s


ALS tune rank=80, reg=0.2 , iters=10 | NDCG@10=0.0005 P@10=0.0005 R@10=0.0009 | train=5.03s infer=2.83s


ALS tune rank=80, reg=0.2 , iters=20 | NDCG@10=0.0006 P@10=0.0007 R@10=0.0012 | train=9.10s infer=2.82s


ALS tune rank=80, reg=0.2 , iters=30 | NDCG@10=0.0007 P@10=0.0007 R@10=0.0012 | train=12.87s infer=3.07s


ALS tune rank=90, reg=0.01, iters=10 | NDCG@10=0.0088 P@10=0.0070 R@10=0.0083 | train=7.82s infer=2.76s


ALS tune rank=90, reg=0.01, iters=20 | NDCG@10=0.0122 P@10=0.0097 R@10=0.0121 | train=14.84s infer=2.74s


ALS tune rank=90, reg=0.01, iters=30 | NDCG@10=0.0130 P@10=0.0103 R@10=0.0138 | train=21.76s infer=2.82s


ALS tune rank=90, reg=0.05, iters=10 | NDCG@10=0.0155 P@10=0.0114 R@10=0.0182 | train=7.10s infer=3.65s


ALS tune rank=90, reg=0.05, iters=20 | NDCG@10=0.0167 P@10=0.0122 R@10=0.0192 | train=12.25s infer=3.18s


ALS tune rank=90, reg=0.05, iters=30 | NDCG@10=0.0167 P@10=0.0123 R@10=0.0191 | train=18.04s infer=3.34s


ALS tune rank=90, reg=0.1 , iters=10 | NDCG@10=0.0096 P@10=0.0081 R@10=0.0142 | train=6.38s infer=3.06s


ALS tune rank=90, reg=0.1 , iters=20 | NDCG@10=0.0134 P@10=0.0108 R@10=0.0176 | train=11.58s infer=3.04s


ALS tune rank=90, reg=0.1 , iters=30 | NDCG@10=0.0143 P@10=0.0113 R@10=0.0184 | train=18.57s infer=3.46s


ALS tune rank=90, reg=0.2 , iters=10 | NDCG@10=0.0005 P@10=0.0005 R@10=0.0009 | train=6.03s infer=2.95s


ALS tune rank=90, reg=0.2 , iters=20 | NDCG@10=0.0006 P@10=0.0007 R@10=0.0012 | train=11.30s infer=3.32s


ALS tune rank=90, reg=0.2 , iters=30 | NDCG@10=0.0007 P@10=0.0007 R@10=0.0012 | train=15.87s infer=3.73s


ALS tune rank=100, reg=0.01, iters=10 | NDCG@10=0.0103 P@10=0.0082 R@10=0.0100 | train=11.73s infer=3.33s


ALS tune rank=100, reg=0.01, iters=20 | NDCG@10=0.0136 P@10=0.0103 R@10=0.0140 | train=19.53s infer=2.88s


ALS tune rank=100, reg=0.01, iters=30 | NDCG@10=0.0143 P@10=0.0110 R@10=0.0141 | train=28.47s infer=3.02s


ALS tune rank=100, reg=0.05, iters=10 | NDCG@10=0.0154 P@10=0.0115 R@10=0.0173 | train=7.77s infer=2.66s


ALS tune rank=100, reg=0.05, iters=20 | NDCG@10=0.0168 P@10=0.0121 R@10=0.0189 | train=14.37s infer=2.89s


ALS tune rank=100, reg=0.05, iters=30 | NDCG@10=0.0171 P@10=0.0123 R@10=0.0192 | train=21.23s infer=2.94s


ALS tune rank=100, reg=0.1 , iters=10 | NDCG@10=0.0101 P@10=0.0084 R@10=0.0144 | train=7.34s infer=3.04s


ALS tune rank=100, reg=0.1 , iters=20 | NDCG@10=0.0135 P@10=0.0108 R@10=0.0177 | train=14.60s infer=3.02s


ALS tune rank=100, reg=0.1 , iters=30 | NDCG@10=0.0145 P@10=0.0115 R@10=0.0188 | train=20.25s infer=2.86s


ALS tune rank=100, reg=0.2 , iters=10 | NDCG@10=0.0005 P@10=0.0005 R@10=0.0009 | train=6.53s infer=2.97s


ALS tune rank=100, reg=0.2 , iters=20 | NDCG@10=0.0006 P@10=0.0007 R@10=0.0012 | train=12.13s infer=2.91s


ALS tune rank=100, reg=0.2 , iters=30 | NDCG@10=0.0007 P@10=0.0007 R@10=0.0012 | train=17.06s infer=2.52s

BEST ALS PARAMS (by NDCG@K):
{'rank': 80, 'regParam': 0.05, 'maxIter': 30, 'P@K': 0.01288811795316568, 'R@K': 0.02094471027340997, 'NDCG@K': 0.01782000260548794, 'train_s': 15.377774499997031, 'infer_s': 3.02562895801384}
l
</pre>

In [30]:
# Retrain final ALS on full train_df with the best params
cf = CollaborativeFilter(rank=als_best["rank"], regParam=als_best["regParam"], maxIter=als_best["maxIter"])

t0 = time.perf_counter()
cf.train(train_df)
als_train_s = time.perf_counter() - t0

t0 = time.perf_counter()
als_recs = cf.get_recommendations(test_df, k=100).withColumnRenamed("prediction", "als_score").cache()
als_num_recs = als_recs.count()
als_infer_s = time.perf_counter() - t0

print(f"\nFINAL ALS train time: {als_train_s:.2f}s")
print(f"FINAL ALS inference time: {als_infer_s:.2f}s | rec rows: {als_num_recs:,}")

26/01/18 20:44:45 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS


FINAL ALS train time: 20.46s
FINAL ALS inference time: 4.39s | rec rows: 604,000


                                                                                

In [31]:
als_recs.show(10)

+-------+-------+---------+
|user_id|item_id|als_score|
+-------+-------+---------+
|     95|    260|4.6372614|
|     95|   1198|4.6299286|
|     95|   1250|4.5401225|
|     95|    527|4.4914494|
|     95|    953|4.4664316|
|     95|    318| 4.438609|
|     95|   1197| 4.424387|
|     95|   2028| 4.410882|
|     95|    110|4.3907533|
|     95|   1304| 4.390076|
+-------+-------+---------+
only showing top 10 rows


## 6. Content-Based Filtering (TF-IDF + LSH)

Implement using `pyspark.ml.feature` (Tokenizer, HashingTF, IDF, BucketedRandomProjectionLSH)

In [32]:
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF, NGram, MinHashLSH
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import (
    col, split, concat_ws, regexp_extract, udf, array_intersect, 
    size, sum as _sum, desc, row_number
)

class ContentBasedFilter:
    def __init__(self, num_features=5000, bigram_features=3000, num_hash_tables=10):
        self.num_features = num_features
        self.bigram_features = bigram_features
        self.num_hash_tables = num_hash_tables
        self.jaccard_threshold = 0.1
        self.genre_weight = 4
        self.min_genre_overlap = 1
        

        self.lsh_model = None
        self.vector_model = None
        self.movies_binary = None
        self.similar_pairs = None

    def combine_binarize_udf(self):
        def process_vectors(v1, v2):
            if v1 is None: 
                return None
            indices = [int(i) for i in v1.indices]
            
            if v2 is not None:
                offset = int(v1.size)
                indices += [int(i) + offset for i in v2.indices]
                total_size = offset + int(v2.size)
            else:
                total_size = int(v1.size)
            
            values = [1.0] * len(indices)
            return Vectors.sparse(total_size, sorted(indices), values)
            
        return udf(process_vectors, VectorUDT())

    def train_features(self, items_df):
        print("building content features (tfidf + bigrams)")
        
        df = (
            items_df
            .withColumn("year", regexp_extract(col("title"), r"\((\d{4})\)", 1))
            .withColumn("decade", regexp_extract(col("title"), r"\((\d{3})\d\)", 1))
            .withColumn("genres_spaced", F.regexp_replace(col("genres"), r"\|", " "))
            .withColumn("content", concat_ws(
                " ", 
                col("title"),
                col("genres_spaced"), col("genres_spaced"), col("genres_spaced"), col("genres_spaced"),
                col("year"), col("decade")
            ))
        )

        stages = []

        stages += [
            Tokenizer(inputCol="content", outputCol="raw_words"),
            StopWordsRemover(inputCol="raw_words", outputCol="words")
        ]
        
        stages += [
            HashingTF(inputCol="words", outputCol="tf_uni", numFeatures=self.num_features),
            IDF(inputCol="tf_uni", outputCol="tfidf_uni", minDocFreq=1)
        ]
        
        stages += [
            NGram(n=2, inputCol="words", outputCol="bigrams"),
            HashingTF(inputCol="bigrams", outputCol="tf_bi", numFeatures=self.bigram_features),
            IDF(inputCol="tf_bi", outputCol="tfidf_bi", minDocFreq=1)
        ]
        
        pipeline = Pipeline(stages=stages)
        self.vector_model = pipeline.fit(df)
        features_df = self.vector_model.transform(df)
        
        combiner = self.combine_binarize_udf()
        self.movies_binary = (
            features_df
            .withColumn("binary_features", combiner("tfidf_uni", "tfidf_bi"))
            .select("item_id", "title", "genres", "binary_features")
            .cache()
        )
        
        print(f"checked {self.movies_binary.count()} movies")

    def build_lsh_index(self):
        print("indexing with minhash LSH")
        
        mh = MinHashLSH(
            inputCol="binary_features", 
            outputCol="hashes", 
            numHashTables=self.num_hash_tables,
            seed=42
        )
        self.lsh_model = mh.fit(self.movies_binary)
        
        dist_threshold = 1.0 - self.jaccard_threshold
        
        print(f"cmputing similarity graph")
        raw_pairs = self.lsh_model.approxSimilarityJoin(
            self.movies_binary, self.movies_binary, 
            threshold=dist_threshold, 
            distCol="jaccard_dist"
        )
        
        pairs = raw_pairs.select(
            col("datasetA.item_id").alias("item_a"),
            col("datasetB.item_id").alias("item_b"),
            (1.0 - col("jaccard_dist")).alias("similarity")
        ).filter("item_a != item_b")
        
        self.similar_pairs = pairs.cache()
        print(f"indexed {self.similar_pairs.count()} similar item pairs")

    def recommend_for_users(self, train_df, items_df, k=10):
        print("generating content-based recommendations")
        
        user_history = train_df.filter(col("rating") >= 4.0).select(
            col("user_id"), col("item_id").alias("seed_item"), col("rating")
        )

        candidates = user_history.join(
            self.similar_pairs,
            user_history.seed_item == self.similar_pairs.item_a
        )
        
        genre_df = items_df.select("item_id", split(col("genres"), r"\|").alias("genres_arr"))        
        
        candidates_enriched = (
            candidates.alias("c")
            .join(
                genre_df.alias("seed_g"), 
                col("c.seed_item") == col("seed_g.item_id")
            )
            .join(
                genre_df.alias("cand_g"), 
                col("c.item_b") == col("cand_g.item_id")
            )
            .select(
                "c.user_id", 
                col("c.item_b").alias("candidate_item"),
                "c.rating", 
                "c.similarity",
                size(array_intersect(
                    col("seed_g.genres_arr"), 
                    col("cand_g.genres_arr")
                )).alias("genre_overlap")
            )
        )

        filtered = candidates_enriched.filter(col("genre_overlap") >= self.min_genre_overlap)

        scored = filtered.withColumn(
            "score", 
            col("rating") * col("similarity") * (1.0 + 0.25 * col("genre_overlap"))
        )

        recs = scored.groupBy("user_id", "candidate_item").agg(_sum("score").alias("content_score"))
        
        seen_items = train_df.select("user_id", "item_id").distinct().alias("seen")
        recs_alias = recs.alias("recs")
        
        final_recs = recs_alias.join(
            seen_items,
            (col("recs.user_id") == col("seen.user_id")) & 
            (col("recs.candidate_item") == col("seen.item_id")),
            "left_anti"
        ).select(
            col("recs.user_id"), 
            col("recs.candidate_item").alias("item_id"), 
            col("recs.content_score")
        )

        window = Window.partitionBy("user_id").orderBy(desc("content_score"))
        return (
            final_recs
            .withColumn("rank", row_number().over(window))
            .filter(col("rank") <= k)
            .drop("rank")
        )

cb_filter = ContentBasedFilter()
cb_filter.train_features(items_df)
cb_filter.build_lsh_index()


building content features (tfidf + bigrams)


                                                                                

checked 3883 movies
indexing with minhash LSH
cmputing similarity graph




indexed 988212 similar item pairs


                                                                                

In [33]:
content_recs = cb_filter.recommend_for_users(train_df, items_df, k=100)

generating content-based recommendations


In [34]:
content_recs = content_recs.cache()
content_recs.show(5)



+-------+-------+------------------+
|user_id|item_id|     content_score|
+-------+-------+------------------+
|     95|   1744| 22.83967523704366|
|     95|   1591|22.473956043956044|
|     95|    849|21.873400389932648|
|     95|   2334|21.532246786394385|
|     95|   2058| 21.53224678639438|
+-------+-------+------------------+
only showing top 5 rows


                                                                                

## 7.Fusion & Evaluation

In [35]:
ALPHA = 0.7

### 7.1 Normalization

In [36]:
def normalize(df, col_name):
    stats = df.agg(F.min(col_name).alias("min"), F.max(col_name).alias("max")).collect()[0]
    if stats["max"] == stats["min"]:
        return df.withColumn(col_name + "_norm", F.lit(0.5))
    return df.withColumn(col_name + "_norm", (F.col(col_name) - stats["min"]) / (stats["max"] - stats["min"]))

### 7.2 Hybrid Fusion

In [37]:
als_norm = normalize(als_recs, "als_score")
content_norm = normalize(content_recs, "content_score")

hybrid_recs = als_norm.select("user_id", "item_id", "als_score_norm") \
    .join(content_norm.select("user_id", "item_id", "content_score_norm"), ["user_id", "item_id"], "full_outer") \
    .fillna(0) \
    .withColumn("final_score", ALPHA * F.col("als_score_norm") + (1 - ALPHA) * F.col("content_score_norm"))

In [38]:
hybrid_recs.orderBy(F.desc("final_score")).show(10)

+-------+-------+------------------+-------------------+------------------+
|user_id|item_id|    als_score_norm| content_score_norm|       final_score|
+-------+-------+------------------+-------------------+------------------+
|   4277|     53|0.7813458212323617|  0.590465080351217|0.7240815989680184|
|   2441|   3851|               1.0|                0.0|               0.7|
|   5246|     69|0.9988184773816583|                0.0|0.6991729341671608|
|   1812|   2305|0.8879036352052065|0.24769294234208108|0.6958404273462688|
|   1200|   2964|0.9780825664404157|0.03430897380639135|0.6949504886502084|
|    210|   2571|0.9030695940179349| 0.2009174378415168|0.6924239471650094|
|   4277|   2579|0.7558030874462422| 0.5430159889973912|0.6919669579115869|
|   3539|    635|0.7296950332243268| 0.6024709281442315|0.6915278017002983|
|   1111|   3847|0.9842770556661816|                0.0|0.6889939389663271|
|   3032|   2332|0.9836119100891428|                0.0|0.6885283370623999|
+-------+---

In [39]:
hybrid_recs.orderBy(F.desc("final_score")).filter((F.col("als_score_norm") > 0.2) & (F.col("content_score_norm") > 0.2)).show(10)

+-------+-------+------------------+-------------------+------------------+
|user_id|item_id|    als_score_norm| content_score_norm|       final_score|
+-------+-------+------------------+-------------------+------------------+
|   4277|     53|0.7813458212323617|  0.590465080351217|0.7240815989680184|
|   1812|   2305|0.8879036352052065|0.24769294234208108|0.6958404273462688|
|    210|   2571|0.9030695940179349| 0.2009174378415168|0.6924239471650094|
|   4277|   2579|0.7558030874462422| 0.5430159889973912|0.6919669579115869|
|   3539|    635|0.7296950332243268| 0.6024709281442315|0.6915278017002983|
|   4086|   3159|  0.89033927173392| 0.2142719162945873|0.6875190651021201|
|   4028|   2858|0.8366232663135118| 0.3306435727439082|0.6848293582426307|
|   1448|     53|0.7366513918047629| 0.5233646290001703|0.6726653629633851|
|    195|     53|0.7981084007992989|0.36953719846743455|0.6695370400997396|
|   1835|     53|0.7863732285641016|   0.39215405067471|0.6681074751972841|
+-------+---

In [40]:
als_pairs = als_recs.select("user_id", "item_id").distinct()
content_pairs = content_recs.select("user_id", "item_id").distinct()

overlap = als_pairs.intersect(content_pairs).count()
print(f"als pairs: {als_pairs.count()}")
print(f"content pairs: {content_pairs.count()}")
print(f"overlap: {overlap}")

als pairs: 604000
content pairs: 603761
overlap: 12453


### 7.3 Evaluation

In [41]:
als_metrics = evaluate(als_recs, "als_score", "ALS")

ALS: P@10=0.0281, R@10=0.0205, NDCG@10=0.0307


In [42]:
content_metrics = evaluate(content_recs, "content_score", "Content-Based")

Content-Based: P@10=0.0150, R@10=0.0167, NDCG@10=0.0188


In [43]:
hybrid_metrics = evaluate(hybrid_recs, "final_score", "Hybrid")

Hybrid: P@10=0.0292, R@10=0.0215, NDCG@10=0.0324


### Bonus: GBT Re-Ranking

In [44]:
# TODO

## Results Summary

In [45]:
summary = [
    ("ALS", als_metrics["Precision@10"], als_metrics["Recall@10"], als_metrics["NDCG@10"]),
    ("Content-Based", content_metrics["Precision@10"], content_metrics["Recall@10"], content_metrics["NDCG@10"]),
    ("Hybrid", hybrid_metrics["Precision@10"], hybrid_metrics["Recall@10"], hybrid_metrics["NDCG@10"]),
]
spark.createDataFrame(summary, ["Model", "Precision@10", "Recall@10", "NDCG@10"]).show()

                                                                                

+-------------+--------------------+--------------------+--------------------+
|        Model|        Precision@10|           Recall@10|             NDCG@10|
+-------------+--------------------+--------------------+--------------------+
|          ALS|0.028066945606694073|0.020494325036241722|0.030663505538513897|
|Content-Based| 0.01497907949790795|0.016735285093672912| 0.01879822686955161|
|       Hybrid|0.029205020920502162| 0.02149253217102322| 0.03244395940155893|
+-------------+--------------------+--------------------+--------------------+



In [46]:
# Sample User Recommendations Visualization
import random

# Pick a random user from test set
test_users = test_df.select("user_id").distinct().collect()
sample_user_id = random.choice(test_users)["user_id"]
print(f"Sample User ID: {sample_user_id}\n")

# Get user's highly-rated movies from training set (rating >= 4)
user_liked = train_df.filter(
    (F.col("user_id") == sample_user_id) & (F.col("rating") >= 4.0)
).join(items_df, "item_id").select("title", "genres", "rating").orderBy(F.desc("rating"))

print("=== Movies rated highly by this user (training data) ===")
user_liked.show(10, truncate=False)

# Get hybrid recommendations for this user
user_recs = hybrid_recs.filter(F.col("user_id") == sample_user_id) \
    .join(items_df, "item_id") \
    .select("title", "genres", "final_score") \
    .orderBy(F.desc("final_score"))

print("=== Hybrid model recommendations ===")
user_recs.show(10, truncate=False)

Sample User ID: 4598

=== Movies rated highly by this user (training data) ===
+--------------------------------------+---------------------------+------+
|title                                 |genres                     |rating|
+--------------------------------------+---------------------------+------+
|Anatomy of a Murder (1959)            |Drama|Mystery              |5.0   |
|Best Years of Our Lives, The (1946)   |Drama|War                  |5.0   |
|Silence of the Lambs, The (1991)      |Drama|Thriller             |5.0   |
|Rosewood (1997)                       |Drama                      |5.0   |
|Bridge on the River Kwai, The (1957)  |Drama|War                  |5.0   |
|Babe (1995)                           |Children's|Comedy|Drama    |5.0   |
|Pulp Fiction (1994)                   |Crime|Drama                |5.0   |
|Toy Story (1995)                      |Animation|Children's|Comedy|5.0   |
|One Flew Over the Cuckoo's Nest (1975)|Drama                      |5.0   |
|North by

In [47]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import Evaluator
from pyspark.ml import Estimator, Model
from pyspark.ml.param.shared import HasInputCol, HasOutputCol, Param, Params, TypeConverters
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
import time

class ContentBasedEstimator(Estimator, DefaultParamsReadable, DefaultParamsWritable):
    numFeatures = Param(Params._dummy(), "numFeatures", "num of tf-idf features", TypeConverters.toInt)
    bigramFeatures = Param(Params._dummy(), "bigramFeatures", "num of bigram features", TypeConverters.toInt)
    numHashTables = Param(Params._dummy(), "numHashTables", "num of LSH hash tables", TypeConverters.toInt)
    jaccardThreshold = Param(Params._dummy(), "jaccardThreshold", "jaccard similarity threshold", TypeConverters.toFloat)
    minGenreOverlap = Param(Params._dummy(), "minGenreOverlap", "minimum genre overlap", TypeConverters.toInt)
    
    def __init__(self, numFeatures=5000, bigramFeatures=3000, 
                numHashTables=10, jaccardThreshold=0.1, minGenreOverlap=1):
        super(ContentBasedEstimator, self).__init__()
        self._setDefault(numFeatures=numFeatures, 
                        bigramFeatures=bigramFeatures, 
                        numHashTables=numHashTables,
                        jaccardThreshold=jaccardThreshold,
                        minGenreOverlap=minGenreOverlap)
        
        self._set(numFeatures=numFeatures, 
                bigramFeatures=bigramFeatures, 
                numHashTables=numHashTables,
                jaccardThreshold=jaccardThreshold, 
                minGenreOverlap=minGenreOverlap)
    
    def _fit(self, dataset):
        cb = ContentBasedFilter(
            num_features=self.getOrDefault(self.numFeatures),
            bigram_features=self.getOrDefault(self.bigramFeatures),
            num_hash_tables=self.getOrDefault(self.numHashTables)
        )
        cb.jaccard_threshold = self.getOrDefault(self.jaccardThreshold)
        cb.min_genre_overlap = self.getOrDefault(self.minGenreOverlap)
        cb.train_features(items_df)
        cb.build_lsh_index()
        return ContentBasedModel(cb)

class ContentBasedModel(Model, DefaultParamsReadable, DefaultParamsWritable):
    def __init__(self, cb_filter=None):
        super(ContentBasedModel, self).__init__()
        self.cb_filter = cb_filter
    
    def _transform(self, dataset):
        if self.cb_filter is None:
            raise ValueError("model not fitted")
        return self.cb_filter.recommend_for_users(dataset, items_df, k=100)

class RecSysEvaluator(Evaluator):
    def __init__(self, test_df, ground_truth, k=10):
        super(RecSysEvaluator, self).__init__()
        self.test_df = test_df
        self.ground_truth = ground_truth
        self.k = k
    
    def _evaluate(self, dataset):
        top_k = get_top_k(dataset, "content_score", self.k)
        return ndcg_at_k(top_k, self.ground_truth, self.k)
    
    def isLargerBetter(self):
        return True

paramGrid = ParamGridBuilder() \
    .addGrid(ContentBasedEstimator.numFeatures, [3000, 5000, 7000]) \
    .addGrid(ContentBasedEstimator.bigramFeatures, [2000, 3000]) \
    .addGrid(ContentBasedEstimator.numHashTables, [10, 20, 30]) \
    .addGrid(ContentBasedEstimator.jaccardThreshold, [0.1, 0.2]) \
    .addGrid(ContentBasedEstimator.minGenreOverlap, [1, 2]) \
    .build()

estimator = ContentBasedEstimator()
evaluator = RecSysEvaluator(test_df, ground_truth, k=10)
items_df.cache()
train_df.cache()
cv = CrossValidator(estimator=estimator, estimatorParamMaps=paramGrid, 
                    evaluator=evaluator, numFolds=2, parallelism=10)

start = time.time()
cvModel = cv.fit(train_df)
train_time = time.time() - start

best_model = cvModel.bestModel
print(f"Training time: {train_time:.2f}s")
print(f"Best params: numFeatures={best_model.cb_filter.num_features}, bigramFeatures={best_model.cb_filter.bigram_features}, numHashTables={best_model.cb_filter.num_hash_tables}, jaccardThreshold={best_model.cb_filter.jaccard_threshold:.3f}, minGenreOverlap={best_model.cb_filter.min_genre_overlap}")

start = time.time()
tuned_recs = best_model.transform(train_df).cache()
inference_time = time.time() - start

tuned_metrics = evaluate(tuned_recs, "content_score", "Tuned Content-Based")
print(f"Inference time: {inference_time:.2f}s")

building content features (tfidf + bigrams)building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)


26/01/18 20:45:58 WARN CacheManager: Asked to cache already cached data.
                                                                                

checked 3883 movies
indexing with minhash LSH
cmputing similarity graph


                                                                                

checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph


                                                                                

indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations


[Stage 3595:(0 + 10) / 20][Stage 3597:>(0 + 0) / 20][Stage 3599:>(0 + 0) / 20]

KeyboardInterrupt: 

<pre>
26/01/17 19:33:09 WARN CacheManager: Asked to cache already cached data.
26/01/17 19:33:09 WARN CacheManager: Asked to cache already cached data.
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
checked 3883 movies
indexing with minhash LSH
                                                                                
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
                                                                                
indexed 988212 similar item pairsindexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs

indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)
building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
building content features (tfidf + bigrams)
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
cmputing similarity graph
cmputing similarity graph
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
building content features (tfidf + bigrams)
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
checked 3883 moviesgenerating content-based recommendations
indexed 988212 similar item pairs

indexing with minhash LSH
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
cmputing similarity graph
cmputing similarity graph
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
building content features (tfidf + bigrams)
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
                                                                                
cmputing similarity graph
checked 3883 movies
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
generating content-based recommendationsgenerating content-based recommendations
generating content-based recommendations
generating content-based recommendations

generating content-based recommendations
indexing with minhash LSH
generating content-based recommendations
cmputing similarity graph
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
building content features (tfidf + bigrams)
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
                                                                                
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
indexed 988212 similar item pairsindexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations

generating content-based recommendations
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
cmputing similarity graph
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
building content features (tfidf + bigrams)
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
building content features (tfidf + bigrams)
                                                                                
checked 3883 moviesindexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
checked 3883 movies
indexed 988212 similar item pairs

indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
indexing with minhash LSH
indexing with minhash LSH
cmputing similarity graphcmputing similarity graph

                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)
building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
                                                                                
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
indexed 988212 similar item pairsindexed 988212 similar item pairs
checked 3883 movies
indexing with minhash LSH
indexed 988212 similar item pairs
generating content-based recommendations
checked 3883 movies
indexing with minhash LSH
generating content-based recommendations

generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
cmputing similarity graph
cmputing similarity graph
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
                                                                                
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
                                                                                
cmputing similarity graph
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
checked 3883 movies
...
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
                                                                                
indexed 988212 similar item pairsindexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs

indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
building content features (tfidf + bigrams)
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
building content features (tfidf + bigrams)
cmputing similarity graph
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
                                                                                
checked 3883 movies
indexing with minhash LSH
                                                                                
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
cmputing similarity graph
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
                                                                                
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
                                                                                
building content features (tfidf + bigrams)
                                                                                
building content features (tfidf + bigrams)
                                                                                
indexed 988212 similar item pairsindexed 988212 similar item pairs
generating content-based recommendations

generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
checked 3883 movies
indexing with minhash LSH
                                                                                
checked 3883 movies
indexing with minhash LSH
                                                                                
cmputing similarity graph
                                                                                
cmputing similarity graph
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
                                                                                
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
checked 3883 movies
...
cmputing similarity graph
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movieschecked 3883 movies
indexing with minhash LSH

indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graphcmputing similarity graph

cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
                                                                                
building content features (tfidf + bigrams)
indexed 988212 similar item pairsindexed 988212 similar item pairs

indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
checked 3883 movies
indexing with minhash LSH
                                                                                
cmputing similarity graph
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
indexed 988212 similar item pairs
generating content-based recommendations
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
                                                                                
building content features (tfidf + bigrams)
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
                                                                                
cmputing similarity graph
                                                                                
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
                                                                                
checked 3883 movieschecked 3883 movies
indexing with minhash LSH

indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
                                                                                
indexed 988212 similar item pairs
Training time: 5721.41s
Best params: numFeatures=5000, bigramFeatures=3000, numHashTables=10, jaccardThreshold=0.100, minGenreOverlap=1
generating content-based recommendations
                                                                                
Tuned Content-Based: P@10=0.0150, R@10=0.0167, NDCG@10=0.0188
Inference time: 0.09s
</pre>