# Hybrid Recommendation System

**Team Structure:**
- Member 1: Infrastructure, Data Loading, Fusion & Evaluation
- Member 2: Collaborative Filtering (ALS)
- Member 3: Content-Based Filtering (TF-IDF + LSH)

## 1. Setup

### 1.1 Imports

In [1]:
import os
import sys
import urllib.request
import zipfile
from math import log2

# Fix for Windows
os.environ["PYSPARK_PYTHON"] = sys.executable
os.environ["PYSPARK_DRIVER_PYTHON"] = sys.executable

from pyspark.ml.recommendation import ALS
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import IntegerType, FloatType, StructType, StructField

### 1.2 Download Data

In [2]:
DATA_URL = "https://files.grouplens.org/datasets/movielens/ml-1m.zip"
DATA_DIR = "data"
DATASET_DIR = os.path.join(DATA_DIR, "ml-1m")
ZIP_PATH = os.path.join(DATA_DIR, "ml-1m.zip")

In [3]:
if not os.path.exists(DATA_DIR):
    os.makedirs(DATA_DIR)

In [4]:
if not os.path.exists(DATASET_DIR):

    if not os.path.exists(ZIP_PATH):
        print("Downloading MovieLens ml-1m...")
        urllib.request.urlretrieve(DATA_URL, ZIP_PATH)

    print("Extracting...")
    with zipfile.ZipFile(ZIP_PATH, 'r') as zip_ref:
        zip_ref.extractall(DATA_DIR)

### 1.3 Spark Session

In [5]:
spark = SparkSession.builder.appName("MMDS") \
    .config("spark.driver.memory", "16g") \
    .config("spark.executor.memory", "16g") \
    .config("spark.sql.shuffle.partitions", "20") \
    .config("spark.default.parallelism", "20") \
    .config("spark.driver.maxResultSize", "8g") \
    .config("spark.sql.autoBroadcastJoinThreshold", "400m") \
    .getOrCreate()

### 1.4 Load Data

In [6]:
users_df = spark.read.text(os.path.join(DATASET_DIR, "users.dat")).select(
    F.split(F.col("value"), "::").getItem(0).cast(IntegerType()).alias("user_id"),
    F.split(F.col("value"), "::").getItem(1).alias("gender"),
    F.split(F.col("value"), "::").getItem(2).cast(IntegerType()).alias("age"),
    F.split(F.col("value"), "::").getItem(3).cast(IntegerType()).alias("occupation"),
    F.split(F.col("value"), "::").getItem(4).alias("zip_code")
)
users_df.count()

6040

In [7]:
items_df = spark.read.text(os.path.join(DATASET_DIR, "movies.dat")).select(
    F.split(F.col("value"), "::").getItem(0).cast(IntegerType()).alias("item_id"),
    F.split(F.col("value"), "::").getItem(1).alias("title"),
    F.split(F.col("value"), "::").getItem(2).alias("genres")
)
items_df.count()

3883

In [8]:
ratings_df = spark.read.text(os.path.join(DATASET_DIR, "ratings.dat")).select(
    F.split(F.col("value"), "::").getItem(0).cast(IntegerType()).alias("user_id"),
    F.split(F.col("value"), "::").getItem(1).cast(IntegerType()).alias("item_id"),
    F.split(F.col("value"), "::").getItem(2).cast(FloatType()).alias("rating"),
    F.split(F.col("value"), "::").getItem(3).cast(IntegerType()).alias("timestamp")
)
ratings_df.count()

1000209

In [9]:
users_df.show(5)

+-------+------+---+----------+--------+
|user_id|gender|age|occupation|zip_code|
+-------+------+---+----------+--------+
|      1|     F|  1|        10|   48067|
|      2|     M| 56|        16|   70072|
|      3|     M| 25|        15|   55117|
|      4|     M| 45|         7|   02460|
|      5|     M| 25|        20|   55455|
+-------+------+---+----------+--------+
only showing top 5 rows


In [10]:
items_df.show(5)

+-------+--------------------+--------------------+
|item_id|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Animation|Childre...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|        Comedy|Drama|
|      5|Father of the Bri...|              Comedy|
+-------+--------------------+--------------------+
only showing top 5 rows


In [11]:
ratings_df.show(5)

+-------+-------+------+---------+
|user_id|item_id|rating|timestamp|
+-------+-------+------+---------+
|      1|   1193|   5.0|978300760|
|      1|    661|   3.0|978302109|
|      1|    914|   3.0|978301968|
|      1|   3408|   4.0|978300275|
|      1|   2355|   5.0|978824291|
+-------+-------+------+---------+
only showing top 5 rows


## 2. Exploratory Data Analysis

### 2.1 Rating matrix

In [12]:
num_users = users_df.count()
num_items = items_df.count()
num_ratings = ratings_df.count()
sparsity = (1 - (num_ratings / (num_users * num_items))) * 100

print(f"Users:            {num_users:,}")
print(f"Movies:           {num_items:,}")
print(f"Ratings:          {num_ratings:,}")
print(f"Sparsity:         {sparsity:.2f}%")
print(f"Avg ratings/user: {num_ratings/num_users:.1f}")
print(f"Avg ratings/movie:{num_ratings/num_items:.1f}")

Users:            6,040
Movies:           3,883
Ratings:          1,000,209
Sparsity:         95.74%
Avg ratings/user: 165.6
Avg ratings/movie:257.6


### 2.2 Rating distribution

In [13]:
ratings_df.groupBy("rating").count().orderBy("rating").show()

+------+------+
|rating| count|
+------+------+
|   1.0| 56174|
|   2.0|107557|
|   3.0|261197|
|   4.0|348971|
|   5.0|226310|
+------+------+



### 2.3 Genre distribution

In [14]:
items_df.select(F.explode(F.split(F.col("genres"), "\\|")).alias("genre")) \
    .groupBy("genre").count().orderBy(F.desc("count")).show()

+-----------+-----+
|      genre|count|
+-----------+-----+
|      Drama| 1603|
|     Comedy| 1200|
|     Action|  503|
|   Thriller|  492|
|    Romance|  471|
|     Horror|  343|
|  Adventure|  283|
|     Sci-Fi|  276|
| Children's|  251|
|      Crime|  211|
|        War|  143|
|Documentary|  127|
|    Musical|  114|
|    Mystery|  106|
|  Animation|  105|
|    Western|   68|
|    Fantasy|   68|
|  Film-Noir|   44|
+-----------+-----+



### 2.4 User gender distribution

In [15]:
users_df.groupBy("gender").count().show()

+------+-----+
|gender|count|
+------+-----+
|     F| 1709|
|     M| 4331|
+------+-----+



### 2.5 User age distribution

In [16]:
users_df.groupBy("age").count().orderBy("age").show()

+---+-----+
|age|count|
+---+-----+
|  1|  222|
| 18| 1103|
| 25| 2096|
| 35| 1193|
| 45|  550|
| 50|  496|
| 56|  380|
+---+-----+



## 3. Train/Test Split

In [17]:
def chronological_split(ratings_df, train_ratio=0.8, min_train_ratings=5):
    user_time_window = Window.partitionBy("user_id").orderBy("timestamp")
    user_count_window = Window.partitionBy("user_id")

    ratings_with_rank = ratings_df.withColumn(
        "row_num", F.row_number().over(user_time_window)
    ).withColumn(
        "user_total", F.count("*").over(user_count_window)
    ).withColumn(
        "train_threshold", F.floor(F.col("user_total") * train_ratio)
    )

    ratings_valid = ratings_with_rank.filter(
        F.col("train_threshold") >= min_train_ratings
    )

    ratings_labeled = ratings_valid.withColumn(
        "split",
        F.when(F.col("row_num") <= F.col("train_threshold"), "train").otherwise("test")
    )

    original_columns = ["user_id", "item_id", "rating", "timestamp"]
    train_df = ratings_labeled.filter(F.col("split") == "train").select(original_columns)
    test_df = ratings_labeled.filter(F.col("split") == "test").select(original_columns)

    return train_df, test_df

In [18]:
train_df, test_df = chronological_split(ratings_df, train_ratio=0.8, min_train_ratings=5)
train_df = train_df.cache()
test_df = test_df.cache()

In [19]:
print(f"Train: {train_df.count():,} ratings")
print(f"Test: {test_df.count():,} ratings")
print(f"Users in train: {train_df.select('user_id').distinct().count():,}")
print(f"Users in test: {test_df.select('user_id').distinct().count():,}")

Train: 797,758 ratings
Test: 202,451 ratings
Users in train: 6,040
Users in test: 6,040


## 4. Collaborative Filtering (ALS)

Implement using `pyspark.ml.recommendation.ALS`

In [20]:
class CollaborativeFilter:

    def __init__(self, rank=10, regParam=0.1, maxIter=10):
        self.als = ALS(
            userCol="user_id",
            itemCol="item_id",
            ratingCol="rating",
            rank=rank,
            regParam=regParam,
            maxIter=maxIter,
            coldStartStrategy="drop",
            nonnegative=True
        )
        self.model = None

    def train(self, df):
        self.model = self.als.fit(df)

    def get_recommendations(self, df, k=10):
        """Get top-K recommendations"""
        if self.model is None:
            raise ValueError("Call train() first.")

        users = df.select("user_id").distinct()
        user_recs = self.model.recommendForUserSubset(users, k)

        return user_recs.select(
            F.col("user_id"),
            F.explode("recommendations").alias("rec")
        ).select(
            F.col("user_id"),
            F.col("rec.item_id").cast(IntegerType()).alias("item_id"),
            F.col("rec.rating").alias("prediction")
        )

    def predict(self, df):
        """Predict ratings for user-item pairs in test"""
        if self.model is None:
            raise ValueError("Train should be called first")
        return self.model.transform(df)

In [21]:
cf = CollaborativeFilter(rank=10, regParam=0.1, maxIter=10)
cf.train(train_df)

In [22]:
als_recs = cf.get_recommendations(test_df, k=100).withColumnRenamed("prediction", "als_score")

In [23]:
als_recs.show(10)

+-------+-------+---------+
|user_id|item_id|als_score|
+-------+-------+---------+
|     95|   3905| 4.470003|
|     95|   1851| 4.413235|
|     95|   3092|4.3593006|
|     95|   2905| 4.310626|
|     95|    318|4.2543035|
|     95|    260|4.2513843|
|     95|   1198|4.2234735|
|     95|    527|4.1753764|
|     95|   2931|4.1601114|
|     95|     50| 4.154743|
+-------+-------+---------+
only showing top 10 rows


### Bonus: Hyperparameter Tuning

In [24]:
# TODO

## 5. Content-Based Filtering (TF-IDF + LSH)

Implement using `pyspark.ml.feature` (Tokenizer, HashingTF, IDF, BucketedRandomProjectionLSH)

In [25]:
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF, NGram, MinHashLSH
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import (
    col, split, concat_ws, regexp_extract, udf, array_intersect, 
    size, sum as _sum, desc, row_number
)

class ContentBasedFilter:
    def __init__(self, num_features=5000, bigram_features=3000, num_hash_tables=10):
        self.num_features = num_features
        self.bigram_features = bigram_features
        self.num_hash_tables = num_hash_tables
        self.jaccard_threshold = 0.1
        self.genre_weight = 4
        self.min_genre_overlap = 1
        

        self.lsh_model = None
        self.vector_model = None
        self.movies_binary = None
        self.similar_pairs = None

    def combine_binarize_udf(self):
        def process_vectors(v1, v2):
            if v1 is None: 
                return None
            indices = [int(i) for i in v1.indices]
            
            if v2 is not None:
                offset = int(v1.size)
                indices += [int(i) + offset for i in v2.indices]
                total_size = offset + int(v2.size)
            else:
                total_size = int(v1.size)
            
            values = [1.0] * len(indices)
            return Vectors.sparse(total_size, sorted(indices), values)
            
        return udf(process_vectors, VectorUDT())

    def train_features(self, items_df):
        print("building content features (tfidf + bigrams)")
        
        df = (
            items_df
            .withColumn("year", regexp_extract(col("title"), r"\((\d{4})\)", 1))
            .withColumn("decade", regexp_extract(col("title"), r"\((\d{3})\d\)", 1))
            .withColumn("genres_spaced", F.regexp_replace(col("genres"), r"\|", " "))
            .withColumn("content", concat_ws(
                " ", 
                col("title"),
                col("genres_spaced"), col("genres_spaced"), col("genres_spaced"), col("genres_spaced"),
                col("year"), col("decade")
            ))
        )

        stages = []

        stages += [
            Tokenizer(inputCol="content", outputCol="raw_words"),
            StopWordsRemover(inputCol="raw_words", outputCol="words")
        ]
        
        stages += [
            HashingTF(inputCol="words", outputCol="tf_uni", numFeatures=self.num_features),
            IDF(inputCol="tf_uni", outputCol="tfidf_uni", minDocFreq=1)
        ]
        
        stages += [
            NGram(n=2, inputCol="words", outputCol="bigrams"),
            HashingTF(inputCol="bigrams", outputCol="tf_bi", numFeatures=self.bigram_features),
            IDF(inputCol="tf_bi", outputCol="tfidf_bi", minDocFreq=1)
        ]
        
        pipeline = Pipeline(stages=stages)
        self.vector_model = pipeline.fit(df)
        features_df = self.vector_model.transform(df)
        
        combiner = self.combine_binarize_udf()
        self.movies_binary = (
            features_df
            .withColumn("binary_features", combiner("tfidf_uni", "tfidf_bi"))
            .select("item_id", "title", "genres", "binary_features")
            .cache()
        )
        
        print(f"checked {self.movies_binary.count()} movies")

    def build_lsh_index(self):
        print("indexing with minhash LSH")
        
        mh = MinHashLSH(
            inputCol="binary_features", 
            outputCol="hashes", 
            numHashTables=self.num_hash_tables,
            seed=42
        )
        self.lsh_model = mh.fit(self.movies_binary)
        
        dist_threshold = 1.0 - self.jaccard_threshold
        
        print(f"cmputing similarity graph")
        raw_pairs = self.lsh_model.approxSimilarityJoin(
            self.movies_binary, self.movies_binary, 
            threshold=dist_threshold, 
            distCol="jaccard_dist"
        )
        
        pairs = raw_pairs.select(
            col("datasetA.item_id").alias("item_a"),
            col("datasetB.item_id").alias("item_b"),
            (1.0 - col("jaccard_dist")).alias("similarity")
        ).filter("item_a != item_b")
        
        self.similar_pairs = pairs.cache()
        print(f"indexed {self.similar_pairs.count()} similar item pairs")

    def recommend_for_users(self, train_df, items_df, k=10):
        print("generating content-based recommendations")
        
        user_history = train_df.filter(col("rating") >= 4.0).select(
            col("user_id"), col("item_id").alias("seed_item"), col("rating")
        )

        candidates = user_history.join(
            self.similar_pairs,
            user_history.seed_item == self.similar_pairs.item_a
        )
        
        genre_df = items_df.select("item_id", split(col("genres"), r"\|").alias("genres_arr"))        
        
        candidates_enriched = (
            candidates.alias("c")
            .join(
                genre_df.alias("seed_g"), 
                col("c.seed_item") == col("seed_g.item_id")
            )
            .join(
                genre_df.alias("cand_g"), 
                col("c.item_b") == col("cand_g.item_id")
            )
            .select(
                "c.user_id", 
                col("c.item_b").alias("candidate_item"),
                "c.rating", 
                "c.similarity",
                size(array_intersect(
                    col("seed_g.genres_arr"), 
                    col("cand_g.genres_arr")
                )).alias("genre_overlap")
            )
        )

        filtered = candidates_enriched.filter(col("genre_overlap") >= self.min_genre_overlap)

        scored = filtered.withColumn(
            "score", 
            col("rating") * col("similarity") * (1.0 + 0.25 * col("genre_overlap"))
        )

        recs = scored.groupBy("user_id", "candidate_item").agg(_sum("score").alias("content_score"))
        
        seen_items = train_df.select("user_id", "item_id").distinct().alias("seen")
        recs_alias = recs.alias("recs")
        
        final_recs = recs_alias.join(
            seen_items,
            (col("recs.user_id") == col("seen.user_id")) & 
            (col("recs.candidate_item") == col("seen.item_id")),
            "left_anti"
        ).select(
            col("recs.user_id"), 
            col("recs.candidate_item").alias("item_id"), 
            col("recs.content_score")
        )

        window = Window.partitionBy("user_id").orderBy(desc("content_score"))
        return (
            final_recs
            .withColumn("rank", row_number().over(window))
            .filter(col("rank") <= k)
            .drop("rank")
        )

cb_filter = ContentBasedFilter()
cb_filter.train_features(items_df)
cb_filter.build_lsh_index()


building content features (tfidf + bigrams)
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
indexed 988212 similar item pairs


In [26]:
content_recs = cb_filter.recommend_for_users(train_df, items_df, k=100)

generating content-based recommendations


In [27]:
content_recs = content_recs.cache()
content_recs.show(5)

+-------+-------+------------------+
|user_id|item_id|     content_score|
+-------+-------+------------------+
|     95|   1744| 22.83967523704366|
|     95|   1591|22.473956043956044|
|     95|    849|21.873400389932648|
|     95|   2334|21.532246786394385|
|     95|   2058| 21.53224678639438|
+-------+-------+------------------+
only showing top 5 rows


## 6.Fusion & Evaluation

In [28]:
ALPHA = 0.7
K = 10
RELEVANCE_THRESHOLD = 4.0

### 6.1 Normalization

In [29]:
def normalize(df, col_name):
    stats = df.agg(F.min(col_name).alias("min"), F.max(col_name).alias("max")).collect()[0]
    if stats["max"] == stats["min"]:
        return df.withColumn(col_name + "_norm", F.lit(0.5))
    return df.withColumn(col_name + "_norm", (F.col(col_name) - stats["min"]) / (stats["max"] - stats["min"]))

### 6.2 Hybrid Fusion

In [30]:
als_norm = normalize(als_recs, "als_score")
content_norm = normalize(content_recs, "content_score")

hybrid_recs = als_norm.select("user_id", "item_id", "als_score_norm") \
    .join(content_norm.select("user_id", "item_id", "content_score_norm"), ["user_id", "item_id"], "full_outer") \
    .fillna(0) \
    .withColumn("final_score", ALPHA * F.col("als_score_norm") + (1 - ALPHA) * F.col("content_score_norm"))

In [31]:
hybrid_recs.orderBy(F.desc("final_score")).show(10)

+-------+-------+------------------+------------------+------------------+
|user_id|item_id|    als_score_norm|content_score_norm|       final_score|
+-------+-------+------------------+------------------+------------------+
|   2155|   2964|               1.0|               0.0|               0.7|
|    283|   2964|0.9698151700844673|               0.0| 0.678870619059127|
|   5258|   2964|0.9697667223390415|               0.0| 0.678836705637329|
|     46|   2332|0.9629095211067962|               0.0|0.6740366647747573|
|    356|   2964|0.9624401982957583|               0.0|0.6737081388070308|
|   2867|   2964|0.9327210792037601|               0.0| 0.652904755442632|
|   4751|   2197|0.9295460636668155|               0.0|0.6506822445667708|
|     46|   1421|0.9260141680602427|               0.0|0.6482099176421698|
|     46|   2964| 0.924343074189654|               0.0|0.6470401519327578|
|   4751|   2823|0.9179485214437122|               0.0|0.6425639650105985|
+-------+-------+--------

In [32]:
hybrid_recs.orderBy(F.desc("final_score")).filter((F.col("als_score_norm") > 0.2) & (F.col("content_score_norm") > 0.2)).show(10)

+-------+-------+------------------+-------------------+------------------+
|user_id|item_id|    als_score_norm| content_score_norm|       final_score|
+-------+-------+------------------+-------------------+------------------+
|   4277|     53|0.6619954703713671| 0.5904650803512171|0.6405363533653221|
|   4169|     53|0.5907159593292562| 0.7100814164276092|0.6265255964587622|
|   4169|   2197|0.5363586097405134| 0.7420784076418976|0.5980745491109287|
|   4277|   2579|0.6177101479358188| 0.5430159889973913|0.5953019002542905|
|   1835|     53| 0.648898089598216|0.39215405067471004|0.5718748779211642|
|    187|   2197|0.6132271217938433|0.46941409482066127|0.5700832137018887|
|   1448|     53|0.5899127632385575| 0.5233646290001704|0.5699483229670413|
|    195|     53|0.6529252201000901| 0.3695371984674346|0.5679088136102934|
|   1680|   3817|0.6208199121333001|0.41946522137832376|0.5604135049068072|
|   5878|   2197| 0.675955173900218| 0.2770115983667207|0.5562721012401688|
+-------+---

In [33]:
als_pairs = als_recs.select("user_id", "item_id").distinct()
content_pairs = content_recs.select("user_id", "item_id").distinct()

overlap = als_pairs.intersect(content_pairs).count()
print(f"als pairs: {als_pairs.count()}")
print(f"content pairs: {content_pairs.count()}")
print(f"overlap: {overlap}")

als pairs: 604000
content pairs: 603761
overlap: 10692


### 6.3 Ground Truth

In [34]:
ground_truth = test_df.filter(F.col("rating") >= RELEVANCE_THRESHOLD) \
    .groupBy("user_id").agg(F.collect_list("item_id").alias("relevant_items"))

### Evaluation Functions

In [35]:
def get_top_k(recs_df, score_col, k):
    window = Window.partitionBy("user_id").orderBy(F.desc(score_col))
    ranked = recs_df.withColumn("rank", F.row_number().over(window)) \
        .filter(F.col("rank") <= k)

    return ranked.groupBy("user_id").agg(
        F.array_sort(F.collect_list(F.struct("rank", "item_id"))).alias("ranked_structs")
    ).withColumn(
        "recommended_items",
        F.expr("transform(ranked_structs, x -> x.item_id)")
    ).drop("ranked_structs")

In [36]:
def precision_at_k(top_k_df, ground_truth_df, k):
    joined = top_k_df.join(ground_truth_df, "user_id")
    result = joined.withColumn("hits", F.size(F.array_intersect("recommended_items", "relevant_items"))) \
        .agg(F.avg(F.col("hits") / k)).collect()[0][0]
    return result or 0.0

In [37]:
def recall_at_k(top_k_df, ground_truth_df):
    joined = top_k_df.join(ground_truth_df, "user_id")
    result = joined.withColumn("hits", F.size(F.array_intersect("recommended_items", "relevant_items"))) \
        .withColumn("recall", F.when(F.size("relevant_items") > 0, F.col("hits") / F.size("relevant_items")).otherwise(0)) \
        .agg(F.avg("recall")).collect()[0][0]
    return result or 0.0

In [38]:
def ndcg_at_k(top_k_df, ground_truth_df, k):
    joined = top_k_df.join(ground_truth_df, "user_id")
    exploded = joined.select(
        "user_id",
        "relevant_items",
        F.posexplode("recommended_items").alias("pos", "item_id")
    ).withColumn("item_id", F.col("item_id").cast(IntegerType()))

    with_dcg = exploded \
        .withColumn("rel", F.when(F.array_contains("relevant_items", F.col("item_id")), 1.0).otherwise(0.0)) \
        .withColumn("dcg", F.col("rel") / F.log2(F.col("pos") + 2)) \
        .groupBy("user_id", "relevant_items").agg(F.sum("dcg").alias("dcg"))

    idcg_vals = [sum(1.0 / log2(i + 2) for i in range(n)) for n in range(k + 1)]
    idcg_map = F.create_map(*[x for i, v in enumerate(idcg_vals) for x in (F.lit(i), F.lit(v))])

    result = with_dcg \
        .withColumn("num_rel", F.least(F.size("relevant_items"), F.lit(k))) \
        .withColumn("idcg", idcg_map[F.col("num_rel")]) \
        .withColumn("ndcg", F.when(F.col("idcg") > 0, F.col("dcg") / F.col("idcg")).otherwise(0)) \
        .agg(F.avg("ndcg")).collect()[0][0]
    return result or 0.0

In [39]:
def evaluate(recs_df, score_col, name):
    if recs_df.count() == 0:
        print(f"{name}: No recommendations (not implemented)")
        return {"Precision@10": 0.0, "Recall@10": 0.0, "NDCG@10": 0.0}

    top_k = get_top_k(recs_df, score_col, K)
    p = precision_at_k(top_k, ground_truth, K)
    r = recall_at_k(top_k, ground_truth)
    n = ndcg_at_k(top_k, ground_truth, K)

    print(f"{name}: P@{K}={p:.4f}, R@{K}={r:.4f}, NDCG@{K}={n:.4f}")
    return {"Precision@10": p, "Recall@10": r, "NDCG@10": n}

### Evaluation

In [40]:
als_metrics = evaluate(als_recs, "als_score", "ALS")

ALS: P@10=0.0099, R@10=0.0081, NDCG@10=0.0088


In [41]:
content_metrics = evaluate(content_recs, "content_score", "Content-Based")

Content-Based: P@10=0.0149, R@10=0.0167, NDCG@10=0.0188


In [42]:
hybrid_metrics = evaluate(hybrid_recs, "final_score", "Hybrid")

Hybrid: P@10=0.0108, R@10=0.0088, NDCG@10=0.0100


### Bonus: GBT Re-Ranking

In [43]:
# TODO

## Results Summary

In [44]:
summary = [
    ("ALS", als_metrics["Precision@10"], als_metrics["Recall@10"], als_metrics["NDCG@10"]),
    ("Content-Based", content_metrics["Precision@10"], content_metrics["Recall@10"], content_metrics["NDCG@10"]),
    ("Hybrid", hybrid_metrics["Precision@10"], hybrid_metrics["Recall@10"], hybrid_metrics["NDCG@10"]),
]
spark.createDataFrame(summary, ["Model", "Precision@10", "Recall@10", "NDCG@10"]).show()

+-------------+--------------------+--------------------+--------------------+
|        Model|        Precision@10|           Recall@10|             NDCG@10|
+-------------+--------------------+--------------------+--------------------+
|          ALS| 0.00989121338912142|0.008075643298256809|0.008834882243884373|
|Content-Based| 0.01492887029288703|0.016718364543279556| 0.01877152620672991|
|       Hybrid|0.010778242677824252|0.008833484494750564|0.009974498398269882|
+-------------+--------------------+--------------------+--------------------+



In [52]:
# Sample User Recommendations Visualization
import random

# Pick a random user from test set
test_users = test_df.select("user_id").distinct().collect()
sample_user_id = random.choice(test_users)["user_id"]
print(f"Sample User ID: {sample_user_id}\n")

# Get user's highly-rated movies from training set (rating >= 4)
user_liked = train_df.filter(
    (F.col("user_id") == sample_user_id) & (F.col("rating") >= 4.0)
).join(items_df, "item_id").select("title", "genres", "rating").orderBy(F.desc("rating"))

print("=== Movies rated highly by this user (training data) ===")
user_liked.show(10, truncate=False)

# Get hybrid recommendations for this user
user_recs = hybrid_recs.filter(F.col("user_id") == sample_user_id) \
    .join(items_df, "item_id") \
    .select("title", "genres", "final_score") \
    .orderBy(F.desc("final_score"))

print("=== Hybrid model recommendations ===")
user_recs.show(10, truncate=False)

Sample User ID: 2246

=== Movies rated highly by this user (training data) ===
+-------------------------------------------------------------------+--------------------+------+
|title                                                              |genres              |rating|
+-------------------------------------------------------------------+--------------------+------+
|Shawshank Redemption, The (1994)                                   |Drama               |5.0   |
|Henry V (1989)                                                     |Drama|War           |5.0   |
|Three Colors: Red (1994)                                           |Drama               |5.0   |
|Hud (1963)                                                         |Drama|Western       |5.0   |
|Godfather: Part II, The (1974)                                     |Action|Crime|Drama  |5.0   |
|Godfather, The (1972)                                              |Action|Crime|Drama  |5.0   |
|Manhattan (1979)                      

In [None]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import Evaluator
from pyspark.ml import Estimator, Model
from pyspark.ml.param.shared import HasInputCol, HasOutputCol, Param, Params, TypeConverters
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
import time

class ContentBasedEstimator(Estimator, DefaultParamsReadable, DefaultParamsWritable):
    numFeatures = Param(Params._dummy(), "numFeatures", "num of tf-idf features", TypeConverters.toInt)
    bigramFeatures = Param(Params._dummy(), "bigramFeatures", "num of bigram features", TypeConverters.toInt)
    numHashTables = Param(Params._dummy(), "numHashTables", "num of LSH hash tables", TypeConverters.toInt)
    jaccardThreshold = Param(Params._dummy(), "jaccardThreshold", "jaccard similarity threshold", TypeConverters.toFloat)
    minGenreOverlap = Param(Params._dummy(), "minGenreOverlap", "minimum genre overlap", TypeConverters.toInt)
    
    def __init__(self, numFeatures=5000, bigramFeatures=3000, 
                numHashTables=10, jaccardThreshold=0.1, minGenreOverlap=1):
        super(ContentBasedEstimator, self).__init__()
        self._setDefault(numFeatures=numFeatures, 
                        bigramFeatures=bigramFeatures, 
                        numHashTables=numHashTables,
                        jaccardThreshold=jaccardThreshold,
                        minGenreOverlap=minGenreOverlap)
        
        self._set(numFeatures=numFeatures, 
                bigramFeatures=bigramFeatures, 
                numHashTables=numHashTables,
                jaccardThreshold=jaccardThreshold, 
                minGenreOverlap=minGenreOverlap)
    
    def _fit(self, dataset):
        cb = ContentBasedFilter(
            num_features=self.getOrDefault(self.numFeatures),
            bigram_features=self.getOrDefault(self.bigramFeatures),
            num_hash_tables=self.getOrDefault(self.numHashTables)
        )
        cb.jaccard_threshold = self.getOrDefault(self.jaccardThreshold)
        cb.min_genre_overlap = self.getOrDefault(self.minGenreOverlap)
        cb.train_features(items_df)
        cb.build_lsh_index()
        return ContentBasedModel(cb)

class ContentBasedModel(Model, DefaultParamsReadable, DefaultParamsWritable):
    def __init__(self, cb_filter=None):
        super(ContentBasedModel, self).__init__()
        self.cb_filter = cb_filter
    
    def _transform(self, dataset):
        if self.cb_filter is None:
            raise ValueError("model not fitted")
        return self.cb_filter.recommend_for_users(dataset, items_df, k=100)

class RecSysEvaluator(Evaluator):
    def __init__(self, test_df, ground_truth, k=10):
        super(RecSysEvaluator, self).__init__()
        self.test_df = test_df
        self.ground_truth = ground_truth
        self.k = k
    
    def _evaluate(self, dataset):
        top_k = get_top_k(dataset, "content_score", self.k)
        return ndcg_at_k(top_k, self.ground_truth, self.k)
    
    def isLargerBetter(self):
        return True

paramGrid = ParamGridBuilder() \
    .addGrid(ContentBasedEstimator.numFeatures, [3000, 5000, 7000]) \
    .addGrid(ContentBasedEstimator.bigramFeatures, [2000, 3000]) \
    .addGrid(ContentBasedEstimator.numHashTables, [10, 20, 30]) \
    .addGrid(ContentBasedEstimator.jaccardThreshold, [0.1, 0.2]) \
    .addGrid(ContentBasedEstimator.minGenreOverlap, [1, 2]) \
    .build()

estimator = ContentBasedEstimator()
evaluator = RecSysEvaluator(test_df, ground_truth, k=10)
items_df.cache()
train_df.cache()
cv = CrossValidator(estimator=estimator, estimatorParamMaps=paramGrid, 
                    evaluator=evaluator, numFolds=2, parallelism=10)

start = time.time()
cvModel = cv.fit(train_df)
train_time = time.time() - start

best_model = cvModel.bestModel
print(f"Training time: {train_time:.2f}s")
print(f"Best params: numFeatures={best_model.cb_filter.num_features}, bigramFeatures={best_model.cb_filter.bigram_features}, numHashTables={best_model.cb_filter.num_hash_tables}, jaccardThreshold={best_model.cb_filter.jaccard_threshold:.3f}, minGenreOverlap={best_model.cb_filter.min_genre_overlap}")

start = time.time()
tuned_recs = best_model.transform(train_df).cache()
inference_time = time.time() - start

tuned_metrics = evaluate(tuned_recs, "content_score", "Tuned Content-Based")
print(f"Inference time: {inference_time:.2f}s")

<pre>
26/01/17 19:33:09 WARN CacheManager: Asked to cache already cached data.
26/01/17 19:33:09 WARN CacheManager: Asked to cache already cached data.
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
checked 3883 movies
indexing with minhash LSH
                                                                                
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
                                                                                
indexed 988212 similar item pairsindexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs

indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)
building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
building content features (tfidf + bigrams)
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
cmputing similarity graph
cmputing similarity graph
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
building content features (tfidf + bigrams)
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
checked 3883 moviesgenerating content-based recommendations
indexed 988212 similar item pairs

indexing with minhash LSH
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
cmputing similarity graph
cmputing similarity graph
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
building content features (tfidf + bigrams)
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
                                                                                
cmputing similarity graph
checked 3883 movies
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
generating content-based recommendationsgenerating content-based recommendations
generating content-based recommendations
generating content-based recommendations

generating content-based recommendations
indexing with minhash LSH
generating content-based recommendations
cmputing similarity graph
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
building content features (tfidf + bigrams)
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
                                                                                
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
indexed 988212 similar item pairsindexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations

generating content-based recommendations
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
cmputing similarity graph
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
building content features (tfidf + bigrams)
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
building content features (tfidf + bigrams)
                                                                                
checked 3883 moviesindexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
checked 3883 movies
indexed 988212 similar item pairs

indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
indexing with minhash LSH
indexing with minhash LSH
cmputing similarity graphcmputing similarity graph

                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)
building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
                                                                                
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
indexed 988212 similar item pairsindexed 988212 similar item pairs
checked 3883 movies
indexing with minhash LSH
indexed 988212 similar item pairs
generating content-based recommendations
checked 3883 movies
indexing with minhash LSH
generating content-based recommendations

generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
cmputing similarity graph
cmputing similarity graph
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
                                                                                
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
                                                                                
cmputing similarity graph
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
checked 3883 movies
...
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
                                                                                
indexed 988212 similar item pairsindexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs

indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
building content features (tfidf + bigrams)
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
building content features (tfidf + bigrams)
cmputing similarity graph
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
                                                                                
checked 3883 movies
indexing with minhash LSH
                                                                                
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
cmputing similarity graph
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
                                                                                
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
                                                                                
building content features (tfidf + bigrams)
                                                                                
building content features (tfidf + bigrams)
                                                                                
indexed 988212 similar item pairsindexed 988212 similar item pairs
generating content-based recommendations

generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
checked 3883 movies
indexing with minhash LSH
                                                                                
checked 3883 movies
indexing with minhash LSH
                                                                                
cmputing similarity graph
                                                                                
cmputing similarity graph
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
                                                                                
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
checked 3883 movies
...
cmputing similarity graph
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)

building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movieschecked 3883 movies
indexing with minhash LSH

indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graphcmputing similarity graph

cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
                                                                                
building content features (tfidf + bigrams)
indexed 988212 similar item pairsindexed 988212 similar item pairs

indexed 988212 similar item pairs
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
generating content-based recommendations
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
checked 3883 movies
indexing with minhash LSH
                                                                                
cmputing similarity graph
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
indexed 988212 similar item pairs
generating content-based recommendations
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
                                                                                
building content features (tfidf + bigrams)
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
                                                                                
cmputing similarity graph
                                                                                
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
building content features (tfidf + bigrams)
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
                                                                                
checked 3883 movieschecked 3883 movies
indexing with minhash LSH

indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
cmputing similarity graph
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
indexed 988212 similar item pairs
generating content-based recommendations
                                                                                
building content features (tfidf + bigrams)
                                                                                
checked 3883 movies
indexing with minhash LSH
cmputing similarity graph
                                                                                
indexed 988212 similar item pairs
Training time: 5721.41s
Best params: numFeatures=5000, bigramFeatures=3000, numHashTables=10, jaccardThreshold=0.100, minGenreOverlap=1
generating content-based recommendations
                                                                                
Tuned Content-Based: P@10=0.0150, R@10=0.0167, NDCG@10=0.0188
Inference time: 0.09s
</pre>