**Task:** Recommend top 3 posts for each user based on profile interests, past engagement, and content attributes.

In [6]:
!pip install chromadb

Collecting chromadb
  Using cached chromadb-1.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Using cached onnxruntime-1.23.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.9 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Using cached opentelemetry_exporter_otlp_proto_grpc-1.37.0-py3-none-any.whl.metadata (2.4 kB)
Collecting kubernetes>=28.1.0 (from chromadb)
  Using cached kubernetes-34.1.0-py2.py3-none-any.whl.metadata (1.7 kB)
Using cached chromadb-1.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.8 MB)
Using cached kubernetes-34.1.0-py2.py3-none-any.whl (2.0 MB)
Using cached onnxruntime-1.23.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (17.3 MB)
Using cached opentelemetry_exporter_otlp_proto_grpc-1.37.0-py3-none-any.whl (19 kB)
Installing collected packages: onnxruntime, kubernetes, opentelemetry-exporter-otlp-pro

chromadb is used as a vector database.

*   This chromadb is crucial for storing and searching for similar items based on their numerical representations(embeddings)
*   Usage: use it to store the vectorized representations of our posts and efficiently find posts that are similar to a user's profile or interests




In [7]:
import pandas as pd # used for data manupulation and analysis
import chromadb
from chromadb.utils import embedding_functions # pre-built function to convert text data into numerical vector embeddings which are essential for storing & searching in a vector database

In [8]:
users = pd.read_csv('/content/Users.csv')
posts = pd.read_csv('/content/Posts.csv')
engagements = pd.read_csv('/content/Engagements.csv')

print(users.head())
print(posts.head())
print(engagements.head())

  user_id  age gender          top_3_interests  past_engagement_score
0      U1   24      F      sports, art, gaming                   0.61
1      U2   32      F    travel, food, fashion                   0.93
2      U3   28  Other  sports, travel, fashion                   0.40
3      U4   25      M     fashion, music, tech                   0.53
4      U5   24      M   fashion, food, fitness                   0.80
  post_id creator_id content_type            tags
0      P1        U44        video    sports, food
1      P2        U26        video   music, travel
2      P3        U32         text  sports, travel
3      P4         U6        image   music, gaming
4      P5        U32        image   food, fashion
  user_id post_id  engagement
0      U1     P52           1
1      U1     P44           0
2      U1      P1           1
3      U1      P4           1
4      U1     P65           0


In [9]:
Top = 3  # for top 3 recommendations to provide
def normalize_list_field(s):
    if pd.isna(s): return []
    return [x.strip().lower() for x in str(s).split(",") if x.strip()]

# Preprocess posts
posts["tags_list"] = posts["tags"].apply(normalize_list_field)
posts["doc_text"] = posts["tags"].apply(lambda x: str(x).lower()) + " " + posts["content_type"].fillna("").str.lower()

# Map post_id → tags text (for user profile building)
post_tag_map = posts.set_index("post_id")["tags"].to_dict()

In [31]:
# Compute popularity (global engagement count)
popularity = engagements[engagements["engagement"] == 1].groupby("post_id").size().to_dict()
if popularity:
    min_pop, max_pop = min(popularity.values()), max(popularity.values())
    post_popularity = {k: (v - min_pop) / (max_pop - min_pop + 1e-9) for k, v in popularity.items()}
else:
    post_popularity = {}

In [32]:
#Init Chroma + embedding function
client = chromadb.Client() # client is an interface it interacts with chromadb
embedding_model = embedding_functions.DefaultEmbeddingFunction() # converts the doc_text -> numerical vector embeddings that captures semantic meaning of the text

collection = client.get_or_create_collection(
    name="posts_collection",
    metadata={"hnsw:space": "cosine"},
    embedding_function=embedding_model,
)

In [33]:
#Insert posts into vector DB
documents = posts["doc_text"].tolist()
metadatas = posts[["post_id", "creator_id", "content_type", "tags"]].to_dict(orient="records")
ids = posts["post_id"].astype(str).tolist()

collection.upsert(documents=documents, metadatas=metadatas, ids=ids)

In [35]:
# Build recommendations
recommendations = []

for _, u in users.iterrows():
    uid = u["user_id"]
    interests = normalize_list_field(u.get("top_3_interests", ""))

    # Positive engagements
    pos_posts = engagements[(engagements["user_id"] == uid) & (engagements["engagement"] == 1)]["post_id"].tolist()
    pos_tags = []
    for pid in pos_posts:
        pos_tags.extend(normalize_list_field(post_tag_map.get(pid, "")))

    # Combine into one profile text
    profile_terms = interests + pos_tags
    profile_text = " ".join(profile_terms).strip()

    if not profile_text:
        # fallback: top popular posts
        top_ids = posts["post_id"].head(Top).tolist()
    else:
        engaged_posts = set(pos_posts)
        results = collection.query(query_texts=[profile_text], n_results=Top * 5)  # fetch extra candidates

        candidates = []
        for sim, meta in zip(results["distances"][0], results["metadatas"][0]):
            pid = meta["post_id"]
            if pid in engaged_posts:
                continue

            # 1. Embedding similarity (Chroma returns distance → convert to similarity)
            emb_sim = 1 - sim

            # 2. Popularity score
            pop_score = post_popularity.get(pid, 0.0)

            # 3. Interest overlap
            user_int = set(interests)
            post_tags = set(normalize_list_field(meta["tags"]))
            overlap = len(user_int & post_tags) / len(user_int) if user_int else 0.0

            # Hybrid score
            final_score = 0.5 * emb_sim + 0.3 * pop_score + 0.2 * overlap
            candidates.append((pid, final_score))

        # Sort and pick top-K
        candidates.sort(key=lambda x: x[1], reverse=True)
        top_ids = [pid for pid, score in candidates[:Top]]

    recommendations.append({"user_id": uid, "recommended_posts": top_ids})

rec_df = pd.DataFrame(recommendations)
print(rec_df.head())

  user_id recommended_posts
0      U1   [P46, P22, P39]
1      U2   [P42, P53, P69]
2      U3   [P39, P46, P53]
3      U4    [P53, P1, P46]
4      U5   [P69, P53, P58]


In [39]:

def normalize_list_field(s):
    if pd.isna(s):
        return []
    return [x.strip().lower() for x in str(s).split(",") if x.strip()]

# Map post_id to tags
post_tag_map = posts.set_index("post_id")["tags"].to_dict()

hits, total_users = 0, 0
precision_sum, recall_sum = 0, 0

for uid, group in engagements.groupby("user_id"):
    engaged_posts = set(group[group["engagement"] == 1]["post_id"].tolist())
    if not engaged_posts:
        continue

    # Tags from engaged posts
    engaged_tags = set()
    for pid in engaged_posts:
        engaged_tags.update(normalize_list_field(post_tag_map.get(pid, "")))

    # Recommended posts for this user
    recs = rec_df.loc[rec_df["user_id"] == uid, "recommended_posts"].values
    if len(recs) == 0:
        continue
    recs = set(recs[0])

    # Tags from recommended posts
    rec_tags = set()
    for pid in recs:
        rec_tags.update(normalize_list_field(post_tag_map.get(pid, "")))

    # True positives = tag overlap
    true_pos = len(engaged_tags & rec_tags)

    # Precision
    precision = true_pos / len(rec_tags) if rec_tags else 0
    precision_sum += precision

    # Recall
    recall = true_pos / len(engaged_tags) if engaged_tags else 0
    recall_sum += recall

    # Accuracy (did we hit at least 1 tag?)
    if true_pos > 0:
        hits += 1

    total_users += 1

# Averages
precision = precision_sum / total_users if total_users else None
recall = recall_sum / total_users if total_users else None
accuracy = hits / total_users if total_users else None

print("Precision:", precision)
print("Recall:", recall)
print("Accuracy:", accuracy)

Precision: 0.8366666666666664
Recall: 0.4180714285714285
Accuracy: 1.0


# Interest-Based Content Recommendation System

## 1. Goal / Task
The goal of this project is to **recommend the top 3 posts to each user** based on:  
- Their profile interests  
- Past engagement behavior  
- Post content attributes (tags, type)  

This helps users discover content they are more likely to engage with, improving user experience and engagement rates.

---

## 2. Datasets

We use three datasets which are linked together:

### Users dataset
| Column | Description |
|--------|-------------|
| user_id | Unique identifier for each user |
| age | Age of the user |
| gender | Gender of the user |
| top_3_interests | Top three interests of the user |
| past_engagement_score | A score representing the user's previous engagement level |

### Posts dataset
| Column | Description |
|--------|-------------|
| post_id | Unique identifier for each post |
| creator_id | ID of the user who created the post |
| content_type | Type of content (e.g., video, text, image) |
| tags | Tags describing the content or topics of the post |

### Engagements dataset
| Column | Description |
|--------|-------------|
| user_id | ID of the user who engaged with a post |
| post_id | ID of the engaged post |
| engagement | Engagement type or level (1 = positive, 0 = none/negative) |

**How the datasets are linked:**  
- engagements.user_id → links to users.user_id  
- engagements.post_id → links to posts.post_id
- posts.creator_id → links to users.user_id (who created the post)  

These links allow us to track which users interacted with which posts and which posts were created by which users.

---

## 3. Approach / Methodology

We followed a **step-by-step approach** to generate recommendations:

### Step 1: Preprocessing
- Normalize tags: convert to lowercase and remove extra spaces  
- Create a combined `doc_text` for each post: `tags + content_type`  

### Step 2: Build user profiles
- Combine each user’s **top 3 interests** with the **tags of posts they engaged with**  
- This gives a textual profile representing their preferences  

### Step 3: Represent posts & profiles using embeddings
- Use **ChromaDB** to convert text into **vector embeddings**  
- Embeddings capture the semantic meaning of posts and user profiles  

### Step 4: Generate candidate recommendations
- For each user, compute **similarity between their profile and all posts** using embeddings  
- Compute a **hybrid score** for each candidate post:
  1. Embedding similarity (0.5 weight)  
  2. Post popularity based on past engagement (0.3 weight)  
  3. Overlap between user interests and post tags (0.2 weight)  

### Step 5: Select top 3 posts
- Sort candidate posts by their hybrid score  
- Pick the **top 3 posts** as recommendations for the user  

---

## 4. Evaluation Metrics

To measure recommendation quality, we computed:

- High precision (≈ 0.84) indicates recommendations are mostly relevant.  
- Moderate recall (≈ 0.42) shows some relevant tags may be missing, can improve with more personalization.  
- Perfect accuracy (1.0) ensures every user got at least one relevant post.

---

## 5. Possible Extensions
- Include **full post content embeddings** for better similarity matching  
- Combine **collaborative filtering** with this content-based method   
- Include **trending posts** or **time decay** for popularity scoring
