# 🛠️ Step-by-Step Explanation: AI-Powered Feed Matching System
This notebook provides a detailed breakdown of the **LDA-based topic modeling** and **Word2Vec-powered feed matching** algorithm.

**Key Features:**
- 🧠 **LDA (Latent Dirichlet Allocation)** for automatic topic classification.
- 🔍 **Word2Vec embeddings** for semantic similarity measurement.
- ⏳ **Recency Weighting** to prioritize fresh content.
- 🤝 **Best Matching Feed Selection** using similarity + recency.


## 🔹 Step 1: Import Required Libraries

In [4]:
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from datetime import datetime, timedelta
from collections import defaultdict

# Download required NLTK resources
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/heiley/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 🔹 Step 2: Define Sample User Feeds

In [6]:
feeds = {
    "Alice": [
        {"text": "Bill Clinton fired 377,000 federal employees.", "timestamp": datetime.now() - timedelta(days=0.4)},
        {"text": "Transportation is changing rapidly after years of slow progress.", "timestamp": datetime.now() - timedelta(days=2)},
        {"text": "AI can help improve medical diagnosis through advanced image analysis.", "timestamp": datetime.now() - timedelta(hours=12)},
        {"text": "Think Trump was aggressive to Zelenskyy? Watch the full talk.", "timestamp": datetime.now() - timedelta(hours=6)},
    ],
    "Bob": [
        {"text": "I enjoy learning about political decisions and their impact.", "timestamp": datetime.now() - timedelta(days=1)},
        {"text": "I find machine learning and its medical applications fascinating.", "timestamp": datetime.now() - timedelta(hours=5)},
        {"text": "Exploring transportation advancements is an exciting topic for me.", "timestamp": datetime.now() - timedelta(days=3)},
    ]
}

## 🔹 Step 3: LDA Topic Modeling for Automatic Classification

In [8]:
# Extract text data for LDA topic modeling
all_posts = [post["text"] for posts in feeds.values() for post in posts]

# Convert text into a document-term matrix using CountVectorizer
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(all_posts)

# Apply LDA for topic modeling with 3 topics
lda_model = LatentDirichletAllocation(n_components=3, random_state=42)
lda_model.fit(X)

# Function to predict the most relevant topic for a given post
def get_topic(text):
    text_vec = vectorizer.transform([text])
    topic_distribution = lda_model.transform(text_vec)
    return np.argmax(topic_distribution)  # Returns topic index

# Assign topics to posts
topic_names = {0: "Technology", 1: "Politics", 2: "AI"}
for user, posts in feeds.items():
    for post in posts:
        predicted_topic_index = get_topic(post["text"])
        post["topic"] = topic_names[predicted_topic_index]

## 🔹 Step 4: Word2Vec for Semantic Similarity Matching

In [10]:
# Tokenize text for Word2Vec training
all_texts = [post["text"] for posts in feeds.values() for post in posts]
tokenized_sentences = [word_tokenize(text.lower()) for text in all_texts]

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)

# Function to get embeddings
def get_embedding(text):
    words = word_tokenize(text.lower())
    word_vectors = [word2vec_model.wv[word] for word in words if word in word2vec_model.wv]
    return np.mean(word_vectors, axis=0) if word_vectors else np.zeros(word2vec_model.vector_size)

## 🔹 Step 5: Computing the Best Matching Feed Using Similarity & Recency Weighting

In [12]:
# Function to calculate recency weight
def recency_weight(timestamp, decay_rate=0.1):
    days_since_post = (datetime.now() - timestamp).days + 1
    return np.exp(-decay_rate * days_since_post)

In [13]:
# Compute the best matching feed between Alice and Bob
best_match = {}
max_weighted_score = 0

# Group posts by predicted LDA topic
topic_groups = defaultdict(list)
for user, posts in feeds.items():
    for post in posts:
        topic_groups[post["topic"]].append(post)

# Find best match based on LDA topics, similarity, and recency
for topic, posts in topic_groups.items():
    user_posts = {"Alice": [], "Bob": []}

    # Separate posts by user
    for post in posts:
        if post in feeds["Alice"]:
            user_posts["Alice"].append(post)
        elif post in feeds["Bob"]:
            user_posts["Bob"].append(post)

    # Compare only within the same LDA topic
    for alice_post in user_posts["Alice"]:
        for bob_post in user_posts["Bob"]:
            alice_embedding = get_embedding(alice_post["text"])
            bob_embedding = get_embedding(bob_post["text"])

            similarity_score = cosine_similarity([alice_embedding], [bob_embedding])[0][0]
            alice_weight = recency_weight(alice_post["timestamp"])
            bob_weight = recency_weight(bob_post["timestamp"])

            # Weighted score = similarity * average recency weight
            weighted_score = similarity_score * ((alice_weight + bob_weight) / 2)

            if weighted_score > max_weighted_score:
                max_weighted_score = weighted_score
                best_match = {
                    "LDA Topic": topic,
                    "Alice's Best Matched Post": alice_post["text"],
                    "Bob's Best Matched Post": bob_post["text"],
                    "Highest Similarity Score": round(similarity_score, 2),
                    "Final Weighted Score": round(weighted_score, 2),
                }

# Convert results to DataFrame and display
best_match_df = pd.DataFrame([best_match])
print(best_match_df)

  LDA Topic                          Alice's Best Matched Post  \
0  Politics  Think Trump was aggressive to Zelenskyy? Watch...   

                             Bob's Best Matched Post  \
0  Exploring transportation advancements is an ex...   

   Highest Similarity Score  Final Weighted Score  
0                      0.27                  0.21  
