**Social Media Post Recommendation Using Reddit Dataset**                       
In this project, we developed a Reddit post recommendation system using semantic search and fine-tuned lightweight language models. We used SBERT embeddings and FAISS indexing to retrieve top-k similar posts based on user input. To enhance contextual generation, we fine-tuned three different LLMs — DeepSeek-7B, TinyLlama-1.1B, and Mistral-7B-Instruct — using LoRA for efficient adaptation. These models generate human-like post recommendations that reflect community-driven responses. We further evaluated the quality of generated outputs using BLEU, ROUGE, and BERTScore metrics to assess relevance, fluency, and semantic similarity. This system enables context-aware content recommendation using scalable and resource-efficient architectures.

Installing Libraries

In [None]:
!pip install transformers
!pip install torch

Reading Dataset From CSV File

In [None]:
import pandas as pd
import numpy as np
import re

df = pd.read_csv("G2FinalDatasetReddit.csv")
print(df.head())


Preprocessing and Cleaning Dataset

In [None]:
def clean_text(text):
    if pd.isna(text):
        return ""
    text = text.lower()
    text = re.sub(r'\[.*?\]', '', text)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'<.*?>+', '', text)
    text = re.sub(r'[^a-z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df['clean_body'] = df['body'].apply(clean_text)
df['clean_body']

In [None]:
df = df[df['clean_body'].str.len() >= 10].reset_index(drop=True)
df

Saving CLeaned Dataset Into New CSV File

In [None]:
df.to_csv("Cleaned_Reddit_Comments.csv", index=False)

Installing senetnce-Transformers

In [None]:
!pip install sentence-transformers


Embedding Of Dataset

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Load model and encode
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = sbert_model.encode(df['clean_body'].tolist(), show_progress_bar=True)
np.save("sbert_embeddings.npy", embeddings)


In [None]:
!pip install transformers accelerate
!pip install safetensors


**Loading Deepseek Model For Recommending Posts**

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-llm-7b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/deepseek-llm-7b-base",
    device_map="auto",
    torch_dtype=torch.float16,
    offload_folder="offload"
)


Recommending Posts By Prompt Base Finetuning Using Deepseek Model

In [None]:
def recommend_with_deepseek(query, k=5):
    top_posts = get_top_k_similar_posts(query, k)
    prompt = f"""You're a Reddit post recommender. Given a user's post: "{query}", and the following similar posts:\n\n"""
    for i, post in enumerate(top_posts):
        prompt += f"{i+1}. {post}\n"
    prompt += "\nGenerate a recommendation summary or suggest the most relevant content for the user."

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=150, do_sample=True, top_p=0.9, temperature=0.7)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return response


In [None]:
query = "I'm feeling stressed about exams, what should I do?"
print(recommend_with_deepseek(query, k=5))


Implementing Lora Finetuning of Deepseek Model and Saving Trained Model

In [None]:
!pip install faiss-cpu


In [None]:
import os
import json
import torch
import pandas as pd
import numpy as np
import faiss
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
    TaskType,
)

# === STEP 1: Load cleaned dataset and SBERT embeddings ===
df = pd.read_csv("Cleaned_Reddit_Comments.csv")
embeddings = np.load("sbert_embeddings.npy").astype("float32")

assert len(df) == len(embeddings), "Mismatch between embeddings and dataset!"

# === STEP 2: Build FAISS index for similarity search ===
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

# === STEP 3: Construct prompt-completion pairs using top-k similar posts ===
k = 5
records = []

for i in range(len(df)):
    query_vector = embeddings[i].reshape(1, -1)
    _, top_k = index.search(query_vector, k + 1)  # +1 because it includes the post itself
    top_k = [idx for idx in top_k[0] if idx != i][:k]

    similar_posts = df.iloc[top_k]['clean_body'].tolist()
    prompt = (
        f"User post from subreddit r/{df.iloc[i]['subreddit']}:\n\n"
        f"{df.iloc[i]['clean_body']}\n\n"
        f"Similar posts:\n" +
        "\n".join([f"{j+1}. {p}" for j, p in enumerate(similar_posts)]) +
        "\n\n###"
    )

    completion = " Recommended content: [Insert relevant recommendation here]"
    records.append({"prompt": prompt, "completion": completion})

# Save to JSONL
jsonl_path = "reddit_lora_dataset.jsonl"
with open(jsonl_path, "w") as f:
    for r in records:
        f.write(json.dumps(r) + "\n")

print(f"✅ Dataset prepared and saved to {jsonl_path}")

# === STEP 4: Load model & tokenizer ===
model_name = "deepseek-ai/deepseek-llm-7b-base"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16,
    offload_folder="offload"
)

# === STEP 5: Apply LoRA via PEFT ===
base_model = prepare_model_for_kbit_training(base_model)
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()

# === STEP 6: Load and tokenize dataset ===
dataset = load_dataset("json", data_files=jsonl_path)

def format_example(e):
    return f"{e['prompt']}{e['completion']}"

def tokenize_example(e):
    return tokenizer(format_example(e), truncation=True, padding="max_length", max_length=512)

tokenized_dataset = dataset.map(tokenize_example, remove_columns=dataset["train"].column_names)

# === STEP 7: Training setup ===
training_args = TrainingArguments(
    output_dir="./deepseek-lora-reddit",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

# === STEP 8: Finetune the model ===
print("🚀 Starting finetuning...")
trainer.train()

# === STEP 9: Save the finetuned model ===
model.save_pretrained("deepseek-lora-reddit")
tokenizer.save_pretrained("deepseek-lora-reddit")
print("✅ Finetuned model saved to ./deepseek-lora-reddit")


Recommending or Inferencing Posts Using Trained Deepseek Model by using Lora Technique

In [None]:
import pandas as pd
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# === Load preprocessed Reddit data and embeddings ===
df = pd.read_csv("Cleaned_Reddit_Comments.csv")
embeddings = np.load("sbert_embeddings.npy").astype("float32")

# === Load FAISS index ===
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

# === Load SBERT model to encode query ===
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')

# === Load base + LoRA-finetuned DeepSeek model ===
base = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-llm-7b-base", device_map="auto", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("deepseek-lora-reddit", trust_remote_code=True)
model = PeftModel.from_pretrained(base, "deepseek-lora-reddit")

# === Text generation pipeline ===
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# === Inference Function ===
def recommend_reddit_post(user_input, k=5):
    # Step 1: Embed user query
    query_embed = sbert_model.encode([user_input]).astype("float32")

    # Step 2: Get top-k similar posts from FAISS
    _, top_k_indices = index.search(query_embed, k)
    similar_posts = df.iloc[top_k_indices[0]]["clean_body"].tolist()

    # Step 3: Construct prompt for generation
    prompt = (
        f"User post:\n\n"
        f"{user_input}\n\n"
        f"Similar posts:\n" +
        "\n".join([f"{i+1}. {post}" for i, post in enumerate(similar_posts)]) +
        "\n\n###\n"
    )

    # Step 4: Generate output
    output = pipe(prompt, max_new_tokens=100, do_sample=True, top_p=0.9, temperature=0.7)
    return output[0]['generated_text']

# === Example Usage ===
user_input = "My cat climbs on my laptop every time I open it."
result = recommend_reddit_post(user_input)
print(result)


Training Another LLM Named as TinyLlama-1.1B For Recommending Posts

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
from datasets import load_dataset
import json

# === Load Tokenizer and Base Model ===
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

# === Prepare for LoRA Training ===
base_model = prepare_model_for_kbit_training(base_model)
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(base_model, lora_config)

# === Load JSONL Dataset ===
dataset = load_dataset("json", data_files="reddit_lora_dataset.jsonl")

def format_prompt(e):
    return f"{e['prompt']}{e['completion']}"

def tokenize(example):
    return tokenizer(format_prompt(example), truncation=True, padding="max_length", max_length=512)

tokenized_dataset = dataset.map(tokenize, remove_columns=dataset["train"].column_names)

# === Training Setup ===
training_args = TrainingArguments(
    output_dir="./tinyllama-lora-reddit",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

trainer.train()

# === Save ===
model.save_pretrained("tinyllama-lora-reddit")
tokenizer.save_pretrained("tinyllama-lora-reddit")


Inferencing Based On TinyLlama Model

In [None]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import numpy as np
import faiss
import pandas as pd
from sentence_transformers import SentenceTransformer

# Load
df = pd.read_csv("Cleaned_Reddit_Comments.csv")
embeddings = np.load("sbert_embeddings.npy").astype("float32")
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

sbert = SentenceTransformer('all-MiniLM-L6-v2')
base = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", device_map="auto", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("tinyllama-lora-reddit")
model = PeftModel.from_pretrained(base, "tinyllama-lora-reddit")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

def recommend(user_input, k=5):
    query = sbert.encode([user_input]).astype("float32")
    _, top_k = index.search(query, k)
    similar = df.iloc[top_k[0]]["clean_body"].tolist()
    prompt = f"User post:\n\n{user_input}\n\nSimilar posts:\n" + "\n".join([f"{i+1}. {p}" for i, p in enumerate(similar)]) + "\n\n###\n"
    return pipe(prompt, max_new_tokens=100, do_sample=True, top_p=0.9, temperature=0.7)[0]['generated_text']

print(recommend("Why do dogs tilt their heads when you talk to them?"))


Training Third Model For Recommending Posts Named as Mistral-7B-Instruct

In [None]:
!pip install transformers peft accelerate sentence-transformers faiss-cpu


In [None]:
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

base_model = prepare_model_for_kbit_training(base_model)
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(base_model, lora_config)

dataset = load_dataset("json", data_files="reddit_lora_dataset.jsonl")
tokenized_dataset = dataset.map(tokenize, remove_columns=dataset["train"].column_names)

training_args = TrainingArguments(
    output_dir="./mistral-lora-reddit",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

trainer.train()

model.save_pretrained("mistral-lora-reddit")
tokenizer.save_pretrained("mistral-lora-reddit")


Inferencing Based on Third Model

In [None]:
import numpy as np
import pandas as pd
import faiss
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from peft import PeftModel
from sentence_transformers import SentenceTransformer

# === Load Reddit Post Data and Embeddings ===
df = pd.read_csv("Cleaned_Reddit_Comments.csv")  # Must have a 'clean_body' column
embeddings = np.load("sbert_embeddings.npy").astype("float32")

# === Create FAISS Index for Fast Similarity Search ===
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

# === Load SBERT for encoding user query ===
sbert = SentenceTransformer('all-MiniLM-L6-v2')

# === Load Mistral Base and LoRA Fine-Tuned Model ===
base = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("mistral-lora-reddit")
model = PeftModel.from_pretrained(base, "mistral-lora-reddit")

# === Setup Pipeline ===
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device_map="auto")

# === Recommender Function ===
def recommend_posts(user_input, top_k=5):
    # Get SBERT embedding of user input
    query_embedding = sbert.encode([user_input]).astype("float32")

    # Search most similar posts
    _, indices = index.search(query_embedding, top_k)
    similar_posts = df.iloc[indices[0]]["clean_body"].tolist()

    # Create prompt
    prompt = f"User post:\n\n{user_input}\n\nSimilar posts:\n"
    prompt += "\n".join([f"{i+1}. {post}" for i, post in enumerate(similar_posts)])
    prompt += "\n\n###\n"

    # Generate response using Mistral
    result = pipe(prompt, max_new_tokens=100, do_sample=True, top_p=0.9, temperature=0.7)
    return result[0]['generated_text']

# === Test Example ===
user_input = "Why do cats suddenly run around like crazy at night?"
response = recommend_posts(user_input)
print("🔍 Mistral Recommender Response:\n")
print(response)


Computing Accuracy of each Model

In [None]:
!pip install scikit-learn


In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def evaluate_model(pipe, model_name, test_queries, top_k=5):
    total_score = 0.0

    for query in test_queries:
        # Step 1: Get query embedding
        query_embedding = sbert.encode([query]).astype("float32")

        # Step 2: Get top-k similar ground truth posts
        _, indices = index.search(query_embedding, top_k)
        ground_truth_posts = df.iloc[indices[0]]["clean_body"].tolist()

        # Step 3: Format prompt
        prompt = f"User post:\n\n{query}\n\nSimilar posts:\n"
        prompt += "\n".join([f"{i+1}. {post}" for i, post in enumerate(ground_truth_posts)])
        prompt += "\n\n###\n"

        # Step 4: Generate output
        result = pipe(prompt, max_new_tokens=100, do_sample=True, top_p=0.9, temperature=0.7)
        generated = result[0]['generated_text']

        # Step 5: Get embedding of generated output
        gen_embedding = sbert.encode([generated]).astype("float32")

        # Step 6: Compute cosine similarity with top-1 ground truth post
        gt_embedding = sbert.encode([ground_truth_posts[0]]).astype("float32")
        score = cosine_similarity(gen_embedding, gt_embedding)[0][0]
        total_score += score

    avg_score = total_score / len(test_queries)
    print(f"🔍 {model_name} - Avg Cosine Similarity Accuracy: {avg_score:.4f}")
    return avg_score


In [None]:
test_queries = [
    "My cat keeps sitting on my laptop.",
    "I feel really low after breaking up.",
    "Why do people ghost others online?",
    "Any tips to save money in college?",
    "How to deal with work anxiety?"
]

# === DeepSeek Evaluation ===
evaluate_model(pipe_deepseek, "DeepSeek", test_queries)

# === TinyLlama Evaluation ===
evaluate_model(pipe_tinyllama, "TinyLlama", test_queries)

# === Mistral Evaluation ===
evaluate_model(pipe_mistral, "Mistral", test_queries)
