# Building a Search Engine for Educational Content
## CSCI S-89B Final Project - Seymur Hasanov
**Harvard Extension School | Fall 2025**

This notebook demonstrates the complete NLP pipeline for analyzing academic research papers.

### Components:
1. **LDA Topic Modeling** - Discover hidden themes in papers
2. **Sentence Transformers** - Semantic search with BERT embeddings
3. **Neural Network Classifier** - Topic prediction with Keras/TensorFlow
4. **t-SNE Visualization** - Embedding visualization
5. **Classical ML Comparison** - Baseline comparison
6. **Interactive Demo** - Streamlit web application

---

## Step 1: Setup and Installation

In [None]:
# Install required packages
!pip install -q streamlit gensim>=4.4.0 sentence-transformers tensorflow pyLDAvis wordcloud arxiv nltk plotly scikit-learn

# Download NLTK data
import nltk
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

print("‚úÖ All dependencies installed!")

## Step 2: Clone Repository from GitHub

In [None]:
# Clone the project repository
!git clone https://github.com/Seymurhh/Search_engine_educational_project_NLP.git
%cd Search_engine_educational_project_NLP

print("‚úÖ Repository cloned successfully!")

## Step 3: Import Modules and Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Import project modules
import data_loader
import topic_model
import semantic_search
import neural_classifier

print("‚úÖ Modules imported!")

# Load dataset
print("\nüìä Loading dataset...")
df = data_loader.load_from_csv("arxiv_dataset.csv")
print(f"‚úÖ Loaded {len(df)} papers from ArXiv")
print(f"   Categories: cs.RO (Robotics), cs.AI (Artificial Intelligence)")

# Show sample
print("\nüìÑ Sample Paper Titles:")
for i, title in enumerate(df['title'].head(3)):
    print(f"   {i+1}. {title[:80]}...")

## Step 4: Text Preprocessing

In [None]:
print("üîß Preprocessing text...")
processed_docs = [data_loader.preprocess_text(doc) for doc in df['abstract']]
processed_docs = data_loader.make_bigrams(processed_docs)

print(f"‚úÖ Preprocessed {len(processed_docs)} documents")
print(f"\nüìù Sample preprocessed tokens (first document):")
print(f"   {processed_docs[0][:10]}...")

## Step 5: Topic Modeling with LDA

In [None]:
# Create dictionary and corpus
print("üìö Creating dictionary and corpus...")
dictionary, corpus = topic_model.create_dictionary_corpus(processed_docs)
print(f"   Dictionary size: {len(dictionary)} unique terms")

# Train LDA model with 5 topics
NUM_TOPICS = 5
print(f"\nüéØ Training LDA model with {NUM_TOPICS} topics...")
lda_model = topic_model.train_lda_model(corpus, dictionary, num_topics=NUM_TOPICS)

# Compute coherence score
print("\nüìà Computing coherence score...")
coherence_score = topic_model.compute_coherence_score(lda_model, processed_docs, dictionary)
print(f"\n" + "="*60)
print(f"üéØ COHERENCE SCORE (Cv): {coherence_score:.4f}")
print(f"   (Scores > 0.4 are considered acceptable)")
print("="*60)

### 5.1 Discovered Topics

In [None]:
print("\nüìä DISCOVERED TOPICS:")
print("-" * 70)
topics = topic_model.get_topics(lda_model, num_words=10)
topic_names = [
    "LLM Reasoning & Agents",
    "General ML/AI Methods",
    "Reinforcement Learning",
    "Dynamic Systems & Control",
    "Visual Robotics & Planning"
]

for idx, topic in topics:
    words = [word.split('*')[1].strip().strip('"') for word in topic.split(' + ')]
    print(f"\nTopic {idx} ({topic_names[idx]}):")
    print(f"   Keywords: {', '.join(words[:8])}")

### 5.2 Topic Distribution

In [None]:
from collections import Counter

topic_counts = Counter()
doc_topics = []

for i, doc_bow in enumerate(corpus):
    topic_dist = lda_model.get_document_topics(doc_bow)
    if topic_dist:
        dominant_topic = max(topic_dist, key=lambda x: x[1])[0]
        topic_counts[dominant_topic] += 1
        doc_topics.append(dominant_topic)
    else:
        doc_topics.append(-1)

df['dominant_topic'] = doc_topics

print("\nüìà TOPIC DISTRIBUTION:")
print("-" * 50)
for topic_id in range(NUM_TOPICS):
    count = topic_counts.get(topic_id, 0)
    pct = count / len(df) * 100
    print(f"Topic {topic_id} ({topic_names[topic_id][:20]}): {count} papers ({pct:.1f}%)")

### 5.3 Topic Distribution Visualization

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
topics_list = list(range(NUM_TOPICS))
counts = [topic_counts.get(t, 0) for t in topics_list]
colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12', '#9b59b6']

bars = ax.bar(topics_list, counts, color=colors, edgecolor='black', linewidth=1.2)
ax.set_xlabel('Topic ID', fontsize=12)
ax.set_ylabel('Number of Papers', fontsize=12)
ax.set_title('Topic Distribution Across 500 ArXiv Papers', fontsize=14, fontweight='bold')
ax.set_xticks(topics_list)

for bar, count in zip(bars, counts):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5, 
            str(count), ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

## Step 6: Semantic Search with Sentence Transformers

In [None]:
print("\nüîç Initializing Semantic Search...")
searcher = semantic_search.SemanticSearch()

print("   Encoding 500 paper abstracts...")
paper_embeddings = searcher.encode_papers(tuple(df['abstract'].tolist()))
searcher.paper_embeddings = paper_embeddings

print(f"‚úÖ Created {paper_embeddings.shape[0]} embeddings of dimension {paper_embeddings.shape[1]}")

### 6.1 Semantic Search Demo

In [None]:
test_queries = [
    "reinforcement learning for robot control",
    "transformer architecture for natural language",
    "autonomous navigation in complex environments"
]

print("\nüîç SEMANTIC SEARCH DEMO:")
print("="*80)

for query in test_queries:
    print(f"\nüìù Query: '{query}'")
    print("-"*70)
    
    results = searcher.search(query, paper_embeddings, df, top_k=3)
    for i, result in enumerate(results):
        title = result['title'][:65]
        score = result['score']
        print(f"   {i+1}. [{score:.3f}] {title}...")

### 6.2 t-SNE Embedding Visualization

In [None]:
from sklearn.manifold import TSNE

print("\nüé® Computing t-SNE projection...")

X_embed = paper_embeddings.cpu().numpy()
y_topics = np.array(doc_topics)

valid_mask = y_topics >= 0
X_valid = X_embed[valid_mask]
y_valid = y_topics[valid_mask]

tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_2d = tsne.fit_transform(X_valid)

fig, ax = plt.subplots(figsize=(12, 8))
colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12', '#9b59b6']

for topic_id in range(NUM_TOPICS):
    mask = y_valid == topic_id
    ax.scatter(X_2d[mask, 0], X_2d[mask, 1], 
               c=colors[topic_id], label=topic_names[topic_id], alpha=0.7, s=50)

ax.set_xlabel('t-SNE Dimension 1', fontsize=12)
ax.set_ylabel('t-SNE Dimension 2', fontsize=12)
ax.set_title('t-SNE Visualization of Paper Embeddings', fontsize=14, fontweight='bold')
ax.legend(loc='best', fontsize=9)
plt.tight_layout()
plt.show()

print("‚úÖ Papers with similar topics cluster together!")

## Step 7: Neural Network Classifier

In [None]:
print("\nüß† Training Neural Network Classifier...")

X = paper_embeddings.cpu().numpy()
y = np.array(doc_topics)

valid_mask = y >= 0
X = X[valid_mask]
y = y[valid_mask]

print(f"   Training samples: {len(X)}")
print(f"   Number of classes: {NUM_TOPICS}")
print(f"   Random baseline: {1/NUM_TOPICS:.1%}")

classifier, history = neural_classifier.train_classifier(
    X, y, num_topics=NUM_TOPICS, epochs=30, batch_size=32
)

print(f"\n" + "="*60)
print(f"üéØ KERAS NEURAL NETWORK RESULTS:")
print(f"   Training Accuracy:   {history.history['accuracy'][-1]:.1%}")
print(f"   Validation Accuracy: {history.history['val_accuracy'][-1]:.1%}")
print("="*60)

### 7.1 Training History

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(history.history['accuracy'], label='Training', linewidth=2)
ax1.plot(history.history['val_accuracy'], label='Validation', linewidth=2)
ax1.axhline(y=1/NUM_TOPICS, color='gray', linestyle='--', label=f'Random ({1/NUM_TOPICS:.0%})')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Accuracy')
ax1.set_title('Model Accuracy')
ax1.legend()
ax1.grid(True, alpha=0.3)

ax2.plot(history.history['loss'], label='Training', linewidth=2)
ax2.plot(history.history['val_loss'], label='Validation', linewidth=2)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Loss')
ax2.set_title('Model Loss')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 7.2 Classical ML Comparison

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

print("\nüìä CLASSICAL ML COMPARISON:")
print("="*60)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

classifiers = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'SVM (RBF)': SVC(kernel='rbf'),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5)
}

results = {'Random Baseline': 1/NUM_TOPICS}

for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    acc = clf.score(X_test, y_test)
    results[name] = acc
    print(f"   {name}: {acc:.1%}")

results['Keras Neural Network'] = history.history['val_accuracy'][-1]
print(f"   Keras Neural Network: {results['Keras Neural Network']:.1%}")

print("\n‚úÖ All classifiers significantly outperform the 20% random baseline!")

## Step 8: Results Summary

| Metric | Value |
|--------|-------|
| Papers | 500 |
| Topics | 5 |
| Coherence | 0.4170 |
| Embedding Dim | 384 |
| Classifier Accuracy | ~53% |
| Random Baseline | 20% |

In [None]:
print("\n" + "="*70)
print("üìä FINAL RESULTS SUMMARY")
print("="*70)
print(f"üìö Dataset: {len(df)} ArXiv papers")
print(f"üéØ Topics: {NUM_TOPICS}")
print(f"üìà Coherence: {coherence_score:.4f}")
print(f"üîç Embedding: {paper_embeddings.shape[1]} dimensions")
print(f"üß† Validation Accuracy: {history.history['val_accuracy'][-1]:.1%}")
print(f"üìä Random Baseline: {1/NUM_TOPICS:.1%}")
print("="*70)
print("\n‚úÖ ALL ANALYSIS COMPLETE!")

---
## Step 9: Interactive Demo (Streamlit)

In [None]:
!wget -q -O cloudflared-linux-amd64 https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64
!chmod +x cloudflared-linux-amd64
print("‚úÖ Cloudflared downloaded")

In [None]:
!streamlit run app.py &>/content/logs.txt &
print("üöÄ Streamlit starting...")

!nohup ./cloudflared-linux-amd64 tunnel --url http://localhost:8501 > cloudflared.log 2>&1 &

import time
time.sleep(8)
print("\nüåê Public URL:")
!grep -o 'https://.*\.trycloudflare\.com' cloudflared.log || echo "Run again if no URL."

### Click the URL above for the interactive demo!