# 🧠 AI-Powered Course Search with OpenAI and Pinecone

This project builds a semantic search engine that lets users search through course data using natural language queries. It leverages **OpenAI embeddings** to capture the meaning of course descriptions and **Pinecone** as the vector database for fast and scalable similarity search.


### 1. Import Libraries and Initialize Environment

This section imports the necessary Python libraries and loads environment variables required for API access (such as OpenAI keys, Pinecone config, etc.). These setup steps are essential for preparing the environment before executing any embedding or vector database operations.


In [38]:
import pinecone
from pinecone import ServerlessSpec, Pinecone
import os 
import uuid
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
import pandas as pd 
from pprint import pprint  
load_dotenv("env")
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")

### 2. Load and Process Course Data

This section loads course data from a CSV file and processes it to generate structured metadata for each course. It also creates formatted course descriptions that combine key fields such as course name, slug, technology, and topic — which will late


In [39]:
files = pd.read_csv("course_descriptions.csv", encoding="latin1")

def create_course_metadata(row):
    return {
        "course_name": row["course_name"],
        "technology": row["course_technology"],
        "description": row["course_description_short"]
    }

def create_course_description(row):
    return f'''
     the course name is {row["course_name"]},
     the slug is {row["course_slug"]}, 
     the technology is {row["course_technology"]}
     and the course topic is {row["course_topic"]}
    '''
    
files["course_metadata"] = files.apply(create_course_metadata, axis=1)
files["course_description_new"] = files.apply(create_course_description, axis=1)

### 3. Initialize Pinecone and Embeddings

This section sets up the Pinecone client and configures the OpenAI embeddings model. The embeddings model is used to convert course descriptions into numerical vector representations, which will be stored and queried in the


In [40]:
pc = Pinecone(api_key = PINECONE_API_KEY,environment = "gcp-starter")
embedder = OpenAIEmbeddings(model="text-embedding-3-small")

### 4. Generate Document Texts and Embeddings

This section processes the formatted course descriptions to create clean text inputs suitable for embedding. These texts are then passed to the OpenAI embeddings model to generate their corresponding vector representations, which capture the semantic meaning of each course.


In [41]:
def get_doc_texts_from_df(df):
    fields = ["course_description_new"]
    doc_texts = []
    for _, row in df.iterrows():
        combined = " ".join([str(row.get(field, "")) for field in fields])
        doc_texts.append(combined.strip())
    return doc_texts

doc_texts = get_doc_texts_from_df(files)
embeddings = embedder.embed_documents(doc_texts)


### 5. Create and Populate Pinecone Index

This section checks whether the specified Pinecone index exists and creates it if necessary. It then upserts the generated embeddings into the index in batches, attaching relevant metadata (such as course name, technology, and description) to each vector for efficient semantic search.


In [42]:
existing_indexes = pc.list_indexes().names()
index_name = "semantic-search-final"
if index_name not in existing_indexes:
    pc.create_index(
        name=index_name,
        dimension=len(embeddings[0]),
        metric="cosine",
        spec=ServerlessSpec(region="us-east-1", cloud = "aws")
        
    )

index = pc.Index(index_name)
# Format data for Pinecone
vectors = [
    {
        "id": str(uuid.uuid4()),
        "values": embedding,
        "metadata": files["course_metadata"].iloc[i]
    }
    for i, embedding in enumerate(embeddings)
]


batch_size = 100
for i in range(0, len(vectors), batch_size):
    batch = vectors[i:i+batch_size]
    index.upsert(vectors=batch)

### 6. Perform Semantic Search

This section executes a semantic search by embedding a user-provided query and comparing it against the vectors stored in the Pinecone index. It retrieves the top matching results based on similarity scores and displays their metadata, allowing for meaningful search over course content.



In [31]:
index = pc.Index(index_name)
query = "python beginner"
query_vector = embedder.embed_query(query)


In [37]:
score_threshold = 0.3
search_results = index.query(
    vector=query_vector,
    top_k=5,
    include_metadata=True
)

for i, match in enumerate(search_results['matches'], start=1):
    if match['score'] >= score_threshold:
        print(f"🔹 Match #{i}")
        print(f"   - Score      : {match['score']:.4f}")
        print(f"   - Course Name: {match['metadata'].get('course_name', 'N/A')}")
        print(f"   - Technology : {match['metadata'].get('technology', 'N/A')}")
        print(f"   - Description: {match['metadata'].get('description', 'N/A')}\n")

🔹 Match #1
   - Score      : 0.3279
   - Course Name: A/B Testing in Python
   - Technology : python
   - escription: A world-class professional teaches you how to perform A/B test experiments to create real business value and improve product experience.

🔹 Match #2
   - Score      : 0.3247
   - Course Name: Working with Text Files in Python
   - Technology : python
   - escription: Covering the essentials and providing hands-on experience working with *.csv, *.txt, *.json, and other types of text files in Python. With these tools under your belt, youll be an independent analyst, ready to gain more insights from your data.

🔹 Match #3
   - Score      : 0.3104
   - Course Name: Introduction to Python
   - Technology : python
   - escription: Laying the foundations of programming in Python to prepare you for deploying machine and deep learning algorithms later in the training.

🔹 Match #4
   - Score      : 0.3047
   - Course Name: Web Scraping and API Fundamentals in Python
   - Technol