## Courses and Sections Semantic Search

In [1]:
%load_ext dotenv
%dotenv

In [2]:
import pandas as pd
import os
from pinecone import Pinecone, ServerlessSpec
from dotenv import load_dotenv, find_dotenv
import pinecone
from sentence_transformers import SentenceTransformer

In [12]:
load_dotenv(find_dotenv(), override=True)

True

In [5]:
files = pd.read_csv("course_section_descriptions.csv", encoding="cp1252")

In [6]:
files['unique_id'] = files['course_id'].astype(str) + '-' + files['section_id'].astype(str)

In [7]:
files['metadata'] = files.apply(
    lambda row: {
        'course_name': row['course_name'],
        'section_name': row['section_name'],
        'section_description': row['section_description'],
    }, axis=1
)

In [8]:
model = SentenceTransformer('all-MiniLM-L6-v2')

In [9]:
def create_embeddings(row):
    combined_text = f'''
            {row['course_name']} {row['course_technology']},
            {row['course_description']} {row['section_name']}{row['section_description']} ,
            '''
    return model.encode(combined_text, show_progress_bar=False)

In [10]:
files['embedding'] = files.apply(create_embeddings, axis=1)

In [11]:
files.head()

Unnamed: 0,course_id,course_name,course_slug,course_description,course_description_short,course_technology,course_topic,course_instructor_quote,section_id,section_name,section_description,unique_id,metadata,embedding
0,2,Introduction to Tableau,tableau,Tableau is now one of the most popular busines...,Teaching you how to tell compelling stories wi...,tableau,data visualization,Data scientists don’t just need to deal with d...,9,Introduction to Tableau,While Tableau is an indispensable tool in the ...,2-9,"{'course_name': 'Introduction to Tableau', 'se...","[0.0028394142, -0.023841972, -0.08445899, -0.0..."
1,2,Introduction to Tableau,tableau,Tableau is now one of the most popular busines...,Teaching you how to tell compelling stories wi...,tableau,data visualization,Data scientists don’t just need to deal with d...,10,Tableau Functionalities,"In this section, you will create your first Ta...",2-10,"{'course_name': 'Introduction to Tableau', 'se...","[0.009216177, -0.018384973, -0.04621931, 0.014..."
2,2,Introduction to Tableau,tableau,Tableau is now one of the most popular busines...,Teaching you how to tell compelling stories wi...,tableau,data visualization,Data scientists don’t just need to deal with d...,11,The Tableau Exercise,This section is a practical example that will ...,2-11,"{'course_name': 'Introduction to Tableau', 'se...","[0.014854905, -0.027043158, -0.040676318, 0.02..."
3,3,The Complete Data Visualization Course with Py...,data-visualization,The Data Visualization course is designed for ...,Teaching you how to master the art of creating...,python,data visualization,Data visualization is the face of data. Many p...,12,Introduction,"In this section, you will learn about the impo...",3-12,{'course_name': 'The Complete Data Visualizati...,"[0.07602877, -0.02480618, -0.013246953, -0.028..."
4,3,The Complete Data Visualization Course with Py...,data-visualization,The Data Visualization course is designed for ...,Teaching you how to master the art of creating...,python,data visualization,Data visualization is the face of data. Many p...,13,Setting Up the Environments,"Here, we set up different environments for the...",3-13,{'course_name': 'The Complete Data Visualizati...,"[0.08091247, -0.02829237, -0.015455507, -0.032..."


In [13]:
pc = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"), environment= os.environ.get("PINECONE_ENV"))

In [14]:
index_name = "ben-start-index"
dimension = 384
metric = "cosine"

In [15]:
if index_name in [index.name for index in pc.list_indexes()]:
    pc.delete_index(index_name)
    print(f"{index_name} successfully delete.")

else:
    print(f"{index_name} not in index list.")


ben-start-index successfully delete.


In [16]:
pc.create_index(
    name=index_name,
    dimension=dimension,
    metric=metric,
    spec=ServerlessSpec(
        cloud='aws',
        region='us-east-1'
    )
)

{
    "name": "ben-start-index",
    "metric": "cosine",
    "host": "ben-start-index-i2oc4nb.svc.aped-4627-b74a.pinecone.io",
    "spec": {
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    },
    "status": {
        "ready": true,
        "state": "Ready"
    },
    "vector_type": "dense",
    "dimension": 384,
    "deletion_protection": "disabled",
    "tags": null
}

In [17]:
index = pc.Index(index_name)

In [18]:
vectors_to_upsert = [(row['unique_id'], row['embedding'].tolist(), row['metadata']) for index, row in files.iterrows()]

In [19]:
index.upsert(vectors=vectors_to_upsert)

print('Data upserted to Pinecone Index ', index_name)

Data upserted to Pinecone Index  ben-start-index


## Semantic Search

In [26]:
query = 'regression'

query_embedding= model.encode(query, show_progress_bar=False).tolist()

In [37]:
query_results = index.query(
    vector=[query_embedding], top_k=6, include_metadata=True
)

In [38]:
query_results

{'matches': [{'id': '37-369',
              'metadata': {'course_name': 'Machine Learning in Python',
                           'section_description': 'While there are many '
                                                  'libraries that can compute '
                                                  'a regression model, the '
                                                  'most numerically stable one '
                                                  'is sklearn. It is also the '
                                                  'preferred choice of many '
                                                  'machine learning '
                                                  'professionals. In this '
                                                  'section, we implement all '
                                                  'we know about regressions '
                                                  'in this amazing library.',
                           'section_name': 'Li

In [39]:
score_threshold= 0.3

In [40]:
for match in query_results['matches']:

    if match['score'] >= score_threshold:
        course_details = match.get('metadata', {})
        course_name = course_details.get('course_name', 'N/A')
        section_name = course_details.get('section_name', 'N/A')
        section_description = course_details.get('section_description', 'No description available')

        print(f"Matched item ID: {match['id']}, score: {match['score']}")
        print(f"Course: {course_name} \nSection: {section_name} \nDescription: {section_description}")

Matched item ID: 37-369, score: 0.504379392
Course: Machine Learning in Python 
Section: Linear Regression with sklearn 
Description: While there are many libraries that can compute a regression model, the most numerically stable one is sklearn. It is also the preferred choice of many machine learning professionals. In this section, we implement all we know about regressions in this amazing library.
Matched item ID: 51-465, score: 0.48060903
Course: Machine Learning in Excel 
Section: Simple Linear Regression 
Description: Join us to create your first simple regression in Excel and get familiar with a very important statistical concept – the Ordinary least squares framework. You will learn about OLS assumptions, how to interpret regression results, as well as how to decompose variability. 
Matched item ID: 51-466, score: 0.473860651
Course: Machine Learning in Excel 
Section: Multiple Linear Regression 
Description: In section 3 you will discover multiple linear regression. We will exp