<a href="https://colab.research.google.com/github/Shubham91999/VectorDB-Semantic-Searching/blob/main/Semantic_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Representing words as numbers**




In [None]:
def integer_encode(text):
    # Split the text into words
    words = text.split()

    # Create a dictionary to store word-to-integer mappings
    word_to_int = {}

    # Create a list to store the integer representation
    integer_encoded = []

    # Counter for assigning new integers
    current_integer = 0

    for word in words:
        # Convert word to lowercase for consistency
        word = word.lower()

        # Remove punctuation (you might want to handle this more thoroughly in a real application)
        word = word.strip('.,!?')

        if word not in word_to_int:
            # If it's a new word, assign it the next integer
            word_to_int[word] = current_integer
            current_integer += 1

        # Append the integer representation of the word
        integer_encoded.append(word_to_int[word])

    return integer_encoded, word_to_int

# Example usage
text = "The dog looked at the other dog."
encoded, word_map = integer_encode(text)

print("Original text:", text)
print("Integer encoded:", encoded)
print("Word to integer mapping:", word_map)

# Decode the integers back to words (for verification)
decoded = ' '.join([list(word_map.keys())[list(word_map.values()).index(i)] for i in encoded])
print("Decoded text:", decoded)

Original text: The dog looked at the other dog.
Integer encoded: [0, 1, 2, 3, 0, 4, 1]
Word to integer mapping: {'the': 0, 'dog': 1, 'looked': 2, 'at': 3, 'other': 4}
Decoded text: the dog looked at the other dog


**TF-IDF Matrix Representation**

In [None]:
import spacy
import pandas as pd

In [None]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m57.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
nlp = spacy.load("en_core_web_md")

corpus = [
    "The dog likes the other dog.",
    "The lazy dog sleeps all day.",
    "The quick brown fox is cunning."
]

tokenized_corpus = []

for text in corpus:
  tokenized_text = []
  doc = nlp(text)

  for token in doc:
    if not token.is_stop:
      tokenized_text.append(token.text.lower())
  tokenized_corpus.append(' '.join(tokenized_text))

print(tokenized_corpus)

['dog likes dog .', 'lazy dog sleeps day .', 'quick brown fox cunning .']


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(tokenized_corpus)

print("Feature names:", vectorizer.get_feature_names_out())
print("Document-term matrix:\n", X.toarray())

Feature names: ['brown' 'cunning' 'day' 'dog' 'fox' 'lazy' 'likes' 'quick' 'sleeps']
Document-term matrix:
 [[0 0 0 2 0 0 1 0 0]
 [0 0 1 1 0 1 0 0 1]
 [1 1 0 0 1 0 0 1 0]]


Better way to view this is as a Pandas Dataframe

In [None]:
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,brown,cunning,day,dog,fox,lazy,likes,quick,sleeps
0,0,0,0,2,0,0,1,0,0
1,0,0,1,1,0,1,0,0,1
2,1,1,0,0,1,0,0,1,0
