This project aims to enhance the search capabilities of a Netflix show search engine by improving how search queries match with the show descriptions in the Netflix catalog. The goal is to make the search results more relevant to the user's query by using advanced text processing techniques and vector space modeling. Below is an explanation of the project, broken down into key components:

### **Objective**
The primary objective of this project is to develop a search engine that can accurately retrieve Netflix shows based on a user's search query. The search engine aims to improve upon basic cosine similarity measures by incorporating advanced text processing techniques, thus enhancing the relevance and accuracy of search results.

### **Application of the Vector Space Model (VSM)**

The Vector Space Model (VSM) stands at the core of our search engine, serving as the fundamental framework that enables the sophisticated matching of user queries with Netflix show descriptions. VSM represents text documents as vectors in a multidimensional space, where each dimension corresponds to a unique term within the dataset. This model facilitates a quantitative analysis of text similarity, which is essential for effective information retrieval.

#### **Incorporation into Preprocessing**

During preprocessing, each text document — a composite of a show's title, genres, and description — is transformed in preparation for its representation in the vector space. The cleaning, tokenization, and lemmatization processes are tailored to ensure that the resultant vectors accurately reflect the semantic essence of the documents, thereby enhancing the fidelity of the vector space representation.

#### **Vectorization and the Essence of VSM**

The TF-IDF vectorization process directly applies the principles of VSM by assigning each document a vector whose elements represent the frequencies of terms adjusted by their inverse document frequency. This method effectively balances the influence of common and unique terms within the dataset. The inclusion of n-grams enriches this representation by capturing not only individual terms but also the context provided by term adjacency, further aligning with VSM's capability to understand document semantics through spatial relationships.

#### **Dimensionality Reduction through the Lens of VSM**

The application of Truncated SVD for dimensionality reduction is a strategic enhancement of VSM. By reducing the vector space to its most informative dimensions, this process mitigates the curse of dimensionality and accentuates the latent semantic structures within the data. This distilled vector space is more conducive to identifying meaningful similarities between documents and queries, as it emphasizes conceptual similarity over superficial text matching.

#### **Enhancing Cosine Similarity with VSM**

Cosine similarity, the metric used to assess the similarity between the query and document vectors within the vector space, epitomizes the operational essence of VSM. By measuring the cosine of the angle between two vectors, this metric quantifies similarity in terms of directional closeness rather than magnitude, making it inherently suited to the nuances of text-based information retrieval. The enhancements in preprocessing and vectorization ensure that the vector space embodies a rich semantic landscape, thereby allowing cosine similarity to discern relevance with greater acuity.

### **Results and Impact through VSM**

The implementation of VSM principles has markedly elevated the search engine's capability to deliver relevant results. This advanced mathematical framework ensures that the engine does not merely match keywords but understands the semantic context of user queries, leading to significantly improved match relevance. The nuanced understanding of document similarity fostered by VSM has direct implications for user satisfaction, as it aligns search outcomes more closely with user intent.



#### **Importing necessary libraries**

In [3]:
import pandas as pd
import json
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD

# NLTK setup
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/AnonymousStudent/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/AnonymousStudent/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/AnonymousStudent/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## The Vector Space Model (VSM)

VSM stands at the core of our search engine, serving as the fundamental framework that enables the sophisticated matching of user queries with Netflix show descriptions. VSM represents text documents as vectors in a multidimensional space, where each dimension corresponds to a unique term within the dataset. This model facilitates a quantitative analysis of text similarity, which is essential for effective information retrieval.

In [13]:

class EnhancedVSMModel:
    def __init__(self, data_path):
        self.data_path = data_path
        self.df = None
        self.tfidf_matrix = None
        self.vectorizer = TfidfVectorizer(ngram_range=(1, 2))  # Use n-grams
        self.stop_words = set(stopwords.words('english'))
        self.lemmatizer = WordNetLemmatizer()
        self.genre_mapping = {
            10759: 'Action & Adventure', 16: 'Animation', 35: 'Comedy', 80: 'Crime',
            99: 'Documentary', 18: 'Drama', 10751: 'Family', 10762: 'Kids',
            9648: 'Mystery', 10763: 'News', 10764: 'Reality', 10765: 'Sci-Fi & Fantasy',
            10766: 'Soap', 10767: 'Talk', 10768: 'War & Politics', 37: 'Western'
        }
        self.svd = TruncatedSVD(n_components=100)  # Dimensionality reduction component
        self._load_data()
        self._preprocess_data()
        self._vectorize()

    def _load_data(self):
        with open(self.data_path) as f:
            data = json.load(f)
        self.df = pd.DataFrame(data)
        self.df['genre_names'] = self.df['genre_ids'].apply(lambda x: [self.genre_mapping.get(id, 'Other') for id in x])

    def _preprocess_text(self, text):
        text = re.sub(r'[^\w\s]', '', text.lower())
        words = nltk.word_tokenize(text)
        words = [word for word in words if word not in self.stop_words]
        words = [self.lemmatizer.lemmatize(word) for word in words]
        return ' '.join(words)

    def _preprocess_data(self):
        self.df['combined_text'] = self.df['name'] + ' ' + self.df['genre_names'].astype(str) + ' ' + self.df['overview']
        self.df['processed_text'] = self.df['combined_text'].apply(self._preprocess_text)

    def _vectorize(self):
        tfidf_matrix_raw = self.vectorizer.fit_transform(self.df['processed_text'])
        self.tfidf_matrix = self.svd.fit_transform(tfidf_matrix_raw)  # Apply SVD


    def search(self, query, top_n=500):
        preprocessed_query = self._preprocess_text(query)
        query_vector_raw = self.vectorizer.transform([preprocessed_query])
        query_vector = self.svd.transform(query_vector_raw)  # Reduce query vector dimensions
        cosine_similarities = cosine_similarity(query_vector, self.tfidf_matrix).flatten()
        top_indices = cosine_similarities.argsort()[-top_n:][::-1]
        results_with_scores = self.df.iloc[top_indices]
        results_with_scores['cosine_similarity'] = cosine_similarities[top_indices]
        return results_with_scores.to_dict(orient='records')

# Usage example
vsm_model = EnhancedVSMModel('../data/test_data.json')
search_results = vsm_model.search("stranger")

print(search_results)



[{'adult': False, 'backdrop_path': '/56v2KjBlU4XaOv9rVYEQypROD7P.jpg', 'genre_ids': [10765, 9648, 18], 'id': 66732, 'origin_country': ['US'], 'original_language': 'en', 'original_name': 'Stranger Things', 'overview': 'When a young boy vanishes, a small town uncovers a mystery involving secret experiments, terrifying supernatural forces, and one strange little girl.', 'popularity': 498.395, 'poster_path': '/rbnuP7hlynAMLdqcQRCpZW9qDkV.jpg', 'first_air_date': '2016-07-15', 'name': 'Stranger Things', 'vote_average': 8.615, 'vote_count': 16805, 'genre_names': ['Sci-Fi & Fantasy', 'Mystery', 'Drama'], 'combined_text': "Stranger Things ['Sci-Fi & Fantasy', 'Mystery', 'Drama'] When a young boy vanishes, a small town uncovers a mystery involving secret experiments, terrifying supernatural forces, and one strange little girl.", 'processed_text': 'stranger thing scifi fantasy mystery drama young boy vanishes small town uncovers mystery involving secret experiment terrifying supernatural force on

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results_with_scores['cosine_similarity'] = cosine_similarities[top_indices]


### **Conclusion: VSM's Role in Revolutionizing Search**

In conclusion, the Vector Space Model's application in our Netflix show search engine project is not just a technical detail but a transformative approach that redefines the efficacy of search mechanisms. By leveraging VSM, along with advanced text processing techniques and dimensionality reduction, the project transcends conventional search capabilities. It offers an exemplary case of how mathematical models and computational techniques can converge to enhance digital experiences, setting a new benchmark for the development of search engines in content-rich domains.