## **Proof of Concept**

#### In this notebook, we will explore the cleaned IMDb US movies dataset to demonstrate key functionalities such as data loading, evaluating preprocessing and TF-IDF pipelines, and implementing a simple search functionality using the cosine similarity.


### Importing Libraries

In [31]:
#%pip install pyarrow

import os
import numpy as np
import pandas as pd
from dotenv import load_dotenv
import polars as pl
import joblib

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector, make_column_transformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MultiLabelBinarizer, FunctionTransformer, KBinsDiscretizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn import set_config
from sklearn.metrics.pairwise import cosine_similarity

from src.custom_transformers import *


## **Data Loading**

In [32]:
df = pl.read_parquet("imdb_us_movies_cleaned.parquet").to_pandas()
print("Data loaded successfully.")

Data loaded successfully.


In [33]:
df.shape

(390855, 11)

In [34]:
df.sample(5)

Unnamed: 0,num__isAdult,num__startYear,num__runtimeMinutes,num__averageRating,num__numVotes,cat__title,cat__types,cat__genres,remainder__cast,remainder__directors,remainder__writers
192062,0.0,2006.0,-1.0,-1.0,-1.0,Last Night in D-Town,missing,drama,"[{'category': 'actor', 'job': 'missing', 'char...","[{'primaryName': 'Lisa Carter', 'birthYear': -...",[]
358002,0.0,1936.0,54.0,4.4,12.0,Child Marriage,missing,drama,"[{'category': 'cinematographer', 'job': 'missi...","[{'primaryName': 'Pat Carlyle', 'birthYear': 1...","[{'primaryName': 'Lillian Gaffney', 'birthYear..."
99993,1.0,1983.0,73.0,-1.0,-1.0,Trick Time,imdbdisplay,adult,"[{'category': 'actor', 'job': 'missing', 'char...",[],[]
349020,0.0,1941.0,85.0,6.7,2513.0,Danger Harbor,working,"crime,drama,film-noir","[{'category': 'cinematographer', 'job': 'cinem...","[{'primaryName': 'Anatole Litvak', 'birthYear'...","[{'primaryName': 'Robert Rossen', 'birthYear':..."
372608,0.0,-1.0,-1.0,-1.0,-1.0,The Leading Man,missing,"action,comedy","[{'category': 'writer', 'job': 'missing', 'cha...",[],"[{'primaryName': 'Jon Hoeber', 'birthYear': -1..."


In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 390855 entries, 0 to 390854
Data columns (total 11 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   num__isAdult          390855 non-null  float64
 1   num__startYear        390855 non-null  float64
 2   num__runtimeMinutes   390855 non-null  float64
 3   num__averageRating    390855 non-null  float64
 4   num__numVotes         390855 non-null  float64
 5   cat__title            390855 non-null  object 
 6   cat__types            390855 non-null  object 
 7   cat__genres           390855 non-null  object 
 8   remainder__cast       390855 non-null  object 
 9   remainder__directors  390855 non-null  object 
 10  remainder__writers    390855 non-null  object 
dtypes: float64(5), object(6)
memory usage: 32.8+ MB


### Loading preprocessing pipelines and TF-IDF Vectorizer using joblib 

In [36]:
PREPROCESSOR_FILE = 'preprocessor.joblib'
TFIDF_VECTORIZER_FILE = 'tfidf_vectorizer.joblib'

FULL_PREPROCESSOR = joblib.load(PREPROCESSOR_FILE)
TFIDF_VECTORIZER = joblib.load(TFIDF_VECTORIZER_FILE)

print("Preprocessor and TF-IDF Vectorizer loaded successfully.")

Preprocessor and TF-IDF Vectorizer loaded successfully.


### Checking wheather the dataset contains infinite values

In [37]:
print((df.isin([np.inf, -np.inf])).sum())

num__isAdult            0
num__startYear          0
num__runtimeMinutes     0
num__averageRating      0
num__numVotes           0
cat__title              0
cat__types              0
cat__genres             0
remainder__cast         0
remainder__directors    0
remainder__writers      0
dtype: int64


### Creating lists of the columns containing particular datatype, and setting the configuration to transform the preprocessed data into a DataFrame

In [38]:
numeric_cols = ['num__isAdult', 'num__startYear', 'num__runtimeMinutes', 'num__averageRating', 'num__numVotes']
corpus_cols = ['cat__title', 'cat__genres', 'remainder__cast', 'remainder__directors', 'remainder__writers']
binner_cols = ['num__startYear', 'num__runtimeMinutes', 'num__averageRating']
onehot_cols = ['cat__types']
multilabel_cols = ['cat__genres']

set_config(transform_output="pandas")


## **Processing the Data**

### Implementation of the preprocessing pipeline:

In [39]:
df[numeric_cols] = df[numeric_cols].replace(-1.0, np.nan)

In [40]:
df_preprocessed = FULL_PREPROCESSOR.fit_transform(df)

In [41]:
df_preprocessed.shape

(390855, 72)

### Converting the preprocessed data into a DataFrame

In [42]:
df_processed = pd.DataFrame(df_preprocessed)

In [43]:
df_processed.sample(5)

Unnamed: 0,search_corpus__searchable_text,ranking_numeric__num__isAdult,ranking_numeric__num__startYear,ranking_numeric__num__runtimeMinutes,ranking_numeric__num__averageRating,ranking_numeric__is_missing_num__startYear,ranking_numeric__is_missing_num__runtimeMinutes,ranking_numeric__is_missing_num__averageRating,ranking_numeric__is_missing_num__numVotes,ranking_numeric__num__numVotes_log,...,filter_multilabel_genres__news,filter_multilabel_genres__reality-tv,filter_multilabel_genres__romance,filter_multilabel_genres__sci-fi,filter_multilabel_genres__short,filter_multilabel_genres__sport,filter_multilabel_genres__talk-show,filter_multilabel_genres__thriller,filter_multilabel_genres__war,filter_multilabel_genres__western
113280,Superior Orders documentary,-0.150341,0.486798,-0.338749,-1.194338,-0.464029,-0.614289,1.26885,1.26885,-1.038056,...,0,0,0,0,0,0,0,0,0,0
183159,The Poisoner drama mystery romance,-0.150341,0.488109,0.177105,-1.194338,-0.464029,-0.614289,1.26885,1.26885,-1.038056,...,0,0,1,0,0,0,0,0,0,0
283541,"W.L Dow, Architect documentary",-0.150341,0.486798,-0.069608,-1.194338,-0.464029,-0.614289,1.26885,1.26885,-1.038056,...,0,0,0,0,0,0,0,0,0,0
3736,The Other Side drama,-0.150341,0.367431,-0.114465,-1.194338,-0.464029,-0.614289,1.26885,1.26885,-1.038056,...,0,0,0,0,0,0,0,0,0,0
188857,A Hole in the Head comedy drama,-0.150341,0.415965,1.231243,0.79094,-0.464029,-0.614289,-0.788115,-0.788115,1.308379,...,0,0,0,0,0,0,0,0,0,0


### Implemetation of TF-IDF Vectorization:

In [44]:
tfidf_vectors = TFIDF_VECTORIZER.fit_transform(df_processed['search_corpus__searchable_text'])

In [45]:
tfidf_vectors.shape

(390855, 2000)

## **Building first simple search functionality using cosine similarity:**

**Description of search_data_with_title_priority function:**

Searches data, prioritizing results where the title contains the keywords from the query.

**Args:**
* **query** (str or list): The search query (keywords).
* **tfidf\_vectors**: The TF-IDF matrix containing document vectors.
* **df\_processed** (pd.DataFrame): The DataFrame with processed data (including titles).
* **TFIDF\_VECTORIZER**: The TfidfVectorizer object used for transformation.
* **title\_column** (str): The name of the column in `df_processed` that contains the movie titles.
* **top\_n** (int): The number of results to return.

**Returns:**
* **tuple**: (pd.DataFrame with the best results, numpy.array with their similarities)

In [46]:
def search_data_with_title_priority(query, tfidf_vectors, df_processed, TFIDF_VECTORIZER, title_column='search_corpus__searchable_text', top_n=5):

    if isinstance(query, list):
        search_string = " ".join(query).lower()
    else:
        search_string = query.lower()
        query = [query]

    query_vector = TFIDF_VECTORIZER.transform(query)
    cosine_similarities = cosine_similarity(query_vector, tfidf_vectors).flatten()

    results_df = df_processed.copy()
    results_df['similarity'] = cosine_similarities
    results_df['original_index'] = results_df.index 

    title_matches = results_df[title_column].str.lower().str.contains(search_string)
    
    results_df['title_match_priority'] = title_matches.astype(int) * 1000 
    
    results_df_sorted = results_df.sort_values(
        by=['title_match_priority', 'similarity'],
        ascending=[False, False]
    )

    top_results = results_df_sorted.head(top_n)

    return top_results.drop(columns=['similarity', 'title_match_priority', 'original_index']), top_results['similarity'].values


**Description of interactive_search function:**

This is an interactive search function that prompts the user to enter a query and then uses `search_data_with_title_priority` to display the results.

**Args:**
* **tfidf\_vectors**: The TF-IDF matrix containing document vectors.
* **df\_processed** (pd.DataFrame): The DataFrame with the processed data.
* **TFIDF\_VECTORIZER**: The TfidfVectorizer object.
* **title\_column** (str): The name of the column containing the titles.
* **top\_n** (int): The number of results to return.

**Returns:**
* **tuple**: (pd.DataFrame with the best results, numpy.array with their similarities)

In [47]:
def interactive_search(tfidf_vectors, df_processed, TFIDF_VECTORIZER, title_column='search_corpus__searchable_text', top_n=5):

    print("--- Interactive Movie Search---")
    user_query = input("Enter keywords (e.g., Forrest Gump Tom Hanks): ")
    print(f"Searching for: '{user_query}'...")

    try:
        results_df, similarities = search_data_with_title_priority(
            query=user_query,
            tfidf_vectors=tfidf_vectors,
            df_processed=df_processed,
            TFIDF_VECTORIZER=TFIDF_VECTORIZER,
            title_column=title_column,
            top_n=top_n
        )

        print("\n Best Results:")

        display_df = results_df.copy()
        display_df['Cosine Similarity'] = [f"{s:.4f}" for s in similarities]

        if title_column in display_df.columns:
            display_cols = [title_column, 'Cosine Similarity']
        else:
            display_cols = ['Cosine Similarity']
            print(f"Warning: Column '{title_column}' not found for displaying titles.")


        print(display_df[display_cols].to_markdown(index=False))

        return results_df, similarities

    except Exception as e:
        print(f"An error occurred during search: {e}")
        return None, None

### Interactive Search Functionality evaluation

In [48]:
interactive_search(tfidf_vectors, df_processed, TFIDF_VECTORIZER, title_column='search_corpus__searchable_text', top_n=15)

--- Interactive Movie Search---
Searching for: 'Toy story'...

 Best Results:
| search_corpus__searchable_text                |   Cosine Similarity |
|:----------------------------------------------|--------------------:|
| Toy Story in 3-D adventure animation comedy   |               0.856 |
| Toy Story adventure animation comedy          |               0.856 |
| Toy Story 2 adventure animation comedy        |               0.856 |
| Toy Story Replayed adventure animation comedy |               0.856 |
| Toy Story 3 adventure animation comedy        |               0.856 |
| Toy Story 5 adventure animation comedy        |               0.856 |
| Toy Story adventure animation comedy          |               0.856 |
| Toy Story 5 adventure animation comedy        |               0.856 |
| Toy Story 2 adventure animation comedy        |               0.856 |
| Toy Story 4 adventure animation comedy        |               0.856 |
| Toy Story 3 adventure animation comedy        |         

(                          search_corpus__searchable_text  \
 20164     Toy Story in 3-D adventure animation comedy      
 67449            Toy Story adventure animation comedy      
 74757          Toy Story 2 adventure animation comedy      
 80449   Toy Story Replayed adventure animation comedy      
 92725          Toy Story 3 adventure animation comedy      
 119686         Toy Story 5 adventure animation comedy      
 183826           Toy Story adventure animation comedy      
 206308         Toy Story 5 adventure animation comedy      
 228275         Toy Story 2 adventure animation comedy      
 231066         Toy Story 4 adventure animation comedy      
 296571         Toy Story 3 adventure animation comedy      
 346234         Toy Story 4 adventure animation comedy      
 363241  Toy Story 2 in 3-D adventure animation comedy      
 378072         Toy Story 5 adventure animation comedy      
 266918              Charlie: A Toy Story drama family      
 
         ranking_numer

#### The implementation of **TF-IDF Vectorization** and **cosine similarity** to build a simple search functionality is working as expected. The search function is able to retrieve relevant movie titles based on user queries, prioritizing titles that contain the search keywords. The results are displayed in a clear format, showing both the movie descriptions and their cosine similarity scores. This demonstrates the effectiveness of the TF-IDF representation and cosine similarity in information retrieval tasks within the context of movie data.