# **Modeling and Evaluation of Best Model**

- I have evaluated seven recommenders ranging from simple bag-of-words (`CountVectorizer + Cosine`, `CountVectorizer + KNN`) through TF-IDF weighting, latent SVD, and hybrid text + numeric feature stacks to a metadata-only cosine model.
- **TF-IDF + KNN (System 3)** achieved the best results on the Bourne query (`Precision@5 = 0.80`, `Recall@5 = 1.00`, `NDCG@5 = 1.00`), perfectly capturing and ranking all four relevant titles.
- **Hybrid text + numeric models (Systems 5 & 6)** blended TF-IDF text vectors with scaled features like budget and popularity, yielding moderate performance (`P@5 = 0.60`, `R@5 = 0.75`, `NDCG@5 ≈ 0.83`), with System 6 offering precomputed similarities for faster lookups.
- **Latent SVD (System 4)** reduced dimensionality for efficiency but failed to retrieve any relevant Bourne films (all metrics = `0.00`), highlighting that topic models aren’t always sufficient without term weighting.
- The **metadata-only cosine recommender (System 7)** was simple and fast but lagged behind in accuracy (`P@5 = 0.40`, `R@5 = 0.50`, `NDCG@5 = 0.64`), demonstrating the value of combining content and numeric signals.


# Import Processed & Feature-Engineered Data

The `holly2.csv` file contains the finalized dataset, generated by executing the following steps in the **Recommendation System (EDA & Features).ipynb** notebook:

1. **Data Collection**  
2. **Preprocessing & Merging**  
3. **Feature Engineering**  
4. **Missing-Data Analysis & Imputation**  
5. **Exploratory Data Analysis (EDA)**  
6. **Feature Selection & Text Engineering**


In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
import pandas as pd

file_path = '/content/drive/My Drive/Collab/Data files/holly2.csv'

holly2 = pd.read_csv(file_path)

print(holly2.shape)
holly2.head()


(46624, 20)


Unnamed: 0,title,budget,popularity,revenue,runtime,vote_average,vote_count,collection,production_houses,production_countries_clean,year,spoken_languages_list,Director,Producer,Writer,lead_actor,lead_character,content,content2,content3
0,Toy Story,30000000.0,21.946943,373554033.0,81.0,7.7,5415.0,Toy Story,['Pixar Animation Studios'],['United States of America'],1995.0,['English'],John Lasseter,"Bonnie Arnold, Ralph Guggenheim",,Tom Hanks,Woody (voice),led by woody andys toys live happily in his ro...,led by woody andys toys live happily in his ro...,led by woodi andi toy live happili in hi room ...
1,Jumanji,65000000.0,17.015539,262797249.0,104.0,6.9,2413.0,Standalone,"['TriStar Pictures', 'Teitler Film', 'Intersco...",['United States of America'],1995.0,"['English', 'Français']",Joe Johnston,"Scott Kroopf, William Teitler",,Robin Williams,Alan Parrish,when siblings judy and peter discover an encha...,when siblings judy and peter discover an encha...,when sibl judi and peter discov an enchant boa...
2,Grumpier Old Men,0.0,11.7129,0.0,101.0,6.5,92.0,Grumpy Old Men,"['Warner Bros.', 'Lancaster Gate']",['United States of America'],1995.0,['English'],Howard Deutch,,Mark Steven Johnson,Walter Matthau,Max Goldman,a family wedding reignites the ancient feud be...,a family wedding reignites the ancient feud be...,a famili wed reignit the ancient feud between ...
3,Waiting to Exhale,16000000.0,3.859495,81452156.0,127.0,6.1,34.0,Standalone,['Twentieth Century Fox Film Corporation'],['United States of America'],1995.0,['English'],Forest Whitaker,"Ronald Bass, Ezra Swerdlow, Deborah Schindler,...",,Whitney Houston,Savannah 'Vannah' Jackson,cheated on mistreated and stepped on the women...,cheated on mistreated and stepped on the women...,cheat on mistreat and step on the women are ho...
4,Father of the Bride Part II,0.0,8.387519,76578911.0,106.0,5.7,173.0,Father of the Bride,"['Sandollar Productions', 'Touchstone Pictures']",['United States of America'],1995.0,['English'],Charles Shyer,Nancy Meyers,,Steve Martin,George Banks,just when george banks has recovered from his ...,just when george banks has recovered from his ...,just when georg bank ha recov from hi daughter...


In [None]:
holly2.dtypes

Unnamed: 0,0
title,object
budget,float64
popularity,float64
revenue,float64
runtime,float64
vote_average,float64
vote_count,float64
collection,object
production_houses,object
production_countries_clean,object


In [None]:
holly2 = holly2[holly2['year'] >= 1980].reset_index(drop=True)
holly2.shape

(34526, 18)

In [None]:
import json
import ast

first_row_raw = holly2['content2'].loc[2]

print("Raw value in the first row of 'belongs_to_collection':")
print(first_row_raw)

if isinstance(first_row_raw, str):
    first_row_raw = first_row_raw.replace("'", "\"")
    try:
        first_row = json.loads(first_row_raw)
        print("Parsed JSON successfully:")
        print(first_row)
    except json.JSONDecodeError as e:
        print(f"Error in parsing JSON: {e}")
else:
    print("The first row is not a string or is empty.")


Raw value in the first row of 'belongs_to_collection':
a family wedding reignites the ancient feud between nextdoor neighbors and fishing buddies john and max meanwhile a sultry italian divorcée opens a restaurant at the local bait shop alarming the locals who worry shell scare the fish away but shes less interested in seafood than she is in cooking up a hot time with max still yelling still fighting still ready for love fishing best friend duringcreditsstinger old men grumpy old men Howard Deutch Mark Steven Johnson Walter Matthau Max Goldman Grumpy Old Men ['Warner Bros.', 'Lancaster Gate'] ['United States of America'] ['English'] 1995.0
Error in parsing JSON: Expecting value: line 1 column 1 (char 0)


In [None]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk

nltk.download('punkt')
nltk.download('stopwords')

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
ps = PorterStemmer()

def stems(text):
    return " ".join([ps.stem(i) for i in text.split()])

In [None]:
holly2['content2'] = holly2['content2'].apply(stems)

# **7. Modelling for Recommendation System**

# **Recommendation System 1: CountVectorizer + Cosine Similarity**  
- **Key Features**  
  - Uses `CountVectorizer(max_features=3000, stop_words='english')` on the stemmed `content2` field  
  - Builds a sparse count matrix and computes pairwise cosine similarity via `cosine_similarity(vector, dense_output=False)`  
  - `recommend1()` sorts similarity scores and returns the top-20 titles with rounded similarity scores  
- **Advantages**  
  - Simple, interpretable bag-of-words approach  
  - Fast to fit and query on moderate-sized datasets  
  - No need for additional feature engineering beyond text stemming  
- **Limitations**  
  - Raw counts overweight common terms (even with stop-words)  
  - High-dimensional sparse vectors can be memory-intensive  
  - Lacks term-weighting (in contrast to TF-IDF), so semantic relevance may suffer  

In [None]:
cv = CountVectorizer(max_features=3000, stop_words='english')
vector = cv.fit_transform(holly2['content2'])

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(vector, dense_output=False)

In [None]:
import numpy as np
import pandas as pd

def recommend1(movie_title):
    idx = holly2[holly2['title'] == movie_title].index[0]

    if isinstance(similarity, np.ndarray):
        sim_scores = similarity[idx]
    else:
        sim_scores = similarity[idx].toarray().flatten()

    distances = sorted(
        enumerate(sim_scores),
        key=lambda x: x[1],
        reverse=True
    )

    recommendations = [
        {
            'title': holly2.iloc[movie_idx]['title'],
            'similarity_score': round(score, 4)
        }
        for movie_idx, score in distances[1:21]
    ]

    return pd.DataFrame(recommendations)



In [None]:
recommend1('Jason Bourne')

Unnamed: 0,title,similarity_score
0,Smoke,0.5967
1,Paul,0.5916
2,Hotel Rwanda,0.577
3,Dungeons & Dragons: Wrath of the Dragon God,0.5733
4,Pee-wee's Big Holiday,0.5732
5,Perfect Child,0.5705
6,The D Train,0.5661
7,Pee-wee's Big Adventure,0.5607
8,The Bourne Ultimatum,0.5505
9,The Chosen One,0.5502


In [None]:
import pickle
from google.colab import files

with open('countvec_cosine.pkl','wb') as f:
    pickle.dump({'holly2': holly2, 'similarity': similarity}, f)

files.download('countvec_cosine.pkl')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **Recommendation System 2: CountVectorizer + KNN**  
- **Key Features**  
  - Same `CountVectorizer(max_features=3000, stop_words='english')` on `content2`  
  - Fits `NearestNeighbors(n_neighbors=21, metric='cosine', algorithm='brute')` on the count matrix  
  - `recommend2()` retrieves the 20 nearest neighbors, converting distances to similarity scores (1 − distance)  
- **Advantages**  
  - Leverages optimized neighbor search routines for faster query times than full sorting  
  - Easily tunable `n_neighbors` to control recommendation breadth  
  - Distance metric directly yields a similarity score  
- **Limitations**  
  - Brute-force KNN has O(n²) complexity for large datasets  
  - Still uses raw counts—subject to the same vocabulary bias as System 1  
  - No dimensionality reduction, so memory use remains high

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import NearestNeighbors
import pandas as pd

In [None]:
cv = CountVectorizer(max_features=3000, stop_words='english')
X = cv.fit_transform(holly2['content2'])

In [None]:
knn = NearestNeighbors(n_neighbors=21, metric='cosine', algorithm='brute')
knn.fit(X)

In [None]:
def recommend2(movie_title):
    index = holly2[holly2['title'] == movie_title].index[0]
    vector = X[index]
    distances, indices = knn.kneighbors(vector)

    recommended2 = []
    for i in range(1, len(indices[0])):
        movie_idx = indices[0][i]
        similarity_score = 1 - distances[0][i]
        recommended2.append({
            'title': holly2.iloc[movie_idx]['title'],
            'similarity_score': round(similarity_score, 4)
        })

    return pd.DataFrame(recommended2)


In [None]:
recommend2("Jason Bourne")

Unnamed: 0,title,similarity_score
0,Pee-wee's Big Holiday,0.3919
1,Extraction,0.3882
2,The Bourne Ultimatum,0.3877
3,Answer This!,0.3829
4,Attack Force,0.3828
5,Killers,0.3763
6,Synchronicity,0.3729
7,Merlin,0.3678
8,The Bourne Supremacy,0.3675
9,Journey to the End of the Night,0.3659


In [None]:
import pickle
from google.colab import files

with open('countvec_knn.pkl', 'wb') as f:
    pickle.dump({'holly2': holly2, 'X': X, 'knn': knn}, f)

files.download('countvec_knn.pkl')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **Recommendation System 3: TF-IDF + KNN**  
- **Key Features**  
  - Uses `TfidfVectorizer(stop_words='english')` on `content2` to weight terms by inverse document frequency  
  - Converts to a CSR sparse matrix and fits `NearestNeighbors(metric='cosine', algorithm='brute')`  
  - `recommend3()` takes a case-insensitive title match and returns the top-20 neighbors with rounded distances  
- **Advantages**  
  - TF-IDF down-weights overly common words, improving relevance  
  - Retains sparse representation for efficiency  
  - KNN lookup avoids full similarity matrix computation on each query  
- **Limitations**  
  - High dimensionality (one feature per token) still poses memory challenges  
  - Brute-force search can be slow as data grows  
  - No integration of numeric/movie-metadata features

In [None]:
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer


In [None]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(holly2['content2'])

movie_features_df_matrix = csr_matrix(tfidf_matrix)

model_knn = NearestNeighbors(metric='cosine', algorithm='brute')
model_knn.fit(movie_features_df_matrix)

query_index = np.random.choice(holly2.shape[0])

distances, indices = model_knn.kneighbors(tfidf_matrix[query_index], n_neighbors=6)

print(f"Recommendations for {holly2.iloc[query_index]['title']}:\n")
for i in range(0, len(distances.flatten())):
    if i == 0:
        continue
    print(f"{i}: {holly2.iloc[indices.flatten()[i]]['title']}, with distance of {distances.flatten()[i]:.4f}")

Recommendations for Myn Bala: Warriors of the Steppe:

1: Racketeer, with distance of 0.6648
2: The Liquidator, with distance of 0.7102
3: The Old Man, with distance of 0.8014
4: Aksuat, with distance of 0.8141
5: Kaïrat, with distance of 0.8290


In [None]:
import numpy as np
import pandas as pd

def recommend3(movie_title, n_neighbors=20):
    matched = holly2[holly2['title'].str.contains(movie_title, case=False)]
    if matched.empty:
        raise ValueError(f"No movies found containing '{movie_title}'")
    query_index = matched.index[0]

    distances, indices = model_knn.kneighbors(
        tfidf_matrix[query_index],
        n_neighbors=n_neighbors + 1
    )

    distances = distances.flatten()
    indices = indices.flatten()

    recommendations = [
        {
            'title': holly2.iloc[idx]['title'],
            'distance': round(dist, 4)
        }
        for dist, idx in zip(distances[1:], indices[1:])
    ]

    return pd.DataFrame(recommendations)


In [None]:
recommend3('Jason Bourne')

Unnamed: 0,title,distance
0,The Bourne Supremacy,0.454
1,The Bourne Ultimatum,0.4544
2,The Bourne Identity,0.5495
3,The Bourne Legacy,0.6333
4,Making 'Do the Right Thing',0.7594
5,Resurrected,0.7992
6,The Garden of Eden,0.8117
7,Jig,0.8165
8,The Bourne Identity,0.8326
9,Hereafter,0.8787


In [None]:
import pickle

with open('tfidf_knn.pkl', 'wb') as f:
    pickle.dump({
        'holly2': holly2,
        'tfidf_matrix': tfidf_matrix,
        'model_knn': model_knn,
        'tfidf_vectorizer': tfidf_vectorizer
    }, f)

In [None]:
from google.colab import files

files.download('tfidf_knn.pkl')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#
**Recommendation System 4: CountVectorizer + SVD + Cosine Similarity**  
- **Key Features**  
  - Builds a dense count matrix with `CountVectorizer(max_features=2000, stop_words='english')`  
  - Applies `TruncatedSVD(n_components=300)` (latent semantic analysis) to reduce dimensions from 2000→300  
  - Computes cosine similarity on the reduced vectors and uses the same sorting logic as System 1 in `recommend4()`  
- **Positives**  
  - Dramatically cuts memory footprint and speeds up similarity computations  
  - Captures latent “topics” via SVD, potentially improving semantic matching  
  - Lower-dimensional vectors speed up both storage and distance calculations  
- **Negatives**  
  - Dense SVD output may actually increase memory compared to sparse for moderate sizes  
  - SVD training is computationally expensive  
  - Dimensionality reduction can discard important fine-grained distinctions

In [None]:
holly4 = holly2[holly2['year'] >= 1980].reset_index(drop=True)
holly4.head()
holly4.dtypes

Unnamed: 0,0
title,object
budget,float64
popularity,float64
revenue,float64
runtime,float64
vote_average,float64
vote_count,float64
collection,object
production_houses,object
production_countries_clean,object


In [None]:
import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [None]:
cv = CountVectorizer(max_features=2000, stop_words='english')
vector2 = cv.fit_transform(holly4['content2']).toarray()

In [None]:
vector2.shape

(34526, 2000)

In [None]:
similarity = cosine_similarity(vector2)

In [None]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=300)
vector_reduced = svd.fit_transform(vector2)

from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(vector_reduced)

In [None]:
import numpy as np
import pandas as pd

def recommend4(movie_title):
    idx = holly2[holly2['title'] == movie_title].index[0]

    if isinstance(similarity, np.ndarray):
        sim_scores = similarity[idx]
    else:
        sim_scores = similarity[idx].toarray().flatten()

    distances = sorted(
        enumerate(sim_scores),
        key=lambda x: x[1],
        reverse=True
    )

    recommendations = [
        {
            'title': holly2.iloc[movie_idx]['title'],
            'similarity_score': round(score, 4)
        }
        for movie_idx, score in distances[1:21]
    ]

    return pd.DataFrame(recommendations)


In [None]:
recommend4('Jason Bourne')

Unnamed: 0,title,similarity_score
0,Smoke,0.5967
1,Paul,0.5916
2,Hotel Rwanda,0.577
3,Dungeons & Dragons: Wrath of the Dragon God,0.5733
4,Pee-wee's Big Holiday,0.5732
5,Perfect Child,0.5705
6,The D Train,0.5661
7,Pee-wee's Big Adventure,0.5607
8,The Bourne Ultimatum,0.5505
9,The Chosen One,0.5502


# **Recommendation System 5: Hybrid Text + Numeric + KNN**  
- **Key Features**  
  - Constructs `combined_features` by concatenating `content2` with metadata fields (`Director`, `Writer`, `lead_actor`, etc.)  
  - Scales numeric features (`budget`, `popularity`, `revenue`, `runtime`, `vote_average`, `vote_count`) via `MinMaxScaler`  
  - Vectorizes text with `TfidfVectorizer(max_features=5000)` and horizontally stacks it with numeric CSR matrix  
  - Fits `NearestNeighbors(metric='cosine', algorithm='brute')` on the hybrid matrix; `recommend5()` returns top-20 by 1 − distance  
- **Advantages**  
  - Combines rich metadata and numeric signals for more nuanced recommendations  
  - TF-IDF weights text while numeric features capture popularity and quality metrics  
  - Flexible: can add or remove features easily  
- **Limitations**  
  - Very high dimensionality (5000 + 6), increasing computation and storage costs  
  - Balancing text vs. numeric feature importance requires manual tuning  
  - Still uses brute‐force KNN search  

**Hybrid recommendation**

In [None]:
def combine_features(row):
    return (
        str(row['content2']) + " " +
        str(row['Director']) + " " +
        str(row['Writer']) + " " +
        str(row['Producer']) + " " +
        str(row['lead_actor']) + " " +
        str(row['lead_character']) + " " +
        str(row['spoken_languages_list']) + " " +
        str(row['production_houses']) + " " +
        str(row['production_countries_clean']) +
        " " + str(row['collection'])
    )

holly2['combined_features'] = holly2.apply(combine_features, axis=1)


In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
num_features = holly2[['budget', 'popularity', 'revenue', 'runtime', 'vote_average', 'vote_count']]
num_scaled = scaler.fit_transform(num_features)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack

tfidf = TfidfVectorizer(max_features=5000, stop_words='english')
text_vec = tfidf.fit_transform(holly2['combined_features'])

from scipy.sparse import csr_matrix
hybrid_matrix = hstack([text_vec, csr_matrix(num_scaled)])


In [None]:
from sklearn.neighbors import NearestNeighbors
Hybrid = NearestNeighbors(metric='cosine', algorithm='brute')
Hybrid.fit(hybrid_matrix)


In [None]:

def recommend5(movie_title, top_n=20):
    try:
        index = holly2[holly2['title'] == movie_title].index[0]
    except IndexError:
        print(" Movie not found in dataset.")
        return pd.DataFrame()

    distances, indices = Hybrid.kneighbors(hybrid_matrix[index], n_neighbors=top_n+1)

    recommendations4 = []
    for i in range(1, len(indices[0])):
        idx = indices[0][i]
        score = 1 - distances[0][i]
        title = holly2.iloc[idx]['title']
        recommendations4.append({'title': title, 'similarity_score': round(score, 4)})

    return pd.DataFrame(recommendations4)


In [None]:
recommend5('Jason Bourne')

Unnamed: 0,title,similarity_score
0,The Bourne Supremacy,0.6414
1,The Bourne Ultimatum,0.6387
2,The Bourne Identity,0.5472
3,The Curious Case of Benjamin Button,0.5304
4,Promised Land,0.5239
5,The Bourne Legacy,0.4994
6,The Great Wall,0.4991
7,That Sugar Film,0.4969
8,Hereafter,0.4955
9,Paul F. Tompkins: Crying and Driving,0.4881


# **Recommendation System 6: Hybrid Cosine Similarity**  
- **Key Features**  
  - Reuses the same `hybrid_matrix` from System 5  
  - Precomputes full pairwise cosine similarity matrix (`Hybrid_Sim = cosine_similarity(hybrid_matrix)`)  
  - `recommend6()` sorts each row’s similarities (converted from sparse to dense if needed) to return top-20 matches  
- **Advantages**  
  - Query time is extremely fast since similarities are precomputed  
  - No need to re-fit KNN for each recommendation  
  - Leverages both text and numeric data  
- **Limitations**  
  - Storing an N×N similarity matrix is O(N²) memory—untenable for large catalogs  
  - One-time computation is expensive and must be redone when data changes  
  - Dense conversion (`toarray()`) may blow up RAM

**Hybrid - Cosine_Similarity**

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

Hybrid_Sim = cosine_similarity(hybrid_matrix, dense_output=False)

In [None]:

def recommend6(movie_title):
    index = holly2[holly2['title'] == movie_title].index[0]

    sim_scores = Hybrid_Sim[index].toarray().flatten()

    distances = sorted(list(enumerate(sim_scores)), reverse=True, key=lambda x: x[1])

    recommended1 = []
    for i in distances[1:21]:
        movie_idx = i[0]
        score = round(i[1], 4)
        recommended1.append({
            'title': holly2.iloc[movie_idx]['title'],
            'similarity_score': score
        })

    return pd.DataFrame(recommended1)


In [None]:
recommend6('Jason Bourne')

Unnamed: 0,title,similarity_score
0,The Bourne Supremacy,0.6414
1,The Bourne Ultimatum,0.6387
2,The Bourne Identity,0.5472
3,The Curious Case of Benjamin Button,0.5304
4,Promised Land,0.5239
5,The Bourne Legacy,0.4994
6,The Great Wall,0.4991
7,That Sugar Film,0.4969
8,Hereafter,0.4955
9,Paul F. Tompkins: Crying and Driving,0.4881


In [None]:
import pickle
from google.colab import files

with open('hybrid_cosine.pkl', 'wb') as f:
    pickle.dump({'holly2': holly2, 'Hybrid_Sim': Hybrid_Sim}, f)

files.download('hybrid_cosine.pkl')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **Recommendation System 7: Non-Numeric Metadata + Cosine Similarity**  
- **Key Features**  
  - Builds `combined_features` string of all non-numeric metadata (overview, crew, cast, languages, production, etc.)  
  - Vectorizes with `CountVectorizer(max_features=3000, stop_words='english')`  
  - Computes cosine similarity (`h_similarity = cosine_similarity(vector)`) and uses `recommend7()` to pick top-20  
- **Advantages**  
  - Captures a wide range of descriptive metadata in a single text field  
  - Lower dimensionality than full hybrid (3000 vs. 5006)  
  - Purely text-based—no scaling or numeric preprocessing needed  
- **Limitations**  
  - Discards numeric features entirely—no popularity or rating signals  
  - Suffers from the same bag-of-words limitations as System 1  
  - Potential bias if some metadata fields dominate the vector space



**Combined_Features(non-numeric) and Cosine_similarity**

In [None]:
cv = CountVectorizer(max_features=3000, stop_words='english')
vector = cv.fit_transform(holly2['combined_features'])

In [None]:
h_similarity = cosine_similarity(vector, dense_output=False)

In [None]:
def recommend7(movie_title):
    index = holly2[holly2['title'] == movie_title].index[0]

    sim_scores = h_similarity[index].toarray().flatten()

    distances = sorted(list(enumerate(sim_scores)), reverse=True, key=lambda x: x[1])

    recommended1 = []
    for i in distances[1:21]:
        movie_idx = i[0]
        score = round(i[1], 4)
        recommended1.append({
            'title': holly2.iloc[movie_idx]['title'],
            'similarity_score': score
        })

    return pd.DataFrame(recommended1)


In [None]:
recommend7('Jason Bourne')

Unnamed: 0,title,similarity_score
0,The Bourne Ultimatum,0.5432
1,The Bourne Supremacy,0.5341
2,Pee-wee's Big Holiday,0.4718
3,Eating Raoul,0.4372
4,Grandma,0.4352
5,Affliction,0.4344
6,Hyena Road,0.4272
7,Paul F. Tompkins: Crying and Driving,0.4231
8,The Backyard,0.4216
9,Pretty Bird,0.418


# **8. Evaluation (best recomendation system)**
## What the Metrics Tell Us in Practice

Suppose our recommender returns a **top-5** list for “Jason Bourne” and we know the four true Bourne films are the only “relevant” items. Here’s how to interpret each metric:

### 1. Precision@ K = 5  
- **Definition:** Of the 5 movies we recommended, what fraction are actually Bourne films?  
- **Applied Example:**  
  - If our top-5 list is  
    ```
    [Ultimatum, Supremacy, Identity, Legacy, Smoke]
    ```  
  - 4 out of 5 are relevant → **Precision@5 = 4/5 = 0.80**  
- **What It Shows:** How “clean” the recommendation list is—high precision means few irrelevant suggestions in the visible top-5.

---

### 2. Recall@K = 5  
- **Definition:** Of *all* relevant items (in this case 4 Bourne films), what fraction did we capture in our top-5?  
- **Applied Example:**  
  - Same list above captures all four Bourne titles → **Recall@5 = 4/4 = 1.00**  
- **What It Shows:** How complete the list is with respect to what the user truly wants—high recall means we’re not leaving out many relevant movies, even if we slipped in one extra.

---

### 3. NDCG@K = 5 (Normalized Discounted Cumulative Gain)  
- **Definition:** Assign each recommended movie a “gain” of 1 if it’s relevant, 0 otherwise, but **discount** that gain by its position: items lower in the list count for less. Finally normalize by the ideal ordering.  
  
- **Applied Example:**  
  - Our list:  
    1. Ultimatum (rel=1)  
    2. Supremacy (1)  
    3. Identity (1)  
    4. Legacy (1)  
    5. Smoke (0)  
  - Gains are highest at positions 1–4, discounted slightly by log₂(2), log₂(3)…  
  - Since the 4 Bourne films occupy the first 4 slots, we achieve the **ideal** ordering → **NDCG@5 = 1.00**.  
- **What It Shows:** Combines precision and rank quality—it rewards recommenders that not only include the right items but rank the most relevant ones as early as possible.

---

### Why All Three Matter Together

- **Precision@K** ensures the top-K list isn’t cluttered with irrelevant items.  
- **Recall@K** ensures you’re not missing the important items a user cares about.  
- **NDCG@K** balances both and adds a premium for putting the very best (most relevant) items at the top of the list.

By tracking all three, you get a full picture of a recommender’s accuracy, completeness, and ranking finesse.


## **Evaluation Metrics (K = 5)**  
_Relevant items: The Bourne Identity, The Bourne Supremacy, The Bourne Ultimatum, The Bourne Legacy_

| System                                              | Precision@5 | Recall@5 | NDCG@5 |
|:----------------------------------------------------|:-----------:|:--------:|:------:|
| **1. CountVectorizer + Cosine**                     |     0.00    |   0.00   |  0.00  |
| **2. CountVectorizer + KNN**                        |     0.00    |   0.00   |  0.00  |
| **3. TF-IDF + KNN**                                 |     0.80    |   1.00   |  1.00  |
| **4. CountVectorizer + SVD + Cosine**               |     0.00    |   0.00   |  0.00  |
| **5. Hybrid Text + Numeric + KNN**                  |     0.60    |   0.75   |  0.83  |
| **6. Hybrid TF-IDF + Numeric Cosine (Precomputed)** |     0.60    |   0.75   |  0.83  |
| **7. Metadata-Only Cosine**                         |     0.40    |   0.50   |  0.64  |


## Final Verdict

- **Top Performer:** **System 3 (TF-IDF + KNN)**  
  - Highest Precision@5 (0.80), Recall@5 (1.00) and NDCG@5 (1.00)  
  - Captures **all four** Bourne films within the top-5  
  - Purely text-based and extremely effective for this query  
  - **Other Key Advantages:**
  - TF-IDF down-weights overly common words, improving relevance
  - Retains sparse representation for efficiency
  - KNN lookup avoids full similarity matrix computation on each query

- **Strong Contenders:**  
  - **System 6 (Hybrid TF-IDF + Numeric Cosine)**  
    - Precision@5 0.60, Recall@5 0.75, NDCG@5 0.83  
    - Precomputed similarity → fastest lookups  
  - **System 5 (Hybrid Text + Numeric + KNN)**  
    - Precision@5 0.60, Recall@5 0.75, NDCG@5 0.83  
    - Fresh KNN search each call (slower than System 6)  