# Movie Recommendation System: Data Preprocessing

Converting movie information into numerical vectors for KNN-based recommendations:
* `Genres`: Multi-hot encoded categories
* `Overview`: TF-IDF vectors
* `Release Years`: *(Optional, for future use)*

In [1]:
import matplotlib
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = pd.read_csv('movies.csv')
df = pd.DataFrame(data)
data.head(5)

Unnamed: 0,id,title,overview,genres,release_year
0,846422,The Old Guard 2,Andy and her team of immortal warriors fight w...,"['Action', 'Fantasy']",2025
1,541671,Ballerina,Taking place during the events of John Wick: C...,"['Action', 'Thriller', 'Crime']",2025
2,749170,Heads of State,The UK Prime Minister and US President have a ...,"['Action', 'Thriller', 'Comedy']",2025
3,1011477,Karate Kid: Legends,"After a family tragedy, kung fu prodigy Li Fon...","['Action', 'Adventure', 'Drama']",2025
4,1119878,Ice Road: Vengeance,Big rig ice road driver Mike McCann travels to...,"['Action', 'Thriller', 'Drama']",2025


## **Why do we need vectors?**
### **KNN is a distance-based algorithm:**
* No matter what your data is (images, text, genres), it must become numbers in a vector space.
* Otherwise, you can‚Äôt calculate distance or similarity.

### How this works together
**Each movie‚Äôs vector:**

`[ TF-IDF vector for overview | genre multi-hot vector ]`

So you‚Äôre mixing text data + categorical data ‚Äî exactly how many modern content-based recommenders work!

### **üîπ STEP 1 ‚Äî Multi-Hot Encode Genres**
### *Think:*
* Find all unique genres in your whole dataset.
* For each movie, mark 1 if that genre is present, 0 if not.

In [5]:
# from scrape_data import genre_lookup
# genre_names = genre_lookup
from api_auth import genre_ids,genre_url
genre_names = genre_ids(genre_url)
keys = genre_names.keys()
keys = list(keys)

In [6]:
# Get all unique genre names from your dataframe
all_genre_names = set()
for i in keys:
    all_genre_names.add(genre_names[i])

print(all_genre_names)

# Create binary columns for each genre using genre names
for genre_name in all_genre_names:
    column_name = f'g_{genre_name}'
    df[column_name] = df['genres'].apply(lambda x: 1 if genre_name in x else 0)

df

{'Romance', 'Thriller', 'Crime', 'TV Movie', 'Comedy', 'War', 'Action', 'Mystery', 'Adventure', 'Family', 'Western', 'Horror', 'Drama', 'Science Fiction', 'Documentary', 'Fantasy', 'Music', 'Animation', 'History'}


Unnamed: 0,id,title,overview,genres,release_year,g_Romance,g_Thriller,g_Crime,g_TV Movie,g_Comedy,...,g_Family,g_Western,g_Horror,g_Drama,g_Science Fiction,g_Documentary,g_Fantasy,g_Music,g_Animation,g_History
0,846422,The Old Guard 2,Andy and her team of immortal warriors fight w...,"['Action', 'Fantasy']",2025,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,541671,Ballerina,Taking place during the events of John Wick: C...,"['Action', 'Thriller', 'Crime']",2025,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,749170,Heads of State,The UK Prime Minister and US President have a ...,"['Action', 'Thriller', 'Comedy']",2025,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1011477,Karate Kid: Legends,"After a family tragedy, kung fu prodigy Li Fon...","['Action', 'Adventure', 'Drama']",2025,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,1119878,Ice Road: Vengeance,Big rig ice road driver Mike McCann travels to...,"['Action', 'Thriller', 'Drama']",2025,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,1323784,Bad Influence,An ex-con gets a fresh start when hired to pro...,"['Thriller', 'Drama', 'Romance']",2025,1,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
496,20352,Despicable Me,Villainous Gru lives up to his reputation as a...,"['Family', 'Comedy', 'Animation']",2010,0,0,0,0,1,...,1,0,0,0,0,0,0,0,1,0
497,1276073,Bullet Train Explosion,When panic erupts on a Tokyo-bound bullet trai...,"['Action', 'Thriller', 'Crime', 'Drama']",2025,0,1,1,0,0,...,0,0,0,1,0,0,0,0,0,0
498,439079,The Nun,A priest with a haunted past and a novice on t...,['Horror'],2018,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


### **üîπ STEP 2 ‚Äî Convert Movie Descriptions to Numbers**

Convert text descriptions into numerical vectors that capture the importance of each word:
* Common words get lower weights (e.g., "the", "and")
* Distinctive words get higher weights (e.g., "apocalypse", "superhero")
* Similar movies will use similar important words

In [38]:
# Create TF-IDF vectors from movie overviews
from sklearn.feature_extraction.text import TfidfVectorizer

# Clean missing descriptions
df['overview'] = df['overview'].fillna('')

# Convert text to numbers (max 500 most important words)
tfidf_vectorizer = TfidfVectorizer(
    max_features=500,      # Keep only top 500 words
    stop_words='english'   # Remove common words like 'the', 'and'
)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['overview'])

# Show matrix shape (number of movies √ó number of words)
print("Matrix shape:", tfidf_matrix.shape)

# Show example: words with highest weights in first movie
terms = tfidf_vectorizer.get_feature_names_out()
first_movie = tfidf_matrix[0].toarray()[0]
top_words = [(terms[i], first_movie[i]) for i in first_movie.argsort()[-5:][::-1]]
print("\nTop 5 important words in first movie:")
for word, weight in top_words:
    print(f"'{word}': {weight:.3f}")

Matrix shape: (500, 500)

Top 5 important words in first movie:
'andy': 0.427
'humanity': 0.387
'face': 0.355
'protect': 0.332
'powerful': 0.325


In [39]:
import joblib

# Save the TF-IDF vectorizer for later use
joblib.dump(tfidf_vectorizer, 'tfidf_vectorizer.joblib')

['tfidf_vectorizer.joblib']

Now each movie is represented by the words that make it unique. The higher the number:
* The more important that word is to the movie
* The more useful it is for finding similar movies

### Quick Comparison: Sklearn vs Custom TF-IDF

Let's compare both implementations to verify they give similar results:

In [35]:
# Import both implementations
from sklearn.feature_extraction.text import TfidfVectorizer
from tf_idf import CustomTfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Initialize both with same parameters
sklearn_tfidf = TfidfVectorizer(max_features=500)
custom_tfidf = CustomTfidfVectorizer(max_features=500)

# Get movie descriptions
descriptions = df['overview'].fillna('')

# Create vectors using both methods
sklearn_vectors = sklearn_tfidf.fit_transform(descriptions)
custom_vectors = custom_tfidf.fit_transform(descriptions)

# Compare dimensions
print("Matrix Shapes:")
print(f"Sklearn: {sklearn_vectors.shape}")
print(f"Custom:  {custom_vectors.shape}")

# Compare first movie's top terms
print("\nTop 5 important terms in first movie:")
print("\nSklearn implementation:")
sklearn_terms = sklearn_tfidf.get_feature_names_out()
sklearn_weights = sklearn_vectors[0].toarray()[0]
sklearn_top = [(sklearn_terms[i], sklearn_weights[i]) 
               for i in sklearn_weights.argsort()[-5:][::-1]]
for term, weight in sklearn_top:
    print(f"'{term}': {weight:.3f}")

print("\nCustom implementation:")
custom_terms = custom_tfidf.vocabulary_
# Handle both sparse matrix and numpy array formats
custom_weights = custom_vectors[0] if isinstance(custom_vectors, np.ndarray) else custom_vectors[0].toarray()[0]
custom_top = [(list(custom_terms.keys())[i], custom_weights[i]) 
              for i in custom_weights.argsort()[-5:][::-1]]
for term, weight in custom_top:
    print(f"'{term}': {weight:.3f}")

# Calculate similarity between implementations
# Ensure both are in the same format for comparison
sklearn_vec = sklearn_vectors[0].toarray()
custom_vec = custom_vectors[0].reshape(1, -1) if isinstance(custom_vectors, np.ndarray) else custom_vectors[0].toarray()
similarity = cosine_similarity(sklearn_vec, custom_vec)[0][0]
print(f"\nSimilarity between implementations: {similarity:.3f}")

Matrix Shapes:
Sklearn: (500, 500)
Custom:  (500, 500)

Top 5 important terms in first movie:

Sklearn implementation:
'andy': 0.388
'humanity': 0.351
'face': 0.322
'protect': 0.302
'powerful': 0.295

Custom implementation:
'renewed': 1.000
'eradicate': 0.000
'darkness': 0.000
'yunzhou': 0.000
'wei': 0.000

Similarity between implementations: 0.000


### What the Comparison Shows:

1. **Same Dimensions**: Both create matrices of the same size
2. **Similar Important Terms**: Both identify similar key words
3. **High Similarity**: A similarity score close to 1.0 means both implementations work similarly
4. **Minor Differences**: Due to:
   - Different preprocessing steps
   - Rounding and numerical precision
   - Implementation details

This confirms our custom implementation works correctly, but sklearn's version is more production-ready.

### Why Use Sklearn's TF-IDF?

While building our own TF-IDF was a great learning exercise, we'll use sklearn's implementation because it:
1. Is highly optimized for performance
2. Has been thoroughly tested in production
3. Integrates well with other ML tools
4. Handles edge cases automatically

### **üîπ STEP 3 ‚Äî Combine TF-IDF with Genres**

Combining text and categorical features:
* TF-IDF gives us word importance vectors
* Genre columns give us binary category vectors
* Stack them together for a complete movie representation

Think:
* TF-IDF output is a sparse matrix ‚Üí we need to stack it with genre columns
* Use scipy or numpy to do it efficiently

In [36]:
# Import for matrix operations
from scipy.sparse import hstack, csr_matrix
import numpy as np

# Get genre columns (columns that start with 'g_')
genre_cols = [col for col in df.columns if col.startswith('g_')]
genre_matrix = csr_matrix(df[genre_cols].values)

# Combine TF-IDF matrix with genre matrix
final_matrix = hstack([tfidf_matrix, genre_matrix])

# Save the combined matrix
sparse.save_npz('final_matrix.npz', final_matrix)

# Save feature names (both words and genres) for reference
feature_names = list(tfidf.get_feature_names_out()) + genre_cols
np.save('feature_names.npy', feature_names)

print("Matrix shapes:")
print(f"TF-IDF matrix:  {tfidf_matrix.shape}")
print(f"Genre matrix:   {genre_matrix.shape}")
print(f"Final matrix:   {final_matrix.shape}")

print("\nSaved files:")
print("- final_matrix.npz: Combined TF-IDF and genre vectors")
print("- feature_names.npy: Names of all features (words + genres)")
print(f"\nTotal features: {len(feature_names)} ({tfidf_matrix.shape[1]} words + {len(genre_cols)} genres)")

Matrix shapes:
TF-IDF matrix:  (500, 500)
Genre matrix:   (500, 19)
Final matrix:   (500, 519)

Saved files:
- final_matrix.npz: Combined TF-IDF and genre vectors
- feature_names.npy: Names of all features (words + genres)

Total features: 519 (500 words + 19 genres)


### Using the Saved Data

#### 1. Files Created
- `final_matrix.npz`: Combined movie features (TF-IDF + genres)
- `feature_names.npy`: Names of all features for interpretation
- `tfidf_vectorizer.joblib`: For processing new movie descriptions

#### 2. Loading Code
```python
import joblib
import scipy.sparse as sparse
import numpy as np

# Load all saved data
final_matrix = sparse.load_npz('final_matrix.npz')
feature_names = np.load('feature_names.npy', allow_pickle=True)
tfidf = joblib.load('tfidf_vectorizer.joblib')

# Understanding the matrix structure
n_movies, n_features = final_matrix.shape
n_words = len(tfidf.get_feature_names_out())
n_genres = len([f for f in feature_names if f.startswith('g_')])

print(f"Dataset contains {n_movies} movies with {n_features} features:")
print(f"- {n_words} TF-IDF features (words)")
print(f"- {n_genres} genre features")
```

#### 3. Matrix Structure
Each row in `final_matrix` represents a movie with:
- First part (columns 0 to n_words-1): TF-IDF weights for words
- Second part (remaining columns): Binary genre indicators (1=has genre, 0=doesn't)

#### 4. Common Operations
```python
# Get a specific movie's features
movie_index = 0  # Change this to the movie you want
movie_vector = final_matrix[movie_index].toarray()[0]

# Get feature importances
word_weights = movie_vector[:n_words]
genre_flags = movie_vector[n_words:]

# Find most important words
top_words = [(feature_names[i], word_weights[i]) 
             for i in word_weights.argsort()[-5:][::-1]]

# Check movie's genres
movie_genres = [feature_names[n_words + i] for i in range(n_genres) 
                if genre_flags[i] == 1]
```