## üé¨ Movie Recommender System - Model Development & Analysis

---

### üìã Overview

This notebook demonstrates the complete development process of a **Content-Based Movie Recommendation System** using Machine Learning techniques. The system analyzes movie metadata (genres, keywords, cast, crew, and plot) to recommend similar movies.

### üéØ Objectives
1. **Data Loading & Exploration**: Load and understand the TMDB movie datasets
2. **Data Preprocessing**: Clean, merge, and transform raw data
3. **Feature Engineering**: Extract and combine relevant features into tags
4. **Vectorization**: Convert text data into numerical vectors
5. **Similarity Computation**: Calculate cosine similarity between movies
6. **Model Export**: Save processed data for production use

### üìä Dataset Information
- **Source**: TMDB 5000 Movie Dataset
- **Movies**: ~4,800 movies
- **Features**: Title, overview, genres, keywords, cast, crew, budget, revenue, ratings

### üîß Technology Stack
- **Data Processing**: Pandas, NumPy
- **Machine Learning**: Scikit-learn (CountVectorizer, Cosine Similarity)
- **Serialization**: Pickle

---

### üìö Step 1: Import Required Libraries

First, we import all necessary libraries for data manipulation, text processing, and machine learning.

In [1]:
import pandas as pd
import numpy as np
import pickle
import ast
import json
from sklearn.feature_extraction.text import CountVectorizer # Machine Learning
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')
print("‚úÖ All libraries imported successfully!")

‚úÖ All libraries imported successfully!


### üìÇ Step 2: Load Datasets

In [2]:
print("="*70)
print("üìÇ LOADING DATASETS")
print("="*70)

# Load movies dataset
movies = pd.read_csv('tmdb_5000_movies.csv')
print(f"‚úÖ Movies loaded: {len(movies):,} entries")

# Load credits dataset
credits = pd.read_csv('tmdb_5000_credits.csv')
print(f"‚úÖ Credits loaded: {len(credits):,} entries")

print("Movies Dataset Shape:", movies.shape)
print("Credits Dataset Shape:", credits.shape)

üìÇ LOADING DATASETS
‚úÖ Movies loaded: 4,803 entries
‚úÖ Credits loaded: 4,803 entries
Movies Dataset Shape: (4803, 20)
Credits Dataset Shape: (4803, 4)


## üîç Step 3: Exploratory Data Analysis (EDA)

Let's examine the structure and content of our datasets.

In [3]:
# Display first few rows of movies dataset
print("\n" + "="*70)
print("üé¨ MOVIES DATASET - FIRST 3 ROWS")
print("="*70)
movies.head(3)


üé¨ MOVIES DATASET - FIRST 3 ROWS


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",10-12-2009,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",19-05-2007,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond‚Äôs past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",26-10-2015,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466


In [4]:
# Display first few rows of credits dataset
print("\n" + "="*70)
print("CREDITS DATASET - FIRST 3 ROWS")
print("="*70)
credits.head(3)


CREDITS DATASET - FIRST 3 ROWS


Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."


In [5]:
# Check for missing values in movies
print("\n" + "="*70)
print("üîç MISSING VALUES IN MOVIES DATASET")
print("="*70)
movies.isnull().sum()


üîç MISSING VALUES IN MOVIES DATASET


budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title                      0
vote_average               0
vote_count                 0
dtype: int64

In [6]:
# Check for missing values in credits
print("\n" + "="*70)
print("üîç MISSING VALUES IN CREDITS DATASET")
print("="*70)
credits.isnull().sum()


üîç MISSING VALUES IN CREDITS DATASET


movie_id    0
title       0
cast        0
crew        0
dtype: int64

In [7]:
# Data types
print("\n" + "="*70)
print("üìä DATA TYPES")
print("="*70)
print("\nüé¨ Movies Dataset:")
print(movies.dtypes)
print("Credits Dataset:")
print(credits.dtypes)


üìä DATA TYPES

üé¨ Movies Dataset:
budget                    int64
genres                   object
homepage                 object
id                        int64
keywords                 object
original_language        object
original_title           object
overview                 object
popularity              float64
production_companies     object
production_countries     object
release_date             object
revenue                   int64
runtime                 float64
spoken_languages         object
status                   object
tagline                  object
title                    object
vote_average            float64
vote_count                int64
dtype: object
Credits Dataset:
movie_id     int64
title       object
cast        object
crew        object
dtype: object


In [8]:
# Example of genres column (JSON-like string)
print("\n" + "="*70)
print("üîç EXAMPLE: GENRES COLUMN (RAW FORMAT)")
print("="*70)
print(movies['genres'].iloc[0])
print("\nNote: Stored as string, needs parsing to extract genre names")


üîç EXAMPLE: GENRES COLUMN (RAW FORMAT)
[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]

Note: Stored as string, needs parsing to extract genre names


### Step 4: Merge Datasets
Combine movies and credits datasets on the 'title' column to have all information in one place.

In [9]:
print("\n" + "="*70)
print("üîó MERGING DATASETS")
print("="*70)

# Merge on title
movies = movies.merge(credits, on='title')

print(f"‚úÖ Merged dataset size: {len(movies):,} movies")
print(f"üìä Total columns: {len(movies.columns)}")
print(f"\n Column names:\n{movies.columns.tolist()}")


üîó MERGING DATASETS
‚úÖ Merged dataset size: 4,807 movies
üìä Total columns: 23

 Column names:
['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'vote_average', 'vote_count', 'movie_id', 'cast', 'crew']


### üéØ Step 5: Feature Selection

Select only the columns necessary for building recommendations:
- **movie_id**: Unique identifier
- **title**: Movie name
- **overview**: Plot summary
- **genres**: Movie genres
- **keywords**: Associated keywords
- **cast**: Actors
- **crew**: Directors and crew members

In [10]:
print("\n" + "="*70)
print("üéØ FEATURE SELECTION")
print("="*70)

# Select important columns
required_cols = ['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']

# Optional columns for statistics
optional_cols = ['vote_average', 'release_date', 'runtime', 'budget', 'revenue']

# Keep available columns
available_cols = [col for col in required_cols + optional_cols if col in movies.columns]
movies = movies[available_cols]

print(f"üìä Selected columns: {', '.join(available_cols)}")
print(f"\nüìã Dataset shape after selection: {movies.shape}")


üéØ FEATURE SELECTION
üìä Selected columns: movie_id, title, overview, genres, keywords, cast, crew, vote_average, release_date, runtime, budget, revenue

üìã Dataset shape after selection: (4807, 12)


In [11]:
# Remove rows with missing values in required columns
print("\nüßπ Cleaning data...")
initial_count = len(movies)
movies.dropna(subset=[col for col in required_cols if col in movies.columns], inplace=True)
removed = initial_count - len(movies)

print(f"‚úÖ Removed {removed} movies with missing data")
print(f"‚úÖ Final dataset: {len(movies):,} movies")


üßπ Cleaning data...
‚úÖ Removed 3 movies with missing data
‚úÖ Final dataset: 4,804 movies


### Step 6: Helper Functions for Data Parsing
Define utility functions to extract useful information from JSON-like string columns.

In [12]:
def convert_to_list(obj):
    """
    Convert string representation of list to actual list.
    Extracts 'name' field from JSON-like structures.
    
    Example:
    '[{"id": 28, "name": "Action"}]' -> ['Action']
    """
    try:
        L = []
        data = ast.literal_eval(obj)
        for i in data:
            if isinstance(i, dict):
                L.append(i.get('name', ''))
            else:
                L.append(str(i))
        return L
    except:
        return []

def convert_cast(obj):
    """
    Extract top 3 cast members.
    Limits to 3 to focus on main actors.
    
    Example:
    '[{"name": "Sam Worthington"}, ...]' -> ['SamWorthington', ...]
    """
    try:
        L = []
        counter = 0
        for i in ast.literal_eval(obj):
            if counter < 3:
                name = i.get('name', '') if isinstance(i, dict) else str(i)
                if name:
                    L.append(name)
                    counter += 1
            else:
                break
        return L
    except:
        return []

def fetch_director(obj):
    """
    Extract director name from crew.
    Directors have significant influence on movie style.
    
    Example:
    '[{"name": "James Cameron", "job": "Director"}]' -> ['JamesCameron']
    """
    try:
        L = []
        for i in ast.literal_eval(obj):
            if isinstance(i, dict) and i.get('job') == 'Director':
                director = i.get('name', '')
                if director:
                    L.append(director)
                    break
        return L
    except:
        return []

def collapse_list(L):
    """
    Remove spaces from list items for better matching.
    'Sam Worthington' -> 'SamWorthington'
    This prevents confusion in vectorization.
    """
    if isinstance(L, list):
        return [i.replace(" ", "") for i in L if i]
    return []

print("‚úÖ Helper functions defined successfully!")
print("\nüìù Functions:")
print("  ‚Ä¢ convert_to_list(): Parse JSON-like strings")
print("  ‚Ä¢ convert_cast(): Extract top 3 actors")
print("  ‚Ä¢ fetch_director(): Extract director name")
print("  ‚Ä¢ collapse_list(): Remove spaces from names")

‚úÖ Helper functions defined successfully!

üìù Functions:
  ‚Ä¢ convert_to_list(): Parse JSON-like strings
  ‚Ä¢ convert_cast(): Extract top 3 actors
  ‚Ä¢ fetch_director(): Extract director name
  ‚Ä¢ collapse_list(): Remove spaces from names


### üîÑ Step 7: Apply Feature Transformations

Transform raw data into clean, structured features.

In [13]:
print("\n" + "="*70)
print("üîÑ APPLYING FEATURE TRANSFORMATIONS")
print("="*70)

# 1Ô∏è Genres
print("\n1Ô∏è‚É£ Processing genres...")
movies['genres'] = movies['genres'].apply(convert_to_list)
print("   Example:", movies['genres'].iloc[0])

# 2Ô∏è Keywords
print("\n2Ô∏è‚É£ Processing keywords...")
movies['keywords'] = movies['keywords'].apply(convert_to_list)
print("   Example:", movies['keywords'].iloc[0][:5])

# 3Ô∏è Cast (top 3 actors)
print("\n3Ô∏è‚É£ Processing cast (top 3 actors)...")
if 'cast' in movies:
    movies['cast'] = movies['cast'].apply(convert_cast)
    print("   Example:", movies['cast'].iloc[0])
else:
    print("   ‚ùå 'cast' column missing")

# 4Ô∏è Crew (director)
print("\n4Ô∏è‚É£ Processing crew (director)...")
if 'crew' in movies:
    movies['crew'] = movies['crew'].apply(fetch_director)
    print("   Example:", movies['crew'].iloc[0])
else:
    print("   ‚ùå 'crew' column missing")

# 5Ô∏è Overview
print("\n5Ô∏è‚É£ Processing overview...")
movies['overview'] = movies['overview'].apply(
    lambda x: x.split() if isinstance(x, str) else []
)
print("   Example:", movies['overview'].iloc[0][:10])

print("\n‚úÖ All transformations applied successfully!")



üîÑ APPLYING FEATURE TRANSFORMATIONS

1Ô∏è‚É£ Processing genres...
   Example: ['Action', 'Adventure', 'Fantasy', 'Science Fiction']

2Ô∏è‚É£ Processing keywords...
   Example: ['culture clash', 'future', 'space war', 'space colony', 'society']

3Ô∏è‚É£ Processing cast (top 3 actors)...
   Example: ['Sam Worthington', 'Zoe Saldana', 'Sigourney Weaver']

4Ô∏è‚É£ Processing crew (director)...
   Example: ['James Cameron']

5Ô∏è‚É£ Processing overview...
   Example: ['In', 'the', '22nd', 'century,', 'a', 'paraplegic', 'Marine', 'is', 'dispatched', 'to']

‚úÖ All transformations applied successfully!


In [14]:
# Remove spaces from names for consistent matching
print("\n" + "="*70)
print("üßπ CLEANING TAG DATA (REMOVING SPACES)")
print("="*70)

movies['genres'] = movies['genres'].apply(collapse_list)
movies['keywords'] = movies['keywords'].apply(collapse_list)
movies['cast'] = movies['cast'].apply(collapse_list)
movies['crew'] = movies['crew'].apply(collapse_list)

print("‚úÖ Space removal complete!")
print(f"\nExample (genres after cleanup): {movies['genres'].iloc[0]}")
print(f"Example (cast after cleanup): {movies['cast'].iloc[0]}")



üßπ CLEANING TAG DATA (REMOVING SPACES)
‚úÖ Space removal complete!

Example (genres after cleanup): ['Action', 'Adventure', 'Fantasy', 'ScienceFiction']
Example (cast after cleanup): ['SamWorthington', 'ZoeSaldana', 'SigourneyWeaver']


## üè∑Ô∏è Step 8: Create Tags Column

Combine all features (overview, genres, keywords, cast, crew) into a single 'tags' column for similarity computation.

In [15]:
print("\n" + "="*70)
print("üè∑Ô∏è CREATING TAGS COLUMN")
print("="*70)

# Combine all features
movies['tags'] = (
    movies['overview'] + 
    movies['genres'] + 
    movies['keywords'] + 
    movies['cast'] + 
    movies['crew']
)

# Convert to lowercase string
movies['tags'] = movies['tags'].apply(lambda x: " ".join(x).lower())

print("‚úÖ Tags created successfully!")
print(f"\nüìä Statistics:")
print(f"   ‚Ä¢ Average tag length: {movies['tags'].str.len().mean():.0f} characters")
print(f"   ‚Ä¢ Min tag length: {movies['tags'].str.len().min():.0f} characters")
print(f"   ‚Ä¢ Max tag length: {movies['tags'].str.len().max():.0f} characters")


üè∑Ô∏è CREATING TAGS COLUMN
‚úÖ Tags created successfully!

üìä Statistics:
   ‚Ä¢ Average tag length: 457 characters
   ‚Ä¢ Min tag length: 23 characters
   ‚Ä¢ Max tag length: 1481 characters


In [16]:
# Display example of tags for 'Avatar'
print("\n" + "="*70)
print("üìù EXAMPLE: TAGS FOR 'AVATAR'")
print("="*70)
avatar_tags = movies[movies['title'] == 'Avatar']['tags'].iloc[0]
print(f"\nTags (first 500 characters):\n{avatar_tags[:500]}...")
print(f"\nTotal length: {len(avatar_tags)} characters")


üìù EXAMPLE: TAGS FOR 'AVATAR'

Tags (first 500 characters):
in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron...

Total length: 455 characters


In [17]:
# Display final processed dataset
print("\n" + "="*70)
print("üìä FINAL PROCESSED DATASET")
print("="*70)
movies[['movie_id', 'title', 'tags']].head(10)


üìä FINAL PROCESSED DATASET


Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond‚Äôs past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."
5,559,Spider-Man 3,the seemingly invincible spider-man goes up ag...
6,38757,Tangled,when the kingdom's most wanted-and most charmi...
7,99861,Avengers: Age of Ultron,when tony stark tries to jumpstart a dormant p...
8,767,Harry Potter and the Half-Blood Prince,"as harry begins his sixth year at hogwarts, he..."
9,209112,Batman v Superman: Dawn of Justice,fearing the actions of a god-like super hero l...


### Step 9: Vectorization using CountVectorizer

Convert text tags into numerical vectors using **Bag of Words** approach.

**Why CountVectorizer?**
- Converts text to numerical representation
- Focuses on word frequency
- Simple and effective for content-based filtering

**Parameters:**
- `max_features=5000`: Keep only top 5000 most frequent words
- `stop_words='english'`: Remove common words (the, is, at, etc.)
- `max_df=0.85`: Ignore words appearing in >85% of documents
- `min_df=2`: Ignore words appearing in <2 documents

In [18]:
print("\n" + "="*70)
print("VECTORIZATION - CONVERTING TEXT TO NUMBERS")
print("="*70)

# Initialize CountVectorizer
cv = CountVectorizer(
    max_features=5000,
    stop_words='english',
    max_df=0.85,
    min_df=2
)

print("\n‚öôÔ∏è CountVectorizer Parameters:")
print(f"   ‚Ä¢ max_features: 5000 (top 5000 words)")
print(f"   ‚Ä¢ stop_words: english (remove common words)")
print(f"   ‚Ä¢ max_df: 0.85 (ignore if in >85% of docs)")
print(f"   ‚Ä¢ min_df: 2 (ignore if in <2 docs)")

print("\nüîÑ Fitting and transforming tags...")
vectors = cv.fit_transform(movies['tags']).toarray()

print("\n‚úÖ Vectorization complete!")
print(f"\nüìä Vector Matrix Information:")
print(f"   ‚Ä¢ Shape: {vectors.shape}")
print(f"   ‚Ä¢ Movies: {vectors.shape[0]:,}")
print(f"   ‚Ä¢ Features: {vectors.shape[1]:,}")
print(f"   ‚Ä¢ Matrix size: {vectors.nbytes / (1024**2):.2f} MB")


VECTORIZATION - CONVERTING TEXT TO NUMBERS

‚öôÔ∏è CountVectorizer Parameters:
   ‚Ä¢ max_features: 5000 (top 5000 words)
   ‚Ä¢ stop_words: english (remove common words)
   ‚Ä¢ max_df: 0.85 (ignore if in >85% of docs)
   ‚Ä¢ min_df: 2 (ignore if in <2 docs)

üîÑ Fitting and transforming tags...

‚úÖ Vectorization complete!

üìä Vector Matrix Information:
   ‚Ä¢ Shape: (4804, 5000)
   ‚Ä¢ Movies: 4,804
   ‚Ä¢ Features: 5,000
   ‚Ä¢ Matrix size: 183.26 MB


In [19]:
# Display feature names (vocabulary)
print("\n" + "="*70)
print("üìù EXTRACTED FEATURES (VOCABULARY)")
print("="*70)

feature_names = cv.get_feature_names_out()
print(f"\nTotal features extracted: {len(feature_names):,}")
print(f"\nFirst 50 features:\n{list(feature_names[:50])}")
print(f"\nLast 50 features:\n{list(feature_names[-50:])}")


üìù EXTRACTED FEATURES (VOCABULARY)

Total features extracted: 5,000

First 50 features:
['000', '007', '10', '100', '11', '12', '13', '14', '15', '16', '17', '17th', '18', '18th', '18thcentury', '19', '1930s', '1940s', '1950s', '1960s', '1970s', '1974', '1976', '1980', '1980s', '1985', '1990s', '19th', '19thcentury', '20', '200', '2009', '20th', '24', '25', '30', '300', '3d', '40', '50', '500', '60', '60s', '70', 'aaron', 'aaroneckhart', 'abandoned', 'abducted', 'abigailbreslin', 'abilities']

Last 50 features:
['workers', 'working', 'works', 'world', 'worlds', 'worldwari', 'worldwarii', 'worldwide', 'worse', 'worst', 'worth', 'wounded', 'wreak', 'wrestling', 'wretch', 'wright', 'write', 'writer', 'writes', 'writing', 'written', 'wrong', 'wwii', 'wyoming', 'xenophobia', 'yacht', 'yakuza', 'yard', 'year', 'years', 'yellow', 'york', 'young', 'youngadult', 'younger', 'youngest', 'youth', 'yuppie', 'zacefron', 'zachgalifianakis', 'zebra', 'zeus', 'zoe', 'zoesaldana', 'zombie', 'zombieap

In [20]:
# Example: Vector representation of 'Avatar'
print("\n" + "="*70)
print("üîç EXAMPLE: VECTOR REPRESENTATION OF 'AVATAR'")
print("="*70)

avatar_idx = movies[movies['title'] == 'Avatar'].index[0]
avatar_vector = vectors[avatar_idx]

print(f"\nAvatar vector shape: {avatar_vector.shape}")
print(f"Non-zero elements: {np.count_nonzero(avatar_vector)}")
print(f"\nFirst 20 vector values:\n{avatar_vector[:20]}")
print(f"\nWords with highest frequency in Avatar:")
top_indices = np.argsort(avatar_vector)[-10:][::-1]
for idx in top_indices:
    print(f"   ‚Ä¢ {feature_names[idx]}: {avatar_vector[idx]}")


üîç EXAMPLE: VECTOR REPRESENTATION OF 'AVATAR'

Avatar vector shape: (5000,)
Non-zero elements: 31

First 20 vector values:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Words with highest frequency in Avatar:
   ‚Ä¢ marine: 2
   ‚Ä¢ alien: 2
   ‚Ä¢ soldier: 1
   ‚Ä¢ sigourneyweaver: 1
   ‚Ä¢ zoesaldana: 1
   ‚Ä¢ unique: 1
   ‚Ä¢ spacetravel: 1
   ‚Ä¢ moon: 1
   ‚Ä¢ torn: 1
   ‚Ä¢ mission: 1


### Step 10: Calculate Cosine Similarity

Compute similarity between all movies using **Cosine Similarity**.

**What is Cosine Similarity?**
- Measures the cosine of the angle between two vectors
- Range: -1 (opposite) to 1 (identical)
- Used to find how similar two movies are based on their features

**Formula:**
```
cosine_similarity(A, B) = (A ¬∑ B) / (||A|| * ||B||)
```

**Result:**
- A matrix where `similarity[i][j]` = similarity between movie i and movie j
- Diagonal elements = 1 (movie is 100% similar to itself)

In [21]:
print("\n" + "="*70)
print("CALCULATING COSINE SIMILARITY MATRIX")
print("="*70)
print("‚è≥ This may take 2-5 minutes depending on your system...\n")

import time
start_time = time.time()

# Calculate cosine similarity
similarity = cosine_similarity(vectors)

elapsed_time = time.time() - start_time

print(f"‚úÖ Similarity matrix created in {elapsed_time:.2f} seconds!")
print(f"\nüìä Similarity Matrix Information:")
print(f"   ‚Ä¢ Shape: {similarity.shape}")
print(f"   ‚Ä¢ Type: {type(similarity)}")
print(f"   ‚Ä¢ Data type: {similarity.dtype}")
print(f"   ‚Ä¢ Size: {similarity.nbytes / (1024**2):.2f} MB")
print(f"   ‚Ä¢ Memory layout: C-contiguous")


CALCULATING COSINE SIMILARITY MATRIX
‚è≥ This may take 2-5 minutes depending on your system...

‚úÖ Similarity matrix created in 1.10 seconds!

üìä Similarity Matrix Information:
   ‚Ä¢ Shape: (4804, 4804)
   ‚Ä¢ Type: <class 'numpy.ndarray'>
   ‚Ä¢ Data type: float64
   ‚Ä¢ Size: 176.07 MB
   ‚Ä¢ Memory layout: C-contiguous


In [22]:
# Display example similarities
print("\n" + "="*70)
print("üîç EXAMPLE: SIMILARITY SCORES FOR 'AVATAR'")
print("="*70)

avatar_idx = movies[movies['title'] == 'Avatar'].index[0]
avatar_similarities = list(enumerate(similarity[avatar_idx]))
avatar_similarities = sorted(avatar_similarities, reverse=True, key=lambda x: x[1])

print(f"\nTop 10 movies similar to Avatar:\n")
print(f"{'Rank':<6} {'Movie Title':<40} {'Similarity':<12}")
print("="*70)
for rank, (idx, score) in enumerate(avatar_similarities[:10], 1):
    title = movies.iloc[idx]['title']
    print(f"{rank:<6} {title:<40} {score:.4f} ({score*100:.2f}%)")


üîç EXAMPLE: SIMILARITY SCORES FOR 'AVATAR'

Top 10 movies similar to Avatar:

Rank   Movie Title                              Similarity  
1      Avatar                                   1.0000 (100.00%)
2      Titan A.E.                               0.2537 (25.37%)
3      Small Soldiers                           0.2511 (25.11%)
4      Ender's Game                             0.2442 (24.42%)
5      Independence Day                         0.2438 (24.38%)
6      Aliens vs Predator: Requiem              0.2426 (24.26%)
7      Battle: Los Angeles                      0.2373 (23.73%)
8      Krull                                    0.2373 (23.73%)
9      Predators                                0.2369 (23.69%)
10     Lifeforce                                0.2339 (23.39%)


In [23]:
# Visualize similarity distribution
print("\n" + "="*70)
print("üìä SIMILARITY SCORE DISTRIBUTION")
print("="*70)

# Flatten similarity matrix (excluding diagonal)
flattened = similarity[np.triu_indices_from(similarity, k=1)]

print(f"\nStatistics (excluding self-similarity):")
print(f"   ‚Ä¢ Mean similarity: {flattened.mean():.4f}")
print(f"   ‚Ä¢ Median similarity: {np.median(flattened):.4f}")
print(f"   ‚Ä¢ Std deviation: {flattened.std():.4f}")
print(f"   ‚Ä¢ Min similarity: {flattened.min():.4f}")
print(f"   ‚Ä¢ Max similarity: {flattened.max():.4f}")
print(f"   ‚Ä¢ 25th percentile: {np.percentile(flattened, 25):.4f}")
print(f"   ‚Ä¢ 75th percentile: {np.percentile(flattened, 75):.4f}")


üìä SIMILARITY SCORE DISTRIBUTION

Statistics (excluding self-similarity):
   ‚Ä¢ Mean similarity: 0.0384
   ‚Ä¢ Median similarity: 0.0316
   ‚Ä¢ Std deviation: 0.0404
   ‚Ä¢ Min similarity: 0.0000
   ‚Ä¢ Max similarity: 0.9836
   ‚Ä¢ 25th percentile: 0.0000
   ‚Ä¢ 75th percentile: 0.0582


### üéØ Step 11: Build Recommendation Function

Create a function to recommend movies based on similarity scores.

In [24]:
def recommend_movies(movie_title, n=10):
    """
    Recommend movies similar to the given movie.
    
    Parameters:
    -----------
    movie_title : str
        Title of the movie to base recommendations on
    n : int, default=10
        Number of recommendations to return
    
    Returns:
    --------
    list of tuples: [(movie_title, similarity_score), ...]
    """
    # Check if movie exists
    if movie_title not in movies['title'].values:
        return f"‚ùå Movie '{movie_title}' not found in dataset!"
    
    # Get movie index
    movie_idx = movies[movies['title'] == movie_title].index[0]
    
    # Get similarity scores for this movie
    distances = list(enumerate(similarity[movie_idx]))
    
    # Sort by similarity (descending)
    distances = sorted(distances, reverse=True, key=lambda x: x[1])
    
    # Get top N recommendations (skip first one - it's the movie itself)
    recommendations = []
    for idx, score in distances[1:n+1]:
        title = movies.iloc[idx]['title']
        recommendations.append((title, score))
    
    return recommendations

print("‚úÖ Recommendation function created!")

‚úÖ Recommendation function created!


### üé¨ Step 12: Test Recommendations

Test the recommendation system with different movies.

In [25]:
# Test 1: Avatar
print("="*70)
print("üé¨ TEST 1: RECOMMENDATIONS FOR 'AVATAR'")
print("="*70)

recs = recommend_movies('Avatar', n=10)
print(f"\n{'Rank':<6} {'Movie Title':<50} {'Match %':<10}")
print("="*70)
for rank, (title, score) in enumerate(recs, 1):
    print(f"{rank:<6} {title:<50} {score*100:.2f}%")

üé¨ TEST 1: RECOMMENDATIONS FOR 'AVATAR'

Rank   Movie Title                                        Match %   
1      Titan A.E.                                         25.37%
2      Small Soldiers                                     25.11%
3      Ender's Game                                       24.42%
4      Independence Day                                   24.38%
5      Aliens vs Predator: Requiem                        24.26%
6      Battle: Los Angeles                                23.73%
7      Krull                                              23.73%
8      Predators                                          23.69%
9      Lifeforce                                          23.39%
10     Falcon Rising                                      23.25%


In [26]:
# Test 2: The Dark Knight
print("\n" + "="*70)
print("üé¨ TEST 2: RECOMMENDATIONS FOR 'THE DARK KNIGHT'")
print("="*70)

recs = recommend_movies('The Dark Knight', n=10)
print(f"\n{'Rank':<6} {'Movie Title':<50} {'Match %':<10}")
print("="*70)
for rank, (title, score) in enumerate(recs, 1):
    print(f"{rank:<6} {title:<50} {score*100:.2f}%")


üé¨ TEST 2: RECOMMENDATIONS FOR 'THE DARK KNIGHT'

Rank   Movie Title                                        Match %   
1      The Dark Knight Rises                              42.39%
2      Batman Begins                                      39.80%
3      Batman Returns                                     32.16%
4      Batman Forever                                     29.01%
5      Batman & Robin                                     28.17%
6      Batman                                             26.76%
7      Batman                                             24.91%
8      Amidst the Devil's Wings                           24.81%
9      Batman v Superman: Dawn of Justice                 24.59%
10     Batman: The Dark Knight Returns, Part 2            23.44%


In [27]:
# Test 3: Inception
print("\n" + "="*70)
print("üé¨ TEST 3: RECOMMENDATIONS FOR 'INCEPTION'")
print("="*70)

recs = recommend_movies('Inception', n=10)
print(f"\n{'Rank':<6} {'Movie Title':<50} {'Match %':<10}")
print("="*70)
for rank, (title, score) in enumerate(recs, 1):
    print(f"{rank:<6} {title:<50} {score*100:.2f}%")


üé¨ TEST 3: RECOMMENDATIONS FOR 'INCEPTION'

Rank   Movie Title                                        Match %   
1      Duplex                                             22.10%
2      The Helix... Loaded                                20.41%
3      Star Trek II: The Wrath of Khan                    20.07%
4      Transformers: Revenge of the Fallen                19.76%
5      Timecop                                            19.76%
6      Chicago Overcoat                                   19.61%
7      Looper                                             18.75%
8      Premium Rush                                       18.26%
9      Cypher                                             18.19%
10     Flatliners                                         17.93%


In [28]:
# Test 4: The Godfather
print("\n" + "="*70)
print("üé¨ TEST 4: RECOMMENDATIONS FOR 'THE GODFATHER'")
print("="*70)

recs = recommend_movies('The Godfather', n=10)
print(f"\n{'Rank':<6} {'Movie Title':<50} {'Match %':<10}")
print("="*70)
for rank, (title, score) in enumerate(recs, 1):
    print(f"{rank:<6} {title:<50} {score*100:.2f}%")


üé¨ TEST 4: RECOMMENDATIONS FOR 'THE GODFATHER'

Rank   Movie Title                                        Match %   
1      Desert Dancer                                      51.01%
2      Take the Lead                                      39.47%
3      Step Up 2: The Streets                             34.08%
4      Center Stage                                       33.66%
5      Step Up                                            33.26%
6      Footloose                                          33.00%
7      ABCD (Any Body Can Dance)                          32.46%
8      Step Up Revolution                                 31.35%
9      Tango                                              31.11%
10     Dancin' It's On                                    26.65%


### üíæ Step 13: Save Processed Data

Export the model for use in production (Streamlit app).

**Files to save:**
1. **movie_dict.pkl** - Movie data as dictionary (for easy lookup)
2. **similarity.pkl** - Similarity matrix (for recommendations)

In [29]:
print("\n" + "="*70)
print("üíæ SAVING PROCESSED DATA")
print("="*70)

# Prepare movies dictionary
# Keep only necessary columns for the app
save_columns = ['movie_id', 'title', 'tags']

# Add optional columns if they exist
optional_save_cols = ['genres', 'vote_average', 'release_date', 'runtime']
for col in optional_save_cols:
    if col in movies.columns:
        save_columns.append(col)

movies_to_save = movies[save_columns]

print(f"\nüìã Columns to save: {save_columns}")


üíæ SAVING PROCESSED DATA

üìã Columns to save: ['movie_id', 'title', 'tags', 'genres', 'vote_average', 'release_date', 'runtime']


In [30]:
# Save movie_dict.pkl
print("Saving movie_dict.pkl...")
movie_dict = movies_to_save.to_dict()

with open('movie_dict.pkl', 'wb') as f:
    pickle.dump(movie_dict, f)

import os
dict_size = os.path.getsize('movie_dict.pkl') / (1024**2)
print(f"   ‚úÖ movie_dict.pkl saved successfully ({dict_size:.2f} MB)")
print(f"   üìä Contains: {len(movies_to_save):,} movies")
print(f"   üìã Columns: {list(movie_dict.keys())}")

Saving movie_dict.pkl...
   ‚úÖ movie_dict.pkl saved successfully (2.58 MB)
   üìä Contains: 4,804 movies
   üìã Columns: ['movie_id', 'title', 'tags', 'genres', 'vote_average', 'release_date', 'runtime']


In [31]:
# Save similarity.pkl
print("Saving similarity.pkl...")

with open('similarity.pkl', 'wb') as f:
    pickle.dump(similarity, f)

sim_size = os.path.getsize('similarity.pkl') / (1024**2)
print(f"   ‚úÖ similarity.pkl saved successfully ({sim_size:.2f} MB)")
print(f"   üìä Matrix shape: {similarity.shape}")
print(f"   üî¢ Data type: {similarity.dtype}")

Saving similarity.pkl...
   ‚úÖ similarity.pkl saved successfully (176.07 MB)
   üìä Matrix shape: (4804, 4804)
   üî¢ Data type: float64


In [32]:
# Verify saved files
print("\n" + "="*70)
print("‚úîÔ∏è VERIFYING SAVED FILES")
print("="*70)

print("\nüìÇ Checking file existence:")
files = ['movie_dict.pkl', 'similarity.pkl']
for filename in files:
    exists = os.path.exists(filename)
    size = os.path.getsize(filename) / (1024**2) if exists else 0
    status = "‚úÖ" if exists else "‚ùå"
    print(f"   {status} {filename}: {size:.2f} MB")

print("\nüîÑ Testing pickle loading:")
try:
    test_movies = pickle.load(open('movie_dict.pkl', 'rb'))
    test_similarity = pickle.load(open('similarity.pkl', 'rb'))
    print("   ‚úÖ movie_dict.pkl loaded successfully")
    print("   ‚úÖ similarity.pkl loaded successfully")
    print("\n‚úÖ All files are valid and ready for production!")
except Exception as e:
    print(f"   ‚ùå Error loading files: {e}")


‚úîÔ∏è VERIFYING SAVED FILES

üìÇ Checking file existence:
   ‚úÖ movie_dict.pkl: 2.58 MB
   ‚úÖ similarity.pkl: 176.07 MB

üîÑ Testing pickle loading:
   ‚úÖ movie_dict.pkl loaded successfully
   ‚úÖ similarity.pkl loaded successfully

‚úÖ All files are valid and ready for production!


### üìä Step 14: Final Summary & Statistics

Display comprehensive information about the trained model.

In [33]:
print("\n" + "="*70)
print("üéâ MODEL DEVELOPMENT COMPLETE!")
print("="*70)

print(f"""
üìä FINAL SUMMARY:
================

üé¨ Dataset:
   ‚Ä¢ Total movies processed: {len(movies):,}
   ‚Ä¢ Features used: genres, keywords, cast, director, overview
   ‚Ä¢ Data quality: {(len(movies)/4803*100):.1f}% of original dataset

üßÆ Model Architecture:
   ‚Ä¢ Algorithm: Content-Based Filtering
   ‚Ä¢ Vectorization: CountVectorizer (Bag of Words)
   ‚Ä¢ Similarity metric: Cosine Similarity
   ‚Ä¢ Feature space: {vectors.shape[1]:,} dimensions
   ‚Ä¢ Matrix size: {similarity.shape[0]:,} √ó {similarity.shape[1]:,}

üìê Vectorizer Parameters:
   ‚Ä¢ max_features: 5,000
   ‚Ä¢ stop_words: english
   ‚Ä¢ max_df: 0.85 (85%)
   ‚Ä¢ min_df: 2

üíæ Output Files:
   ‚úÖ movie_dict.pkl ({dict_size:.2f} MB)
   ‚úÖ similarity.pkl ({sim_size:.2f} MB)
   üìä Total size: {dict_size + sim_size:.2f} MB

üéØ Model Performance:
   ‚Ä¢ Average similarity: {flattened.mean():.4f}
   ‚Ä¢ Similarity std dev: {flattened.std():.4f}
   ‚Ä¢ Similarity range: [{flattened.min():.4f}, {flattened.max():.4f}]

üöÄ Next Steps:
   1. Run: streamlit run app.py
   2. Access: http://localhost:8501
   3. Enjoy personalized movie recommendations!

üí° Notes:
   ‚Ä¢ Keep .pkl files in same directory as app.py
   ‚Ä¢ Files are compatible with existing Streamlit app
   ‚Ä¢ No code changes needed in app.py
""")

print("="*70)
print("‚ú® Thank you for using Movie Recommender System! ‚ú®")
print("="*70)


üéâ MODEL DEVELOPMENT COMPLETE!

üìä FINAL SUMMARY:

üé¨ Dataset:
   ‚Ä¢ Total movies processed: 4,804
   ‚Ä¢ Features used: genres, keywords, cast, director, overview
   ‚Ä¢ Data quality: 100.0% of original dataset

üßÆ Model Architecture:
   ‚Ä¢ Algorithm: Content-Based Filtering
   ‚Ä¢ Vectorization: CountVectorizer (Bag of Words)
   ‚Ä¢ Similarity metric: Cosine Similarity
   ‚Ä¢ Feature space: 5,000 dimensions
   ‚Ä¢ Matrix size: 4,804 √ó 4,804

üìê Vectorizer Parameters:
   ‚Ä¢ max_features: 5,000
   ‚Ä¢ stop_words: english
   ‚Ä¢ max_df: 0.85 (85%)
   ‚Ä¢ min_df: 2

üíæ Output Files:
   ‚úÖ movie_dict.pkl (2.58 MB)
   ‚úÖ similarity.pkl (176.07 MB)
   üìä Total size: 178.65 MB

üéØ Model Performance:
   ‚Ä¢ Average similarity: 0.0384
   ‚Ä¢ Similarity std dev: 0.0404
   ‚Ä¢ Similarity range: [0.0000, 0.9836]

üöÄ Next Steps:
   1. Run: streamlit run app.py
   2. Access: http://localhost:8501
   3. Enjoy personalized movie recommendations!

üí° Notes:
   ‚Ä¢ Keep .pkl 

### Step 15: Model Analysis & Insights

Analyze the recommendation quality and model behavior.

In [34]:
# Analyze genre distribution
print("="*70)
print("GENRE DISTRIBUTION ANALYSIS")
print("="*70)

from collections import Counter

all_genres = []
for genres_list in movies['genres']:
    if isinstance(genres_list, list):
        all_genres.extend(genres_list)

genre_counts = Counter(all_genres)
top_10_genres = genre_counts.most_common(10)

print(f"\nTop 10 Most Common Genres:\n")
print(f"{'Rank':<6} {'Genre':<30} {'Count':<10}")
print("="*50)
for rank, (genre, count) in enumerate(top_10_genres, 1):
    print(f"{rank:<6} {genre:<30} {count:<10}")

GENRE DISTRIBUTION ANALYSIS

Top 10 Most Common Genres:

Rank   Genre                          Count     
1      Drama                          2298      
2      Comedy                         1723      
3      Thriller                       1273      
4      Action                         1156      
5      Romance                        895       
6      Adventure                      792       
7      Crime                          696       
8      ScienceFiction                 538       
9      Horror                         519       
10     Family                         514       


In [35]:
# Analyze recommendation diversity
print("\n" + "="*70)
print("üîç RECOMMENDATION DIVERSITY ANALYSIS")
print("="*70)

# Sample 10 random movies and check diversity of their recommendations
sample_movies = movies['title'].sample(n=10, random_state=42)

print(f"\nAnalyzing recommendations for 10 random movies...\n")
for movie_title in sample_movies:
    recs = recommend_movies(movie_title, n=5)
    avg_similarity = np.mean([score for _, score in recs])
    print(f"‚Ä¢ {movie_title:<40} ‚Üí Avg similarity: {avg_similarity:.4f}")

print("Insight: Higher diversity (lower avg similarity) means more varied recommendations")


üîç RECOMMENDATION DIVERSITY ANALYSIS

Analyzing recommendations for 10 random movies...

‚Ä¢ I Spy                                    ‚Üí Avg similarity: 0.2650
‚Ä¢ Rabbit-Proof Fence                       ‚Üí Avg similarity: 0.2269
‚Ä¢ Little Children                          ‚Üí Avg similarity: 0.2158
‚Ä¢ Hot Fuzz                                 ‚Üí Avg similarity: 0.2063
‚Ä¢ Harry Potter and the Half-Blood Prince   ‚Üí Avg similarity: 0.3481
‚Ä¢ AVP: Alien vs. Predator                  ‚Üí Avg similarity: 0.2495
‚Ä¢ Down to You                              ‚Üí Avg similarity: 0.1909
‚Ä¢ Meet the Parents                         ‚Üí Avg similarity: 0.2890
‚Ä¢ Never Back Down                          ‚Üí Avg similarity: 0.2289
‚Ä¢ The Other End of the Line                ‚Üí Avg similarity: 0.2572
Insight: Higher diversity (lower avg similarity) means more varied recommendations


In [36]:
# Find movies with unique characteristics
print("\n" + "="*70)
print("üåü UNIQUE MOVIES ANALYSIS")
print("="*70)

# Calculate average similarity for each movie
avg_similarities = []
for i in range(len(movies)):
    # Exclude self-similarity (diagonal)
    avg_sim = (similarity[i].sum() - 1) / (len(movies) - 1)
    avg_similarities.append((i, avg_sim))

# Sort by average similarity
avg_similarities.sort(key=lambda x: x[1])

print("\nüéØ Top 10 Most Unique Movies (lowest avg similarity):")
print(f"\n{'Rank':<6} {'Movie Title':<50} {'Avg Sim':<10}")
print("="*70)
for rank, (idx, avg_sim) in enumerate(avg_similarities[:10], 1):
    title = movies.iloc[idx]['title']
    print(f"{rank:<6} {title:<50} {avg_sim:.4f}")

print("\nüî• Top 10 Most Common Movies (highest avg similarity):")
print(f"\n{'Rank':<6} {'Movie Title':<50} {'Avg Sim':<10}")
print("="*70)
for rank, (idx, avg_sim) in enumerate(reversed(avg_similarities[-10:]), 1):
    title = movies.iloc[idx]['title']
    print(f"{rank:<6} {title:<50} {avg_sim:.4f}")


üåü UNIQUE MOVIES ANALYSIS

üéØ Top 10 Most Unique Movies (lowest avg similarity):

Rank   Movie Title                                        Avg Sim   
1      Sardaarji                                          0.0028
2      Taxi to the Dark Side                              0.0050
3      A LEGO Brickumentary                               0.0053
4      Open Range                                         0.0055
5      The Hadza:  Last of the First                      0.0058
6      Murderball                                         0.0059
7      The Gatekeepers                                    0.0065
8      Roger & Me                                         0.0071
9      Standard Operating Procedure                       0.0071
10     Harrison Montgomery                                0.0078

üî• Top 10 Most Common Movies (highest avg similarity):

Rank   Movie Title                                        Avg Sim   
1      Four Single Fathers                                0.0928
2

### üéì Conclusion & Key Takeaways

#### ‚úÖ What We Accomplished:
1. **Loaded and merged** 5,000 movies with their cast and crew information
2. **Extracted features** from genres, keywords, cast, director, and plot
3. **Created tags** by combining all features into a single text representation
4. **Vectorized text** using CountVectorizer (Bag of Words approach)
5. **Calculated similarity** using Cosine Similarity metric
6. **Built recommendation function** to suggest similar movies
7. **Exported model** as pickle files for production use

#### üìä Model Characteristics:
- **Type**: Content-Based Filtering
- **Features**: 5,000 dimensions
- **Algorithm**: Cosine Similarity
- **Coverage**: ~4,800 movies

#### üéØ Strengths:
- ‚úÖ No cold start problem (doesn't need user history)
- ‚úÖ Explainable recommendations (based on movie features)
- ‚úÖ Works for new users immediately
- ‚úÖ Handles niche movies well

#### ‚ö†Ô∏è Limitations:
- ‚ùå Cannot discover completely different genres
- ‚ùå Limited by feature quality
- ‚ùå Doesn't learn from user feedback
- ‚ùå May recommend very similar movies (lacks diversity)

#### üöÄ Possible Improvements:
1. **Add TF-IDF** instead of CountVectorizer for better feature weighting
2. **Include user ratings** to build hybrid recommender
3. **Add more features**: budget, revenue, language, runtime
4. **Implement stemming/lemmatization** for better text processing
5. **Use neural embeddings** (Word2Vec, BERT) for semantic understanding
6. **Add diversity penalty** to avoid too similar recommendations

#### üìö References:
- TMDB Dataset: https://www.themoviedb.org/
- Cosine Similarity: https://scikit-learn.org/stable/modules/metrics.html#cosine-similarity
- Content-Based Filtering: https://en.wikipedia.org/wiki/Recommender_system#Content-based_filtering

---

**üé¨ The model is now ready!**                                                                                                         