# Movie Recommendation System
## Content-Based Filtering Approach

**Author**: [Your Name]  
**Date**: 2024-11-13  
**Project**: Classic Content-Based Movie Recommender

---

### Project Overview

This notebook implements a content-based movie recommendation system using the MovieLens 25M dataset. The system recommends movies based on content features (genres and user-generated tags) rather than user ratings or collaborative filtering.

**Approach**:
- **Feature Engineering**: Combine genres and tags into text features
- **Vectorization**: Use TF-IDF to convert text to numerical vectors
- **Similarity**: Calculate cosine similarity between movies
- **Recommendation**: Find top N most similar movies

**Success Criteria**: Precision@10 ≥ 0.7 (70% of recommendations share genre with input)

---

## Table of Contents

1. [Setup & Imports](#1-setup-and-imports)
2. [Data Loading](#2-data-loading)
3. [Exploratory Data Analysis](#3-exploratory-data-analysis)
4. [Data Cleaning & Preprocessing](#4-data-cleaning-and-preprocessing)
5. [Feature Engineering](#5-feature-engineering)
6. [Model Building](#6-model-building)
7. [Recommendation Function](#7-recommendation-function)
8. [Evaluation](#8-evaluation)
9. [Results & Conclusions](#9-results-and-conclusions)
10. [Future Work](#10-future-work)

---

## 1. Setup and Imports

### 1.1 Environment Information

In [None]:
# Document environment
import sys
print(f"Python version: {sys.version}")

import pandas as pd
print(f"Pandas version: {pd.__version__}")

import numpy as np
print(f"NumPy version: {np.__version__}")

import sklearn
print(f"scikit-learn version: {sklearn.__version__}")

### 1.2 Library Imports

In [None]:
# Data manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Utilities
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

print("✓ All libraries imported successfully")

### 1.3 Configuration

In [None]:
# File paths
DATA_PATH = '../data/ml-25m/'
MOVIES_FILE = DATA_PATH + 'movies.csv'
TAGS_FILE = DATA_PATH + 'tags.csv'

# Hyperparameters
MAX_FEATURES = 5000  # TF-IDF max features
N_RECOMMENDATIONS = 10  # Default number of recommendations

# Visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("✓ Configuration complete")

---

## 2. Data Loading

We'll load the MovieLens 25M dataset, which contains:
- **movies.csv**: Movie IDs, titles, and genres
- **tags.csv**: User-generated tags for movies

### 2.1 Load Movies Dataset

In [None]:
# Load movies
movies = pd.read_csv(MOVIES_FILE)
print(f"Loaded {len(movies):,} movies")
movies.head()

**Observations:**
- Dataset has X rows and Y columns
- Columns: movieId (unique identifier), title (name + year), genres (pipe-separated list)

### 2.2 Load Tags Dataset

In [None]:
# Load tags
tags = pd.read_csv(TAGS_FILE)
print(f"Loaded {len(tags):,} tag entries")
tags.head()

### 2.3 Data Loading Checkpoint

In [None]:
# === CHECKPOINT: Data Loading ===
assert 'movieId' in movies.columns, "movieId column missing from movies!"
assert 'title' in movies.columns, "title column missing from movies!"
assert 'genres' in movies.columns, "genres column missing from movies!"
assert 'movieId' in tags.columns, "movieId column missing from tags!"
assert 'tag' in tags.columns, "tag column missing from tags!"
assert len(movies) > 0, "Movies dataset is empty!"
assert len(tags) > 0, "Tags dataset is empty!"

print("✓ Data loading checkpoint passed")

---

## 3. Exploratory Data Analysis

Understanding our data before processing is crucial. We'll explore:
- Dataset structure and statistics
- Genre distribution
- Tag coverage
- Data quality issues

### 3.1 Movies Dataset Overview

In [None]:
# Basic statistics
print("=== Movies Dataset Info ===")
movies.info()

print("\n=== Missing Values ===")
print(movies.isnull().sum())

print("\n=== Data Types ===")
print(movies.dtypes)

### 3.2 Genre Distribution

In [None]:
# Analyze genres (split pipe-separated values and count)
genre_counts = movies['genres'].str.split('|').explode().value_counts()
print("=== Top 10 Genres ===")
print(genre_counts.head(10))

# Visualize
plt.figure(figsize=(12, 6))
genre_counts.head(15).plot(kind='bar')
plt.title('Top 15 Movie Genres', fontsize=16, fontweight='bold')
plt.xlabel('Genre', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

**Insights:**
- Drama is the most common genre
- Action, Comedy, Thriller are also very common
- Film-Noir is rare (important for our future niche project!)

### 3.3 Tags Analysis

In [None]:
# How many movies have tags?
movies_with_tags = tags['movieId'].nunique()
total_movies = movies['movieId'].nunique()
coverage = (movies_with_tags / total_movies) * 100

print(f"Movies with tags: {movies_with_tags:,}")
print(f"Total movies: {total_movies:,}")
print(f"Tag coverage: {coverage:.1f}%")

# Tag distribution per movie
tags_per_movie = tags.groupby('movieId').size()
print("\n=== Tags per Movie Statistics ===")
print(tags_per_movie.describe())

---

## 4. Data Cleaning and Preprocessing

*This section will be completed in Phase 3*

Tasks:
- Handle missing values
- Aggregate tags per movie
- Merge movies and tags datasets
- Create master DataFrame

---

## 5. Feature Engineering

*This section will be completed in Phase 4*

Tasks:
- Create "soup" column combining genres and tags
- Text preprocessing (lowercase, remove special chars)
- Handle movies without tags

---

## 6. Model Building

*This section will be completed in Phase 5*

Tasks:
- Apply TF-IDF vectorization
- Calculate cosine similarity matrix
- Verify matrix properties

---

## 7. Recommendation Function

*This section will be completed in Phase 6*

Tasks:
- Implement get_recommendations() function
- Handle edge cases
- Test with sample movies

---

## 8. Evaluation

*This section will be completed in Phase 7*

Tasks:
- Implement Precision@K metric
- Evaluate on test set
- Create visualizations
- Qualitative analysis

---

## 9. Results and Conclusions

*This section will be completed in Phase 7*

Summary:
- Key findings
- Performance metrics
- Limitations discovered

---

## 10. Future Work

*This section will be completed in Phase 7*

Next steps:
- Add more features (directors, actors)
- Try different similarity metrics
- Implement diversity in recommendations
- Create web API with FastAPI