Discover your next favourite watch — powered by TF-IDF & Cosine Similarity
- About
- Features
- How It Works
- Tech Stack
- Dataset
- Project Structure
- Installation
- Usage
- Test Cases
- Deployment
CineMatch is a content-based movie and web series recommendation system built with a Netflix-inspired dark UI. Search for any title and instantly get intelligent recommendations based on genre, theme, era, rating tier, director, and cast — not just popularity.
Built as a Data Mining Project using TF-IDF vectorization and Cosine Similarity on 20,000 IMDb titles.
| Feature | Description |
|---|---|
| 🔍 Smart Search | Substring matching with "Did you mean?" suggestions |
| 🎬 Source Card | Full details of your searched title — rating, cast, director, IMDb link |
| 🃏 Recommendation Cards | Top N results with genre tags, similarity %, rating bars |
| 📊 Similarity Chart | Interactive Plotly bar chart showing cosine similarity scores |
| 🎛️ Filters | Filter by content type — Movie, TV Series, Mini Series, TV Movie |
| 🕐 Will Be Added Soon | Graceful screen when a title isn't in the library |
| 🌑 Netflix Dark Theme | Full Netflix-style UI with red accents and dark backgrounds |
| ⚡ Auto Model Build | Model builds automatically on first run — no manual setup needed |
CineMatch uses a hybrid weighted TF-IDF + Cosine Similarity approach:
Search Query
│
▼
┌─────────────────────────────────────┐
│ Substring Matching │
│ Finds all titles containing query │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Feature Soup (Weighted) │
│ │
│ Inferred Subgenre ████████ 4x │
│ Genre Tags ██████████ 5x │
│ Content Type ██████ 3x │
│ Decade Bucket ████ 2x │
│ Rating Tier ████ 2x │
│ Director ██ 1x │
│ Lead Actor ██ 1x │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ TF-IDF Vectorization │
│ 20,000 features, bigrams │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Cosine Similarity │
│ Ranked Top-N Results │
└─────────────────────────────────────┘
│
▼
Results + Similarity Chart
The IMDb dataset uses broad genre tags — Action, Adventure, Drama is shared by Game of Thrones, Star Trek, and Mahabharat. To fix this, CineMatch infers thematic subgenres from title text:
| Subgenre Token | Trigger Keywords |
|---|---|
medieval_fantasy |
dragon, throne, knight, viking, witch, magic... |
scifi_space |
space, galaxy, alien, robot, future... |
crime_thriller |
heist, murder, detective, cop, mafia... |
superhero |
avenger, batman, spider, marvel... |
horror |
zombie, demon, haunted, vampire... |
This ensures Game of Thrones → House of the Dragon, Vikings rather than random action shows.
| Component | Technology |
|---|---|
| Frontend | Streamlit + Custom CSS (Netflix Theme) |
| ML Model | scikit-learn TF-IDF + Cosine Similarity |
| Visualisation | Plotly interactive bar charts |
| Data Processing | Pandas, NumPy |
| Model Storage | Python Pickle |
| Language | Python 3.9+ |
- Source: IMDb Top 20,000 Titles
- Size: 20,000 titles
- Fields:
title,year,type,genre,rating,votes,director,cast,runtime,imdb_url
| Content Type | Count |
|---|---|
| 🎬 Movies | 15,978 |
| 📺 TV Series | 3,110 |
| 📽️ Mini Series | 637 |
| 🎥 TV Movies | 275 |
| Total | 20,000 |
Rating range: 1.0 ⭐ to 9.6 ⭐
CINEMATCH/
│
├── app.py # Streamlit UI — Netflix dark theme
├── movie_analysis.py # Recommendation engine — TF-IDF model
├── imdb_dataset.csv # IMDb dataset (20,000 titles)
├── requirements.txt # Python dependencies
└── README.md # You are here
recommendation_model.pklis auto-generated on first run and not committed to the repo.
1. Clone the repository:
git clone https://github.com/Lakshya438/CINEMATCH.git
cd CINEMATCH2. Install dependencies:
pip install -r requirements.txt3. Build the model (first time only):
python movie_analysis.py4. Run the app:
streamlit run app.pyOpen your browser at http://localhost:8501 🎉
- Type any movie or series name in the search bar
- Click a suggestion from the "Did you mean?" list if multiple matches appear
- View the source card with full details of your searched title
- Browse the recommendation cards below
- Analyse the cosine similarity chart to understand match strength
- Filter by content type using the sidebar
- Adjust the number of recommendations (5–20) using the slider
| Search | What You Get |
|---|---|
Game of Thrones |
House of the Dragon, Vikings, The Last Kingdom |
Breaking Bad |
Better Call Saul, Ozark, Narcos |
Inception |
Interstellar, The Matrix, Tenet |
Stranger Things |
Dark, The OA, Haunting of Hill House |
Parasite |
Memories of Murder, Oldboy, The Host |
| ID | Query | Expected Output |
|---|---|---|
| TC-01 | Game of Thrones |
Medieval fantasy series — House of Dragon, Vikings |
| TC-02 | Inception |
Sci-Fi/Action movies — Mad Max, Pacific Rim |
| TC-03 | Breaking Bad |
Crime drama series — 9.5/10 rating shown |
| TC-04 | Asur |
"Did you mean?" → Asur, Asuran, Devasuram |
| TC-05 | Avengers |
MCU superhero cluster |
| TC-06 | xyznonexistent |
🕐 "Will Be Added Soon" screen |
| TC-07 | The Dark Knight + Movie filter |
Only movies — 96% match for Dark Knight Rises |
| TC-08 | Stranger Things + TV Series filter |
Only TV series — horror/supernatural cluster |
Deployed on Streamlit Community Cloud — free hosting.
To deploy your own:
- Fork this repo
- Go to Website
- Connect your GitHub and select this repo
- Set main file as
app.py - Click Deploy!
Lakshya
- GitHub: @Lakshya438
This project is for educational purposes as part of a Data Mining course project.
Made with ❤️ and 🎬 | Data Mining Project 2026
⭐ Star this repo if you found it useful!