# Content-Based Movie Recommendation System

## Overview
This project builds a content-based recommendation system that suggests movies similar to a user’s input based on their preferences. The system uses a small dataset of movies and their descriptions to generate recommendations based on textual similarity.

### About Dataset
The dataset used in this project is **Top 100 Movies Dataset** from [IMDB](https://www.imdb.com/list/ls053251213/). The dataset contains information about movies including their descriptions, which will be used for similarity comparison.

### Steps
1. **Load the dataset** - Read the movie data from a CSV file.
2. **Preprocess the text** - Clean movie descriptions by converting text to lowercase, removing special characters, tokenize words, and remove stopwords.
3. **Convert text into numerical vectors** - Use TF-IDF to transform text descriptions into numerical format.
4. **Compute similarity and Return top 5 recommendations** - Calculate cosine similarity between the user query and all movie descriptions, and retrieve the most relevant movie titles.

## Step 1: Load the Dataset
I start by loading the dataset and displaying a preview.

In [2]:
# Import necessary libraries
import pandas as pd
from csv import reader
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
import nltk
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
nltk.download( 'stopwords' )
nltk.download( 'punkt' )

# Load dataset
opened_file = open( 'Top_movies.csv' )
read_file = reader( opened_file )
movies_data = list( read_file )
movies_header = movies_data[0]
movies_data = movies_data[1:]

# Check loading 
print ( movies_header )
print ( '\n' )
print ( movies_data[:5] )

['Position', 'Const', 'Created', 'Modified', 'Description', 'Title', 'Original Title', 'URL', 'Title Type', 'IMDb Rating', 'Runtime (mins)', 'Year', 'Genres', 'Num Votes', 'Release Date', 'Directors', 'Your Rating', 'Date Rated']


[['1', 'tt0111161', '2013-06-24', '2013-06-24', '', 'The Shawshank Redemption', 'The Shawshank Redemption', 'https://www.imdb.com/title/tt0111161/', 'Movie', '9.3', '142', '1994', 'Drama', '3009964', '1994-10-14', 'Frank Darabont', '', ''], ['2', 'tt0110912', '2013-06-24', '2013-06-24', '', 'Pulp Fiction', 'Pulp Fiction', 'https://www.imdb.com/title/tt0110912/', 'Movie', '8.9', '154', '1994', 'Crime, Drama', '2308040', '1994-10-14', 'Quentin Tarantino', '', ''], ['3', 'tt0120689', '2013-06-24', '2013-06-24', '', 'The Green Mile', 'The Green Mile', 'https://www.imdb.com/title/tt0120689/', 'Movie', '8.6', '189', '1999', 'Crime, Drama, Fantasy, Mystery', '1469373', '1999-12-10', 'Frank Darabont', '', ''], ['4', 'tt0109830', '2013-06-24', '2013-06-24', '', 'Forr

[nltk_data] Downloading package stopwords to /Users/polly/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/polly/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Step 2: Preprocess Text Data
Perform text cleaning, including:
- Converting text to lowercase
- Removing special characters
- Tokenizing words
- Removing stopwords

In [4]:
# Function to convert text to lowercase
def to_lowercase( text ):
    return str( text ).lower()

# Apply transformation to title and genre columns
for row in movies_data:
    row[5] = to_lowercase( row[5] )  # Title
    row[12] = to_lowercase( row[12] )  # Genre

# Check transformation
print ( movies_data[:5] )

[['1', 'tt0111161', '2013-06-24', '2013-06-24', '', 'the shawshank redemption', 'The Shawshank Redemption', 'https://www.imdb.com/title/tt0111161/', 'Movie', '9.3', '142', '1994', 'drama', '3009964', '1994-10-14', 'Frank Darabont', '', ''], ['2', 'tt0110912', '2013-06-24', '2013-06-24', '', 'pulp fiction', 'Pulp Fiction', 'https://www.imdb.com/title/tt0110912/', 'Movie', '8.9', '154', '1994', 'crime, drama', '2308040', '1994-10-14', 'Quentin Tarantino', '', ''], ['3', 'tt0120689', '2013-06-24', '2013-06-24', '', 'the green mile', 'The Green Mile', 'https://www.imdb.com/title/tt0120689/', 'Movie', '8.6', '189', '1999', 'crime, drama, fantasy, mystery', '1469373', '1999-12-10', 'Frank Darabont', '', ''], ['4', 'tt0109830', '2013-06-24', '2013-06-24', '', 'forrest gump', 'Forrest Gump', 'https://www.imdb.com/title/tt0109830/', 'Movie', '8.8', '142', '1994', 'drama, romance', '2353685', '1994-07-06', 'Robert Zemeckis', '', ''], ['5', 'tt1403865', '2013-06-24', '2013-06-24', '', 'true grit'

In [5]:
# Function to remove special characters from text
def remove_special_chars(text):
    return re.sub( r'[^a-zA-Z0-9\s]', '', text )

# Apply transformation to title column
for row in movies_data:
    row[5] = remove_special_chars( row[5] )

# Check transformation
print ( movies_data[:5] )

[['1', 'tt0111161', '2013-06-24', '2013-06-24', '', 'the shawshank redemption', 'The Shawshank Redemption', 'https://www.imdb.com/title/tt0111161/', 'Movie', '9.3', '142', '1994', 'drama', '3009964', '1994-10-14', 'Frank Darabont', '', ''], ['2', 'tt0110912', '2013-06-24', '2013-06-24', '', 'pulp fiction', 'Pulp Fiction', 'https://www.imdb.com/title/tt0110912/', 'Movie', '8.9', '154', '1994', 'crime, drama', '2308040', '1994-10-14', 'Quentin Tarantino', '', ''], ['3', 'tt0120689', '2013-06-24', '2013-06-24', '', 'the green mile', 'The Green Mile', 'https://www.imdb.com/title/tt0120689/', 'Movie', '8.6', '189', '1999', 'crime, drama, fantasy, mystery', '1469373', '1999-12-10', 'Frank Darabont', '', ''], ['4', 'tt0109830', '2013-06-24', '2013-06-24', '', 'forrest gump', 'Forrest Gump', 'https://www.imdb.com/title/tt0109830/', 'Movie', '8.8', '142', '1994', 'drama, romance', '2353685', '1994-07-06', 'Robert Zemeckis', '', ''], ['5', 'tt1403865', '2013-06-24', '2013-06-24', '', 'true grit'

In [6]:
# Function to tokenize words (fallback approach using split)
def tokenize_text( text ):
    return text.split()  # Splitting by spaces

# Apply transformation to title column
for row in movies_data:
    row[5] = tokenize_text( row[5] )
    
# Check transformation
print( movies_data[:5] )

[['1', 'tt0111161', '2013-06-24', '2013-06-24', '', ['the', 'shawshank', 'redemption'], 'The Shawshank Redemption', 'https://www.imdb.com/title/tt0111161/', 'Movie', '9.3', '142', '1994', 'drama', '3009964', '1994-10-14', 'Frank Darabont', '', ''], ['2', 'tt0110912', '2013-06-24', '2013-06-24', '', ['pulp', 'fiction'], 'Pulp Fiction', 'https://www.imdb.com/title/tt0110912/', 'Movie', '8.9', '154', '1994', 'crime, drama', '2308040', '1994-10-14', 'Quentin Tarantino', '', ''], ['3', 'tt0120689', '2013-06-24', '2013-06-24', '', ['the', 'green', 'mile'], 'The Green Mile', 'https://www.imdb.com/title/tt0120689/', 'Movie', '8.6', '189', '1999', 'crime, drama, fantasy, mystery', '1469373', '1999-12-10', 'Frank Darabont', '', ''], ['4', 'tt0109830', '2013-06-24', '2013-06-24', '', ['forrest', 'gump'], 'Forrest Gump', 'https://www.imdb.com/title/tt0109830/', 'Movie', '8.8', '142', '1994', 'drama, romance', '2353685', '1994-07-06', 'Robert Zemeckis', '', ''], ['5', 'tt1403865', '2013-06-24', '20

In [7]:
# Function to remove stopwords from tokenized text
def remove_stopwords( tokenized_text ):
    return [ word for word in tokenized_text if word not in stop_words ]

# Manually provide stopwords fallback in case NLTK resources are unavailable
try:
    stop_words = set( stopwords.words('english') )
except:
    stop_words = { "the", "a", "an", "in", "on", "at", "to", "is", "and", "it", "of", "for", "with", "as", "by", "this", "that" }

# Apply transformation to title column
for row in movies_data:
    row[5] = remove_stopwords( row[5] )

# Check transformation
print( movies_data[:5] )

[['1', 'tt0111161', '2013-06-24', '2013-06-24', '', ['shawshank', 'redemption'], 'The Shawshank Redemption', 'https://www.imdb.com/title/tt0111161/', 'Movie', '9.3', '142', '1994', 'drama', '3009964', '1994-10-14', 'Frank Darabont', '', ''], ['2', 'tt0110912', '2013-06-24', '2013-06-24', '', ['pulp', 'fiction'], 'Pulp Fiction', 'https://www.imdb.com/title/tt0110912/', 'Movie', '8.9', '154', '1994', 'crime, drama', '2308040', '1994-10-14', 'Quentin Tarantino', '', ''], ['3', 'tt0120689', '2013-06-24', '2013-06-24', '', ['green', 'mile'], 'The Green Mile', 'https://www.imdb.com/title/tt0120689/', 'Movie', '8.6', '189', '1999', 'crime, drama, fantasy, mystery', '1469373', '1999-12-10', 'Frank Darabont', '', ''], ['4', 'tt0109830', '2013-06-24', '2013-06-24', '', ['forrest', 'gump'], 'Forrest Gump', 'https://www.imdb.com/title/tt0109830/', 'Movie', '8.8', '142', '1994', 'drama, romance', '2353685', '1994-07-06', 'Robert Zemeckis', '', ''], ['5', 'tt1403865', '2013-06-24', '2013-06-24', '',

## Step 3: Vectorize Text Using TF-IDF
Transform the cleaned text descriptions into numerical format using TF-IDF.

In [9]:
# Convert genre text into TF-IDF format
genres_cleaned = [row[12] for row in movies_data]
genre_vectorizer = TfidfVectorizer()
tfidf_genre_matrix = genre_vectorizer.fit_transform( genres_cleaned )

## Step 4: Compute Similarity & Recommend Movies
Compare the user’s input with movie descriptions using cosine similarity and return the most relevant matches.

In [19]:
# Function to recommend movies based on user input using genre similarity
def recommend_movies_by_genre(user_input, top_n=5):
    # Preprocess user input
    user_input_processed = " ".join(remove_stopwords(user_input.lower().split()))
    
    # Transform user input into TF-IDF vector
    user_vec = genre_vectorizer.transform([user_input_processed])
    
    # Compute cosine similarity between user input and movie genres
    similarities = cosine_similarity(user_vec, tfidf_genre_matrix).flatten()
    
    # Get top N indices with highest similarity scores
    top_indices = similarities.argsort()[-top_n:][::-1]
    
    # Retrieve recommended movies with only name and link
    recommendations = [(movies_data[i][6], movies_data[i][7]) for i in top_indices]  # Index 6 = title, 7 = link
    
    return recommendations

# Allow user input
user_query = input("Enter movie preferences: ")

# Get recommendations
recommended_movies = recommend_movies_by_genre(user_query)

# Display top 5 recommended movies (Title and Link)
for movie in recommended_movies:
    print(movie)

Enter movie preferences:  horror


('Dawn of the Dead', 'https://www.imdb.com/title/tt0363547/')
('Tucker and Dale vs Evil', 'https://www.imdb.com/title/tt1465522/')
('Shaun of the Dead', 'https://www.imdb.com/title/tt0365748/')
('Insidious', 'https://www.imdb.com/title/tt1591095/')
('The Mothman Prophecies', 'https://www.imdb.com/title/tt0265349/')
