# 🎬 Movie Recommendation System

This project suggests movies based on user preferences using Machine Learning techniques. It combines **collaborative filtering** and **content-based filtering** to provide personalized recommendations.

**Tools Used:** Python, Pandas, Scikit-learn, Streamlit  
** Dataset:** [MovieLens 100K](https://grouplens.org/datasets/movielens/100k/)


## Libraries Used

The following Python libraries are used in this project to build the Movie Recommendation System:

- **pandas**: For loading and manipulating movie and user data in tabular form.
- **numpy**: For efficient numerical computations and array handling.
- **scikit-learn**:
  - `CountVectorizer`: Converts movie metadata (like genre, keywords, etc.) into numerical vectors for content-based filtering.
  - `cosine_similarity`: Measures the similarity between users or movies based on vector representations.
- **streamlit**: For building an interactive and user-friendly web app interface to display recommendations.

These libraries together enable both the recommendation logic and the front-end interface.


In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
import streamlit as st


## Loading the Movie Dataset

We load the movie metadata using the Pandas library:

```python
movies = pd.read_csv("movies.csv")
movies.head()


In [2]:
movies = pd.read_csv("movies.csv")
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
ratings = pd.read_csv("ratings.csv")
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


## Merging Datasets

We merge the ratings and movies datasets to combine user ratings with movie details.

movie_data = pd.merge(ratings, movies, on='movieId')  
→ Merges the ratings and movies DataFrames using the movieId column as the key.  
  This allows us to associate each rating with its corresponding movie title and genre.

movie_data.dropna(inplace=True)  
→ Removes any rows with missing values to ensure clean data for analysis.

movie_data.head()  
→ Displays the first five rows of the merged dataset.


In [4]:
# Merge datasets
movie_data = pd.merge(ratings, movies, on='movieId')
movie_data.dropna(inplace=True)
movie_data.head()


Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller


##  Creating the User-Item Matrix

We now create a matrix where each row represents a user and each column represents a movie.

user_movie_matrix = movie_data.pivot_table(index='userId', columns='title', values='rating')  
→ Creates a pivot table where rows are `userId`, columns are `movie titles`, and the values are the corresponding ratings.  
  This structure is known as a **user-item matrix**, which is commonly used in collaborative filtering.

user_movie_matrix.fillna(0, inplace=True)  
→ Replaces all missing ratings (`NaN`) with `0`, assuming the user hasn't rated that movie.  
  This step is essential to apply similarity-based algorithms like cosine similarity.


In [5]:
# Create user-item matrix
user_movie_matrix = movie_data.pivot_table(index='userId', columns='title', values='rating')

# Fill NaN with 0 for collaborative filtering
user_movie_matrix.fillna(0, inplace=True)

## Computing Similarity Between Movies (Mathematical Explanation)

This step calculates how similar two movies are based on user ratings using **cosine similarity**.

### Cosine Similarity Formula (Plain Markdown)

cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)

Where:

- A · B = dot product of vectors A and B  
- ||A|| = magnitude (length) of vector A  
- ||B|| = magnitude (length) of vector B  
- Resulting value lies between 0 and 1

A value close to 1 means the two movies are very similar (liked by the same users), and close to 0 means they are dissimilar.

### Intuition

If two movies have similar user ratings — for example, if the same users gave both movies high ratings — then their vectors will be close in direction, resulting in a high cosine similarity.

This similarity is used to identify which movies are related to each other based on **user behavior**, not just content.


In [8]:
# Compute similarity between movies
movie_similarity = cosine_similarity(user_movie_matrix.T)
movie_similarity_df = pd.DataFrame(movie_similarity, index=user_movie_matrix.columns, columns=user_movie_matrix.columns)
movie_similarity_df.shape


(9719, 9719)

In [9]:
movie_similarity_df.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.141653,0.0,...,0.0,0.342055,0.543305,0.707107,0.0,0.0,0.139431,0.327327,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,1.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.707107,1.0,0.0,0.0,0.0,0.176777,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Salem's Lot (2004),0.0,0.0,0.0,1.0,0.857493,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Til There Was You (1997),0.0,0.0,0.0,0.857493,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


##  Genre-Based Similarity – Mathematical Explanation

This section calculates how similar two movies are based on their genres using **cosine similarity** on bag-of-words vectors.

---

### Step 1: Genre Vector Representation (Bag of Words)

Each movie’s genre string (like `"Action Adventure Sci-Fi"`) is converted into a binary vector using `CountVectorizer()`.

For example, given these genres:
- Movie A: "Action Adventure"
- Movie B: "Action Sci-Fi"

Suppose the vocabulary is: `[Action, Adventure, Sci-Fi]`

Then their vectors become:
- Movie A → **[1, 1, 0]**
- Movie B → **[1, 0, 1]**

---

### Step 2: Cosine Similarity Formula

To compute how similar these vectors are, we use the **cosine similarity** formula:

**cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)**

Where:
- A · B = dot product of the two genre vectors
- ||A|| = Euclidean norm (magnitude) of A
- ||B|| = Euclidean norm (magnitude) of B

---

### Example:

Let’s compute similarity between Movie A `[1, 1, 0]` and Movie B `[1, 0, 1]`.

1. **Dot product:**

   A · B = (1×1 + 1×0 + 0×1) = 1

2. **Magnitudes:**

   ||A|| = √(1² + 1² + 0²) = √2  
   ||B|| = √(1² + 0² + 1²) = √2

3. **Cosine Similarity:**

   cosine_similarity = 1 / (√2 × √2) = 1 / 2 = **0.5**

So, the similarity between these two movies based on genres is **0.5** (moderate).

---

### Summary:

- Each movie is represented as a vector of genre features.
- Cosine similarity tells us **how close** two movies are in terms of genres.
- Values range from 0 (not similar at all) to 1 (exact same genres).


In [10]:
# Process genres
movies['genres'] = movies['genres'].str.replace('|', ' ')
cv = CountVectorizer()
genre_matrix = cv.fit_transform(movies['genres'])
genre_similarity = cosine_similarity(genre_matrix)

# Genre similarity dataframe
genre_similarity_df = pd.DataFrame(genre_similarity, index=movies['title'], columns=movies['title'])
genre_similarity_df.shape

(9742, 9742)

In [11]:
genre_similarity_df.head()

title,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995),Heat (1995),Sabrina (1995),Tom and Huck (1995),Sudden Death (1995),GoldenEye (1995),...,Gintama: The Movie (2010),anohana: The Flower We Saw That Day - The Movie (2013),Silver Spoon (2014),Love Live! The School Idol Movie (2015),Jon Stewart Has Left the Building (2015),Black Butler: Book of the Atlantic (2017),No Game No Life: Zero (2017),Flint (2017),Bungo Stray Dogs: Dead Apple (2018),Andrew Dice Clay: Dice Rules (1991)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story (1995),1.0,0.774597,0.316228,0.258199,0.447214,0.0,0.316228,0.632456,0.0,0.258199,...,0.4,0.316228,0.316228,0.447214,0.0,0.67082,0.774597,0.0,0.316228,0.447214
Jumanji (1995),0.774597,1.0,0.0,0.0,0.0,0.0,0.0,0.816497,0.0,0.333333,...,0.0,0.0,0.0,0.0,0.0,0.288675,0.333333,0.0,0.0,0.0
Grumpier Old Men (1995),0.316228,0.0,1.0,0.816497,0.707107,0.0,1.0,0.0,0.0,0.0,...,0.316228,0.0,0.5,0.0,0.0,0.353553,0.408248,0.0,0.0,0.707107
Waiting to Exhale (1995),0.258199,0.0,0.816497,1.0,0.57735,0.0,0.816497,0.0,0.0,0.0,...,0.258199,0.408248,0.816497,0.0,0.0,0.288675,0.333333,0.57735,0.0,0.57735
Father of the Bride Part II (1995),0.447214,0.0,0.707107,0.57735,1.0,0.0,0.707107,0.0,0.0,0.0,...,0.447214,0.0,0.707107,0.0,0.0,0.5,0.57735,0.0,0.0,1.0


## Recommendation Function

This function recommends the top 5 movies similar to a given movie using a hybrid approach that combines collaborative and content-based filtering:

- It first checks whether the input movie exists in both the collaborative similarity matrix and the genre-based similarity matrix. If not, it returns `"Movie not found"`.

- It retrieves similarity scores for the input movie from both:
  - **Collaborative filtering matrix**, which is based on user rating patterns.
  - **Genre similarity matrix**, which is based on movie genres using bag-of-words representation.

- It averages the two similarity scores to form a **final hybrid score**. This helps balance personal preferences (collaborative) with content similarity (genre).

- It sorts the movies by this final score in descending order and selects the top 5 most similar movies, excluding the input movie itself.

- Finally, it returns the list of recommended movie titles as the output.

This hybrid method improves recommendation quality by leveraging both user behavior and movie content.


In [12]:
def recommend(movie_title):
    # Average of collaborative and genre similarity
    if movie_title not in movie_similarity_df or movie_title not in genre_similarity_df:
        return ["Movie not found"]

    collab_scores = movie_similarity_df[movie_title]
    genre_scores = genre_similarity_df[movie_title]
    final_scores = (collab_scores + genre_scores) / 2

    recommendations = final_scores.sort_values(ascending=False)[1:6]
    return list(recommendations.index)


## Streamlit App Interface (Explanation Only)

This part of the project builds an interactive user interface using **Streamlit**, allowing users to select a movie and get recommendations.

- First, the app displays a title at the top of the page.

- It then loads all the movie titles from the dataset and shows them in a dropdown menu where the user can select a movie they like.

- Once the user selects a movie and clicks the "Recommend" button, the app calls the `recommend` function using the selected movie as input.

- The app then displays the top 5 recommended movies one by one as a list.

This interface allows users to interact with the recommendation engine without writing any code, making it easy and intuitive.

---
### **How to Run the App**

**To run the Streamlit app:**

**1. Open your terminal** (Command Prompt, Anaconda Prompt, or VS Code terminal).  
**2. Navigate to the directory where your Python file is saved.**  
**3. Type and run the following command:**  
**`streamlit run movie_recommendation_without_sentiment_score.py`**

In [None]:
# streamlit_app.py
# To run: `streamlit run streamlit_app.py`

st.title("Movie Recommendation System")
movie_list = movies['title'].values
selected_movie = st.selectbox("Select a movie you like:", movie_list)

if st.button("Recommend"):
    recommendations = recommend(selected_movie)
    st.write("Top 5 recommendations:")
    for i, rec in enumerate(recommendations):
        st.write(f"{i+1}. {rec}")
