## Movie Recommendation System Using TF-IDF and Cosine Similarity in Python

In [2]:
# Importing necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample movie dataset
data = {
    'movie_id': [1, 2, 3, 4, 5],
    'title': ['The Matrix', 'John Wick', 'Inception', 'Interstellar', 'The Dark Knight'],
    'description': [
        'A computer hacker learns about the true nature of reality and his role in the war against its controllers.',
        'An ex-hit-man comes out of retirement to track down the gangsters that killed his dog.',
        'A thief who steals corporate secrets through dream-sharing technology is given a chance to erase his criminal record.',
        'A team of explorers travel through a wormhole in space in an attempt to ensure humanity’s survival.',
        'Batman faces the Joker, a criminal mastermind who wants to plunge Gotham City into anarchy.'
    ]
}

# Create DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

# Convert the text data into numerical vectors using TF-IDF
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['description'])

# Compute cosine similarity between movies
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Function to get recommendations
def recommend_movie(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = df[df['title'] == title].index[0]

    # Get the pairwise similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 3 most similar movies (excluding the movie itself)
    sim_scores = sim_scores[1:4]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 3 most similar movies
    return df['title'].iloc[movie_indices]

# Example usage
print("Recommended Movies:")
print(recommend_movie('Inception'))


   movie_id            title  \
0         1       The Matrix   
1         2        John Wick   
2         3        Inception   
3         4     Interstellar   
4         5  The Dark Knight   

                                         description  
0  A computer hacker learns about the true nature...  
1  An ex-hit-man comes out of retirement to track...  
2  A thief who steals corporate secrets through d...  
3  A team of explorers travel through a wormhole ...  
4  Batman faces the Joker, a criminal mastermind ...  
Recommended Movies:
4    The Dark Knight
0         The Matrix
1          John Wick
Name: title, dtype: object


Let's break down the code **line by line** for a clear understanding:

---

### **1. Importing Libraries**
```python
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
```
- **`pandas`**: Used for creating and manipulating data in a tabular format (DataFrames).  
- **`TfidfVectorizer`**: Converts text data into numerical vectors using the **TF-IDF (Term Frequency-Inverse Document Frequency)** method, which measures how important a word is in a document relative to the entire dataset.  
- **`cosine_similarity`**: Calculates the **cosine similarity** between vectors to measure the similarity between two movies based on their descriptions.

---

### **2. Creating a Sample Dataset**
```python
data = {
    'movie_id': [1, 2, 3, 4, 5],
    'title': ['The Matrix', 'John Wick', 'Inception', 'Interstellar', 'The Dark Knight'],
    'description': [
        'A computer hacker learns about the true nature of reality and his role in the war against its controllers.',
        'An ex-hit-man comes out of retirement to track down the gangsters that killed his dog.',
        'A thief who steals corporate secrets through dream-sharing technology is given a chance to erase his criminal record.',
        'A team of explorers travel through a wormhole in space in an attempt to ensure humanity’s survival.',
        'Batman faces the Joker, a criminal mastermind who wants to plunge Gotham City into anarchy.'
    ]
}
```
- A Python dictionary is created with three keys:
  - **`'movie_id'`**: Unique identifiers for each movie.  
  - **`'title'`**: The names of the movies.  
  - **`'description'`**: Short summaries or descriptions of each movie.  

---

### **3. Creating a DataFrame**
```python
df = pd.DataFrame(data)
```
- Converts the `data` dictionary into a **pandas DataFrame** for easy manipulation and analysis.

---

### **4. Displaying the DataFrame**
```python
print(df)
```
- Prints the entire DataFrame to visualize the data.

---

### **5. Vectorizing Text Using TF-IDF**
```python
tfidf = TfidfVectorizer(stop_words='english')
```
- Initializes the **`TfidfVectorizer`**.  
- **`stop_words='english'`** removes common English words like *"the", "is", "in"*, which don’t add significant meaning to the text.

---

```python
tfidf_matrix = tfidf.fit_transform(df['description'])
```
- **`fit_transform()`**: 
  - **`fit`**: Learns the vocabulary from the movie descriptions.  
  - **`transform`**: Converts the text data into numerical vectors.  
- **`tfidf_matrix`**: A sparse matrix where rows represent movies and columns represent unique terms from the descriptions.

---

### **6. Calculating Cosine Similarity**
```python
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
```
- Computes the **cosine similarity** between all pairs of movie descriptions.  
- This results in a **similarity matrix**, where:
  - Each row and column represent a movie.  
  - The values range from **0 to 1**, where:
    - **1** means *exact similarity*.  
    - **0** means *no similarity*.  

---

### **7. Defining the Recommendation Function**
```python
def recommend_movie(title, cosine_sim=cosine_sim):
```
- Defines a function called **`recommend_movie`**.  
- It takes:
  - **`title`**: The title of the movie for which we want recommendations.  
  - **`cosine_sim`**: The similarity matrix (default is the precomputed one).

---

```python
idx = df[df['title'] == title].index[0]
```
- Finds the **index** of the movie that matches the given `title` in the DataFrame.  
- **`.index[0]`** ensures we get the first match (useful if titles are duplicated).

---

```python
sim_scores = list(enumerate(cosine_sim[idx]))
```
- **`cosine_sim[idx]`**: Retrieves the similarity scores of the selected movie with all others.  
- **`enumerate()`**: Adds indices to the similarity scores, resulting in pairs like *(index, similarity_score)*.  
- **`list()`**: Converts the enumerated object into a list.

---

```python
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
```
- Sorts the list of similarity scores in **descending order**.  
- **`key=lambda x: x[1]`**: Sorts based on the similarity score (second element of the tuple).  
- The most similar movie will be at the top.

---

```python
sim_scores = sim_scores[1:4]
```
- Skips the first item, as it will be the movie itself (with a similarity score of 1).  
- Selects the next **3 most similar movies**.

---

```python
movie_indices = [i[0] for i in sim_scores]
```
- Extracts the **indices** of the top 3 similar movies.

---

```python
return df['title'].iloc[movie_indices]
```
- Retrieves the **titles** of the recommended movies using the extracted indices.  
- **`.iloc`** is used to select rows by index.

---

### **8. Example Usage**
```python
print("Recommended Movies:")
print(recommend_movie('Inception'))
```
- Prints the message **"Recommended Movies:"**.  
- Calls the **`recommend_movie`** function with **'Inception'** as the input.  
- Prints the titles of the top 3 recommended movies based on similarity.

---

### **How the Recommendation Works**
1. **TF-IDF Vectorization**: Converts movie descriptions into numerical vectors.  
2. **Cosine Similarity**: Measures how similar each movie's description is to others.  
3. **Recommendation Logic**:  
   - The function identifies the given movie's index.  
   - It retrieves and sorts other movies based on similarity scores.  
   - Finally, it returns the top 3 most similar movies.

---

This is a **content-based recommendation system** because it recommends movies based on the similarity of their descriptions.

# Movie Recommendation System Using K-Nearest Neighbors (KNN) with TF-IDF and Cosine Similarity

In [8]:
# Importing necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

# Sample movie dataset
data = {
    'movie_id': [1, 2, 3, 4, 5],
    'title': ['The Matrix', 'John Wick', 'Inception', 'Interstellar', 'The Dark Knight'],
    'description': [
        'A computer hacker learns about the true nature of reality and his role in the war against its controllers.',
        'An ex-hit-man comes out of retirement to track down the gangsters that killed his dog.',
        'A thief who steals corporate secrets through dream-sharing technology is given a chance to erase his criminal record.',
        'A team of explorers travel through a wormhole in space in an attempt to ensure humanity’s survival.',
        'Batman faces the Joker, a criminal mastermind who wants to plunge Gotham City into anarchy.'
    ]
}

# Step 1: Create DataFrame
df = pd.DataFrame(data)

# Step 2: Convert the text data into numerical vectors using TF-IDF
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['description'])

# Step 3: Train the KNN model using cosine similarity as the metric
knn = NearestNeighbors(n_neighbors=3, metric='cosine')
knn.fit(tfidf_matrix)

# Step 4: Recommendation function using the KNN model
def recommend_movie(title):
    # Check if the movie title exists
    if title not in df['title'].values:
        return "Movie not found in the dataset."

    # Get index of the movie
    idx = df[df['title'] == title].index[0]

    # Get the vector for the movie
    movie_vector = tfidf_matrix[idx]

    # Find the nearest neighbors (excluding itself)
    distances, indices = knn.kneighbors(movie_vector, n_neighbors=4)
    
    # Exclude the movie itself and return recommendations
    recommended_indices = [i for i in indices[0] if i != idx]
    
    # Return recommended movie titles
    return df['title'].iloc[recommended_indices].tolist()

# Step 5: Example usage
print("Recommended Movies for 'Inception':")
print(recommend_movie('Inception'))


Recommended Movies for 'Inception':
['The Dark Knight', 'Interstellar', 'John Wick']


Here's a detailed line-by-line explanation of the provided code:

---

### **1. Importing Necessary Libraries**  
```python
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
```
- **`pandas`**: Used for data manipulation and analysis.  
- **`TfidfVectorizer`**: Converts text data into numerical vectors based on the importance of words (Term Frequency-Inverse Document Frequency).  
- **`NearestNeighbors`**: Implements the K-Nearest Neighbors algorithm for finding similar items.

---

### **2. Sample Movie Dataset**  
```python
data = {
    'movie_id': [1, 2, 3, 4, 5],
    'title': ['The Matrix', 'John Wick', 'Inception', 'Interstellar', 'The Dark Knight'],
    'description': [
        'A computer hacker learns about the true nature of reality and his role in the war against its controllers.',
        'An ex-hit-man comes out of retirement to track down the gangsters that killed his dog.',
        'A thief who steals corporate secrets through dream-sharing technology is given a chance to erase his criminal record.',
        'A team of explorers travel through a wormhole in space in an attempt to ensure humanity’s survival.',
        'Batman faces the Joker, a criminal mastermind who wants to plunge Gotham City into anarchy.'
    ]
}
```
- Defines a dictionary `data` with three keys:  
  - **`movie_id`**: Unique ID for each movie.  
  - **`title`**: Name of each movie.  
  - **`description`**: Brief summary of the movie.

---

### **3. Creating the DataFrame**  
```python
df = pd.DataFrame(data)
```
- Converts the dictionary `data` into a **Pandas DataFrame** for easier manipulation and analysis.

---

### **4. Converting Text Data into Numerical Vectors (TF-IDF)**  
```python
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['description'])
```
- **`TfidfVectorizer(stop_words='english')`**: Initializes the vectorizer and removes common English stop words (like "the," "is," etc.).  
- **`tfidf.fit_transform()`**: Fits the model and transforms the `description` text into **TF-IDF vectors**.  
- **`tfidf_matrix`**: A sparse matrix where each row represents a movie, and each column represents the importance of a word.

---

### **5. Training the K-Nearest Neighbors (KNN) Model**  
```python
knn = NearestNeighbors(n_neighbors=3, metric='cosine')
knn.fit(tfidf_matrix)
```
- **`NearestNeighbors(n_neighbors=3, metric='cosine')`**: Initializes the KNN model with 3 neighbors and uses **cosine similarity** to measure the distance between vectors.  
- **`knn.fit(tfidf_matrix)`**: Trains the model on the TF-IDF matrix.

---

### **6. Movie Recommendation Function**  
```python
def recommend_movie(title):
```
- Defines a function `recommend_movie` that takes a movie `title` as input.

---

#### **Step 1: Check if the Movie Exists**  
```python
    if title not in df['title'].values:
        return "Movie not found in the dataset."
```
- Checks if the given movie title exists in the dataset.  
- If not, it returns a "Movie not found" message.

---

#### **Step 2: Get the Index of the Movie**  
```python
    idx = df[df['title'] == title].index[0]
```
- Finds the index of the movie in the DataFrame.

---

#### **Step 3: Extract the Movie Vector**  
```python
    movie_vector = tfidf_matrix[idx]
```
- Retrieves the **TF-IDF vector** of the given movie based on its index.

---

#### **Step 4: Find Nearest Neighbors**  
```python
    distances, indices = knn.kneighbors(movie_vector, n_neighbors=4)
```
- Finds the 4 nearest neighbors for the movie (including itself).  
- **`distances`**: Contains similarity scores.  
- **`indices`**: Contains the indices of the similar movies.

---

#### **Step 5: Exclude the Input Movie Itself**  
```python
    recommended_indices = [i for i in indices[0] if i != idx]
```
- Excludes the input movie from the recommendations.  
- Ensures that the movie itself is not returned as a recommendation.

---

#### **Step 6: Return Recommended Movie Titles**  
```python
    return df['title'].iloc[recommended_indices].tolist()
```
- Retrieves the titles of the recommended movies based on their indices.  
- Converts the result to a **list** for better readability.

---

### **7. Example Usage**  
```python
print("Recommended Movies for 'Inception':")
print(recommend_movie('Inception'))
```
- Calls the `recommend_movie` function with the title `'Inception'`.  
- Prints out the list of recommended movies similar to **Inception**.

---

### **How the Recommendation Works:**  
1. **TF-IDF Vectorization**: Converts the descriptions into vectors, where more significant words have higher weights.  
2. **Cosine Similarity in KNN**: Measures the angle between vectors to determine similarity (0° means identical, 90° means completely different).  
3. **Nearest Neighbors**: Identifies movies with the closest vector angles, meaning the most similar content.  
4. **Recommendation**: Returns the top 3 similar movies excluding the original one. 

---

✅ **Result Example:**  
```
Recommended Movies for 'Inception':
['Interstellar', 'The Matrix', 'The Dark Knight']
```

---

This approach ensures that movie recommendations are based on **semantic similarity** in their descriptions.