<a href="https://colab.research.google.com/github/SandySingh72/DATA_Analytics/blob/main/Project_Suggest_Movies_Based_on_Liking_(1)_Github.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project - Suggest Movies Based on Liking**

Task in Hand:

1. Consider the dataset wiki_movie_plots_deduped.csv

2. Filter the dataset by only taking Origin/Ethnicity = Bollywood and Release >= 1990.

3. Create embeddings for the variable plot using anyone feature extraction / sentence transformer model from Hugging Face.

4. Cluster the dataset of embeddings using any clustering algorithm of choice with any number of clusters.

5. Suppose that anyone has liked a movie Lakshya (2004). Which top 3 movies can be suggested to that person.

**Step - 1 Load Required Libraries**

In [None]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

**Step 2: Load and Filter the Dataset**

Dataset Used: wiki_movie_plots_deduped.csv

The dataset contains movie titles, release years, plot summaries, and other metadata.

We only consider movies where:

Origin/Ethnicity is Bollywood

Release Year is 1990 or later

Note:

The 'Release Year' column was originally a string. We converted it to numeric using:

pd.to_numeric(df['Release Year'], errors='coerce')

Rows with missing or invalid years were removed.

In [None]:
#Load dataset
df = pd.read_csv('/content/wiki_movie_plots_deduped.csv')

#Convert Release Year to numeric
df['Release Year'] = pd.to_numeric(df['Release Year'], errors='coerce')

#Drop blank rows
df = df.dropna(subset=['Release Year'])

#Filter  movies from 1990 onwards
df_filtered = df[(df['Origin/Ethnicity'] == 'Bollywood') & (df['Release Year'] >= 1990)].copy()

#Reset index
df_filtered.reset_index(drop=True, inplace=True)

df_filtered[['Title', 'Release Year', 'Origin/Ethnicity', 'Plot']].head()

**Step 3: Generate Embeddings for Plots**

We use the SentenceTransformer model from Hugging Face to convert each movie plot into a numerical vector (called an embedding).

Model used: all-MiniLM-L6-v2

These embeddings help capture the meaning of the plot in numerical form for comparison.

Why use embeddings? Text data like plots cannot be directly compared or clustered.

Embeddings allow us to measure similarity between plots based on their meaning.

In [None]:
#Load sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')


#Generate plot embeddings
tqdm.pandas()

#Get plot texts
plots = df_filtered['Plot'].fillna('').tolist()

#Compute embedding vvalues
embeddings = model.encode(plots, show_progress_bar=True)

**Step 4: Clustering the Movies**

We use the KMeans algorithm to group similar movie plots into clusters.

Number of clusters chosen: 5

After clustering, each movie is assigned a cluster label.

This allows us to categorize movies based on the type of story they contain.

In [None]:
from sklearn.cluster import KMeans


num_clusters = 5   #Set number of clusters

kmeans = KMeans(n_clusters=num_clusters, random_state=42)   #Apply KMeans Clustering
df_filtered['cluster'] = kmeans.fit_predict(embeddings)

df_filtered[['Title', 'Release Year', 'cluster']].head()

**Step 5: Recommend Similar Movies**

If a person likes the movie "Lakshya (2004)", we use cosine similarity to compare its plot with other movies.

The top 3 most similar movie plots (excluding itself) are recommended.

This method finds movies with semantically similar storylines.

In [None]:
#Find index of Lakshya (2004)
lakshya_index = df_filtered[df_filtered['Title'].str.lower().str.contains('lakshya') & (df_filtered['Release Year'] == 2004)].index

if len(lakshya_index) == 0:
    print("Lakshya (2004) not found!")
else:
    lakshya_vec = embeddings[lakshya_index[0]]

    #Compute similarities
    similarities = cosine_similarity([lakshya_vec], embeddings)[0]

    #Get top 3 similar movie
    top_indices = similarities.argsort()[::-1][1:4]
    recommendations = df_filtered.iloc[top_indices][['Title', 'Release Year', 'Plot']]
    print("Top 3 Recommendations if user liked 'Lakshya (2004)':")
    display(recommendations)

# **Summary**

This project demonstrates how to:

1. Clean and filter text data
Generate sentence embeddings using a pre-trained model

2. Apply clustering to understand categories in unstructured text

3. Build a basic content-based recommendation system using plot similarity