<a href="https://colab.research.google.com/github/SandySingh72/DATA_Analytics/blob/main/Lab_Assessment_Solution__GROUP_11_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project - Suggest Movies Based on Liking**

Task in Hand:
1. Consider the dataset wiki_movie_plots_deduped.csv

2. Filter the dataset by only taking Origin/Ethnicity = Bollywood and Release >= 1990.
2. Create embeddings for the variable plot using anyone feature extraction / sentence
transformer model from Hugging Face.
3. Cluster the dataset of embeddings using any clustering algorithm of choice with
any number of clusters.
4. Suppose that anyone has liked a movie Lakshya (2004). Which top 3 movies can be
suggested to that person.

**Step - 1 Load Required Libraries**

In [1]:
#!pip install -q sentence-transformers scikit-learn pandas tqdm

import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm
import numpy as np

**Step 2: Load and Filter the Dataset**

Dataset Used: wiki_movie_plots_deduped.csv

The dataset contains movie titles, release years, plot summaries, and other metadata.

We only consider movies where:

Origin/Ethnicity is Bollywood

Release Year is 1990 or later

Note:

The 'Release Year' column was originally a string. We converted it to numeric using:

pd.to_numeric(df['Release Year'], errors='coerce')

Rows with missing or invalid years were removed.

In [2]:
#Load dataset
df = pd.read_csv('/content/wiki_movie_plots_deduped.csv')

#Convert Release Year to numeric
df['Release Year'] = pd.to_numeric(df['Release Year'], errors='coerce')

#Drop blank rows
df = df.dropna(subset=['Release Year'])

#Filter  movies from 1990 onwards
df_filtered = df[(df['Origin/Ethnicity'] == 'Bollywood') & (df['Release Year'] >= 1990)].copy()

#Reset index
df_filtered.reset_index(drop=True, inplace=True)

df_filtered[['Title', 'Release Year', 'Origin/Ethnicity', 'Plot']].head()


Unnamed: 0,Title,Release Year,Origin/Ethnicity,Plot
0,Aadmi Aur Apsara,1990.0,Bollywood,"Raju (Chiranjeevi), a courageous and spirited ..."
1,Aaj Ka Arjun,1990.0,Bollywood,Thakur Bhupendra Singh (Amrish Puri) and his s...
2,Aag Ka Gola,1990.0,Bollywood,Young Shankar is framed for a theft he did not...
3,Aaj Ke Shahenshah,1990.0,Bollywood,Saawan (Chunkey Pandey) and Barkha (Sonam) stu...
4,Aandhiyan,1990.0,Bollywood,The story revolves around Shakuntala (Mumtaz) ...


**Step 3: Generate Embeddings for Plots**

We use the SentenceTransformer model from Hugging Face to convert each movie plot into a numerical vector (called an embedding).

Model used: all-MiniLM-L6-v2

These embeddings help capture the meaning of the plot in numerical form for comparison.

Why use embeddings? Text data like plots cannot be directly compared or clustered.

Embeddings allow us to measure similarity between plots based on their meaning.

In [3]:
#Load sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')


#Generate plot embeddings
tqdm.pandas()

#Get plot texts
plots = df_filtered['Plot'].fillna('').tolist()

#Compute embedding vvalues
embeddings = model.encode(plots, show_progress_bar=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/50 [00:00<?, ?it/s]

  return forward_call(*args, **kwargs)


**Step 4: Clustering the Movies**

We use the KMeans algorithm to group similar movie plots into clusters.

Number of clusters chosen: 5

After clustering, each movie is assigned a cluster label.

This allows us to categorize movies based on the type of story they contain.

In [4]:
from sklearn.cluster import KMeans


num_clusters = 5   #Set number of clusters

kmeans = KMeans(n_clusters=num_clusters, random_state=42)   #Apply KMeans Clustering
df_filtered['cluster'] = kmeans.fit_predict(embeddings)

df_filtered[['Title', 'Release Year', 'cluster']].head()

Unnamed: 0,Title,Release Year,cluster
0,Aadmi Aur Apsara,1990.0,3
1,Aaj Ka Arjun,1990.0,4
2,Aag Ka Gola,1990.0,2
3,Aaj Ke Shahenshah,1990.0,0
4,Aandhiyan,1990.0,1


**Step 5: Recommend Similar Movies**

If a person likes the movie "Lakshya (2004)", we use cosine similarity to compare its plot with other movies.

The top 3 most similar movie plots (excluding itself) are recommended.

This method finds movies with semantically similar storylines.

In [5]:
#Find index of Lakshya (2004)
lakshya_index = df_filtered[df_filtered['Title'].str.lower().str.contains('lakshya') & (df_filtered['Release Year'] == 2004)].index

if len(lakshya_index) == 0:
    print("Lakshya (2004) not found!")
else:
    lakshya_vec = embeddings[lakshya_index[0]]

    #Compute similarities
    similarities = cosine_similarity([lakshya_vec], embeddings)[0]

    #Get top 3 similar movie
    top_indices = similarities.argsort()[::-1][1:4]
    recommendations = df_filtered.iloc[top_indices][['Title', 'Release Year', 'Plot']]
    print("Top 3 Recommendations if user liked 'Lakshya (2004)':")
    display(recommendations)

Top 3 Recommendations if user liked 'Lakshya (2004)':


Unnamed: 0,Title,Release Year,Plot
665,Kaise Kahoon Ke... Pyaar Hai,2003.0,Karan (Amit Hongorani) is the son of Lakshmi (...
1046,Life Partner,2009.0,Karan (Fardeen Khan) and Bhavesh (Tusshar Kapo...
736,Lakeer - Forbidden Lines,2004.0,Karan (Sohail Khan) and Bindiya (Nauheed Cyrus...


# **Summary**

This project demonstrates how to:

 1. Clean and filter text data

2. Generate sentence embeddings using a pre-trained model

3. Apply clustering to understand categories in unstructured text

4. Build a basic content-based recommendation system using plot similarity