I've taken the dataset from Kaggle. Here is the link

https://www.kaggle.com/datasets/ramjasmaurya/top-250s-in-imdb

I've renamed dataset to "imdb_1000_movies.csv"

**Code**

Install dependencies if they aren't installed
(like !pip install nltk)

**1.**

I imported pandas which is a python library which is used for data analysis and manipulation.

And I've read the dataset which is a comma seperated file by using pandas and converted into a dataframe.

df.head() function displays the first five rows of a dataframe.

In [2]:
import pandas as pd

# Load the dataset
file_path = "imdb_1000_movies.csv"
df = pd.read_csv(file_path)

# Display the first few rows to understand the structure
df.head()

Unnamed: 0,ranking of movie\r\n,movie name\r\n,Year,certificate,runtime,genre,RATING,metascore,DETAIL ABOUT MOVIE\n,DIRECTOR\r\n,ACTOR 1\n,ACTOR 2\n,ACTOR 3,ACTOR 4,votes,GROSS COLLECTION\r\n
0,1,The Shawshank Redemption,-1994,15,142 min,Drama,9.3,81.0,Two imprisoned men bond over a number of years...,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2603314,$28.34M
1,2,The Godfather,-1972,X,175 min,"Crime, Drama",9.2,100.0,The aging patriarch of an organized crime dyna...,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1798731,$134.97M
2,3,The Dark Knight,-2008,12A,152 min,"Action, Crime, Drama",9.0,84.0,When the menace known as the Joker wreaks havo...,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2574810,$534.86M
3,4,The Lord of the Rings: The Return of the King,-2003,12A,201 min,"Action, Adventure, Drama",9.0,94.0,Gandalf and Aragorn lead the World of Men agai...,Peter Jackson,Elijah Wood,Viggo Mortensen,Ian McKellen,Orlando Bloom,1787701,$377.85M
4,5,Schindler's List,-1993,15,195 min,"Biography, Drama, History",9.0,94.0,"In German-occupied Poland during World War II,...",Steven Spielberg,Liam Neeson,Ralph Fiennes,Ben Kingsley,Caroline Goodall,1323776,$96.90M


**2.**

I've changed the column names for clear understanding.

In [None]:
# Clean column names
df.columns = [
    "ranking", "movie_name", "year", "certificate", "runtime", "genre",
    "rating", "metascore", "movie_detail", "director", "actor_1",
    "actor_2", "actor_3", "actor_4", "votes", "gross_collection"]

df.head()

Unnamed: 0,ranking,movie_name,year,certificate,runtime,genre,rating,metascore,movie_detail,director,actor_1,actor_2,actor_3,actor_4,votes,gross_collection
0,1,The Shawshank Redemption,-1994,15,142 min,Drama,9.3,81.0,Two imprisoned men bond over a number of years...,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2603314,$28.34M
1,2,The Godfather,-1972,X,175 min,"Crime, Drama",9.2,100.0,The aging patriarch of an organized crime dyna...,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1798731,$134.97M
2,3,The Dark Knight,-2008,12A,152 min,"Action, Crime, Drama",9.0,84.0,When the menace known as the Joker wreaks havo...,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2574810,$534.86M
3,4,The Lord of the Rings: The Return of the King,-2003,12A,201 min,"Action, Adventure, Drama",9.0,94.0,Gandalf and Aragorn lead the World of Men agai...,Peter Jackson,Elijah Wood,Viggo Mortensen,Ian McKellen,Orlando Bloom,1787701,$377.85M
4,5,Schindler's List,-1993,15,195 min,"Biography, Drama, History",9.0,94.0,"In German-occupied Poland during World War II,...",Steven Spielberg,Liam Neeson,Ralph Fiennes,Ben Kingsley,Caroline Goodall,1323776,$96.90M


**3.**

As you have mentioned to build a recommender system with a list of movies with plot summaries, I've taken only the necessary columns.

In [None]:
# Select only the required columns: movie_name and movie_detail
df = df[["movie_name", "movie_detail"]]

df.head()

Unnamed: 0,movie_name,movie_detail
0,The Shawshank Redemption,Two imprisoned men bond over a number of years...
1,The Godfather,The aging patriarch of an organized crime dyna...
2,The Dark Knight,When the menace known as the Joker wreaks havo...
3,The Lord of the Rings: The Return of the King,Gandalf and Aragorn lead the World of Men agai...
4,Schindler's List,"In German-occupied Poland during World War II,..."


**4.**

Here I've checked for any null/missing values and dropped if there are any.

In [None]:
# Drop rows with missing values in these columns
df = df.dropna().reset_index(drop=True)

# Display cleaned dataset
df.head()

Unnamed: 0,movie_name,movie_detail
0,The Shawshank Redemption,Two imprisoned men bond over a number of years...
1,The Godfather,The aging patriarch of an organized crime dyna...
2,The Dark Knight,When the menace known as the Joker wreaks havo...
3,The Lord of the Rings: The Return of the King,Gandalf and Aragorn lead the World of Men agai...
4,Schindler's List,"In German-occupied Poland during World War II,..."


**5.**

Counting the number of rows after removing the rows with missing values

In [None]:
df.count()

Unnamed: 0,0
movie_name,1000
movie_detail,1000


**6.**

import re: It loads the regular expressions module for advanced text pattern matching and manipulation.

import string: It provides access to string constants and utilities (like punctuation) for easier text processing.

from sklearn.feature_extraction.text import TfidfVectorizer: It imports a tool to convert text documents into TF-IDF feature matrices for machine learning.

from nltk.corpus import stopwords: Enables access to predefined lists of common stopwords for text cleaning and filtering.

import nltk: Loads the Natural Language Toolkit, a comprehensive library for various natural language processing tasks.

In [None]:
import re
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import nltk

# Ensure NLTK stopwords are available
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**7.**

Before performing TF-IDF, I preprocessed the text.

In [None]:
# Define a preprocessing function
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(f"[{string.punctuation}]", "", text)  # Remove punctuation
    words = text.split()  # Tokenize
    words = [word for word in words if word not in stopwords.words('english')]  # Remove stopwords
    return " ".join(words)

**8.**

Applying the above preprocessing function to the movie_detail column.

In [None]:
# Apply preprocessing to the movie details
df["processed_detail"] = df["movie_detail"].apply(preprocess_text)

# Display a few processed descriptions
df[["movie_name", "processed_detail"]].head(10)

Unnamed: 0,movie_name,processed_detail
0,The Shawshank Redemption,two imprisoned men bond number years finding s...
1,The Godfather,aging patriarch organized crime dynasty postwa...
2,The Dark Knight,menace known joker wreaks havoc chaos people g...
3,The Lord of the Rings: The Return of the King,gandalf aragorn lead world men saurons army dr...
4,Schindler's List,germanoccupied poland world war ii industriali...
5,The Godfather Part II,early life career vito corleone 1920s new york...
6,12 Angry Men,jury new york city murder trial frustrated sin...
7,Jai Bhim,tribal man arrested case alleged theft wife tu...
8,Pulp Fiction,lives two mob hitmen boxer gangster wife pair ...
9,Inception,thief steals corporate secrets use dreamsharin...


**9.**

Importing cosine similarity feature and initializing TF-IDF vector.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Initialize the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

**10.**

Learning the vocabulary and inverse document frequencies and then transforms the text into a sparse matrix of numerical TF-IDF features.

In [None]:
# Fit and transform the processed movie details into TF-IDF vectors
tfidf_matrix = tfidf_vectorizer.fit_transform(df["processed_detail"])

**11.**

Created a table format for the output with columns Movie and Cosine Similarity.

Preprocessed the user query(input) and transformed into TF-IDF vector.

Calculated cosine similarity and arranged top 5 movies based on the similarity score (in descending order)

In [None]:
# Create a formatted table for recommendations

from tabulate import tabulate

def recommend_movies_table(user_query, top_n=5):

    # Preprocess the user's query
    processed_query = preprocess_text(user_query)

    # Convert the processed query into a TF-IDF vector
    query_vector = tfidf_vectorizer.transform([processed_query])

    # Compute cosine similarity between the query and all movie descriptions
    similarity_scores = cosine_similarity(query_vector, tfidf_matrix).flatten()

    # Get indices of top N similar movies
    top_indices = similarity_scores.argsort()[-top_n:][::-1]

    # Retrieve movie names and their similarity scores
    recommendations = [(df.iloc[i]["movie_name"], similarity_scores[i]) for i in top_indices]

    # Create a table format
    table = tabulate(recommendations, headers=["Movie", "Cosine Similarity"], tablefmt="grid")

    return table

**Sample Output 1**

In [None]:
# Example test query
user_query = "I like mind-bending thrillers with unexpected twists."
recommendations_table = recommend_movies_table(user_query)

# Display the table output
print(recommendations_table)

+--------------------------------------------+---------------------+
| Movie                                      |   Cosine Similarity |
| Billy Elliot                               |            0.270999 |
+--------------------------------------------+---------------------+
| Drishyam                                   |            0.238301 |
+--------------------------------------------+---------------------+
| Porco Rosso                                |            0.20238  |
+--------------------------------------------+---------------------+
| Zelig                                      |            0.194547 |
+--------------------------------------------+---------------------+
| Spring, Summer, Fall, Winter... And Spring |            0.182593 |
+--------------------------------------------+---------------------+


**Sample Output 2**

In [None]:
# Example test query
user_query = "Suggest me a crime saga featuring a powerful mafia family."
recommendations_table = recommend_movies_table(user_query)

# Display the table output
print(recommendations_table)

+---------------+---------------------+
| Movie         |   Cosine Similarity |
| The Big Heat  |            0.219361 |
+---------------+---------------------+
| Drishyam      |            0.20585  |
+---------------+---------------------+
| Donnie Brasco |            0.18151  |
+---------------+---------------------+
| Veer Zaara    |            0.167887 |
+---------------+---------------------+
| Drishyam      |            0.144755 |
+---------------+---------------------+


**Sample Output 3**

In [None]:
# Example test query
user_query = "I like prison escape drama with strong character development."
recommendations_table = recommend_movies_table(user_query)

# Display the table output
print(recommendations_table)

+--------------------------+---------------------+
| Movie                    |   Cosine Similarity |
| A Prophet                |            0.14678  |
+--------------------------+---------------------+
| The Purple Rose of Cairo |            0.143726 |
+--------------------------+---------------------+
| Beasts of No Nation      |            0.123257 |
+--------------------------+---------------------+
| Drive                    |            0.113375 |
+--------------------------+---------------------+
| In the Mood for Love     |            0.112645 |
+--------------------------+---------------------+


**Sample Output 4**

In [None]:
# Example test query
user_query = "Suggest me a romantic comedy-drama with a mix of humor and emotion."
recommendations_table = recommend_movies_table(user_query)

# Display the table output
print(recommendations_table)

+--------------------------------+---------------------+
| Movie                          |   Cosine Similarity |
| Life Is Beautiful              |            0.211793 |
+--------------------------------+---------------------+
| Control                        |            0.178296 |
+--------------------------------+---------------------+
| Crouching Tiger, Hidden Dragon |            0.174416 |
+--------------------------------+---------------------+
| 500 Days of Summer             |            0.152213 |
+--------------------------------+---------------------+
| Fiddler on the Roof            |            0.143421 |
+--------------------------------+---------------------+


**Salary Expectation per Month**

Thank you for considering me for this role.

My salary expectation is around $2000 per month.

I'm flexible and also open to a lower offer for the opportunity to gain startup experience.