# Content-Based ML Model

DATASET: imbd_top_1000.csv<br>
TF-IDF vectorization.<br>
Recommend the movies with the highest similarity form the dataset

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [5]:
# Load dataset
df = pd.read_csv('imdb_top_1000.csv')
df.head(10)

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000
5,https://m.media-amazon.com/images/M/MV5BNzA5ZD...,The Lord of the Rings: The Return of the King,2003,U,201 min,"Action, Adventure, Drama",8.9,Gandalf and Aragorn lead the World of Men agai...,94.0,Peter Jackson,Elijah Wood,Viggo Mortensen,Ian McKellen,Orlando Bloom,1642758,377845905
6,https://m.media-amazon.com/images/M/MV5BNGNhMD...,Pulp Fiction,1994,A,154 min,"Crime, Drama",8.9,"The lives of two mob hitmen, a boxer, a gangst...",94.0,Quentin Tarantino,John Travolta,Uma Thurman,Samuel L. Jackson,Bruce Willis,1826188,107928762
7,https://m.media-amazon.com/images/M/MV5BNDE4OT...,Schindler's List,1993,A,195 min,"Biography, Drama, History",8.9,"In German-occupied Poland during World War II,...",94.0,Steven Spielberg,Liam Neeson,Ralph Fiennes,Ben Kingsley,Caroline Goodall,1213505,96898818
8,https://m.media-amazon.com/images/M/MV5BMjAxMz...,Inception,2010,UA,148 min,"Action, Adventure, Sci-Fi",8.8,A thief who steals corporate secrets through t...,74.0,Christopher Nolan,Leonardo DiCaprio,Joseph Gordon-Levitt,Elliot Page,Ken Watanabe,2067042,292576195
9,https://m.media-amazon.com/images/M/MV5BMmEzNT...,Fight Club,1999,A,139 min,Drama,8.8,An insomniac office worker and a devil-may-car...,66.0,David Fincher,Brad Pitt,Edward Norton,Meat Loaf,Zach Grenier,1854740,37030102


In [6]:
# Rename and select relevant columns for clarity and easier access
df = df.rename(columns={
    'Series_Title': 'title',
    'Genre': 'genre',
    'Overview': 'overview',
    'Star1': 'star1', 'Star2': 'star2',
    'Star3': 'star3', 'Star4': 'star4',
})

# Keep only the columns we will use for building the content-based model
df = df[['title', 'genre', 'overview', 'star1', 'star2', 'star3', 'star4']]

# Fill missing values with empty strings so TF-IDF doesn't break on NaNs
for col in ['genre', 'overview', 'star1', 'star2', 'star3', 'star4']:
    df[col] = df[col].fillna('')

In [7]:
# Combine genre, overview, and star names into one text string per movie
# This forms the "content" that will be vectorized for similarity comparison
df['content'] = (
    df['genre'] + ' ' +
    df['overview'] + ' ' +
    df['star1'] + ' ' +
    df['star2'] + ' ' +
    df['star3'] + ' ' +
    df['star4']
)

*Text vectorization*: Transforms text into vectors representing importance of each word, considering frequency across dataset.

In [8]:
# Initialize TF-IDF vectorizer, remove common English stop words, and limit max features for speed
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)

# Apply TF-IDF to the 'content' column, converting text to numerical feature vectors
tfidf_matrix = tfidf.fit_transform(df['content'])

*Cosine similarity*: Measures how similar two vectors (movies) are, with 1 meaning identical.

In [9]:
# Compute the cosine similarity matrix between all movies based on their TF-IDF vectors
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [10]:
# Create a mapping from movie title to its index in the dataframe for quick lookup
# Strip whitespace to avoid key errors due to extra spaces, drop duplicates for safety
indices = pd.Series(df.index, index=df['title'].str.strip()).drop_duplicates()

*Recommendation function*: Retrieves movies closest in content space to the chosen one.<br>
Return top-N most similar movie titles.

In [11]:
# Function to get recommendations given a movie title and number of recommendations
def get_recommendations(title, num_recommendations=10):
    title = title.strip()  # Clean whitespace
    if title not in indices:
        print(f"Movie '{title}' not found.")
        return None

    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get a list of tuples (movie_index, similarity_score) for this movie compared to all others
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the list of tuples based on similarity score in descending order
    # Exclude the first one because it's the same movie (similarity = 1)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:num_recommendations + 1]

    # Extract just the indices of the most similar movies
    recommended_idxs = [i[0] for i in sim_scores]

    # Return the titles of the recommended movies
    return df['title'].iloc[recommended_idxs].tolist()

Trying this with the movie title: *The Shawshank Redemption*

In [12]:
# Example usage: Get top 5 movies similar to 'The Shawshank Redemption'
recs = get_recommendations('The Shawshank Redemption', num_recommendations=5)

if recs:
    print("Top 5 recommendations:")
    for r in recs:
        print("-", r)


Top 5 recommendations:
- Mystic River
- Unforgiven
- Short Cuts
- Dev.D
- Million Dollar Baby


Showing recommnedation base The User Inputs

In [13]:
def main():
    while True:
        user_title = input("Enter a movie title (or 'exit' to quit): ").strip()
        if user_title.lower() == 'exit':
            print("Goodbye!")
            break

        recommendations = get_recommendations(user_title, num_recommendations=5)

        if recommendations:
            print(f"Top 5 recommendations similar to '{user_title}':")
            for i, rec in enumerate(recommendations, start=1):
                print(f"{i}. {rec}")
        else:
            print("Try another movie title or check your spelling.")

if __name__ == "__main__":
    main()

Enter a movie title (or 'exit' to quit): The Dark Knight
Top 5 recommendations similar to 'The Dark Knight':
1. Batman Begins
2. The Dark Knight Rises
3. Brokeback Mountain
4. The Prestige
5. Kill Bill: Vol. 1
Enter a movie title (or 'exit' to quit): Beauty and the Beast
Top 5 recommendations similar to 'Beauty and the Beast':
1. The Little Mermaid
2. Hauru no ugoku shiro
3. Before Sunset
4. The Nightmare Before Christmas
5. Belle de jour
Enter a movie title (or 'exit' to quit): exit
Goodbye!


#### Using gradio to the same code

In [14]:
import gradio as gr

In [15]:
def recommend_movies_gradio(title):
    recs = get_recommendations(title, num_recommendations=5)
    if recs:
        return recs
    else:
        return ["Movie not found. Try another title."]

# Create Gradio interface
iface = gr.Interface(
    fn=recommend_movies_gradio,
    inputs=gr.Textbox(label="Enter a movie title"),
    outputs=gr.Textbox(label="Recommended movies"),
    title="Movie Recommender",
    description="Type a movie title to get recommendations based on similar content."
)

# Launch the web app
iface.launch()

It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://d3d4e238610df1f17b.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




Some Drawbacks to the codes:
- Only work for the Movies Avialable in datasets.
- The *Td-idf vectorization*: <br>
 - A *content-based recommendation* technique.
 - Convert movie metadata (like description, genre, cast) into TF-IDF vectors.
 - Compare the similarity between vectors using cosine similarity.


This approach does not involve training a machine learning model. It's rule-based and unsupervised. It doesn't learn patterns from user preferences or behavior.


Some of the Movies you can try:
1. Andaz Apna Apna
2. Baby Driver
3. Big Hero 6
4. Captain America: Civil War
5. Catch Me If You Can
6. Chak De! India
7. Days of Heaven
8. Die Hard
9. English Vinglish
10. Enter the Dragon
11. Fight Club
12. Gravity
13. Guardians of the Galaxy
14. Harry Potter and the Goblet of Fire
15. Home Alone
16. How to Train Your Dragon
17. Incredibles 2
18. Iron Man
19. Interstellar
20. Joker
21. Kai po che!
22. Kingsman: The Secret Service
23. Les yeux sans visage
24. Life of Pi
25. Mission: Impossible - Fallout
26. Mulan
27. Night of the Living Dead
28. Once Upon a Time in the West
29. Pirates of the Caribbean: The Curse of the Black Pearl
30. Predator
31. Pride & Prejudice
32. Queen
33. Raazi
34. Sherlock Holmes
35. Sin City
36. Star Trek Into Darkness
37. Star Wars
38. Taxi Driver
39. The Avengers
40. The Exorcist
41. The Hangover
42. The Martian
43. The Matrix
44. The Notebook
45. The Wolf of Wall Street
46. Thor: Ragnarok
47. Titanic
48. To Be or Not to Be
49. To Have and Have Not
50. To Kill a Mockingbird
51. Udaan
52. Udta Punjab
53. Vicky Donor
54. Victoria
55. When Harry Met Sally…
56. Wreck-It Ralp
57. X: First Class
58. Yip Man
59. Zindagi Na Milegi Dobara
60. Zootopia

