<a href="https://colab.research.google.com/github/Premiiitn/MovieRecommendationSystem/blob/main/MovieRecommendationSystem.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd
import random

def generate_100k_dataset(output_csv="final_100k_movies_webseries.csv", number_of_records=100_000):
    """
    Creates a synthetic CSV file with <number_of_records> rows of data
    that includes the columns needed for your recommendation system:
      - title
      - Type (Movie or Webseries)
      - Year
      - genres
      - keywords
      - tagline
      - cast
      - director
      - Country
      - index (auto-assigned row number)

    With expanded lists to reduce repetition.
    """

    # ---------------------------------------------------------------------
    # 1) Extended lists of real(ish) items for more variety
    # ---------------------------------------------------------------------

    movie_titles = [
        # 60 popular or known movies
        "The Shawshank Redemption","The Godfather","Pulp Fiction","Fight Club","Inception","The Dark Knight",
        "Forrest Gump","Parasite","Interstellar","Schindler's List","The Matrix","Goodfellas","Se7en","Gladiator",
        "Titanic","Avatar","Braveheart","Saving Private Ryan","Avengers: Endgame","La La Land","Joker","Black Panther",
        "WALL-E","Up","Toy Story","Spirited Away","Amélie","City of God","Whiplash","The Prestige","Memento","Django Unchained",
        "The Lion King","The Silence of the Lambs","Jurassic Park","Terminator 2: Judgment Day","The Green Mile","Back to the Future",
        "Casablanca","Star Wars","The Pianist","12 Angry Men","Once Upon a Time in the West","Alien","Psycho","The Departed",
        "The Usual Suspects","Indiana Jones and the Last Crusade","Die Hard","Raiders of the Lost Ark","The Great Dictator",
        "Cinema Paradiso","Slumdog Millionaire","Inglourious Basterds","1917","Shutter Island","The Wolf of Wall Street","Gone Girl"
    ]

    webseries_titles = [
        # 40+ popular or known web series
        "Breaking Bad","Game of Thrones","Stranger Things","The Crown","Friends","Sherlock","Money Heist","Dark",
        "The Mandalorian","The Witcher","House of Cards","Westworld","Narcos","The Handmaid's Tale","Peaky Blinders",
        "Better Call Saul","The Boys","Chernobyl","Ozark","The Office","Brooklyn Nine-Nine","Modern Family","Vikings",
        "Black Mirror","Succession","Lucifer","Fargo","Dexter","The Big Bang Theory","Homeland","Prison Break",
        "True Detective","Lost","Mr. Robot","Hannibal","Cobra Kai","Invincible","Band of Brothers","Ted Lasso","Arcane"
    ]

    genres_list = [
        "Drama","Crime","Thriller","Action","Romance","Comedy","Sci-Fi","Fantasy","Mystery","Horror",
        "Adventure","Animation","Biography","Family","Musical","War","Superhero"
    ]

    keywords_list = [
        "revenge","love","betrayal","hero","villain","friendship","war","magic","time-travel","epic",
        "survival","espionage","heist","dystopian","robots","zombies","murder","coming-of-age","conspiracy","courtroom",
        "gangster","space","multiverse","pirates","historical","martial-arts","alien","apocalypse","cyberpunk","disaster"
    ]

    tagline_list = [
        "An unexpected journey.","A battle of wills.","Destiny awaits.","Nothing is what it seems.","Love conquers all.",
        "The world will tremble.","A mind-bending adventure.","Truth lies within.","Beyond the horizon.","Two worlds collide.",
        "A race against time.","Unleash your imagination.","The legend begins.","Where heroes are born.",
        "One secret can change everything."
    ]

    cast_list = [
        "Leonardo DiCaprio","Morgan Freeman","Brad Pitt","Tom Hanks","Scarlett Johansson","Natalie Portman","Johnny Depp",
        "Kate Winslet","Al Pacino","Keanu Reeves","Marlon Brando","Christian Bale","Heath Ledger","Benedict Cumberbatch",
        "Emilia Clarke","Aaron Paul","Bryan Cranston","Pedro Pascal","Henry Cavill","Jennifer Aniston","Courteney Cox",
        "Matthew Perry","Matt Damon","Julia Roberts","Angelina Jolie","Samuel L. Jackson","Robert Downey Jr.","Chris Evans",
        "Chris Hemsworth","Gal Gadot","Ryan Reynolds","Denzel Washington","Will Smith","Anne Hathaway","Jessica Chastain",
        "Harrison Ford","Keira Knightley","Ian McKellen","Patrick Stewart","Emma Stone","Amy Adams","Bruce Willis",
        "Zoe Saldana","Tom Holland","Zac Efron","Joaquin Phoenix"
    ]

    director_list = [
        "Christopher Nolan","Steven Spielberg","Quentin Tarantino","Martin Scorsese","David Fincher","James Cameron",
        "Francis Ford Coppola","Tim Burton","Ridley Scott","Denis Villeneuve","Alfred Hitchcock","Bong Joon Ho",
        "Coen Brothers","Peter Jackson","Guy Ritchie","Michael Bay","Clint Eastwood","Alejandro G. Inarritu","Kathryn Bigelow",
        "Taika Waititi","Wes Anderson","Robert Zemeckis","Stanley Kubrick","Ron Howard","George Lucas","J.J. Abrams",
        "Gore Verbinski","Sam Mendes","Anthony & Joe Russo"
    ]

    countries = [
        "USA","India","China","Japan","France","UK","Germany","Italy","Spain","Korea","Russia","Brazil","Australia",
        "Canada","Mexico","New Zealand","Sweden","Denmark","South Africa","Argentina"
    ]

    # ---------------------------------------------------------------------
    # 2) Generate the rows
    # ---------------------------------------------------------------------
    data_rows = []
    for idx in range(number_of_records):
        # Randomly pick whether it's a Movie or a Web Series
        content_type = random.choice(["Movie", "Webseries"])

        # Pick a title based on the content type
        if content_type == "Movie":
            title = random.choice(movie_titles)
        else:
            title = random.choice(webseries_titles)

        # Random year:
        # For movies, single year 1960-2023
        # For web series, possibly "start–end" or "start–Present"
        if content_type == "Movie":
            year_val = str(random.randint(1960, 2023))
        else:
            start_year = random.randint(2000, 2023)
            if random.random() < 0.5:
                # e.g. 2010–2015
                end_year = random.randint(start_year, 2023)
                year_val = f"{start_year}–{end_year}"
            else:
                # e.g. 2018–Present
                year_val = f"{start_year}–Present"

        # Randomly pick 1 or 2 genres
        num_genres = random.randint(1,2)
        chosen_genres = random.sample(genres_list, num_genres)
        genres_str = ", ".join(chosen_genres)

        # Randomly pick 1 to 3 keywords
        num_keywords = random.randint(1,3)
        chosen_keywords = random.sample(keywords_list, num_keywords)
        keywords_str = ", ".join(chosen_keywords)

        # Choose exactly one tagline
        tagline_str = random.choice(tagline_list)

        # We'll pick 2 or 3 names from cast_list
        chosen_cast = random.sample(cast_list, k=random.choice([2, 3]))
        cast_str = ", ".join(chosen_cast)

        director_str = random.choice(director_list)
        country_str = random.choice(countries)

        row_dict = {
            "title": title,
            "Type": content_type,
            "Year": year_val,
            "genres": genres_str,
            "keywords": keywords_str,
            "tagline": tagline_str,
            "cast": cast_str,
            "director": director_str,
            "Country": country_str,
            "index": idx  # row number
        }

        data_rows.append(row_dict)

    # 3) Create the DataFrame and save
    df = pd.DataFrame(data_rows)
    df.to_csv(output_csv, index=False)
    print(f"Generated '{output_csv}' with {number_of_records:,} rows.")

# Run directly if desired:
if __name__ == "__main__":
    generate_100k_dataset()


Generated 'final_100k_movies_webseries.csv' with 100,000 rows.


In [3]:
import numpy as np
import pandas as pd
import difflib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def recommend_content(expanded_csv="final_100k_movies_webseries.csv"):
    """
    Loads the 100k dataset created by generate_data.py and runs an interactive recommendation system.

    Steps:
      1. User picks content type: "Movie" or "Webseries".
      2. Filter dataset accordingly.
      3. Combine text features (genres, keywords, tagline, cast, director).
      4. Build TF-IDF vectors for the filtered set.
      5. Prompt user for a specific title (fuzzy matched).
      6. Compute 1 x N similarity for that single item vs. entire set.
      7. Print top recommendations.
    """
    # Load the expanded dataset
    content_data = pd.read_csv(expanded_csv, engine='python')
    content_data.reset_index(drop=True, inplace=True)

    # Ensure columns exist
    for col in ['genres','keywords','tagline','cast','director','title','Type']:
        if col not in content_data.columns:
            content_data[col] = ''
        else:
            content_data[col].fillna('', inplace=True)

    # Ask user for content preference
    user_pref = ""
    while user_pref.lower() not in ["movie", "webseries"]:
        user_pref = input("Enter your preferred content type (Movie/Webseries): ").strip()

    # Filter dataset
    filtered_data = content_data[content_data['Type'].str.lower() == user_pref.lower()].copy()
    if filtered_data.empty:
        print(f"Sorry, no entries found for '{user_pref}'.")
        return

    # Combine text fields
    filtered_data['combined_features'] = (
        filtered_data['genres'] + " " +
        filtered_data['keywords'] + " " +
        filtered_data['tagline'] + " " +
        filtered_data['cast'] + " " +
        filtered_data['director']
    )

    # Build TF-IDF vectors
    vectorizer = TfidfVectorizer()
    feature_vectors = vectorizer.fit_transform(filtered_data['combined_features'])

    # Show some sample titles
    list_of_titles = filtered_data['title'].tolist()
    print("\nHere are a few titles in the selected category:")
    for idx, title in enumerate(list_of_titles[:10]):
        print(f"{idx+1}. {title}")

    # Ask user for their favorite title
    user_title = input("\nEnter your favorite title (from the list above or any known title): ").strip()
    # Fuzzy match
    close_matches = difflib.get_close_matches(user_title, list_of_titles)
    if not close_matches:
        print("No close match found. Check the spelling and try again.")
        return

    chosen_title = close_matches[0]
    print(f"\nWe matched your input to: {chosen_title}\n")

    # Find that chosen title's index
    row_of_chosen_item = filtered_data[filtered_data['title'] == chosen_title]
    if row_of_chosen_item.empty:
        print("Couldn't locate the chosen title. Please try again.")
        return

    index_of_content = row_of_chosen_item.index[0]

    # 1 x N similarity: compare the chosen item to all others
    user_vector = feature_vectors[filtered_data.index.get_loc(index_of_content)]  # shape (1, features)
    similarity_vector = cosine_similarity(user_vector, feature_vectors).flatten() # shape (N,)

    # Sort by descending similarity
    item_sim_scores = list(enumerate(similarity_vector))
    sorted_similar_items = sorted(item_sim_scores, key=lambda x: x[1], reverse=True)

    print("Content recommendations for you:\n")
    rec_count = 0
    for item_idx, score in sorted_similar_items:
        # Skip if it's the exact same item
        if filtered_data.iloc[item_idx]['title'] == chosen_title:
            continue

        title = filtered_data.iloc[item_idx]['title']
        genres = filtered_data.iloc[item_idx]['genres']
        cast = filtered_data.iloc[item_idx]['cast']
        director = filtered_data.iloc[item_idx]['director']

        print(f"{rec_count+1}. {title}")
        print(f"   Genres  : {genres}")
        print(f"   Cast    : {cast}")
        print(f"   Director: {director}\n")

        rec_count += 1
        if rec_count >= 30:
            break

if __name__ == "__main__":
    # Just call the function directly
    print("Running the Recommendation Engine...")
    recommend_content("final_100k_movies_webseries.csv")


Running the Recommendation Engine...


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  content_data[col].fillna('', inplace=True)


Enter your preferred content type (Movie/Webseries): Movie

Here are a few titles in the selected category:
1. Inglourious Basterds
2. Star Wars
3. Slumdog Millionaire
4. Slumdog Millionaire
5. Die Hard
6. Jurassic Park
7. Titanic
8. Se7en
9. Se7en
10. Toy Story

Enter your favorite title (from the list above or any known title): Inglorious Basterds

We matched your input to: Inglourious Basterds

Content recommendations for you:

1. The Dark Knight
   Genres  : Fantasy
   Cast    : Bruce Willis, Tom Holland, Zac Efron
   Director: Christopher Nolan

2. Psycho
   Genres  : Drama
   Cast    : Matthew Perry, Tom Holland, Courteney Cox
   Director: Christopher Nolan

3. Interstellar
   Genres  : Animation
   Cast    : Harrison Ford, Tom Holland
   Director: Christopher Nolan

4. The Godfather
   Genres  : War
   Cast    : Brad Pitt, Tom Holland
   Director: Denis Villeneuve

5. Toy Story
   Genres  : Action
   Cast    : Gal Gadot, Tom Hanks, Christian Bale
   Director: Christopher Nolan

