* Course: DSC630
* Title: Assignment 10.2
* Author Nels Findley
* 11/16/2025
* Description: Movie Recommendation

In [11]:
# Setup libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

In [27]:
# load dataset and view it
movies_df = pd.read_csv('movies.csv')


In [31]:
# Remove years string from movie. This is confusing for user input
movies_df['title'] = movies_df['title'].str.replace(r' \(\d{4}\)$', '', regex=True)
# Handles movie with no genre listed and remove duplicates
movies_df['genres'] = movies_df['genres'].str.replace('|', ' ', regex=False)
movies_df['genres'] = movies_df['genres'].replace('(no genres listed)', '', regex=False)
movies_df.drop_duplicates(subset=['title'], inplace=True)
movies_df.reset_index(drop=True, inplace=True)

# Use TfidfVectorizer for feature extraction
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies_df['genres'])

# Initialize Nearest Neightbor model
K_NEIGHBORS = 11
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=K_NEIGHBORS)

# Train the model
model_knn.fit(tfidf_matrix)


# Generate a Series of movie titles
indices = pd.Series(movies_df.index, index=movies_df['title']).drop_duplicates()

# Function to get recommended movies. Returns 10 recommended movies
def get_recommendations_knn(title, model_knn=model_knn, df=movies_df, indices=indices):
  
    # Check if the movie title is in the dataset
    if title not in indices:
        print(f"Movie '{title}' not found in the dataset.")
        print("Please check the spelling or choose a different movie.")
        return []

    # Get the index of the movie that matches the title
    idx = indices[title]

    # Find the K_NEIGHBORS closest movies
    distances, neighbor_indices = model_knn.kneighbors(
        tfidf_matrix[idx],
        n_neighbors=K_NEIGHBORS
    )

    # neighbor_indices[0] gives the indices array
    # [1:] excludes the first element, which is the movie itself (distance 0)
    movie_indices = neighbor_indices[0][1:]
    # Return the top 10 recommended movie titles
    return df['title'].iloc[movie_indices].tolist()

In [36]:
# Prompt user for input

# Loop to ask user for movies
while True:
    user_movie = input("\nEnter the title of a movie you like (or type 'exit' to exit): ")
    if user_movie.lower() == 'exit':
        print("\nExiting program.")
        break

    # Get recommendations using the KNN function
    recommendations = get_recommendations_knn(user_movie)

    # Check if recommendations were found
    if recommendations:
        print(f"\n Recommendations for {user_movie}")
        for i, movie in enumerate(recommendations, 1):
            print(f"{i}. {movie}")


Enter the title of a movie you like (or type 'exit' to exit):  Toy Story



 Recommendations for Toy Story
1. Olaf's Frozen Adventure
2. The Good Dinosaur
3. UglyDolls
4. Casper's Scare School
5. Turbo
6. Toy Story Toons: Small Fry
7. The Magic Crystal
8. Toy Story Toons: Hawaiian Vacation
9. Legends of Valhalla: Thor
10. The SpongeBob Movie: Sponge on the Run



Enter the title of a movie you like (or type 'exit' to exit):  Bad data


Movie 'Bad data' not found in the dataset.
Please check the spelling or choose a different movie.



Enter the title of a movie you like (or type 'exit' to exit):  exit



Exiting program.


# Summary
In the first the Data Preparation stage involved cleaning movie titles by removing the publication year for better user lookups and consolidating genre strings by replacing separators (|) with spaces. Next, Feature Extraction utilized the TF-IDF Vectorizer to transform these cleaned genre texts into a sparse numerical matrix, representing movie genres in a weighted format. A key optimization occurred in the Similarity Model stage: an initial attempt using the linear_kernel function to compute the full similarity matrix failed with a MemoryError due to its excessive size. This was resolved by switching to the K-Nearest Neighbors (KNN) model with a cosine metric, which efficiently finds the top 10 similar movies without allocating the entire matrix. Finally, Recommendation Generation involves the user inputting a  title, allowing the trained KNN model to query and return the 10 most similar movies based on their genre profiles.