<a href="https://colab.research.google.com/github/Sanaa-3/lumaa-spring-2025-ai-ml/blob/main/Lumaa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

//Code by: Sanaa Stanezai
//Date created: 02/21/2025
//Date last updated: 02/21/2025
/Lumaa Spring 2025/AI/Machine Learning Intern Challenge: Simple Content-Based Recommendation

In [53]:
# Install necessary libraries
!pip install pandas numpy scikit-learn



In [54]:
# Import kagglehub to download dataset
import kagglehub

# Download the dataset
path = kagglehub.dataset_download("harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows")

# Check the dataset folder
print("Path to dataset files:", path)
print("Files in dataset folder:", os.listdir(path))

Path to dataset files: /root/.cache/kagglehub/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows/versions/1
Files in dataset folder: ['imdb_top_1000.csv']


In [55]:

# Import pandas to work with the dataset
import pandas as pd
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


In [56]:
# Define the filename of the dataset
filename = "imdb_top_1000.csv"  # Update this if the actual file name is different

# Load the dataset into a pandas DataFrame
df = pd.read_csv(os.path.join(path, filename))

# Display the first few rows of the dataset
df.head()

# Print the column names to inspect the dataset structure
print(df.columns)

# Select only the relevant columns: Series_Title (movie title), Overview (movie summary), and Genre
# Also, drop any rows that contain missing values
df = df[['Series_Title', 'Overview', 'Genre']].dropna()

# Display the first few rows of the cleaned dataset
df.head()

Index(['Poster_Link', 'Series_Title', 'Released_Year', 'Certificate',
       'Runtime', 'Genre', 'IMDB_Rating', 'Overview', 'Meta_score', 'Director',
       'Star1', 'Star2', 'Star3', 'Star4', 'No_of_Votes', 'Gross'],
      dtype='object')


Unnamed: 0,Series_Title,Overview,Genre
0,The Shawshank Redemption,Two imprisoned men bond over a number of years...,Drama
1,The Godfather,An organized crime dynasty's aging patriarch t...,"Crime, Drama"
2,The Dark Knight,When the menace known as the Joker wreaks havo...,"Action, Crime, Drama"
3,The Godfather: Part II,The early life and career of Vito Corleone in ...,"Crime, Drama"
4,12 Angry Men,A jury holdout attempts to prevent a miscarria...,"Crime, Drama"


In [57]:
# Create a TF-IDF vectorizer instance, which converts text to numerical vectors
vectorizer = TfidfVectorizer(stop_words='english')

# Transform the 'Overview' column into a TF-IDF matrix (numerical vectors)
tfidf_matrix = vectorizer.fit_transform(df['Overview'])

# Output the shape of the TF-IDF matrix to understand its size
print("TF-IDF Matrix Shape:", tfidf_matrix.shape)


TF-IDF Matrix Shape: (1000, 5426)


In [58]:
# Function to recommend movies based on user query
def recommend_movies(query, df, top_n=5):
    # Convert the user query into a TF-IDF vector
    query_vector = vectorizer.transform([query])

    # Compute cosine similarity with all movie descriptions
    similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()

    # Get indices of top N matches
    top_indices = similarities.argsort()[-top_n:][::-1]

    # Return top matching movies with their similarity scores
    recommendations = df.iloc[top_indices][['Series_Title', 'Overview']]
    recommendations['Similarity'] = similarities[top_indices]
    return recommendations

In [64]:
# Main function to get user input and display recommendations
def main():
    # User query input
    query = input("Enter a movie description or keywords: ")

    print("\nExplanation of Similarity Codes:")
    print("A similarity score of 1 means the movie is identical to your entered description.")
    print("A similarity score of 0 means it is not a match to your description.")



    # Get movie recommendations based on query
    recommendations = recommend_movies(query, df)

    # Display recommendations
    print("\nTop movie recommendations based on your query:\n")
    for idx, row in recommendations.iterrows():
        print(f"{idx+1}. {row['Series_Title']}\n   Overview: {row['Overview']}\n   Similarity Score: {row['Similarity']:.3f}\n")

In [65]:
# Run the program
if __name__ == "__main__":
    main()

Enter a movie description or keywords: action and thriller dystopian movies

Explanation of Similarity Codes:
A similarity score of 1 means the movie is identical to your entered description.
A similarity score of 0 means it is not a match to your description.

Top movie recommendations based on your query:

379. The Incredibles
   Overview: A family of undercover superheroes, while trying to live the quiet suburban life, are forced into action to save the world.
   Similarity Score: 0.196

602. Kokuhaku
   Overview: A psychological thriller of a grieving mother turned cold-blooded avenger with a twisty master plan to pay back those who were responsible for her daughter's death.
   Similarity Score: 0.185

827. Barton Fink
   Overview: A renowned New York playwright is enticed to California to write for the movies and discovers the hellish truth of Hollywood.
   Similarity Score: 0.172

25. Saving Private Ryan
   Overview: Following the Normandy Landings, a group of U.S. soldiers go be