> ## **Movie Recommendation System Project** ##

First, we will import the necessary libraries and the dataset and create a Lemmatizer object.

In [91]:
import pandas as pd
import re
from nltk.stem import WordNetLemmatizer
from contractions import fix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [92]:
lemmatizer = WordNetLemmatizer()

In [93]:
# Read CSV files
df1 = pd.read_csv('movie_data_train.csv')
df2 = pd.read_csv('movie_data_solution.csv')

We have two datasets, which both have data of movie names, genres and plot summaries. We need a large dataset so we will merge the two datasets together. This would help us get a large dataset with a lot of movie names and plot summaries. We will use the movie names and plot summaries to create a recommendation system. The model would work more efficiently if we have a large dataset.

In [94]:
# Concatenate DataFrames
df = pd.concat([df1, df2], axis=0)

Now, we would refine and preprocess the data. We will remove any special words, symbols, numbers, etc. from the plot summaries. We will also remove any stopwords from the plot summaries. We will also lemmatize the words in the plot summaries. 

In [95]:
def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)  # remove non-alpha
    text = fix(text)  # expand contractions using the 'fix' function from contractions
    text = lemmatizer.lemmatize(text)  # lemmatize
    return text

In [96]:
# Preprocess plots: lower case, remove punctuation, lemmatize
df['Plot'] = df['Plot Summary'].apply(preprocess)

After the data has been preprocessed, it is now time for the main part of the project. We will create a TF-IDF vectorizer object and fit it on the plot summaries. We will then create a cosine similarity matrix. We will use the cosine similarity matrix to find the cosine similarity between the plot summaries. We will then create a function that takes in a movie name and returns the most similar movie to it. We will then test the function by passing in a movie name and printing the top 10 most similar movies to it.

In [97]:
# Create a TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=10_000)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['Plot'])

In [98]:
def recommend_and_print(plot):
    processed_plot = preprocess(plot)
    plot_vec = tfidf_vectorizer.transform([processed_plot])
    
    # Calculate cosine similarities
    similarities = cosine_similarity(tfidf_matrix, plot_vec)
    
    # Get top indices
    indices = similarities.argsort(axis=0)[-10:][::-1]
    
    # Get movie titles using the indices
    recommended_movies = []
    for idx in indices:
        original_idx = df.index[idx]  # Get the original index
        movie_title = df.loc[original_idx, 'Title']
        recommended_movies.append(movie_title)
    
    # Print recommended movies
    for idx, movie_title in enumerate(recommended_movies, start=1):
        print(f"{idx}. {movie_title}")
    
    return recommended_movies


In [99]:
# Example plot 
plot = input("Enter a plot of a type of movie you would like to see: ")

In [100]:
recommend_and_print(plot)

1. 48604    Discovering the Real World of Harry Potter (2001)
48604                              Friendly Enemies (1942)
Name: Title, dtype: object
2. 16724      I'm Not Obsessed (2008)
16724    "Common Sense: AU" (2017)
Name: Title, dtype: object
3. 5547             Les galeries Lévy et Cie (1932)
5547    The Seekers Guide to Harry Potter (2010)
Name: Title, dtype: object
4. 8328    "Hogwarts: The Truth About Potter" (2014)
8328       Ladies' Night in a Turkish Bath (1928)
Name: Title, dtype: object
5. 7886                            Sun Valley Cyclone (1946)
7886    An Exclusive Inside Look at 'Harry Potter and ...
Name: Title, dtype: object
6. 4885     The Sound of Life (2008)
4885    Choihui jeongmumun (1977)
Name: Title, dtype: object
7. 16360      I bambini della miniera (2016)
16360    JK Rowling: The Interview (2003)
Name: Title, dtype: object
8. 47163                          Virtual Encounters 2 (1998)
47163    The Wizard Rockumentary: A Movie About Rocking...
Name: Title, dt

[48604    Discovering the Real World of Harry Potter (2001)
 48604                              Friendly Enemies (1942)
 Name: Title, dtype: object,
 16724      I'm Not Obsessed (2008)
 16724    "Common Sense: AU" (2017)
 Name: Title, dtype: object,
 5547             Les galeries Lévy et Cie (1932)
 5547    The Seekers Guide to Harry Potter (2010)
 Name: Title, dtype: object,
 8328    "Hogwarts: The Truth About Potter" (2014)
 8328       Ladies' Night in a Turkish Bath (1928)
 Name: Title, dtype: object,
 7886                            Sun Valley Cyclone (1946)
 7886    An Exclusive Inside Look at 'Harry Potter and ...
 Name: Title, dtype: object,
 4885     The Sound of Life (2008)
 4885    Choihui jeongmumun (1977)
 Name: Title, dtype: object,
 16360      I bambini della miniera (2016)
 16360    JK Rowling: The Interview (2003)
 Name: Title, dtype: object,
 47163                          Virtual Encounters 2 (1998)
 47163    The Wizard Rockumentary: A Movie About Rocking...
 Name: Ti