# Movie Recommendation System

### 1. Import Libraries

In [1]:
# Import necessary libraries for data analysis and visualization
import numpy as np  # Numerical computations
import pandas as pd  # Data manipul* ation
import matplotlib.pyplot as plt  # Visualizations
import seaborn as sns  # Advanced visualizations

### 2. Load Dataset

In [2]:
data = pd.read_csv("C:/Users/muge/Dropbox/GMU/Portfolio/AIT 580/Movies/cleaned_movies_data.csv")

### 3. Dataset Structure & Cleaning

In [3]:
#1. Check the structure of the dataset:
#Look for missing values, data types, and basic statistics.
# Check the data types and missing values
data.info()

# Display basic statistics for numerical columns
data.describe()

# Check for missing values
data.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2544 entries, 0 to 2543
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Title                   2544 non-null   object 
 1   US Gross                2544 non-null   float64
 2   Worldwide Gross         2544 non-null   float64
 3   US DVD Sales            2544 non-null   float64
 4   Production Budget       2544 non-null   float64
 5   Release Date            2537 non-null   object 
 6   MPAA Rating             2452 non-null   object 
 7   Running Time (min)      2544 non-null   float64
 8   Distributor             2544 non-null   object 
 9   Source                  2544 non-null   object 
 10  Major Genre             2544 non-null   object 
 11  Creative Type           2544 non-null   object 
 12  Director                2544 non-null   object 
 13  Rotten Tomatoes Rating  2544 non-null   float64
 14  IMDB Rating             2544 non-null   

Title                      0
US Gross                   0
Worldwide Gross            0
US DVD Sales               0
Production Budget          0
Release Date               7
MPAA Rating               92
Running Time (min)         0
Distributor                0
Source                     0
Major Genre                0
Creative Type              0
Director                   0
Rotten Tomatoes Rating     0
IMDB Rating                0
IMDB Votes                 0
Release Year               7
Profit                     0
ROI                        0
dtype: int64

In [4]:
# Convert 'Release Date' back to datetime
data["Release Date"] = pd.to_datetime(data["Release Date"], errors='coerce')

# Convert 'MPAA Rating' back to category
rating_order = ["unknown", "g", "pg", "pg-13", "r", "nc-17"]
data["MPAA Rating"] = pd.Categorical(data["MPAA Rating"], categories=rating_order, ordered=True)

# Confirm the data types
print(data.dtypes)

Title                             object
US Gross                         float64
Worldwide Gross                  float64
US DVD Sales                     float64
Production Budget                float64
Release Date              datetime64[ns]
MPAA Rating                     category
Running Time (min)               float64
Distributor                       object
Source                            object
Major Genre                       object
Creative Type                     object
Director                          object
Rotten Tomatoes Rating           float64
IMDB Rating                      float64
IMDB Votes                       float64
Release Year                     float64
Profit                           float64
ROI                              float64
dtype: object


In [5]:
# Handle missing values:
# Decide how to handle missing values: remove rows, impute missing values, or ignore certain columns.
# Drop rows with missing 'Release Year'
data = data.dropna(subset=['Release Year'])

# We could fill missing values in MPAA Rating column. Here I chose to not do it.
# data['MPAA Rating'].fillna('Unknown', inplace=True)

### 4. Movie Recommendation Function Based on Similarity

#### 1. Content-Based Filtering

Content-based filtering recommends movies based on the similarity of movie features (like Genre, Director, Creative Type). We will calculate the similarity between movies using these features and recommend similar movies to a user based on a given movie.

Steps:
* Create a feature vector for each movie based on relevant features like Genre, Director, and Creative Type.
* Calculate similarity between movies using cosine similarity (or any other distance measure).
* Recommend movies similar to a given movie.

In [6]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# If the 'features' column does not exist, we can still combine the other columns like this:
data['features'] = data['Major Genre'] + ' ' + data['Director'] + ' ' + data['Creative Type']

# Initialize the TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')

# Create the TF-IDF matrix for the 'features' column
tfidf_matrix = tfidf.fit_transform(data['features'])

# Compute the cosine similarity between all movies
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Normalize titles
data['Title_normalized'] = data['Title'].str.strip().str.lower()

# Now check for the normalized version of 'The Matrix'
print('wilson' in data['Title_normalized'].values)  # Should return True if the title exists in any case or with spaces

# Function to get movie recommendations based on similarity
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = data.index[data['Title_normalized'] == title].tolist()[0]
    
    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the top 10 most similar movies
    sim_scores = sim_scores[1:11]
    
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    
    # Return the top 10 most similar movies
    return data['Title_normalized'].iloc[movie_indices]

# Example: Get recommendations for a given movie title
recommended_movies = get_recommendations('the land girls')  # Replace 'The Matrix' with a movie in the dataset
print(recommended_movies)

True
5                    mississippi mermaid
9                        duel in the sun
10                             tom jones
12                 to kill a mockingbird
14                     hollywood shuffle
15        over the hill to the poorhouse
16                                wilson
33                       51 birch street
58    sweet sweetback's baad asssss song
63                       barbarians, the
Name: Title_normalized, dtype: object


The output shows the 10 movies that are most similar to "the land girls" based on the content of their 'features' column (genre, director, creative type):

Key Observations and Insights:
* Content-Based Recommendations: The recommendations are based on the similarity of the movie's genre, director, and creative type, not on user ratings or popularity.
* Potential for Further Improvement: The recommendations might not always be perfect, as they are based on a limited set of features. we could improve the system by including more features, such as actors, plot keywords, or user reviews.
* TF-IDF Effectiveness: TF-IDF vectorization is a simple but effective way to represent text data for similarity calculations.
* Cosine Similarity Suitability: Cosine similarity is a good metric for measuring the similarity between text documents.
* Normalization Importance: Normalizing titles ensures that the recommendation system works correctly even with variations in capitalization and spacing.

This code demonstrates a basic content-based movie recommendation system that uses TF-IDF and cosine similarity to find movies similar to a given movie. The recommendations are based on the movie's genre, director, and creative type, and the system can be further improved by incorporating more features.

In [7]:
print(data['Title_normalized'].head(20))  # This will print the first 20 movie titles in the dataset

0                     the land girls
1             first love, last rites
2         i married a strange person
3               let's talk about sex
4                               slam
5                mississippi mermaid
6                          following
7                            foolish
8                            pirates
9                    duel in the sun
10                         tom jones
11                           oliver!
12             to kill a mockingbird
13                  tora, tora, tora
14                 hollywood shuffle
15    over the hill to the poorhouse
16                            wilson
17                      darling lili
18                      12 angry men
19                    twelve monkeys
Name: Title_normalized, dtype: object


### 2. Collaborative Filtering Using Surprise Library

#### A. Collaborative Filtering (User-Based or Item-Based)

Collaborative filtering makes recommendations based on the interactions (ratings) between users and items (movies). Since we don’t have user ratings data in the dataset, let’s assume that we would have ratings data, and we will implement user-based collaborative filtering.

Steps:
* Prepare the ratings dataset (assumed to have userId, movieId, and rating).
* Train a collaborative filtering model (e.g., using KNNBasic, SVD, or other models in Surprise).
* Recommend movies based on similar users.

In [8]:
data.columns

Index(['Title', 'US Gross', 'Worldwide Gross', 'US DVD Sales',
       'Production Budget', 'Release Date', 'MPAA Rating',
       'Running Time (min)', 'Distributor', 'Source', 'Major Genre',
       'Creative Type', 'Director', 'Rotten Tomatoes Rating', 'IMDB Rating',
       'IMDB Votes', 'Release Year', 'Profit', 'ROI', 'features',
       'Title_normalized'],
      dtype='object')

In [9]:
# pip install scikit-surprise

##### Step 1: Generate the Synthetic Ratings Dataset

We will create synthetic ratings using userId, movieId, and rating. To make it more realistic, we will assume that each movie has a unique movieId corresponding to the Title column.

Our dataset does not currently contain user IDs, movie IDs, or explicit user ratings, which are required for collaborative filtering. So this code generates a synthetic dataset rather than using our actual dataset. If we want to apply collaborative filtering to our real dataset, we'll need to replace the synthetic data generation with real dataset containing these.

In [10]:
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import KNNBasic
from surprise import accuracy

# Example data with some features we have (simulating user-based ratings)
# Assume that the movie dataset is called 'data'

np.random.seed(42)  # For reproducibility

# Create user IDs and movie IDs (assuming 5 users for this example)
num_users = 5
num_movies = len(data)  # Number of movies

# Generate random user IDs for each movie (movies rated by users)
user_ids = np.random.randint(1, num_users + 1, size=num_movies)

# Generate movie IDs based on the index
movie_ids = np.arange(1, num_movies + 1)

# Generate random ratings (between 1 and 5)
ratings = np.random.randint(1, 6, size=num_movies)

# Create a synthetic ratings dataframe
ratings_df = pd.DataFrame({
    'userId': user_ids,
    'movieId': movie_ids,
    'rating': ratings
})

# Show the first few rows of the ratings dataset
print(ratings_df.head())

   userId  movieId  rating
0       4        1       3
1       5        2       4
2       3        3       2
3       5        4       2
4       5        5       3


##### Step 2: Train the Collaborative Filtering Model

We can now train a KNNBasic collaborative filtering model from the Surprise library using the synthetic ratings data.

In [11]:
# Initialize the Reader for Surprise
reader = Reader(rating_scale=(1, 5))

# Load the ratings dataset into the Surprise format
data_surprise = Dataset.load_from_df(ratings_df[['userId', 'movieId', 'rating']], reader)

# Split the dataset into training and testing
trainset, testset = train_test_split(data_surprise, test_size=0.2)

# Initialize the KNNBasic model
sim_options = {
    'name': 'cosine',
    'user_based': True  # Use user-based collaborative filtering
}
model = KNNBasic(sim_options=sim_options)

# Train the model
model.fit(trainset)

# Evaluate the model using RMSE (Root Mean Square Error)
predictions = model.test(testset)
print(f'RMSE: {accuracy.rmse(predictions)}')

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.4352
RMSE: 1.4351523890664715


* User-Based Collaborative Filtering: The model uses user-based collaborative filtering, meaning it makes predictions based on the ratings of users who are similar to the target user.
* Cosine Similarity: Cosine similarity is used to measure user similarity.
* RMSE Evaluation: The RMSE is 1.4352. This means that, on average, the model's predictions are about 1.4352 rating points away from the actual ratings.

RMSE Interpretation:
* The RMSE value of 1.4352 indicates a relatively high error.
* Given that the rating scale is from 1 to 5, an error of 1.4352 suggests that the model's predictions are not very accurate.
* This could be due to several factors, such as:
    * Sparsity of the data (users may have rated only a few movies).
    * Lack of strong user similarities.
    * The inherent difficulty of predicting movie ratings.

Possible Improvements:
* Try Different Similarity Metrics: Experiment with other similarity metrics, such as Pearson correlation.
* Tune K: Adjust the number of neighbors (K) used in the KNN algorithm.
* Use Item-Based Collaborative Filtering: Try item-based collaborative filtering instead of user-based.
* Try Different Algorithms: Explore other collaborative filtering algorithms, such as SVD or NMF.
* Feature Engineering: If we had additional information about users or movies, we could incorporate it into the model.

The results show that the user-based KNN collaborative filtering model has an RMSE of 1.4352, indicating that its predictions are not very accurate. Further experimentation and model tuning are needed to improve the performance of the recommendation system.

#### Movie Recommendations Based on User Preferences

##### Step 3: Recommend Movies for a User

Now that we have trained the collaborative filtering model, we can make movie recommendations for a user based on similar users’ preferences. We will generate movie recommendations for a specific userId.

In [12]:
# Function to get movie recommendations for a given user
def get_movie_recommendations(user_id, model, num_recommendations=5):
    # Get all movie ids
    movie_ids = np.arange(1, num_movies + 1)
    
    # Predict ratings for all movies for the given user
    predictions = [model.predict(user_id, movie_id) for movie_id in movie_ids]
    
    # Sort predictions by estimated rating (in descending order)
    predictions.sort(key=lambda x: x.est, reverse=True)
    
    # Get the top N recommendations
    top_recommendations = predictions[:num_recommendations]
    
    # Get the movie IDs corresponding to the recommended movie titles
    recommended_movie_ids = [pred.iid for pred in top_recommendations]
    
    # Adjust for 0-based indexing
    recommended_titles = data['Title'].iloc[recommended_movie_ids].values  # No need to subtract 1
    
    return recommended_titles

# Example: Get movie recommendations for user 1
recommended_movies = get_movie_recommendations(1, model, num_recommendations=5)
print(recommended_movies)

['20,000 Leagues Under the Sea' "April Fool's Day" 'Chain Reaction'
 'Clockwatchers' 'Dawn of the Dead']


Explanation:
* model.predict(user_id, movie_id): This generates predicted ratings for each movie based on the user's past interactions (even though it's simulated here, it works for a collaborative filtering model).
* pred.iid: This is the predicted movie ID. The assumption is that movie IDs match the dataset's index or are aligned in some way.
* Now, when we call get_movie_recommendations(1, model, num_recommendations=5), it should print the top 5 recommended movie titles for user 1. 

* Synthetic Ratings: We generated synthetic ratings for users based on movie features (this can be modified based on more realistic user interactions).
* Collaborative Filtering: We used the KNNBasic model from the Surprise library to train the collaborative filtering algorithm.
* Movie Recommendations: We made movie recommendations based on similar users.

#### B. Collaborative Filtering (User-Based or Item-Based) using IMDB rating

The dataset doesn't contain explicit user ratings, so we'll need to simulate user interactions to make collaborative filtering work. Since IMDB Rating is already a numerical value, we can treat it as the "rating."

Plan for Collaborative Filtering with the Data
Since we don't have user IDs, we will:
1. Simulate user IDs by assigning random users to each movie.
2. Use IMDB Ratings as the "rating" column.
3. Train a User-Based Collaborative Filtering Model using Surprise.

##### Step 1: Prepare the Data


In [13]:
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import KNNBasic, accuracy

# Drop rows with missing IMDB Ratings
df = data.dropna(subset=['IMDB Rating'])

# Simulate User IDs (assign random users)
num_users = 1000  # Simulate 1000 different users
num_movies = len(df)

# Assign a random user ID to each movie (as if different users rated different movies)
df['userId'] = np.random.randint(1, num_users + 1, size=num_movies)

# Rename columns for compatibility with Surprise
df = df.rename(columns={'Title_normalized': 'movieId', 'IMDB Rating': 'rating'})

# Select relevant columns
ratings_df = df[['userId', 'movieId', 'rating']]

# Display first few rows
print(ratings_df.head())

   userId                     movieId  rating
0     990              the land girls     6.1
1     626      first love, last rites     6.9
2     491  i married a strange person     6.8
3     255        let's talk about sex     6.4
4     381                        slam     3.4


##### Step 2: Convert to Surprise Format

In [14]:
# Define rating scale for IMDB Ratings (1-10)
reader = Reader(rating_scale=(1, 10))

# Load dataset into Surprise
data_surprise = Dataset.load_from_df(ratings_df[['userId', 'movieId', 'rating']], reader)

# Split into training and test sets
trainset, testset = train_test_split(data_surprise, test_size=0.2)

##### Step 3: Train a User-Based Collaborative Filtering Model

In [15]:
# Define similarity options (cosine similarity, user-based)
sim_options = {'name': 'cosine', 'user_based': True}

# Initialize and train the model
model = KNNBasic(sim_options=sim_options)
model.fit(trainset)

# Make predictions and evaluate
predictions = model.test(testset)
print(f'RMSE: {accuracy.rmse(predictions)}')
print(f'MAE: {accuracy.mae(predictions)}')

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.1348
RMSE: 1.1348243466511567
MAE:  0.8889
MAE: 0.8889284508485232


The model's performance is decent, but there's room for improvement.
* An RMSE of 1.13 and an MAE of 0.89 suggest that the model's predictions have a reasonable level of accuracy, but there is still a notable average error.
* The fact that the MAE is lower than the RMSE is normal. RMSE penalizes large errors more heavily than MAE.
* Whether these values are "good" depends on the specific context and the nature of the rating scale. For example, in a 1-5 star rating system, an average error of around 1.13 stars might be considered moderate.

Possible Improvements:
* Parameter Tuning: Experiment with different values for the number of neighbors (k) in the KNN algorithm.
* Different Similarity Metrics: Try other similarity metrics, such as Pearson correlation.
* Item-Based Filtering: Consider using item-based collaborative filtering instead of user-based filtering.
* Data Preprocessing: Explore techniques like normalization or scaling of ratings to improve similarity calculations.
* Hybrid Approaches: Combine user-based and item-based filtering or incorporate other machine learning models.
* Increase training data: increasing the size of the training dataset can improve the models accuracy.

📌 Expected Outcome: 
The system will learn which users (randomly assigned) rate movies similarly and predict ratings for other movies.
The RMSE and MAE scores will help us assess how well the model predicts ratings.

❓ Want Recommendations for a Specific Movie?
we can generate predictions for a specific movie like this:

In [16]:
movie_id = "Mississippi Mermaid"  # Movie title to search for

# Extract the actual IMDB rating for the movie
actual_rating = df[df["Title"] == movie_id]["rating"].values

if len(actual_rating) > 0:
    actual_rating = actual_rating[0]
    print(f'Actual IMDB Rating for Movie "{movie_id}": {actual_rating}')
else:
    print(f'Actual IMDB Rating for "{movie_id}" not found in dataset.')

Actual IMDB Rating for Movie "Mississippi Mermaid": 6.4


In [17]:
movie_id = "Mississippi Mermaid"  # Replace with a movie's normalized title
user_id = 123  # A random user

pred = model.predict(user_id, movie_id)
print(f'Predicted Rating for Movie {movie_id}: {pred.est}')

Predicted Rating for Movie Mississippi Mermaid: 6.3213405618531295


In [18]:
movie_id = "Mississippi Mermaid"  # Movie title to search for
user_id = 123  # Random user for prediction

# Predict the rating using the trained model
pred = model.predict(user_id, movie_id)

# Extract the actual IMDB rating for the movie
actual_rating = df[df["Title"] == movie_id]["rating"].values

if len(actual_rating) > 0:
    actual_rating = actual_rating[0]
    print(f'Predicted Rating for Movie "{movie_id}": {pred.est:.2f}')
    print(f'Actual IMDB Rating for Movie "{movie_id}": {actual_rating}')
    print(f'Error (Difference): {abs(pred.est - actual_rating):.2f}')
else:
    print(f'Actual IMDB Rating for "{movie_id}" not found in dataset.')

Predicted Rating for Movie "Mississippi Mermaid": 6.32
Actual IMDB Rating for Movie "Mississippi Mermaid": 6.4
Error (Difference): 0.08


Predicted Rating:
* The model predicted that user 123 would rate "Mississippi Mermaid" a 6.32.
Actual Rating:
* The actual IMDB rating for "Mississippi Mermaid" in the dataset is 6.4.
Error:
* The difference between the predicted and actual ratings is 0.08. This is a very small error.

Significance:

Accuracy for a Single Prediction:
* The very small error (0.08) indicates that, for this specific prediction, the model performed very well.
Context:
* It is important to remember that this is only one single prediction. The overall performance of the model is measured by the RMSE and MAE that were previously provided. This result shows that for this particular user and movie combination, the model was very accurate.
Good Prediction:
* A difference of 0.08 on a typical rating scale (like a 1-10 scale) is negligible. This indicates that the model's prediction was very close to the actual rating.

In summary:
The model made a very accurate prediction for user 123's rating of "Mississippi Mermaid." This specific result is a good example of the model's ability to provide close predictions, but it is important to remember that the overall performance of the model should be judged by the RMSE and MAE values.

#### 3. Clustering for Movie Recommendations (K-Means)

Clustering techniques like K-Means can group similar movies based on features (e.g., genre, director, and creative type), and then we can recommend movies from the same cluster.

Steps:
* Use K-Means clustering to group movies based on selected features.
* Recommend movies from the same cluster.

In [19]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Use selected features: Genre, Director, Creative Type
# Convert categorical data to numerical using LabelEncoder
encoder = LabelEncoder()
data['Genre_encoded'] = encoder.fit_transform(data['Major Genre'])
data['Director_encoded'] = encoder.fit_transform(data['Director'])
data['Creative_Type_encoded'] = encoder.fit_transform(data['Creative Type'])

# Select the relevant features for clustering
features = data[['Genre_encoded', 'Director_encoded', 'Creative_Type_encoded']]

# Apply K-Means clustering
kmeans = KMeans(n_clusters=5, random_state=42)  # Adjust number of clusters as needed
data['Cluster'] = kmeans.fit_predict(features)

# Function to recommend movies based on the same cluster
def recommend_cluster_movies(movie_title, data, n_recommendations=5):
    movie_idx = data[data['Title'] == movie_title].index[0]
    cluster = data.loc[movie_idx, 'Cluster']
    
    # Get all movies from the same cluster
    cluster_movies = data[data['Cluster'] == cluster]
    
    # Exclude the original movie
    cluster_movies = cluster_movies[cluster_movies['Title'] != movie_title]
    
    # Get top N recommendations from the same cluster
    return cluster_movies['Title'].head(n_recommendations)

# Example: Get movie recommendations based on clustering
recommended_movies_cluster = recommend_cluster_movies('The Land Girls', data)
print(recommended_movies_cluster)

1        First Love, Last Rites
2    I Married a Strange Person
3          Let's Talk About Sex
4                          Slam
5           Mississippi Mermaid
Name: Title, dtype: object


* Cluster-Based Recommendations: The recommendations are based on the cluster assigned to the input movie. Movies in the same cluster are considered similar based on their genre, director, and creative type.
* Categorical Feature Similarity: The clustering is based on the similarity of categorical features, which might not always capture the full complexity of movie similarity.
* Label Encoding: Label encoding is a simple way to convert categorical data to numerical data, but it can introduce arbitrary ordering between categories.
* Limited Feature Set: The recommendation system is based on a limited set of features. Including more features, such as actors, plot keywords, or user ratings, could improve the recommendations.
* Cluster Interpretation: To better understand the recommendations, we should analyze the characteristics of the movies within each cluster. For example, we could examine the most common genres, directors, and creative types in each cluster.

The results show that the cluster-based recommendation system has recommended movies that are in the same cluster as "The Land Girls" based on their genre, director, and creative type. The recommendations might not always be perfect, and further experimentation and feature engineering are needed to improve the system.

##### Summary of Recommendation System:
* Content-Based Filtering: Recommends movies based on similarities in attributes like Genre, Director, and Creative Type.
* Collaborative Filtering: Recommends movies based on the ratings and preferences of similar users. We used the Surprise library for implementing this.
* Clustering (K-Means): Groups similar movies based on features and recommends movies from the same cluster.