Movie Recommendation System

-------------

## **Objective**

The objective of this project is to develop a movie recommendation system using Python that can suggest movies to users based on their preferences and past viewing history. This system will leverage collaborative filtering techniques to predict user preferences and recommend movies that are likely to interest them.

## **Data Source**

ratings.csv: Contains user ratings for movies (userId, movieId, rating, timestamp).

movies.csv: Contains movie information (movieId, title, genres).

## **Import Library**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn for machine learning algorithms
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Surprise library for collaborative filtering
from surprise import Reader, Dataset, SVD, accuracy
from surprise.model_selection import cross_validate, train_test_split

# System
import os

# Warnings
import warnings

## **Import Data**

In [None]:
import pandas as pd

# Load the data
ratings = pd.read_csv('path/to/ml-latest-small/ratings.csv')
movies = pd.read_csv('path/to/ml-latest-small/movies.csv')

# Display the first few rows of the datasets
print("Ratings DataFrame:")
print(ratings.head())

print("\nMovies DataFrame:")
print(movies.head())


## **Describe Data**

In [None]:
import pandas as pd

# Load the data
ratings = pd.read_csv('path/to/ml-latest-small/ratings.csv')
movies = pd.read_csv('path/to/ml-latest-small/movies.csv')

# Display basic information about the dataframes
print("Ratings DataFrame Info:")
print(ratings.info())

print("\nMovies DataFrame Info:")
print(movies.info())

# Display basic statistics about the numerical columns
print("\nRatings DataFrame Description:")
print(ratings.describe())

print("\nMovies DataFrame Description:")
print(movies.describe())

# Display the first few rows of each DataFrame
print("\nFirst 5 Rows of Ratings DataFrame:")
print(ratings.head())

print("\nFirst 5 Rows of Movies DataFrame:")
print(movies.head())


## **Data Visualization**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
ratings = pd.read_csv('path/to/ml-latest-small/ratings.csv')
movies = pd.read_csv('path/to/ml-latest-small/movies.csv')

# Merge ratings and movies data on movieId
merged_data = pd.merge(ratings, movies, on='movieId')

# Visualize the distribution of movie ratings
plt.figure(figsize=(10, 6))
sns.histplot(merged_data['rating'], bins=10, kde=True)
plt.title('Distribution of Movie Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()

# Visualize the number of ratings per movie
ratings_per_movie = merged_data['title'].value_counts().head(10)
plt.figure(figsize=(12, 8))
sns.barplot(x=ratings_per_movie.values, y=ratings_per_movie.index, palette='viridis')
plt.title('Top 10 Movies by Number of Ratings')
plt.xlabel('Number of Ratings')
plt.ylabel('Movie Title')
plt.show()

# Visualize the number of ratings given by users
ratings_per_user = merged_data['userId'].value_counts().head(10)
plt.figure(figsize=(12, 8))
sns.barplot(x=ratings_per_user.values, y=ratings_per_user.index, palette='viridis')
plt.title('Top 10 Users by Number of Ratings')
plt.xlabel('Number of Ratings')
plt.ylabel('User ID')
plt.show()

# Visualize the average rating per movie
avg_ratings_per_movie = merged_data.groupby('title')['rating'].mean().sort_values(ascending=False).head(10)
plt.figure(figsize=(12, 8))
sns.barplot(x=avg_ratings_per_movie.values, y=avg_ratings_per_movie.index, palette='viridis')
plt.title('Top 10 Movies by Average Rating')
plt.xlabel('Average Rating')
plt.ylabel('Movie Title')
plt.show()


## **Data Preprocessing**

In [None]:
import pandas as pd

# Load the data
ratings = pd.read_csv('path/to/ml-latest-small/ratings.csv')
movies = pd.read_csv('path/to/ml-latest-small/movies.csv')

# Merge ratings and movies data on movieId
merged_data = pd.merge(ratings, movies, on='movieId')

# Check for missing values
print("Missing values in ratings dataset:\n", ratings.isnull().sum())
print("Missing values in movies dataset:\n", movies.isnull().sum())
print("Missing values in merged dataset:\n", merged_data.isnull().sum())

# Drop any rows with missing values (if any)
merged_data.dropna(inplace=True)

# Encode genres into individual boolean columns
# Create a column for each unique genre
genres = merged_data['genres'].str.get_dummies(sep='|')

# Merge genres with the original dataset
merged_data = pd.concat([merged_data, genres], axis=1)

# Drop the original genres column
merged_data.drop('genres', axis=1, inplace=True)

# Normalize the ratings (optional, depending on the model you plan to use)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
merged_data['rating'] = scaler.fit_transform(merged_data[['rating']])

# Display the first few rows of the preprocessed data
print(merged_data.head())


## **Model Evaluation**

In [None]:
import pandas as pd
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

# Load the data
ratings = pd.read_csv('path/to/ml-latest-small/ratings.csv')

# Load the data into Surprise dataset
reader = Reader(rating_scale=(0.5, 5.0))
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

# Build the SVD model
model = SVD()

# Evaluate the model using cross-validation
results = cross_validate(model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# Print the results
print("Cross-validation results:")
print("Mean RMSE: ", results['test_rmse'].mean())
print("Mean MAE: ", results['test_mae'].mean())


## **Prediction**

In [None]:
import pandas as pd
from surprise import Reader, Dataset, SVD

# Load the data
ratings = pd.read_csv('path/to/ml-latest-small/ratings.csv')

# Load the data into Surprise dataset
reader = Reader(rating_scale=(0.5, 5.0))
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

# Build the SVD model (assuming you have already trained it)
model = SVD()
trainset = data.build_full_trainset()
model.fit(trainset)

# Example prediction: Predict rating for user 1 and movie 1
userId = 1
movieId = 1
prediction = model.predict(userId, movieId)

# Print the prediction
print(f"Predicted rating for user {userId} and movie {movieId}: {prediction.est}")


## **Explaination**

Load Data: Load your ratings data into a pandas DataFrame.
Prepare Data: Use the Reader class from Surprise to define the rating scale and load data into a Surprise Dataset.
Train Model: Build and train the SVD model using the Surprise library.
Make Predictions: Specify a user ID and a movie ID for which you want to predict the rating using the trained model's predict method.