# 🎬 MovieLens Recommendation System

## **📌 Business Understanding**

### **Context**

In the modern entertainment industry, streaming platforms and digital content providers face a significant challenge—ensuring users find content that aligns with their preferences. A robust **movie recommendation system** enhances user experience, increases engagement, and drives business growth.

## **📌 Problem Statement**

Users struggle to find relevant movies from vast catalogs, often leading to dissatisfaction and disengagement. A data-driven recommendation system is needed to personalize movie suggestions, improving user satisfaction and retention.

## **📌 Objectives**

1. **Develop a movie recommendation system** using advanced machine learning techniques.
2. **Implement multiple recommendation approaches**, including:
   - **Collaborative Filtering (SVD)** to analyze user-movie interactions.
   - **Content-Based Filtering (TF-IDF & Cosine Similarity)** to recommend movies based on their genre.
   - **Clustering (K-Means)** for user segmentation and diversity in recommendations.
3. **Evaluate model performance** using RMSE, Silhouette Score, and Cosine Similarity metrics.
4. **Deploy the system** as an interactive **Streamlit web app** for real-time recommendations.



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import shap
import streamlit as st
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from surprise import SVD, Dataset, Reader
from surprise.model_selection import cross_validate, GridSearchCV
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.feature_extraction.text import TfidfVectorizer
import warnings
warnings.filterwarnings("ignore")


In [2]:
# Load datasets
ratings_path = "ratings.csv"
movies_path = "movies.csv"
tags_path = "tags.csv"

ratings = pd.read_csv(ratings_path)
movies = pd.read_csv(movies_path)
tags = pd.read_csv(tags_path)


In [None]:
# Data Exploration
print("Ratings Dataset:")
print(ratings.head())
print("\nMovies Dataset:")
print(movies.head())
print("\nTags Dataset:")
print(tags.head())

# Commit Message: "Performed initial exploration of ratings, movies, and tags datasets."

# Data Preprocessing Pipeline
def preprocess_movies(movies_df):
    movies_df['genres'] = movies_df['genres'].str.replace('|', ' ')
    vectorizer = TfidfVectorizer(stop_words='english')
    genre_matrix = vectorizer.fit_transform(movies_df['genres'])
    return genre_matrix

movies_tfidf_matrix = preprocess_movies(movies)

# Commit Message: "Implemented TF-IDF transformation in a preprocessing pipeline."

# Collaborative Filtering Pipeline
reader = Reader(rating_scale=(0.5, 5.0))
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
param_grid = {'n_factors': [50, 100, 150], 'n_epochs': [20, 30], 'lr_all': [0.002, 0.005], 'reg_all': [0.02, 0.1]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=5)
gs.fit(data)
best_svd = gs.best_estimator['rmse']
cross_validate(best_svd, data, cv=5, verbose=True)

# Commit Message: "Integrated collaborative filtering with hyperparameter tuning into a pipeline."

# Model Explainability using SHAP
trainset = data.build_full_trainset()
best_svd.fit(trainset)
predictions = [best_svd.predict(uid, iid) for uid, iid, _ in trainset.all_ratings()]
predicted_ratings = np.array([pred.est for pred in predictions])

explainer = shap.Explainer(best_svd.fit, np.array(trainset.all_ratings())[:, :-1])
shap_values = explainer(predicted_ratings)
shap.summary_plot(shap_values)

# Commit Message: "Added SHAP explainability to Collaborative Filtering model."

# K-Means Clustering Pipeline
def cluster_movies(movies_df, num_clusters=5):
    kmeans = KMeans(n_clusters=num_clusters, random_state=42)
    cluster_labels = kmeans.fit_predict(movies_tfidf_matrix)
    movies_df['Cluster'] = cluster_labels
    return movies_df

movies = cluster_movies(movies)

# Commit Message: "Implemented K-Means clustering in a structured pipeline."

# Compute Silhouette Score
sil_score = silhouette_score(movies_tfidf_matrix, movies['Cluster'])
print(f"Silhouette Score for {5} clusters: {sil_score}")

# Commit Message: "Computed silhouette score within the clustering pipeline."

# Content-Based Filtering Pipeline
def recommend_movies(movie_title, movies_df, similarity_matrix, top_n=5):
    idx = movies_df[movies_df['title'] == movie_title].index[0]
    scores = list(enumerate(similarity_matrix[idx]))
    scores = sorted(scores, key=lambda x: x[1], reverse=True)
    movie_indices = [i[0] for i in scores[1:top_n+1]]
    return movies_df.iloc[movie_indices][['title', 'genres']]

similarity_matrix = cosine_similarity(movies_tfidf_matrix, movies_tfidf_matrix)

# Commit Message: "Refactored content-based filtering into a modular pipeline."

# Model Performance Comparison
performance_summary = {
    "Collaborative Filtering (SVD)": cross_validate(best_svd, data, cv=5),
    "Content-Based Filtering (Cosine Similarity)": np.mean(similarity_matrix),
    "Clustering (Silhouette Score)": sil_score
}
print("Model Performance Summary:")
for model, score in performance_summary.items():
    print(f"{model}: {score}")

# Commit Message: "Added final model performance comparison section."

# Deployment with Streamlit
st.title("🎬 Movie Recommendation System")
user_input = st.text_input("Enter a movie title:")
if user_input:
    recommendations = recommend_movies(user_input, movies, similarity_matrix)
    st.write("Top Recommendations:")
    st.write(recommendations)

# Commit Message: "Integrated Streamlit app for interactive recommendations."