
# SWE485 - Group 1

**Group 1 Team Members:**  
- Rama Alayed  
- Renad Almutred  
- Layan Almunasser  
- Sara Almogren  
- Nada Alkubra  



# Predicting Movie Success and Building a Recommendation System Using the IMDB Movie Dataset

## Introduction
The goal of this project is to develop a machine learning-based recommendation system that predicts the success of a movie based on its features and provides personalized movie recommendations. 
The IMDB Movie Dataset will be used, which includes variables such as title, genre, description, director, actors, year, runtime, rating, and votes.

## Problem Statement
The movie industry is highly competitive. Predicting the success of a movie is crucial for stakeholders to plan effectively. 
This project aims at classifying movies into three categories based on their IMDB ratings:

- **1-3 (Flop):** Low-rated movies  
- **4-7 (Average/Niche):** Moderately rated movies or movies with niche appeal  
- **>7 (Success):** Highly rated and successful movies  

Additionally, clustering techniques will be used to group similar movies for recommendation purposes.



## Dataset
The IMDB Movie Dataset (available on Kaggle) contains 1,000 movies with features such as title, genre, description, director, actors, year, runtime, rating, and votes.  
The dataset will be preprocessed to handle missing values, encode categorical variables, and normalize numerical features.

**Dataset link:** [IMDB Movie Dataset on Kaggle](https://www.kaggle.com/datasets/yusufdelikkaya/imdb-movie-dataset)


In [None]:

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.cluster import KMeans


In [None]:

# Load the dataset
file_path = "/mnt/data/imdb_movie_dataset.csv"  
df = pd.read_csv(file_path)

# Display first few rows
df.head()


In [None]:

# Check for missing values
df.isnull().sum()

# Summary statistics
df.describe()


In [None]:

# Visualizing IMDB Ratings Distribution
plt.figure(figsize=(8,5))
sns.histplot(df["Rating"], bins=10, kde=True, color='blue')
plt.title("Distribution of IMDB Ratings")
plt.xlabel("Rating")
plt.ylabel("Frequency")
plt.show()


In [None]:

# Convert ratings into categories: 0 (Flop), 1 (Average/Niche), 2 (Success)
df['Success_Category'] = df['Rating'].apply(lambda x: 0 if x < 4 else (1 if x <= 7 else 2))

# Define features and target
X = df.drop(columns=["Rating", "Success_Category"])  # Drop unnecessary columns
y = df["Success_Category"]

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Predictions
y_pred = clf.predict(X_test)

# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Classification Model Accuracy: {accuracy:.2f}")


In [None]:

# Unsupervised Learning - Clustering with K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
df['Cluster'] = kmeans.fit_predict(df.select_dtypes(include=[np.number]))

# Visualizing Clusters
plt.figure(figsize=(8,5))
sns.scatterplot(x=df['Runtime (Minutes)'], y=df['Rating'], hue=df['Cluster'], palette='viridis')
plt.title("Movie Clusters based on Runtime and Rating")
plt.xlabel("Runtime (Minutes)")
plt.ylabel("IMDB Rating")
plt.show()



## Conclusion
This project applies machine learning techniques to predict movie success and build a recommendation system.  
- **Supervised Learning**: A classification model predicts whether a movie is a flop, average, or successful based on its features.  
- **Unsupervised Learning**: Clustering groups similar movies to enhance recommendations.  

These insights can help movie industry stakeholders make data-driven decisions.
