# **Movie Recommendation System**

-------------

## **Objective**

The objective of a movie recommender system is to provide personalized suggestions to users based on their preferences, viewing history, and behavior. Here are the key objectives of such systems:

    1. Enhance User Experience: By recommending movies that align with users' tastes and interests, the system aims to improve overall user satisfaction and engagement.

    2. Increase Engagement: Recommender systems aim to keep users on platforms longer by continuously offering relevant content, thereby increasing user interaction and retention.

    3. Personalization: Tailoring recommendations to individual users ensures that they receive content that matches their specific preferences, leading to a more enjoyable and relevant experience.

    4. Diversity of Content: Besides personalization, recommender systems also strive to introduce users to a diverse range of movies they might not have discovered on their own, thereby broadening their viewing horizons.

    5. Revenue Generation: In commercial applications like streaming services, recommending popular or new content can potentially increase revenue through increased subscriptions, rentals, or purchases.

    6. Improve Content Discovery: Recommender systems help users navigate through vast catalogs of movies efficiently, making it easier for them to discover new and interesting content.

    7. Optimize Resources: By predicting user preferences, recommender systems can optimize resource allocation, such as server bandwidth or storage, based on anticipated demand for specific movies.

Overall, the goal is to create a win-win situation where users find relevant content they enjoy, while platforms benefit from increased user engagement and satisfaction.

## **Data Source**

The dataset was collected from The Movie Database (TMDB) using a valid API key.
The CSV data was scrape https://api.themoviedb.org/3/movie/top_rated/ by ensuring proper authorization to access their database .

The raw data obtained from API responses was processed to extract relevant information. This may include parsing JSON responses, handling pagination, and cleaning the data to ensure consistency.

## **Import Library**

In [11]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from datetime import datetime

## **Import Data**

In [12]:
df = pd.read_csv("Top_rated_movies1.csv")

In [13]:
df.head()

Unnamed: 0,id,title,overview,popularity,release_date,vote_average,vote_count
0,168705,BloodRayne,"In 18th-century Romania, after spending much o...",17.499,2005-10-22,4.105,501
1,19766,Inspector Gadget 2,"After capturing Claw, all the criminals have g...",20.772,2003-03-11,4.1,342
2,248705,The Visitors: Bastille Day,"Stuck in the corridors of time, Godefroy de Mo...",18.828,2016-03-23,4.09,636
3,17711,The Adventures of Rocky & Bullwinkle,Rocky and Bullwinkle have been living off the ...,16.436,2000-06-30,4.075,335
4,580,Jaws: The Revenge,"After another deadly shark attack, Ellen Brody...",30.996,1987-07-17,4.064,931


## **Describe Data**

In [14]:
# Dataframe info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8831 entries, 0 to 8830
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            8831 non-null   int64  
 1   title         8831 non-null   object 
 2   overview      8830 non-null   object 
 3   popularity    8831 non-null   float64
 4   release_date  8831 non-null   object 
 5   vote_average  8831 non-null   float64
 6   vote_count    8831 non-null   int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 483.1+ KB


In [15]:
# Describing Dataframe
df.describe()

Unnamed: 0,id,popularity,vote_average,vote_count
count,8831.0,8831.0,8831.0,8831.0
mean,177555.5,33.136177,6.639059,1969.126486
std,235240.0,49.108903,0.794924,3059.560533
min,5.0,0.6,2.106,1.0
25%,9927.5,16.729,6.1135,469.0
50%,33875.0,23.804,6.662,846.0
75%,334521.5,35.7515,7.205,1966.0
max,1151534.0,1766.305,8.708,34794.0


## **Define Target Variable (y) and Feature Variables (X)**

In [16]:
# Selecting features and target variables
X_text = df['overview'].fillna('')  # Textual feature for TF-IDF
X_numerical = df[['id', 'popularity', 'release_date']]  # Numerical features
y = df[['vote_average', 'vote_count']]  # Target variables

## **Data Preprocessing**

In [17]:
# Convert release_date to numeric features (year, month, day)
X_numerical['release_year'] = pd.to_datetime(X_numerical['release_date']).dt.year
X_numerical['release_month'] = pd.to_datetime(X_numerical['release_date']).dt.month
X_numerical['release_day'] = pd.to_datetime(X_numerical['release_date']).dt.day

# Drop original release_date column
X_numerical.drop(columns=['release_date'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_numerical['release_year'] = pd.to_datetime(X_numerical['release_date']).dt.year
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_numerical['release_month'] = pd.to_datetime(X_numerical['release_date']).dt.month
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_numerical['release_day'] = pd.to_date

## **Train Test Split**

In [18]:
# Train-test split
X_text_train, X_text_test, X_numerical_train, X_numerical_test, y_train, y_test = train_test_split(X_text, X_numerical, y, test_size=0.2, random_state=42)

## **Modeling**

In [19]:
# Text vectorization for overview using TF-IDF
tfidf = TfidfVectorizer(stop_words='english')
X_text_train_tfidf = tfidf.fit_transform(X_text_train)
X_text_test_tfidf = tfidf.transform(X_text_test)

# Convert TF-IDF vectors to DataFrame (sparse matrix to array)
X_text_train_df = pd.DataFrame(X_text_train_tfidf.toarray(), columns=tfidf.get_feature_names_out())
X_text_test_df = pd.DataFrame(X_text_test_tfidf.toarray(), columns=tfidf.get_feature_names_out())

# Concatenate numerical features with TF-IDF vectors
X_train = pd.concat([X_numerical_train.reset_index(drop=True), X_text_train_df], axis=1)
X_test = pd.concat([X_numerical_test.reset_index(drop=True), X_text_test_df], axis=1)

## **Model Evaluation**

In [20]:
# Example Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Model evaluation (Example: RMSE for vote_average prediction)
y_pred = model.predict(X_test)
rmse = mean_squared_error(y_test['vote_average'], y_pred[:, 0], squared=False)
print(f"RMSE for vote_average: {rmse}")

RMSE for vote_average: 2767.8207991596355


## **Prediction**

In [21]:
# Example prediction for a new movie entry (assuming the format of X_new matches X_train)
X_new = X_test.iloc[0:1]  
y_new_pred = model.predict(X_new)
print(f"Predicted vote_average: {y_new_pred[0, 0]}, Predicted vote_count: {y_new_pred[0, 1]}")

Predicted vote_average: 2957.1132614504922, Predicted vote_count: 285496.6170125321


## **Explaination**

This movie recommender system utilizes a comprehensive approach combining textual analysis and numerical features to enhance user satisfaction and engagement on a movie platform. By leveraging TF-IDF vectorization of movie overviews and incorporating key numerical attributes such as popularity and release date, the system aims to personalize recommendations based on user preferences and viewing history. The TF-IDF technique assigns weights to terms in movie descriptions, capturing their importance in relation to the entire dataset, while numerical features provide additional context like release timing and popularity trends. A linear regression model exemplifies how these features are integrated to predict movie ratings and recommend content that aligns with user tastes. Overall, the system strives to optimize user experience by offering relevant, diverse content, thus increasing user engagement and potentially boosting platform revenue through enhanced content discovery and personalized recommendations.